1
|
Lu X, Xie L, Xu L, Mao R, Xu X, Chang S. Multimodal fused deep learning for drug property prediction: Integrating chemical language and molecular graph. Comput Struct Biotechnol J 2024; 23:1666-1679. [PMID: 38680871 PMCID: PMC11046066 DOI: 10.1016/j.csbj.2024.04.030] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2024] [Revised: 04/01/2024] [Accepted: 04/10/2024] [Indexed: 05/01/2024] Open
Abstract
Accurately predicting molecular properties is a challenging but essential task in drug discovery. Recently, many mono-modal deep learning methods have been successfully applied to molecular property prediction. However, mono-modal learning is inherently limited as it relies solely on a single modality of molecular representation, which restricts a comprehensive understanding of drug molecules. To overcome the limitations, we propose a multimodal fused deep learning (MMFDL) model to leverage information from different molecular representations. Specifically, we construct a triple-modal learning model by employing Transformer-Encoder, Bidirectional Gated Recurrent Unit (BiGRU), and graph convolutional network (GCN) to process three modalities of information from chemical language and molecular graph: SMILES-encoded vectors, ECFP fingerprints, and molecular graphs, respectively. We evaluate the proposed triple-modal model using five fusion approaches on six molecule datasets, including Delaney, Llinas2020, Lipophilicity, SAMPL, BACE, and pKa from DataWarrior. The results show that the MMFDL model achieves the highest Pearson coefficients, and stable distribution of Pearson coefficients in the random splitting test, outperforming mono-modal models in accuracy and reliability. Furthermore, we validate the generalization ability of our model in the prediction of binding constants for protein-ligand complex molecules, and assess the resilience capability against noise. Through analysis of feature distributions in chemical space and the assigned contribution of each modal model, we demonstrate that the MMFDL model shows the ability to acquire complementary information by using proper models and suitable fusion approaches. By leveraging diverse sources of bioinformatics information, multimodal deep learning models hold the potential for successful drug discovery.
Collapse
Affiliation(s)
- Xiaohua Lu
- Institute of Bioinformatics and Medical Engineering, Jiangsu University of Technology, Changzhou 213001, China
| | - Liangxu Xie
- Institute of Bioinformatics and Medical Engineering, Jiangsu University of Technology, Changzhou 213001, China
| | - Lei Xu
- Institute of Bioinformatics and Medical Engineering, Jiangsu University of Technology, Changzhou 213001, China
| | - Rongzhi Mao
- Institute of Bioinformatics and Medical Engineering, Jiangsu University of Technology, Changzhou 213001, China
| | - Xiaojun Xu
- Institute of Bioinformatics and Medical Engineering, Jiangsu University of Technology, Changzhou 213001, China
| | - Shan Chang
- Institute of Bioinformatics and Medical Engineering, Jiangsu University of Technology, Changzhou 213001, China
| |
Collapse
|
2
|
Oh M, Shen M, Liu R, Stavitskaya L, Shen J. Machine Learned Classification of Ligand Intrinsic Activities at Human μ-Opioid Receptor. ACS Chem Neurosci 2024. [PMID: 38990780 DOI: 10.1021/acschemneuro.4c00212] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/13/2024] Open
Abstract
Opioids are small-molecule agonists of μ-opioid receptor (μOR), while reversal agents such as naloxone are antagonists of μOR. Here, we developed machine learning (ML) models to classify the intrinsic activities of ligands at the human μOR based on the SMILES strings and two-dimensional molecular descriptors. We first manually curated a database of 983 small molecules with measured Emax values at the human μOR. Analysis of the chemical space allowed identification of dominant scaffolds and structurally similar agonists and antagonists. Decision tree models and directed message passing neural networks (MPNNs) were then trained to classify agonistic and antagonistic ligands. The hold-out test AUCs (areas under the receiver operator curves) of the extra-tree (ET) and MPNN models are 91.5 ± 3.9% and 91.8 ± 4.4%, respectively. To overcome the challenge of a small data set, a student-teacher learning method called tritraining with disagreement was tested using an unlabeled data set comprised of 15,816 ligands of human, mouse, and rat μOR, κOR, and δOR. We found that the tritraining scheme was able to increase the hold-out AUC of MPNN models to as high as 95.7%. Our work demonstrates the feasibility of developing ML models to accurately predict the intrinsic activities of μOR ligands, even with limited data. We envisage potential applications of these models in evaluating uncharacterized substances for public safety risks and discovering new therapeutic agents to counteract opioid overdoses.
Collapse
Affiliation(s)
- Myongin Oh
- Division of Applied Regulatory Science, Office of Clinical Pharmacology, Center for Drug Evaluation and Research, United States Food and Drug Administration, Silver Spring, Maryland 20993, United States
- Department of Pharmaceutical Sciences, University of Maryland School of Pharmacy, Baltimore, Maryland 21201, United States
| | - Maximilian Shen
- Department of Electrical and Computer Engineering, University of Maryland, College Park, Maryland 20742, United States
| | - Ruibin Liu
- Department of Pharmaceutical Sciences, University of Maryland School of Pharmacy, Baltimore, Maryland 21201, United States
| | - Lidiya Stavitskaya
- Division of Applied Regulatory Science, Office of Clinical Pharmacology, Center for Drug Evaluation and Research, United States Food and Drug Administration, Silver Spring, Maryland 20993, United States
| | - Jana Shen
- Department of Pharmaceutical Sciences, University of Maryland School of Pharmacy, Baltimore, Maryland 21201, United States
| |
Collapse
|
3
|
Walter M, Borghardt JM, Humbeck L, Skalic M. Multi-Task ADME/PK prediction at industrial scale: leveraging large and diverse experimentaldatasets. Mol Inform 2024:e202400079. [PMID: 38973777 DOI: 10.1002/minf.202400079] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2023] [Revised: 04/10/2024] [Accepted: 05/04/2024] [Indexed: 07/09/2024]
Abstract
ADME (Absorption, Distribution, Metabolism, Excretion) properties are key parameters to judge whether a drug candidate exhibits a desired pharmacokinetic (PK) profile. In this study, we tested multi-task machine learning (ML) models to predict ADME and animal PK endpoints trained on in-house data generated at Boehringer Ingelheim. Models were evaluated both at the design stage of a compound (i. e., no experimental data of test compounds available) and at testing stage when a particular assay would be conducted (i. e., experimental data of earlier conducted assays may be available). Using realistic time-splits, we found a clear benefit in performance of multi-task graph-based neural network models over single-task model, which was even stronger when experimental data of earlier assays is available. In an attempt to explain the success of multi-task models, we found that especially endpoints with the largest numbers of data points (physicochemical endpoints, clearance in microsomes) are responsible for increased predictivity in more complex ADME and PK endpoints. In summary, our study provides insight into how data for multiple ADME/PK endpoints in a pharmaceutical company can be best leveraged to optimize predictivity of ML models.
Collapse
Affiliation(s)
- Moritz Walter
- Medicinal Chemistry Department, Boehringer Ingelheim Pharma GmbH & Co. KG, Birkendorfer Str. 65, 88397, Biberach an der Riss, Germany
| | - Jens M Borghardt
- Drug Discovery Sciences Department, Boehringer Ingelheim Pharma GmbH & Co. KG, Birkendorfer Str. 65, 88397, Biberach an der Riss, Germany
| | - Lina Humbeck
- Medicinal Chemistry Department, Boehringer Ingelheim Pharma GmbH & Co. KG, Birkendorfer Str. 65, 88397, Biberach an der Riss, Germany
| | - Miha Skalic
- Medicinal Chemistry Department, Boehringer Ingelheim Pharma GmbH & Co. KG, Birkendorfer Str. 65, 88397, Biberach an der Riss, Germany
| |
Collapse
|
4
|
Shi S, Fu L, Yi J, Yang Z, Zhang X, Deng Y, Wang W, Wu C, Zhao W, Hou T, Zeng X, Lyu A, Cao D. ChemFH: an integrated tool for screening frequent false positives in chemical biology and drug discovery. Nucleic Acids Res 2024; 52:W439-W449. [PMID: 38783035 PMCID: PMC11223804 DOI: 10.1093/nar/gkae424] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2024] [Revised: 04/25/2024] [Accepted: 05/10/2024] [Indexed: 05/25/2024] Open
Abstract
High-throughput screening rapidly tests an extensive array of chemical compounds to identify hit compounds for specific biological targets in drug discovery. However, false-positive results disrupt hit compound screening, leading to wastage of time and resources. To address this, we propose ChemFH, an integrated online platform facilitating rapid virtual evaluation of potential false positives, including colloidal aggregators, spectroscopic interference compounds, firefly luciferase inhibitors, chemical reactive compounds, promiscuous compounds, and other assay interferences. By leveraging a dataset containing 823 391 compounds, we constructed high-quality prediction models using multi-task directed message-passing network (DMPNN) architectures combining uncertainty estimation, yielding an average AUC value of 0.91. Furthermore, ChemFH incorporated 1441 representative alert substructures derived from the collected data and ten commonly used frequent hitter screening rules. ChemFH was validated with an external set of 75 compounds. Subsequently, the virtual screening capability of ChemFH was successfully confirmed through its application to five virtual screening libraries. Furthermore, ChemFH underwent additional validation on two natural products and FDA-approved drugs, yielding reliable and accurate results. ChemFH is a comprehensive, reliable, and computationally efficient screening pipeline that facilitates the identification of true positive results in assays, contributing to enhanced efficiency and success rates in drug discovery. ChemFH is freely available via https://chemfh.scbdd.com/.
Collapse
Affiliation(s)
- Shaohua Shi
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha, Hunan 410013, P.R. China
- School of Chinese Medicine, Hong Kong Baptist University, Kowloon, Hong Kong SAR, 999077, P.R. China
| | - Li Fu
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha, Hunan 410013, P.R. China
| | - Jiacai Yi
- School of Computer Science, National University of Defense Technology, Changsha, Hunan 410073, P.R. China
| | - Ziyi Yang
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha, Hunan 410013, P.R. China
| | - Xiaochen Zhang
- School of Information Technology, Shangqiu Normal University, Shangqiu, Henan 476000, P.R. China
| | - Youchao Deng
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha, Hunan 410013, P.R. China
| | - Wenxuan Wang
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha, Hunan 410013, P.R. China
| | - Chengkun Wu
- School of Computer Science, National University of Defense Technology, Changsha, Hunan 410073, P.R. China
| | - Wentao Zhao
- School of Computer Science, National University of Defense Technology, Changsha, Hunan 410073, P.R. China
| | - Tingjun Hou
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang 310058, P.R. China
| | - Xiangxiang Zeng
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, Hunan 410082, P.R. China
| | - Aiping Lyu
- School of Chinese Medicine, Hong Kong Baptist University, Kowloon, Hong Kong SAR, 999077, P.R. China
| | - Dongsheng Cao
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha, Hunan 410013, P.R. China
| |
Collapse
|
5
|
Fu L, Shi S, Yi J, Wang N, He Y, Wu Z, Peng J, Deng Y, Wang W, Wu C, Lyu A, Zeng X, Zhao W, Hou T, Cao D. ADMETlab 3.0: an updated comprehensive online ADMET prediction platform enhanced with broader coverage, improved performance, API functionality and decision support. Nucleic Acids Res 2024; 52:W422-W431. [PMID: 38572755 PMCID: PMC11223840 DOI: 10.1093/nar/gkae236] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2024] [Revised: 03/10/2024] [Accepted: 03/21/2024] [Indexed: 04/05/2024] Open
Abstract
ADMETlab 3.0 is the second updated version of the web server that provides a comprehensive and efficient platform for evaluating ADMET-related parameters as well as physicochemical properties and medicinal chemistry characteristics involved in the drug discovery process. This new release addresses the limitations of the previous version and offers broader coverage, improved performance, API functionality, and decision support. For supporting data and endpoints, this version includes 119 features, an increase of 31 compared to the previous version. The updated number of entries is 1.5 times larger than the previous version with over 400 000 entries. ADMETlab 3.0 incorporates a multi-task DMPNN architecture coupled with molecular descriptors, a method that not only guaranteed calculation speed for each endpoint simultaneously, but also achieved a superior performance in terms of accuracy and robustness. In addition, an API has been introduced to meet the growing demand for programmatic access to large amounts of data in ADMETlab 3.0. Moreover, this version includes uncertainty estimates in the prediction results, aiding in the confident selection of candidate compounds for further studies and experiments. ADMETlab 3.0 is publicly for access without the need for registration at: https://admetlab3.scbdd.com.
Collapse
Affiliation(s)
- Li Fu
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha, Hunan 410013, P.R. China
| | - Shaohua Shi
- School of Chinese Medicine, Hong Kong Baptist University, Kowloon, Hong Kong SAR, 999077, P.R. China
| | - Jiacai Yi
- School of Computer Science, National University of Defense Technology, Changsha, Hunan 410073, P.R. China
| | - Ningning Wang
- Xiangya Hospital of Central South University, Changsha, Hunan 410008, P.R. China
| | - Yuanhang He
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha, Hunan 410013, P.R. China
| | - Zhenxing Wu
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang 310058, P.R. China
| | - Jinfu Peng
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha, Hunan 410013, P.R. China
| | - Youchao Deng
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha, Hunan 410013, P.R. China
| | - Wenxuan Wang
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha, Hunan 410013, P.R. China
| | - Chengkun Wu
- School of Computer Science, National University of Defense Technology, Changsha, Hunan 410073, P.R. China
| | - Aiping Lyu
- School of Chinese Medicine, Hong Kong Baptist University, Kowloon, Hong Kong SAR, 999077, P.R. China
| | - Xiangxiang Zeng
- Department of Computer Science, Hunan University, Changsha, Hunan 410082, P.R. China
| | - Wentao Zhao
- School of Computer Science, National University of Defense Technology, Changsha, Hunan 410073, P.R. China
| | - Tingjun Hou
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang 310058, P.R. China
| | - Dongsheng Cao
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha, Hunan 410013, P.R. China
| |
Collapse
|
6
|
McDonald MA, Koscher BA, Canty RB, Jensen KF. Calibration-free reaction yield quantification by HPLC with a machine-learning model of extinction coefficients. Chem Sci 2024; 15:10092-10100. [PMID: 38966367 PMCID: PMC11220585 DOI: 10.1039/d4sc01881h] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2024] [Accepted: 05/19/2024] [Indexed: 07/06/2024] Open
Abstract
Reaction optimization and characterization depend on reliable measures of reaction yield, often measured by high-performance liquid chromatography (HPLC). Peak areas in HPLC chromatograms are correlated to analyte concentrations by way of calibration standards, typically pure samples of known concentration. Preparing the pure material required for calibration runs can be tedious for low-yielding reactions and technically challenging at small reaction scales. Herein, we present a method to quantify the yield of reactions by HPLC without needing to isolate the product(s) by combining a machine learning model for molar extinction coefficient estimation, and both UV-vis absorption and mass spectra. We demonstrate the method for a variety of reactions important in medicinal and process chemistry, including amide couplings, palladium catalyzed cross-couplings, nucleophilic aromatic substitutions, aminations, and heterocycle syntheses. The reactions were all performed using an automated synthesis and isolation platform. Calibration-free methods such as the presented approach are necessary for such automated platforms to be able to discover, characterize, and optimize reactions automatically.
Collapse
Affiliation(s)
- Matthew A McDonald
- Department of Chemical Engineering, Massachusetts Institute of Technology Cambridge Massachusetts 02139 USA
| | - Brent A Koscher
- Department of Chemical Engineering, Massachusetts Institute of Technology Cambridge Massachusetts 02139 USA
| | - Richard B Canty
- Department of Chemical Engineering, Massachusetts Institute of Technology Cambridge Massachusetts 02139 USA
| | - Klavs F Jensen
- Department of Chemical Engineering, Massachusetts Institute of Technology Cambridge Massachusetts 02139 USA
| |
Collapse
|
7
|
Avchaciov K, Clay KJ, Denisov K, Burmistrova O, Petrascheck M, Fedichev P. Multiple Targets, One Goal: Compounding life-extending effects through Polypharmacology. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.06.23.600269. [PMID: 38979167 PMCID: PMC11230182 DOI: 10.1101/2024.06.23.600269] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/10/2024]
Abstract
Analysis of lifespan-extending compounds suggested the most effective geroprotectors target multiple biogenic amine receptors. To test this hypothesis, we used graph neural networks to predict such polypharmacological compounds and evaluated them in C. elegans . Over 70% of the selected compounds extended lifespan, with effect sizes in the top 5% compared to the DrugAge database. This reveals that rationally designing polypharmacological compounds enables the design of geroprotectors with exceptional efficacy. Key takeaways The most effective known geroprotectors act by polypharmacological mechanisms.Graph neural networks predicted polypharmacological geroprotectors with a hit rate of 70%.The predicted polypharmacological geroprotectors are exceptionally effective.The predicted polypharmacological mechanism was experimentally confirmed.Rationally designing polypharmacological compounds results in geroprotectors with exceptional efficacy.
Collapse
|
8
|
Arvidsson McShane S, Norinder U, Alvarsson J, Ahlberg E, Carlsson L, Spjuth O. CPSign: conformal prediction for cheminformatics modeling. J Cheminform 2024; 16:75. [PMID: 38943219 PMCID: PMC11214261 DOI: 10.1186/s13321-024-00870-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2023] [Accepted: 06/11/2024] [Indexed: 07/01/2024] Open
Abstract
Conformal prediction has seen many applications in pharmaceutical science, being able to calibrate outputs of machine learning models and producing valid prediction intervals. We here present the open source software CPSign that is a complete implementation of conformal prediction for cheminformatics modeling. CPSign implements inductive and transductive conformal prediction for classification and regression, and probabilistic prediction with the Venn-ABERS methodology. The main chemical representation is signatures but other types of descriptors are also supported. The main modeling methodology is support vector machines (SVMs), but additional modeling methods are supported via an extension mechanism, e.g. DeepLearning4J models. We also describe features for visualizing results from conformal models including calibration and efficiency plots, as well as features to publish predictive models as REST services. We compare CPSign against other common cheminformatics modeling approaches including random forest, and a directed message-passing neural network. The results show that CPSign produces robust predictive performance with comparative predictive efficiency, with superior runtime and lower hardware requirements compared to neural network based models. CPSign has been used in several studies and is in production-use in multiple organizations. The ability to work directly with chemical input files, perform descriptor calculation and modeling with SVM in the conformal prediction framework, with a single software package having a low footprint and fast execution time makes CPSign a convenient and yet flexible package for training, deploying, and predicting on chemical data. CPSign can be downloaded from GitHub at https://github.com/arosbio/cpsign .Scientific contribution CPSign provides a single software that allows users to perform data preprocessing, modeling and make predictions directly on chemical structures, using conformal and probabilistic prediction. Building and evaluating new models can be achieved at a high abstraction level, without sacrificing flexibility and predictive performance-showcased with a method evaluation against contemporary modeling approaches, where CPSign performs on par with a state-of-the-art deep learning based model.
Collapse
Affiliation(s)
- Staffan Arvidsson McShane
- Department of Pharmaceutical Biosciences and Science for Life Laboratory, Uppsala University, Uppsala, 75124, Sweden
| | - Ulf Norinder
- Department of Pharmaceutical Biosciences and Science for Life Laboratory, Uppsala University, Uppsala, 75124, Sweden
- Department of Computer and Systems Sciences, Stockholm University, Stockholm, 10587, Sweden
- MTM Research Centre, School of Science and Technology, Örebro University, Örebro, 70182, Sweden
| | - Jonathan Alvarsson
- Department of Pharmaceutical Biosciences and Science for Life Laboratory, Uppsala University, Uppsala, 75124, Sweden
| | - Ernst Ahlberg
- Department of Pharmaceutical Biosciences and Science for Life Laboratory, Uppsala University, Uppsala, 75124, Sweden
- Department of Computer Science, Royal Holloway University of London, Egham, TW20 0EX, UK
| | - Lars Carlsson
- Department of Computer Science, Royal Holloway University of London, Egham, TW20 0EX, UK
- Department of Computing, Jönköping University, Jönköping, 55111, Sweden
| | - Ola Spjuth
- Department of Pharmaceutical Biosciences and Science for Life Laboratory, Uppsala University, Uppsala, 75124, Sweden.
| |
Collapse
|
9
|
Ginex T, Vázquez J, Estarellas C, Luque FJ. Quantum mechanical-based strategies in drug discovery: Finding the pace to new challenges in drug design. Curr Opin Struct Biol 2024; 87:102870. [PMID: 38914031 DOI: 10.1016/j.sbi.2024.102870] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2024] [Revised: 06/02/2024] [Accepted: 06/04/2024] [Indexed: 06/26/2024]
Abstract
The expansion of the chemical space to tangible libraries containing billions of synthesizable molecules opens exciting opportunities for drug discovery, but also challenges the power of computer-aided drug design to prioritize the best candidates. This directly hits quantum mechanics (QM) methods, which provide chemically accurate properties, but subject to small-sized systems. Preserving accuracy while optimizing the computational cost is at the heart of many efforts to develop high-quality, efficient QM-based strategies, reflected in refined algorithms and computational approaches. The design of QM-tailored physics-based force fields and the coupling of QM with machine learning, in conjunction with the computing performance of supercomputing resources, will enhance the ability to use these methods in drug discovery. The challenge is formidable, but we will undoubtedly see impressive advances that will define a new era.
Collapse
Affiliation(s)
- Tiziana Ginex
- Pharmacelera, Parc Científic de Barcelona (PCB), Baldiri Reixac 4-8, 08028 Barcelona, Spain
| | - Javier Vázquez
- Pharmacelera, Parc Científic de Barcelona (PCB), Baldiri Reixac 4-8, 08028 Barcelona, Spain; Departament de Nutrició, Ciències de l'Alimentació i Gastronomia, Universitat de Barcelona, Institut de Biomedicina (IBUB), 08921 Santa Coloma de Gramenet, Spain; Institut de Biomedicina (IBUB), 08921 Santa Coloma de Gramenet, Spain
| | - Carolina Estarellas
- Departament de Nutrició, Ciències de l'Alimentació i Gastronomia, Universitat de Barcelona, Institut de Biomedicina (IBUB), 08921 Santa Coloma de Gramenet, Spain; Institut de Química Teòrica i Computacional (IQTCUB), 08921 Santa Coloma de Gramenet, Spain
| | - F Javier Luque
- Departament de Nutrició, Ciències de l'Alimentació i Gastronomia, Universitat de Barcelona, Institut de Biomedicina (IBUB), 08921 Santa Coloma de Gramenet, Spain; Institut de Biomedicina (IBUB), 08921 Santa Coloma de Gramenet, Spain; Institut de Química Teòrica i Computacional (IQTCUB), 08921 Santa Coloma de Gramenet, Spain.
| |
Collapse
|
10
|
Crouzet A, Lopez N, Riss Yaw B, Lepelletier Y, Demange L. The Millennia-Long Development of Drugs Associated with the 80-Year-Old Artificial Intelligence Story: The Therapeutic Big Bang? Molecules 2024; 29:2716. [PMID: 38930784 PMCID: PMC11206022 DOI: 10.3390/molecules29122716] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2024] [Revised: 05/30/2024] [Accepted: 05/31/2024] [Indexed: 06/28/2024] Open
Abstract
The journey of drug discovery (DD) has evolved from ancient practices to modern technology-driven approaches, with Artificial Intelligence (AI) emerging as a pivotal force in streamlining and accelerating the process. Despite the vital importance of DD, it faces challenges such as high costs and lengthy timelines. This review examines the historical progression and current market of DD alongside the development and integration of AI technologies. We analyse the challenges encountered in applying AI to DD, focusing on drug design and protein-protein interactions. The discussion is enriched by presenting models that put forward the application of AI in DD. Three case studies are highlighted to demonstrate the successful application of AI in DD, including the discovery of a novel class of antibiotics and a small-molecule inhibitor that has progressed to phase II clinical trials. These cases underscore the potential of AI to identify new drug candidates and optimise the development process. The convergence of DD and AI embodies a transformative shift in the field, offering a path to overcome traditional obstacles. By leveraging AI, the future of DD promises enhanced efficiency and novel breakthroughs, heralding a new era of medical innovation even though there is still a long way to go.
Collapse
Affiliation(s)
- Aurore Crouzet
- UMR 8038 CNRS CiTCoM, Team PNAS, Faculté de Pharmacie, Université Paris Cité, 4 Avenue de l’Observatoire, 75006 Paris, France
- W-MedPhys, 128 Rue la Boétie, 75008 Paris, France
| | - Nicolas Lopez
- W-MedPhys, 128 Rue la Boétie, 75008 Paris, France
- ENOES, 62 Rue de Miromesnil, 75008 Paris, France
- Unité Mixte de Recherche «Institut de Physique Théorique (IPhT)» CEA-CNRS, UMR 3681, Bat 774, Route de l’Orme des Merisiers, 91191 St Aubin-Gif-sur-Yvette, France
| | - Benjamin Riss Yaw
- UMR 8038 CNRS CiTCoM, Team PNAS, Faculté de Pharmacie, Université Paris Cité, 4 Avenue de l’Observatoire, 75006 Paris, France
| | - Yves Lepelletier
- W-MedPhys, 128 Rue la Boétie, 75008 Paris, France
- Université Paris Cité, Imagine Institute, 24 Boulevard Montparnasse, 75015 Paris, France
- INSERM UMR 1163, Laboratory of Cellular and Molecular Basis of Normal Hematopoiesis and Hematological Disorders: Therapeutical Implications, 24 Boulevard Montparnasse, 75015 Paris, France
| | - Luc Demange
- UMR 8038 CNRS CiTCoM, Team PNAS, Faculté de Pharmacie, Université Paris Cité, 4 Avenue de l’Observatoire, 75006 Paris, France
| |
Collapse
|
11
|
Retchin M, Wang Y, Takaba K, Chodera JD. DrugGym: A testbed for the economics of autonomous drug discovery. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.28.596296. [PMID: 38854082 PMCID: PMC11160604 DOI: 10.1101/2024.05.28.596296] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2024]
Abstract
Drug discovery is stochastic. The effectiveness of candidate compounds in satisfying design objectives is unknown ahead of time, and the tools used for prioritization-predictive models and assays-are inaccurate and noisy. In a typical discovery campaign, thousands of compounds may be synthesized and tested before design objectives are achieved, with many others ideated but deprioritized. These challenges are well-documented, but assessing potential remedies has been difficult. We introduce DrugGym, a framework for modeling the stochastic process of drug discovery. Emulating biochemical assays with realistic surrogate models, we simulate the progression from weak hits to sub-micromolar leads with viable ADME. We use this testbed to examine how different ideation, scoring, and decision-making strategies impact statistical measures of utility, such as the probability of program success within predefined budgets and the expected costs to achieve target candidate profile (TCP) goals. We also assess the influence of affinity model inaccuracy, chemical creativity, batch size, and multi-step reasoning. Our findings suggest that reducing affinity model inaccuracy from 2 to 0.5 pIC50 units improves budget-constrained success rates tenfold. DrugGym represents a realistic testbed for machine learning methods applied to the hit-to-lead phase. Source code is available at www.drug-gym.org.
Collapse
Affiliation(s)
- Michael Retchin
- Tri-Institutional PhD Program in Computational Biology and Medicine, Weill Cornell Medical College, Cornell University, New York, NY 10065
| | - Yuanqing Wang
- Computational and Systems Biology Program, Sloan Kettering Institute, Memorial Sloan Kettering Cancer Center, New York, NY 10065
- Simons Center for Computational Chemistry and Center for Data Science, New York University, New York, NY 10004
| | - Kenichiro Takaba
- Computational and Systems Biology Program, Sloan Kettering Institute, Memorial Sloan Kettering Cancer Center, New York, NY 10065
- Pharmaceutical Research Center, Advanced Drug Discovery, Asahi Kasei Pharma Corporation, Shizuoka 410-2321, Japan
| | - John D. Chodera
- Tri-Institutional PhD Program in Computational Biology and Medicine, Weill Cornell Medical College, Cornell University, New York, NY 10065
- Computational and Systems Biology Program, Sloan Kettering Institute, Memorial Sloan Kettering Cancer Center, New York, NY 10065
| |
Collapse
|
12
|
Bassani D, Parrott NJ, Manevski N, Zhang JD. Another string to your bow: machine learning prediction of the pharmacokinetic properties of small molecules. Expert Opin Drug Discov 2024; 19:683-698. [PMID: 38727016 DOI: 10.1080/17460441.2024.2348157] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2023] [Accepted: 04/23/2024] [Indexed: 05/22/2024]
Abstract
INTRODUCTION Prediction of pharmacokinetic (PK) properties is crucial for drug discovery and development. Machine-learning (ML) models, which use statistical pattern recognition to learn correlations between input features (such as chemical structures) and target variables (such as PK parameters), are being increasingly used for this purpose. To embed ML models for PK prediction into workflows and to guide future development, a solid understanding of their applicability, advantages, limitations, and synergies with other approaches is necessary. AREAS COVERED This narrative review discusses the design and application of ML models to predict PK parameters of small molecules, especially in light of established approaches including in vitro-in vivo extrapolation (IVIVE) and physiologically based pharmacokinetic (PBPK) models. The authors illustrate scenarios in which the three approaches are used and emphasize how they enhance and complement each other. In particular, they highlight achievements, the state of the art and potentials of applying machine learning for PK prediction through a comphrehensive literature review. EXPERT OPINION ML models, when carefully crafted, regularly updated, and appropriately used, empower users to prioritize molecules with favorable PK properties. Informed practitioners can leverage these models to improve the efficiency of drug discovery and development process.
Collapse
Affiliation(s)
- Davide Bassani
- Pharmaceutical Research & Early Development, Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd., Basel, Switzerland
| | - Neil John Parrott
- Pharmaceutical Research & Early Development, Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd., Basel, Switzerland
| | - Nenad Manevski
- Pharmaceutical Research & Early Development, Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd., Basel, Switzerland
| | - Jitao David Zhang
- Pharmaceutical Research & Early Development, Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd., Basel, Switzerland
| |
Collapse
|
13
|
Xiang W, Zhong F, Ni L, Zheng M, Li X, Shi Q, Wang D. Gram matrix: an efficient representation of molecular conformation and learning objective for molecular pretraining. Brief Bioinform 2024; 25:bbae340. [PMID: 38990515 PMCID: PMC11238115 DOI: 10.1093/bib/bbae340] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2024] [Revised: 06/05/2024] [Accepted: 06/28/2024] [Indexed: 07/12/2024] Open
Abstract
Accurate prediction of molecular properties is fundamental in drug discovery and development, providing crucial guidance for effective drug design. A critical factor in achieving accurate molecular property prediction lies in the appropriate representation of molecular structures. Presently, prevalent deep learning-based molecular representations rely on 2D structure information as the primary molecular representation, often overlooking essential three-dimensional (3D) conformational information due to the inherent limitations of 2D structures in conveying atomic spatial relationships. In this study, we propose employing the Gram matrix as a condensed representation of 3D molecular structures and for efficient pretraining objectives. Subsequently, we leverage this matrix to construct a novel molecular representation model, Pre-GTM, which inherently encapsulates 3D information. The model accurately predicts the 3D structure of a molecule by estimating the Gram matrix. Our findings demonstrate that Pre-GTM model outperforms the baseline Graphormer model and other pretrained models in the QM9 and MoleculeNet quantitative property prediction task. The integration of the Gram matrix as a condensed representation of 3D molecular structure, incorporated into the Pre-GTM model, opens up promising avenues for its potential application across various domains of molecular research, including drug design, materials science, and chemical engineering.
Collapse
Affiliation(s)
| | - Feisheng Zhong
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
- University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing 100049, China
- Fujian Key Laboratory of Drug Target Discovery and Structural and Functional Research, School of Pharmacy, Fujian Medical University, Fuzhou 350122, China
| | - Lin Ni
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
- Nanjing University of Chinese Medicine, 138 Xianlin Road, Nanjing 210023, China
| | - Mingyue Zheng
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
- University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing 100049, China
- Nanjing University of Chinese Medicine, 138 Xianlin Road, Nanjing 210023, China
| | - Xutong Li
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
- University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing 100049, China
| | - Qian Shi
- Lingang Laboratory, Shanghai 200031, China
| | | |
Collapse
|
14
|
van Gerwen P, Briling KR, Calvino Alonso Y, Franke M, Corminboeuf C. Benchmarking machine-readable vectors of chemical reactions on computed activation barriers. DIGITAL DISCOVERY 2024; 3:932-943. [PMID: 38756222 PMCID: PMC11094696 DOI: 10.1039/d3dd00175j] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/06/2023] [Accepted: 02/28/2024] [Indexed: 05/18/2024]
Abstract
In recent years, there has been a surge of interest in predicting computed activation barriers, to enable the acceleration of the automated exploration of reaction networks. Consequently, various predictive approaches have emerged, ranging from graph-based models to methods based on the three-dimensional structure of reactants and products. In tandem, many representations have been developed to predict experimental targets, which may hold promise for barrier prediction as well. Here, we bring together all of these efforts and benchmark various methods (Morgan fingerprints, the DRFP, the CGR representation-based Chemprop, SLATMd, B2Rl2, EquiReact and language model BERT + RXNFP) for the prediction of computed activation barriers on three diverse datasets.
Collapse
Affiliation(s)
- Puck van Gerwen
- Laboratory for Computational Molecular Design, Institute of Chemical Sciences and Engineering, École Polytechnique Fédérale de Lausanne 1015 Lausanne Switzerland
- National Center for Competence in Research-Catalysis (NCCR-Catalysis), École Polytechnique Fédérale de Lausanne 1015 Lausanne Switzerland
| | - Ksenia R Briling
- Laboratory for Computational Molecular Design, Institute of Chemical Sciences and Engineering, École Polytechnique Fédérale de Lausanne 1015 Lausanne Switzerland
| | - Yannick Calvino Alonso
- Laboratory for Computational Molecular Design, Institute of Chemical Sciences and Engineering, École Polytechnique Fédérale de Lausanne 1015 Lausanne Switzerland
| | - Malte Franke
- Laboratory for Computational Molecular Design, Institute of Chemical Sciences and Engineering, École Polytechnique Fédérale de Lausanne 1015 Lausanne Switzerland
| | - Clemence Corminboeuf
- Laboratory for Computational Molecular Design, Institute of Chemical Sciences and Engineering, École Polytechnique Fédérale de Lausanne 1015 Lausanne Switzerland
- National Center for Competence in Research-Catalysis (NCCR-Catalysis), École Polytechnique Fédérale de Lausanne 1015 Lausanne Switzerland
| |
Collapse
|
15
|
Zhang H, Liu X, Cheng W, Wang T, Chen Y. Prediction of drug-target binding affinity based on deep learning models. Comput Biol Med 2024; 174:108435. [PMID: 38608327 DOI: 10.1016/j.compbiomed.2024.108435] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2024] [Revised: 04/05/2024] [Accepted: 04/07/2024] [Indexed: 04/14/2024]
Abstract
The prediction of drug-target binding affinity (DTA) plays an important role in drug discovery. Computerized virtual screening techniques have been used for DTA prediction, greatly reducing the time and economic costs of drug discovery. However, these techniques have not succeeded in reversing the low success rate of new drug development. In recent years, the continuous development of deep learning (DL) technology has brought new opportunities for drug discovery through the DTA prediction. This shift has moved the prediction of DTA from traditional machine learning methods to DL. The DL frameworks used for DTA prediction include convolutional neural networks (CNN), graph convolutional neural networks (GCN), and recurrent neural networks (RNN), and reinforcement learning (RL), among others. This review article summarizes the available literature on DTA prediction using DL models, including DTA quantification metrics and datasets, and DL algorithms used for DTA prediction (including input representation of models, neural network frameworks, valuation indicators, and model interpretability). In addition, the opportunities, challenges, and prospects of the application of DL frameworks for DTA prediction in the field of drug discovery are discussed.
Collapse
Affiliation(s)
- Hao Zhang
- College of Science, Nanjing Agricultural University, Nanjing, 210095, China
| | - Xiaoqian Liu
- College of Science, Nanjing Agricultural University, Nanjing, 210095, China
| | - Wenya Cheng
- College of Science, Nanjing Agricultural University, Nanjing, 210095, China
| | - Tianshi Wang
- College of Science, Nanjing Agricultural University, Nanjing, 210095, China
| | - Yuanyuan Chen
- College of Science, Nanjing Agricultural University, Nanjing, 210095, China.
| |
Collapse
|
16
|
Umemori Y, Handa K, Yoshimura S, Kageyama M, Iijima T. Development of a Novel In Silico Classification Model to Assess Reactive Metabolite Formation in the Cysteine Trapping Assay and Investigation of Important Substructures. Biomolecules 2024; 14:535. [PMID: 38785942 PMCID: PMC11117661 DOI: 10.3390/biom14050535] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2024] [Revised: 04/25/2024] [Accepted: 04/26/2024] [Indexed: 05/25/2024] Open
Abstract
Predicting whether a compound can cause drug-induced liver injury (DILI) is difficult due to the complexity of drug mechanism. The cysteine trapping assay is a method for detecting reactive metabolites that bind to microsomes covalently. However, it is cumbersome to use 35S isotope-labeled cysteine for this assay. Therefore, we constructed an in silico classification model for predicting a positive/negative outcome in the cysteine trapping assay. We collected 475 compounds (436 in-house compounds and 39 publicly available drugs) based on experimental data performed in this study, and the composition of the results showed 248 positives and 227 negatives. Using a Message Passing Neural Network (MPNN) and Random Forest (RF) with extended connectivity fingerprint (ECFP) 4, we built machine learning models to predict the covalent binding risk of compounds. In the time-split dataset, AUC-ROC of MPNN and RF were 0.625 and 0.559 in the hold-out test, restrictively. This result suggests that the MPNN model has a higher predictivity than RF in the time-split dataset. Hence, we conclude that the in silico MPNN classification model for the cysteine trapping assay has a better predictive power. Furthermore, most of the substructures that contributed positively to the cysteine trapping assay were consistent with previous results.
Collapse
Affiliation(s)
| | - Koichi Handa
- DMPK Research Department, Teijin Institute for Bio-Medical Research, TEIJIN PHARMA LIMITED, 4-3-2 Asahigaoka, Hino-shi, Tokyo 191-8512, Japan; (Y.U.); (S.Y.); (M.K.); (T.I.)
| | | | | | | |
Collapse
|
17
|
Llompart P, Minoletti C, Baybekov S, Horvath D, Marcou G, Varnek A. Will we ever be able to accurately predict solubility? Sci Data 2024; 11:303. [PMID: 38499581 PMCID: PMC10948805 DOI: 10.1038/s41597-024-03105-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2023] [Accepted: 02/29/2024] [Indexed: 03/20/2024] Open
Abstract
Accurate prediction of thermodynamic solubility by machine learning remains a challenge. Recent models often display good performances, but their reliability may be deceiving when used prospectively. This study investigates the origins of these discrepancies, following three directions: a historical perspective, an analysis of the aqueous solubility dataverse and data quality. We investigated over 20 years of published solubility datasets and models, highlighting overlooked datasets and the overlaps between popular sets. We benchmarked recently published models on a novel curated solubility dataset and report poor performances. We also propose a workflow to cure aqueous solubility data aiming at producing useful models for bench chemist. Our results demonstrate that some state-of-the-art models are not ready for public usage because they lack a well-defined applicability domain and overlook historical data sources. We report the impact of factors influencing the utility of the models: interlaboratory standard deviation, ionic state of the solute and data sources. The herein obtained models, and quality-assessed datasets are publicly available.
Collapse
Affiliation(s)
- P Llompart
- Laboratory of Chemoinformatics, UMR7140, University of Strasbourg, Strasbourg, France
- IDD/CADD, Sanofi, Vitry-Sur-Seine, France
| | | | - S Baybekov
- Laboratory of Chemoinformatics, UMR7140, University of Strasbourg, Strasbourg, France
| | - D Horvath
- Laboratory of Chemoinformatics, UMR7140, University of Strasbourg, Strasbourg, France
| | - G Marcou
- Laboratory of Chemoinformatics, UMR7140, University of Strasbourg, Strasbourg, France.
| | - A Varnek
- Laboratory of Chemoinformatics, UMR7140, University of Strasbourg, Strasbourg, France
| |
Collapse
|
18
|
Buterez D, Janet JP, Kiddle SJ, Oglic D, Lió P. Transfer learning with graph neural networks for improved molecular property prediction in the multi-fidelity setting. Nat Commun 2024; 15:1517. [PMID: 38409255 DOI: 10.1038/s41467-024-45566-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2023] [Accepted: 01/25/2024] [Indexed: 02/28/2024] Open
Abstract
We investigate the potential of graph neural networks for transfer learning and improving molecular property prediction on sparse and expensive to acquire high-fidelity data by leveraging low-fidelity measurements as an inexpensive proxy for a targeted property of interest. This problem arises in discovery processes that rely on screening funnels for trading off the overall costs against throughput and accuracy. Typically, individual stages in these processes are loosely connected and each one generates data at different scale and fidelity. We consider this setup holistically and demonstrate empirically that existing transfer learning techniques for graph neural networks are generally unable to harness the information from multi-fidelity cascades. Here, we propose several effective transfer learning strategies and study them in transductive and inductive settings. Our analysis involves a collection of more than 28 million unique experimental protein-ligand interactions across 37 targets from drug discovery by high-throughput screening and 12 quantum properties from the dataset QMugs. The results indicate that transfer learning can improve the performance on sparse tasks by up to eight times while using an order of magnitude less high-fidelity training data. Moreover, the proposed methods consistently outperform existing transfer learning strategies for graph-structured data on drug discovery and quantum mechanics datasets.
Collapse
Affiliation(s)
- David Buterez
- Department of Computer Science and Technology, University of Cambridge, Cambridge, UK.
| | - Jon Paul Janet
- Molecular AI, BioPharmaceuticals R&D, AstraZeneca, Gothenburg, Sweden
| | - Steven J Kiddle
- Data Science & Advanced Analytics, Data Science & AI, R&D, AstraZeneca, Cambridge, UK
| | - Dino Oglic
- Centre for AI, BioPharmaceuticals R&D, AstraZeneca, Cambridge, UK
| | - Pietro Lió
- Department of Computer Science and Technology, University of Cambridge, Cambridge, UK
| |
Collapse
|
19
|
Chung Y, Green WH. Machine learning from quantum chemistry to predict experimental solvent effects on reaction rates. Chem Sci 2024; 15:2410-2424. [PMID: 38362410 PMCID: PMC10866337 DOI: 10.1039/d3sc05353a] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2023] [Accepted: 01/04/2024] [Indexed: 02/17/2024] Open
Abstract
Fast and accurate prediction of solvent effects on reaction rates are crucial for kinetic modeling, chemical process design, and high-throughput solvent screening. Despite the recent advance in machine learning, a scarcity of reliable data has hindered the development of predictive models that are generalizable for diverse reactions and solvents. In this work, we generate a large set of data with the COSMO-RS method for over 28 000 neutral reactions and 295 solvents and train a machine learning model to predict the solvation free energy and solvation enthalpy of activation (ΔΔG‡solv, ΔΔH‡solv) for a solution phase reaction. On unseen reactions, the model achieves mean absolute errors of 0.71 and 1.03 kcal mol-1 for ΔΔG‡solv and ΔΔH‡solv, respectively, relative to the COSMO-RS calculations. The model also provides reliable predictions of relative rate constants within a factor of 4 when tested on experimental data. The presented model can provide nearly instantaneous predictions of kinetic solvent effects or relative rate constants for a broad range of neutral closed-shell or free radical reactions and solvents only based on atom-mapped reaction SMILES and solvent SMILES strings.
Collapse
Affiliation(s)
- Yunsie Chung
- Department of Chemical Engineering, Massachusetts Institute of Technology Cambridge MA 02139 USA
| | - William H Green
- Department of Chemical Engineering, Massachusetts Institute of Technology Cambridge MA 02139 USA
| |
Collapse
|
20
|
Baygi SF, Barupal DK. IDSL_MINT: a deep learning framework to predict molecular fingerprints from mass spectra. J Cheminform 2024; 16:8. [PMID: 38238779 PMCID: PMC10797927 DOI: 10.1186/s13321-024-00804-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2023] [Accepted: 01/14/2024] [Indexed: 01/22/2024] Open
Abstract
The majority of tandem mass spectrometry (MS/MS) spectra in untargeted metabolomics and exposomics studies lack any annotation. Our deep learning framework, Integrated Data Science Laboratory for Metabolomics and Exposomics-Mass INTerpreter (IDSL_MINT) can translate MS/MS spectra into molecular fingerprint descriptors. IDSL_MINT allows users to leverage the power of the transformer model for mass spectrometry data, similar to the large language models. Models are trained on user-provided reference MS/MS libraries via any customizable molecular fingerprint descriptors. IDSL_MINT was benchmarked using the LipidMaps database and improved the annotation rate of a test study for MS/MS spectra that were not originally annotated using existing mass spectral libraries. IDSL_MINT may improve the overall annotation rates in untargeted metabolomics and exposomics studies. The IDSL_MINT framework and tutorials are available in the GitHub repository at https://github.com/idslme/IDSL_MINT .Scientific contribution statement.Structural annotation of MS/MS spectra from untargeted metabolomics and exposomics datasets is a major bottleneck in gaining new biological insights. Machine learning models to convert spectra into molecular fingerprints can help in the annotation process. Here, we present IDSL_MINT, a new, easy-to-use and customizable deep-learning framework to train and utilize new models to predict molecular fingerprints from spectra for the compound annotation workflows.
Collapse
Affiliation(s)
- Sadjad Fakouri Baygi
- Department of Environmental Medicine and Public Health, Icahn School of Medicine at Mount Sinai, CAM Building, 3rd Floor, 17 E 102 St, New York, NY, 10029, USA
| | - Dinesh Kumar Barupal
- Department of Environmental Medicine and Public Health, Icahn School of Medicine at Mount Sinai, CAM Building, 3rd Floor, 17 E 102 St, New York, NY, 10029, USA.
| |
Collapse
|