51
|
McGibbon M, Shave S, Dong J, Gao Y, Houston DR, Xie J, Yang Y, Schwaller P, Blay V. From intuition to AI: evolution of small molecule representations in drug discovery. Brief Bioinform 2023; 25:bbad422. [PMID: 38033290 PMCID: PMC10689004 DOI: 10.1093/bib/bbad422] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2023] [Revised: 10/13/2023] [Accepted: 11/01/2023] [Indexed: 12/02/2023] Open
Abstract
Within drug discovery, the goal of AI scientists and cheminformaticians is to help identify molecular starting points that will develop into safe and efficacious drugs while reducing costs, time and failure rates. To achieve this goal, it is crucial to represent molecules in a digital format that makes them machine-readable and facilitates the accurate prediction of properties that drive decision-making. Over the years, molecular representations have evolved from intuitive and human-readable formats to bespoke numerical descriptors and fingerprints, and now to learned representations that capture patterns and salient features across vast chemical spaces. Among these, sequence-based and graph-based representations of small molecules have become highly popular. However, each approach has strengths and weaknesses across dimensions such as generality, computational cost, inversibility for generative applications and interpretability, which can be critical in informing practitioners' decisions. As the drug discovery landscape evolves, opportunities for innovation continue to emerge. These include the creation of molecular representations for high-value, low-data regimes, the distillation of broader biological and chemical knowledge into novel learned representations and the modeling of up-and-coming therapeutic modalities.
Collapse
Affiliation(s)
- Miles McGibbon
- Institute of Quantitative Biology, Biochemistry and Biotechnology, University of Edinburgh, Edinburgh, Scotland EH9 3BF, United Kingdom
| | - Steven Shave
- Institute of Quantitative Biology, Biochemistry and Biotechnology, University of Edinburgh, Edinburgh, Scotland EH9 3BF, United Kingdom
| | - Jie Dong
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha, 410013, China
| | - Yumiao Gao
- Institute of Quantitative Biology, Biochemistry and Biotechnology, University of Edinburgh, Edinburgh, Scotland EH9 3BF, United Kingdom
| | - Douglas R Houston
- Institute of Quantitative Biology, Biochemistry and Biotechnology, University of Edinburgh, Edinburgh, Scotland EH9 3BF, United Kingdom
| | - Jiancong Xie
- Key Laboratory of Machine Intelligence and Advanced Computing, Sun Yat-Sen University, Guangzhou, 510000, China
| | - Yuedong Yang
- Key Laboratory of Machine Intelligence and Advanced Computing, Sun Yat-Sen University, Guangzhou, 510000, China
| | - Philippe Schwaller
- Laboratory of Artificial Chemical Intelligence (LIAC), Institut des Sciences et Ingénierie Chimiques, Ecole Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
| | - Vincent Blay
- Institute of Quantitative Biology, Biochemistry and Biotechnology, University of Edinburgh, Edinburgh, Scotland EH9 3BF, United Kingdom
| |
Collapse
|
52
|
Zhong S, Guan X. Count-Based Morgan Fingerprint: A More Efficient and Interpretable Molecular Representation in Developing Machine Learning-Based Predictive Regression Models for Water Contaminants' Activities and Properties. ENVIRONMENTAL SCIENCE & TECHNOLOGY 2023; 57:18193-18202. [PMID: 37406199 DOI: 10.1021/acs.est.3c02198] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/07/2023]
Abstract
In this study, we introduce the count-based Morgan fingerprint (C-MF) to represent chemical structures of contaminants and develop machine learning (ML)-based predictive models for their activities and properties. Compared with the binary Morgan fingerprint (B-MF), C-MF not only qualifies the presence or absence of an atom group but also quantifies its counts in a molecule. We employ six different ML algorithms (ridge regression, SVM, KNN, RF, XGBoost, and CatBoost) to develop models on 10 contaminant-related data sets based on C-MF and B-MF to compare them in terms of the model's predictive performance, interpretation, and applicability domain (AD). Our results show that C-MF outperforms B-MF in nine of 10 data sets in terms of model predictive performance. The advantage of C-MF over B-MF is dependent on the ML algorithm, and the performance enhancements are proportional to the difference in the chemical diversity of data sets calculated by B-MF and C-MF. Model interpretation results show that the C-MF-based model can elucidate the effect of atom group counts on the target and have a wider range of SHAP values. AD analysis shows that C-MF-based models have an AD similar to that of B-MF-based ones. Finally, we developed a "ContaminaNET" platform to deploy these C-MF-based models for free use.
Collapse
Affiliation(s)
- Shifa Zhong
- Department of Environmental Science, School of Ecological and Environmental Sciences, East China Normal University, Shanghai 200241, P. R. China
| | - Xiaohong Guan
- Department of Environmental Science, School of Ecological and Environmental Sciences, East China Normal University, Shanghai 200241, P. R. China
| |
Collapse
|
53
|
Liu X, Guo Y, Pan W, Xue Q, Fu J, Qu G, Zhang A. Exogenous Chemicals Impact Virus Receptor Gene Transcription: Insights from Deep Learning. ENVIRONMENTAL SCIENCE & TECHNOLOGY 2023; 57:18038-18047. [PMID: 37186679 DOI: 10.1021/acs.est.2c09837] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/17/2023]
Abstract
Despite the fact that coronavirus disease 2019 (COVID-19), caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), has been disrupting human life and health worldwide since the outbreak in late 2019, the impact of exogenous substance exposure on the viral infection remains unclear. It is well-known that, during viral infection, organism receptors play a significant role in mediating the entry of viruses to enter host cells. A major receptor of SARS-CoV-2 is the angiotensin-converting enzyme 2 (ACE2). This study proposes a deep learning model based on the graph convolutional network (GCN) that enables, for the first time, the prediction of exogenous substances that affect the transcriptional expression of the ACE2 gene. It outperforms other machine learning models, achieving an area under receiver operating characteristic curve (AUROC) of 0.712 and 0.703 on the validation and internal test set, respectively. In addition, quantitative polymerase chain reaction (qPCR) experiments provided additional supporting evidence for indoor air pollutants identified by the GCN model. More broadly, the proposed methodology can be applied to predict the effect of environmental chemicals on the gene transcription of other virus receptors as well. In contrast to typical deep learning models that are of black box nature, we further highlight the interpretability of the proposed GCN model and how it facilitates deeper understanding of gene change at the structural level.
Collapse
Affiliation(s)
- Xian Liu
- State Key Laboratory of Environmental Chemistry and Ecotoxicology, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing 100085, P. R. China
| | - Yunhe Guo
- State Key Laboratory of Environmental Chemistry and Ecotoxicology, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing 100085, P. R. China
| | - Wenxiao Pan
- State Key Laboratory of Environmental Chemistry and Ecotoxicology, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing 100085, P. R. China
| | - Qiao Xue
- State Key Laboratory of Environmental Chemistry and Ecotoxicology, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing 100085, P. R. China
| | - Jianjie Fu
- State Key Laboratory of Environmental Chemistry and Ecotoxicology, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing 100085, P. R. China
- School of Environment, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Hangzhou 310012, P. R. China
- College of Resources and Environment, University of Chinese Academy of Sciences, Beijing 100190, P. R. China
- Institute of Environment and Health, Jianghan University, Wuhan 430056, P.R. China
| | - Guangbo Qu
- State Key Laboratory of Environmental Chemistry and Ecotoxicology, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing 100085, P. R. China
- School of Environment, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Hangzhou 310012, P. R. China
- College of Resources and Environment, University of Chinese Academy of Sciences, Beijing 100190, P. R. China
| | - Aiqian Zhang
- State Key Laboratory of Environmental Chemistry and Ecotoxicology, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing 100085, P. R. China
- School of Environment, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Hangzhou 310012, P. R. China
- College of Resources and Environment, University of Chinese Academy of Sciences, Beijing 100190, P. R. China
- Institute of Environment and Health, Jianghan University, Wuhan 430056, P.R. China
| |
Collapse
|
54
|
Fan F, Wu G, Yang Y, Liu F, Qian Y, Yu Q, Ren H, Geng J. A Graph Neural Network Model with a Transparent Decision-Making Process Defines the Applicability Domain for Environmental Estrogen Screening. ENVIRONMENTAL SCIENCE & TECHNOLOGY 2023; 57:18236-18245. [PMID: 37749748 DOI: 10.1021/acs.est.3c04571] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/27/2023]
Abstract
The application of deep learning (DL) models for screening environmental estrogens (EEs) for the sound management of chemicals has garnered significant attention. However, the currently available DL model for screening EEs lacks both a transparent decision-making process and effective applicability domain (AD) characterization, making the reliability of its prediction results uncertain and limiting its practical applications. To address this issue, a graph neural network (GNN) model was developed to screen EEs, achieving accuracy rates of 88.9% and 92.5% on the internal and external test sets, respectively. The decision-making process of the GNN model was explored through the network-like similarity graphs (NSGs) based on the model features (FT). We discovered that the accuracy of the predictions is dependent on the feature distribution of compounds in NSGs. An AD characterization method called ADFT was proposed, which excludes predictions falling outside of the model's prediction range, leading to a 15% improvement in the F1 score of the GNN model. The GNN model with the AD method may serve as an efficient tool for screening EEs, identifying 800 potential EEs in the Inventory of Existing Chemical Substances of China. Additionally, this study offers new insights into comprehending the decision-making process of DL models.
Collapse
Affiliation(s)
- Fan Fan
- State Key Laboratory of Pollution Control and Resource Reuse, School of the Environment, Nanjing University, Nanjing 210023, Jiangsu, P. R. China
| | - Gang Wu
- State Key Laboratory of Pollution Control and Resource Reuse, School of the Environment, Nanjing University, Nanjing 210023, Jiangsu, P. R. China
| | - Yining Yang
- School of Life Sciences, Tsinghua University, Beijing 100084, China
| | - Fu Liu
- State Key Laboratory of Pollution Control and Resource Reuse, School of the Environment, Nanjing University, Nanjing 210023, Jiangsu, P. R. China
| | - Yuli Qian
- State Key Laboratory of Pollution Control and Resource Reuse, School of the Environment, Nanjing University, Nanjing 210023, Jiangsu, P. R. China
| | - Qingmiao Yu
- State Key Laboratory of Pollution Control and Resource Reuse, School of the Environment, Nanjing University, Nanjing 210023, Jiangsu, P. R. China
- Key Laboratory of the Three Gorges Reservoir Region's Eco-Environment, Ministry of Education, Chongqing University, Chongqing 400044, China
| | - Hongqiang Ren
- State Key Laboratory of Pollution Control and Resource Reuse, School of the Environment, Nanjing University, Nanjing 210023, Jiangsu, P. R. China
| | - Jinju Geng
- State Key Laboratory of Pollution Control and Resource Reuse, School of the Environment, Nanjing University, Nanjing 210023, Jiangsu, P. R. China
- Key Laboratory of the Three Gorges Reservoir Region's Eco-Environment, Ministry of Education, Chongqing University, Chongqing 400044, China
| |
Collapse
|
55
|
Cao H, Peng J, Zhou Z, Yang Z, Wang L, Sun Y, Wang Y, Liang Y. Investigation of the Binding Fraction of PFAS in Human Plasma and Underlying Mechanisms Based on Machine Learning and Molecular Dynamics Simulation. ENVIRONMENTAL SCIENCE & TECHNOLOGY 2023; 57:17762-17773. [PMID: 36282672 DOI: 10.1021/acs.est.2c04400] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
More than 7000 per- and polyfluorinated alkyl substances (PFAS) have been documented in the U.S. Environmental Protection Agency's CompTox Chemicals database. These PFAS can be used in a broad range of industrial and consumer applications but may pose potential environmental issues and health risks. However, little is known about emerging PFAS bioaccumulation to assess their chemical safety. This study focuses specifically on the large and high-quality data set of fluorochemicals from the related environmental and pharmaceutical chemicals databases, and machine learning (ML) models were developed for the classification prediction of the unbound fraction of compounds in plasma. A comprehensive evaluation of the ML models shows that the best blending model yields an accuracy of 0.901 for the test set. The predictions suggest that most PFAS (∼92%) have a high binding fraction in plasma. Introduction of alkaline amino groups is likely to reduce the binding affinities of PFAS with plasma proteins. Molecular dynamics simulations indicate a clear distinction between the high and low binding fractions of PFAS. These computational workflows can be used to predict the bioaccumulation of emerging PFAS and are also helpful for the molecular design of PFAS to prevent the release of high-bioaccumulation compounds into the environment.
Collapse
Affiliation(s)
- Huiming Cao
- Hubei Key Laboratory of Environmental and Health Effects of Persistent Toxic Substances, School of Environment and Health, Jianghan University, Wuhan 430056, China
| | - Jianhua Peng
- Hubei Key Laboratory of Environmental and Health Effects of Persistent Toxic Substances, School of Environment and Health, Jianghan University, Wuhan 430056, China
| | - Zhen Zhou
- Hubei Key Laboratory of Environmental and Health Effects of Persistent Toxic Substances, School of Environment and Health, Jianghan University, Wuhan 430056, China
| | - Zeguo Yang
- Hubei Key Laboratory of Environmental and Health Effects of Persistent Toxic Substances, School of Environment and Health, Jianghan University, Wuhan 430056, China
| | - Ling Wang
- Hubei Key Laboratory of Environmental and Health Effects of Persistent Toxic Substances, School of Environment and Health, Jianghan University, Wuhan 430056, China
| | - Yuzhen Sun
- Hubei Key Laboratory of Environmental and Health Effects of Persistent Toxic Substances, School of Environment and Health, Jianghan University, Wuhan 430056, China
| | - Yawei Wang
- State Key Laboratory of Environmental Chemistry and Ecotoxicology, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing 100085, China
| | - Yong Liang
- Hubei Key Laboratory of Environmental and Health Effects of Persistent Toxic Substances, School of Environment and Health, Jianghan University, Wuhan 430056, China
| |
Collapse
|
56
|
Li H, Zhang R, Min Y, Ma D, Zhao D, Zeng J. A knowledge-guided pre-training framework for improving molecular representation learning. Nat Commun 2023; 14:7568. [PMID: 37989998 PMCID: PMC10663446 DOI: 10.1038/s41467-023-43214-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2023] [Accepted: 11/03/2023] [Indexed: 11/23/2023] Open
Abstract
Learning effective molecular feature representation to facilitate molecular property prediction is of great significance for drug discovery. Recently, there has been a surge of interest in pre-training graph neural networks (GNNs) via self-supervised learning techniques to overcome the challenge of data scarcity in molecular property prediction. However, current self-supervised learning-based methods suffer from two main obstacles: the lack of a well-defined self-supervised learning strategy and the limited capacity of GNNs. Here, we propose Knowledge-guided Pre-training of Graph Transformer (KPGT), a self-supervised learning framework to alleviate the aforementioned issues and provide generalizable and robust molecular representations. The KPGT framework integrates a graph transformer specifically designed for molecular graphs and a knowledge-guided pre-training strategy, to fully capture both structural and semantic knowledge of molecules. Through extensive computational tests on 63 datasets, KPGT exhibits superior performance in predicting molecular properties across various domains. Moreover, the practical applicability of KPGT in drug discovery has been validated by identifying potential inhibitors of two antitumor targets: hematopoietic progenitor kinase 1 (HPK1) and fibroblast growth factor receptor 1 (FGFR1). Overall, KPGT can provide a powerful and useful tool for advancing the artificial intelligence (AI)-aided drug discovery process.
Collapse
Affiliation(s)
- Han Li
- Institute for Interdisciplinary Information Sciences, Tsinghua University, 100084, Beijing, China
| | - Ruotian Zhang
- Institute for Interdisciplinary Information Sciences, Tsinghua University, 100084, Beijing, China
| | - Yaosen Min
- Institute for Interdisciplinary Information Sciences, Tsinghua University, 100084, Beijing, China
| | - Dacheng Ma
- Research Center for Biological Computation, Zhejiang Province, Zhejiang Laboratory, 311100, Hangzhou, China
| | - Dan Zhao
- Institute for Interdisciplinary Information Sciences, Tsinghua University, 100084, Beijing, China.
| | - Jianyang Zeng
- Institute for Interdisciplinary Information Sciences, Tsinghua University, 100084, Beijing, China.
- School of Engineering, Westlake University, Zhejiang Province, 310030, Hangzhou, China.
| |
Collapse
|
57
|
Wu W, Qian J, Liang C, Yang J, Ge G, Zhou Q, Guan X. GeoDILI: A Robust and Interpretable Model for Drug-Induced Liver Injury Prediction Using Graph Neural Network-Based Molecular Geometric Representation. Chem Res Toxicol 2023; 36:1717-1730. [PMID: 37839069 DOI: 10.1021/acs.chemrestox.3c00199] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2023]
Abstract
Drug-induced liver injury (DILI) is a significant cause of drug failure and withdrawal due to liver damage. Accurate prediction of hepatotoxic compounds is crucial for safe drug development. Several DILI prediction models have been published, but they are built on different data sets, making it difficult to compare model performance. Moreover, most existing models are based on molecular fingerprints or descriptors, neglecting molecular geometric properties and lacking interpretability. To address these limitations, we developed GeoDILI, an interpretable graph neural network that uses a molecular geometric representation. First, we utilized a geometry-based pretrained molecular representation and optimized it on the DILI data set to improve predictive performance. Second, we leveraged gradient information to obtain high-precision atomic-level weights and deduce the dominant substructure. We benchmarked GeoDILI against recently published DILI prediction models, as well as popular GNN models and fingerprint-based machine learning models using the same data set, showing superior predictive performance of our proposed model. We applied the interpretable method in the DILI data set and derived seven precise and mechanistically elucidated structural alerts. Overall, GeoDILI provides a promising approach for accurate and interpretable DILI prediction with potential applications in drug discovery and safety assessment. The data and source code are available at GitHub repository (https://github.com/CSU-QJY/GeoDILI).
Collapse
Affiliation(s)
- Wenxuan Wu
- Institute of Interdisciplinary Integrative Medicine Research, Shanghai University of Traditional Chinese Medicine, Shanghai 201203, China
| | - Jiayu Qian
- School of Mathematics and Statistics, Central South University, Changsha, Hunan 410083, China
| | - Changjie Liang
- Institute of Interdisciplinary Integrative Medicine Research, Shanghai University of Traditional Chinese Medicine, Shanghai 201203, China
| | - Jingya Yang
- School of Mathematics and Statistics, Central South University, Changsha, Hunan 410083, China
| | - Guangbo Ge
- Institute of Interdisciplinary Integrative Medicine Research, Shanghai University of Traditional Chinese Medicine, Shanghai 201203, China
| | - Qingping Zhou
- School of Mathematics and Statistics, Central South University, Changsha, Hunan 410083, China
| | - Xiaoqing Guan
- Institute of Interdisciplinary Integrative Medicine Research, Shanghai University of Traditional Chinese Medicine, Shanghai 201203, China
| |
Collapse
|
58
|
Xia S, Chen E, Zhang Y. Integrated Molecular Modeling and Machine Learning for Drug Design. J Chem Theory Comput 2023; 19:7478-7495. [PMID: 37883810 PMCID: PMC10653122 DOI: 10.1021/acs.jctc.3c00814] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2023] [Revised: 10/10/2023] [Accepted: 10/11/2023] [Indexed: 10/28/2023]
Abstract
Modern therapeutic development often involves several stages that are interconnected, and multiple iterations are usually required to bring a new drug to the market. Computational approaches have increasingly become an indispensable part of helping reduce the time and cost of the research and development of new drugs. In this Perspective, we summarize our recent efforts on integrating molecular modeling and machine learning to develop computational tools for modulator design, including a pocket-guided rational design approach based on AlphaSpace to target protein-protein interactions, delta machine learning scoring functions for protein-ligand docking as well as virtual screening, and state-of-the-art deep learning models to predict calculated and experimental molecular properties based on molecular mechanics optimized geometries. Meanwhile, we discuss remaining challenges and promising directions for further development and use a retrospective example of FDA approved kinase inhibitor Erlotinib to demonstrate the use of these newly developed computational tools.
Collapse
Affiliation(s)
- Song Xia
- Department
of Chemistry, New York University, New York, New York 10003, United States
| | - Eric Chen
- Department
of Chemistry, New York University, New York, New York 10003, United States
| | - Yingkai Zhang
- Department
of Chemistry, New York University, New York, New York 10003, United States
- Simons
Center for Computational Physical Chemistry at New York University, New York, New York 10003, United States
- NYU-ECNU
Center for Computational Chemistry at NYU Shanghai, Shanghai 200062, China
| |
Collapse
|
59
|
Sanchez-Fernandez A, Rumetshofer E, Hochreiter S, Klambauer G. CLOOME: contrastive learning unlocks bioimaging databases for queries with chemical structures. Nat Commun 2023; 14:7339. [PMID: 37957207 PMCID: PMC10643690 DOI: 10.1038/s41467-023-42328-w] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2023] [Accepted: 10/06/2023] [Indexed: 11/15/2023] Open
Abstract
The field of bioimage analysis is currently impacted by a profound transformation, driven by the advancements in imaging technologies and artificial intelligence. The emergence of multi-modal AI systems could allow extracting and utilizing knowledge from bioimaging databases based on information from other data modalities. We leverage the multi-modal contrastive learning paradigm, which enables the embedding of both bioimages and chemical structures into a unified space by means of bioimage and molecular structure encoders. This common embedding space unlocks the possibility of querying bioimaging databases with chemical structures that induce different phenotypic effects. Concretely, in this work we show that a retrieval system based on multi-modal contrastive learning is capable of identifying the correct bioimage corresponding to a given chemical structure from a database of ~2000 candidate images with a top-1 accuracy >70 times higher than a random baseline. Additionally, the bioimage encoder demonstrates remarkable transferability to various further prediction tasks within the domain of drug discovery, such as activity prediction, molecule classification, and mechanism of action identification. Thus, our approach not only addresses the current limitations of bioimaging databases but also paves the way towards foundation models for microscopy images.
Collapse
Affiliation(s)
- Ana Sanchez-Fernandez
- ELLIS Unit Linz and LIT AI Lab, Institute for Machine Learning, Johannes Kepler University Linz, Linz, Austria
| | - Elisabeth Rumetshofer
- ELLIS Unit Linz and LIT AI Lab, Institute for Machine Learning, Johannes Kepler University Linz, Linz, Austria
| | - Sepp Hochreiter
- ELLIS Unit Linz and LIT AI Lab, Institute for Machine Learning, Johannes Kepler University Linz, Linz, Austria
- Institute of Advanced Research in Artificial Intelligence (IARAI), Vienna, Austria
| | - Günter Klambauer
- ELLIS Unit Linz and LIT AI Lab, Institute for Machine Learning, Johannes Kepler University Linz, Linz, Austria.
| |
Collapse
|
60
|
Komura H, Watanabe R, Mizuguchi K. The Trends and Future Prospective of In Silico Models from the Viewpoint of ADME Evaluation in Drug Discovery. Pharmaceutics 2023; 15:2619. [PMID: 38004597 PMCID: PMC10675155 DOI: 10.3390/pharmaceutics15112619] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2023] [Revised: 11/05/2023] [Accepted: 11/07/2023] [Indexed: 11/26/2023] Open
Abstract
Drug discovery and development are aimed at identifying new chemical molecular entities (NCEs) with desirable pharmacokinetic profiles for high therapeutic efficacy. The plasma concentrations of NCEs are a biomarker of their efficacy and are governed by pharmacokinetic processes such as absorption, distribution, metabolism, and excretion (ADME). Poor ADME properties of NCEs are a major cause of attrition in drug development. ADME screening is used to identify and optimize lead compounds in the drug discovery process. Computational models predicting ADME properties have been developed with evolving model-building technologies from a simplified relationship between ADME endpoints and physicochemical properties to machine learning, including support vector machines, random forests, and convolution neural networks. Recently, in the field of in silico ADME research, there has been a shift toward evaluating the in vivo parameters or plasma concentrations of NCEs instead of using predictive results to guide chemical structure design. Another research hotspot is the establishment of a computational prediction platform to strengthen academic drug discovery. Bioinformatics projects have produced a series of in silico ADME models using free software and open-access databases. In this review, we introduce prediction models for various ADME parameters and discuss the currently available academic drug discovery platforms.
Collapse
Affiliation(s)
- Hiroshi Komura
- University Research Administration Center, Osaka Metropolitan University, 1-2-7 Asahimachi, Abeno-ku, Osaka 545-0051, Osaka, Japan
| | - Reiko Watanabe
- Institute for Protein Research, Osaka University, 3-2 Yamadaoka, Suita 565-0871, Osaka, Japan; (R.W.); (K.M.)
- Artificial Intelligence Centre for Health and Biomedical Research, National Institutes of Biomedical Innovation, Health, and Nutrition (NIBIOHN), 3-17 Senrioka-shinmachi, Settu 566-0002, Osaka, Japan
| | - Kenji Mizuguchi
- Institute for Protein Research, Osaka University, 3-2 Yamadaoka, Suita 565-0871, Osaka, Japan; (R.W.); (K.M.)
- Artificial Intelligence Centre for Health and Biomedical Research, National Institutes of Biomedical Innovation, Health, and Nutrition (NIBIOHN), 3-17 Senrioka-shinmachi, Settu 566-0002, Osaka, Japan
| |
Collapse
|
61
|
Wang H, Liu W, Chen J, Wang Z. Applicability Domains Based on Molecular Graph Contrastive Learning Enable Graph Attention Network Models to Accurately Predict 15 Environmental End Points. ENVIRONMENTAL SCIENCE & TECHNOLOGY 2023; 57:16906-16917. [PMID: 37897806 DOI: 10.1021/acs.est.3c03860] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/30/2023]
Abstract
In silico models for predicting physicochemical properties and environmental fate parameters are necessary for the sound management of chemicals. This study employed graph attention network (GAT) algorithms to construct such models on 15 end points. The results showed that the GAT models outperformed the previous state-of-the-art models, and their performance was not influenced by the presence or absence of compounds with certain structures. Molecular similarity density (ρs) was found to be a key metrics characterizing data set modelability, in addition to the proportion of compounds at activity cliffs. By introducing molecular graph (MG) contrastive learning, MG-based ρs and molecular inconsistency in activities (IA) were calculated and employed for characterizing the structure-activity landscape (SAL)-based applicability domain ADSAL{ρs, IA}. The GAT models coupled with ADSAL{ρs, IA} significantly improved the prediction coefficient of determination (R2) on all the end points by an average of 14.4% and enabled all the end points to have R2 > 0.9, which could hardly be achieved previously. The models were employed to screen persistent, mobile, and/or bioaccumulative chemicals from inventories consisting of about 106 chemicals. Given the current state-of-the-art model performance and coverage of the various environmental end points, the constructed models with ADSAL{ρs, IA} may serve as benchmarks for future efforts to improve modeling efficacy.
Collapse
Affiliation(s)
- Haobo Wang
- Key Laboratory of Industrial Ecology and Environmental Engineering (Ministry of Education), Dalian Key Laboratory on Chemicals Risk Control and Pollution Prevention Technology, School of Environmental Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Wenjia Liu
- Key Laboratory of Industrial Ecology and Environmental Engineering (Ministry of Education), Dalian Key Laboratory on Chemicals Risk Control and Pollution Prevention Technology, School of Environmental Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Jingwen Chen
- Key Laboratory of Industrial Ecology and Environmental Engineering (Ministry of Education), Dalian Key Laboratory on Chemicals Risk Control and Pollution Prevention Technology, School of Environmental Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Zhongyu Wang
- Key Laboratory of Industrial Ecology and Environmental Engineering (Ministry of Education), Dalian Key Laboratory on Chemicals Risk Control and Pollution Prevention Technology, School of Environmental Science and Technology, Dalian University of Technology, Dalian 116024, China
| |
Collapse
|
62
|
Zhao X, Kong Y, Ji Y, Xin X, Chen L, Chen G, Yu C. Classification models for predicting the bioactivity of pan-TRK inhibitors and SAR analysis. Mol Divers 2023:10.1007/s11030-023-10735-2. [PMID: 37910346 DOI: 10.1007/s11030-023-10735-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2023] [Accepted: 09/22/2023] [Indexed: 11/03/2023]
Abstract
Tropomyosin receptor kinases (TRKs) are important broad-spectrum anticancer targets. The oncogenic rearrangement of the NTRK gene disrupts the extracellular structural domain and epitopes for therapeutic antibodies, making small-molecule inhibitors essential for treating NTRK fusion-driven tumors. In this work, several algorithms were used to construct descriptor-based and nondescriptor-based models, and the models were evaluated by outer 10-fold cross-validation. To find a model with good generalization ability, the dataset was partitioned by random and cluster-splitting methods to construct in- and cross-domain models, respectively. Among the 48 models built, the model with the combination of the deep neural network (DNN) algorithm and extended connectivity fingerprints 4 (ECFP4) descriptors achieved excellent performance in both dataset divisions. The results indicate that the DNN algorithm has a strong generalization prediction ability, and the richness of features plays a vital role in predicting unknown spatial molecules. Additionally, we combined the clustering results and decision tree models of fingerprint descriptors to perform structure-activity relationship analysis. It was found that nitrogen-containing aromatic heterocyclic and benzo heterocyclic structures play a crucial role in enhancing the activity of TRK inhibitors. Workflow for generating predictive models for TRK inhibitors.
Collapse
Affiliation(s)
- Xiaoman Zhao
- College of Life Science and Technology, Beijing University of Chemical Technology, 15 BeiSanHuan East Road, Beijing, 100029, People's Republic of China
- College of Bio engineering, No. 9 Liangshuihe 1st Street, Beijing, 100176, People's Republic of China
| | - Yue Kong
- College of Life Science and Technology, Beijing University of Chemical Technology, 15 BeiSanHuan East Road, Beijing, 100029, People's Republic of China
| | - Yueshan Ji
- College of Life Science and Technology, Beijing University of Chemical Technology, 15 BeiSanHuan East Road, Beijing, 100029, People's Republic of China
| | - Xiulan Xin
- College of Bio engineering, No. 9 Liangshuihe 1st Street, Beijing, 100176, People's Republic of China
| | - Liang Chen
- College of Bio engineering, No. 9 Liangshuihe 1st Street, Beijing, 100176, People's Republic of China
| | - Guang Chen
- College of Life Science and Technology, Beijing University of Chemical Technology, 15 BeiSanHuan East Road, Beijing, 100029, People's Republic of China
| | - Changyuan Yu
- College of Life Science and Technology, Beijing University of Chemical Technology, 15 BeiSanHuan East Road, Beijing, 100029, People's Republic of China.
| |
Collapse
|
63
|
Yin X, Wang X, Li Y, Wang J, Wang Y, Deng Y, Hou T, Liu H, Luo P, Yao X. CODD-Pred: A Web Server for Efficient Target Identification and Bioactivity Prediction of Small Molecules. J Chem Inf Model 2023; 63:6169-6176. [PMID: 37820365 DOI: 10.1021/acs.jcim.3c00685] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/13/2023]
Abstract
Target identification and bioactivity prediction are critical steps in the drug discovery process. Here we introduce CODD-Pred (COmprehensive Drug Design Predictor), an online web server with well-curated data sets from the GOSTAR database, which is designed with a dual purpose of predicting potential protein drug targets and computing bioactivity values of small molecules. We first designed a double molecular graph perception (DMGP) framework for target prediction based on a large library of 646 498 small molecules interacting with 640 human targets. The framework achieved a top-5 accuracy of over 80% for hitting at least one target on both external validation sets. Additionally, its performance on the external validation set comprising 200 molecules surpassed that of four existing target prediction servers. Second, we collected 56 targets closely related to the occurrence and development of cancer, metabolic diseases, and inflammatory immune diseases and developed a multi-model self-validation activity prediction (MSAP) framework that enables accurate bioactivity quantification predictions for small-molecule ligands of these 56 targets. CODD-Pred is a handy tool for rapid evaluation and optimization of small molecules with specific target activity. CODD-Pred is freely accessible at http://codd.iddd.group/.
Collapse
Affiliation(s)
- Xiaodan Yin
- Dr. Neher's Biophysics Laboratory for Innovative Drug Discovery, State Key Laboratory of Quality Research in Chinese Medicine, Macau Institute for Applied Research in Medicine and Health, Macau University of Science and Technology, Macao, 999078, China
- Carbon-Silicon AI Technology Co., Ltd, Zhejiang, Hangzhou 310018, China
| | - Xiaorui Wang
- Dr. Neher's Biophysics Laboratory for Innovative Drug Discovery, State Key Laboratory of Quality Research in Chinese Medicine, Macau Institute for Applied Research in Medicine and Health, Macau University of Science and Technology, Macao, 999078, China
- Carbon-Silicon AI Technology Co., Ltd, Zhejiang, Hangzhou 310018, China
| | - Yuquan Li
- College of Chemistry and Chemical Engineering, Lanzhou University, Lanzhou, 730000, China
| | - Jike Wang
- College of Pharmaceutical Sciences and Cancer Center, Zhejiang University, Hangzhou, 310058, China
| | - Yuwei Wang
- College of Pharmacy, Shaanxi University of Chinese Medicine, Xianyang, 712000, China
| | - Yafeng Deng
- Carbon-Silicon AI Technology Co., Ltd, Zhejiang, Hangzhou 310018, China
| | - Tingjun Hou
- College of Pharmaceutical Sciences and Cancer Center, Zhejiang University, Hangzhou, 310058, China
| | - Huanxiang Liu
- Faculty of Applied Sciences, Macao Polytechnic University, Macao, 999078, China
| | - Pei Luo
- Dr. Neher's Biophysics Laboratory for Innovative Drug Discovery, State Key Laboratory of Quality Research in Chinese Medicine, Macau Institute for Applied Research in Medicine and Health, Macau University of Science and Technology, Macao, 999078, China
| | - Xiaojun Yao
- Faculty of Applied Sciences, Macao Polytechnic University, Macao, 999078, China
| |
Collapse
|
64
|
Tayyebi A, Alshami AS, Rabiei Z, Yu X, Ismail N, Talukder MJ, Power J. Prediction of organic compound aqueous solubility using machine learning: a comparison study of descriptor-based and fingerprints-based models. J Cheminform 2023; 15:99. [PMID: 37853492 PMCID: PMC10583449 DOI: 10.1186/s13321-023-00752-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2022] [Accepted: 08/25/2023] [Indexed: 10/20/2023] Open
Abstract
A reliable and practical determination of a chemical species' solubility in water continues to be examined using empirical observations and exhaustive experimental studies alone. Predictions of chemical solubility in water using data-driven algorithms can allow us to create a rationally designed, efficient, and cost-effective tool for next-generation materials and chemical formulations. We present results from two machine learning (ML) modeling studies to adequately predict various species' solubility using data for over 8400 compounds. Molecular-descriptors, the most used method in previous studies, and Morgan fingerprint, a circular-based hash of the molecules' structures, were applied to produce water solubility estimates. We trained all models on 80% of the total datasets using the Random Forest (RFs) technique as the regressor and tested the prediction performance using the remaining 20%, resulting in coefficient of determination (R2) test values of 0.88 and 0.81 and root-mean-square deviation (RMSE) test values 0.64 and 0.80 for the descriptors and circular fingerprint methods, respectively. We interpreted the produced ML models and reported the most effective features for aqueous solubility measures using the Shapley Additive exPlanations (SHAP) and thermodynamic analysis. Low error, ability to investigate the molecular-level interactions, and compatibility with thermodynamic quantities made the fingerprint method a distinct model compared to other available computational tools. However, it is worth emphasizing that physicochemical descriptor model outperformed the fingerprint model in achieving better predictive accuracy for the given test set.
Collapse
Affiliation(s)
- Arash Tayyebi
- University of North Dakota, Chemical Engineering, Grand Forks, ND, 58201, USA
| | - Ali S Alshami
- University of North Dakota, Chemical Engineering, Grand Forks, ND, 58201, USA.
| | - Zeinab Rabiei
- Chemistry Department, University of North Dakota, Grand Forks, ND, 58202, USA
| | - Xue Yu
- Energy & Environmental Research Center, University of North Dakota, Grand Forks, ND, 58202, USA
| | - Nadhem Ismail
- University of North Dakota, Chemical Engineering, Grand Forks, ND, 58201, USA
| | | | - Jason Power
- University of North Dakota, Biomedical Sciences, Grand Forks, ND, 58202, USA
| |
Collapse
|
65
|
Chen Z, Liu X, Wang W, Zhang L, Ling W, Wang C, Jiang J, Song J, Liu Y, Lu D, Liu F, Zhang A, Liu Q, Zhang J, Jiang G. Machine learning-aided metallomic profiling in serum and urine of thyroid cancer patients and its environmental implications. THE SCIENCE OF THE TOTAL ENVIRONMENT 2023; 895:165100. [PMID: 37356765 DOI: 10.1016/j.scitotenv.2023.165100] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/16/2023] [Revised: 05/17/2023] [Accepted: 06/21/2023] [Indexed: 06/27/2023]
Abstract
The incidence rate of thyroid cancer has been growing worldwide. Thyroid health is closely related with multiple trace metals, and the nutrients are essential in maintaining thyroid function while the contaminants can disturb thyroid morphology and homeostasis. In this study, we conducted metallomic analysis in thyroid cancer patients (n = 40) and control subjects (n = 40) recruited in Shenzhen, China with a high incidence of thyroid cancer. We found significant alterations in serumal and urinary metallomic profiling (including Cr, Mn, Fe, Co, Ni, Cu, Zn, As, Sr, Cd, I, Ba, Tl, and Pb) and elemental correlative patterns between thyroid cancer patients and controls. Additionally, we also measured the serum Cu isotopic composition and found a multifaceted disturbance in Cu metabolism in thyroid disease patients. Based on the metallome variations, we built and assessed the thyroid cancer-predictive performance of seven machine learning algorithms. Among them, the Random Forest model performed the best with the accuracy of 1.000, 0.858, and 0.813 on the training, 5-fold cross-validation, and test set, respectively. The high performance of machine learning has demonstrated the great promise of metallomic analysis in the identification of thyroid cancer. Then, the Shapley Additive exPlanations approach was used to further interpret the variable contributions of the model and it showed that serum Pb contributed the most in the identification process. To the best of our knowledge, this is the first study that combines machine learning and metallome data for cancer identification, and it supports the indication of environmental heavy metal-related thyroid cancer etiology.
Collapse
Affiliation(s)
- Zigu Chen
- State Key Laboratory of Environmental Chemistry and Ecotoxicology, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing 100085, China; University of Chinese Academy of Sciences, Beijing 100190, China
| | - Xian Liu
- State Key Laboratory of Environmental Chemistry and Ecotoxicology, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing 100085, China.
| | - Weichao Wang
- State Key Laboratory of Environmental Chemistry and Ecotoxicology, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing 100085, China
| | - Luyao Zhang
- State Key Laboratory of Environmental Chemistry and Ecotoxicology, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing 100085, China; University of Chinese Academy of Sciences, Beijing 100190, China
| | - Weibo Ling
- State Key Laboratory of Environmental Chemistry and Ecotoxicology, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing 100085, China; University of Chinese Academy of Sciences, Beijing 100190, China
| | - Chao Wang
- Shenzhen Center for Disease Control and Prevention, Shenzhen 518055, China
| | - Jie Jiang
- Shenzhen Center for Disease Control and Prevention, Shenzhen 518055, China
| | - Jiayi Song
- Shenzhen Center for Disease Control and Prevention, Shenzhen 518055, China
| | - Yuan Liu
- Shenzhen Center for Disease Control and Prevention, Shenzhen 518055, China
| | - Dawei Lu
- State Key Laboratory of Environmental Chemistry and Ecotoxicology, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing 100085, China
| | - Fen Liu
- The First Hospital of Changsha, Changsha 410005, China
| | - Aiqian Zhang
- State Key Laboratory of Environmental Chemistry and Ecotoxicology, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing 100085, China; University of Chinese Academy of Sciences, Beijing 100190, China
| | - Qian Liu
- State Key Laboratory of Environmental Chemistry and Ecotoxicology, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing 100085, China; University of Chinese Academy of Sciences, Beijing 100190, China; Institute of Environment and Health, Jianghan University, Wuhan 430056, China.
| | - Jianqing Zhang
- Shenzhen Center for Disease Control and Prevention, Shenzhen 518055, China.
| | - Guibin Jiang
- State Key Laboratory of Environmental Chemistry and Ecotoxicology, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing 100085, China; University of Chinese Academy of Sciences, Beijing 100190, China
| |
Collapse
|
66
|
Deng J, Yang Z, Wang H, Ojima I, Samaras D, Wang F. A systematic study of key elements underlying molecular property prediction. Nat Commun 2023; 14:6395. [PMID: 37833262 PMCID: PMC10575948 DOI: 10.1038/s41467-023-41948-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2022] [Accepted: 09/18/2023] [Indexed: 10/15/2023] Open
Abstract
Artificial intelligence (AI) has been widely applied in drug discovery with a major task as molecular property prediction. Despite booming techniques in molecular representation learning, key elements underlying molecular property prediction remain largely unexplored, which impedes further advancements in this field. Herein, we conduct an extensive evaluation of representative models using various representations on the MoleculeNet datasets, a suite of opioids-related datasets and two additional activity datasets from the literature. To investigate the predictive power in low-data and high-data space, a series of descriptors datasets of varying sizes are also assembled to evaluate the models. In total, we have trained 62,820 models, including 50,220 models on fixed representations, 4200 models on SMILES sequences and 8400 models on molecular graphs. Based on extensive experimentation and rigorous comparison, we show that representation learning models exhibit limited performance in molecular property prediction in most datasets. Besides, multiple key elements underlying molecular property prediction can affect the evaluation results. Furthermore, we show that activity cliffs can significantly impact model prediction. Finally, we explore into potential causes why representation learning models can fail and show that dataset size is essential for representation learning models to excel.
Collapse
Affiliation(s)
- Jianyuan Deng
- Stony Brook University, Department of Biomedical Informatics, Stony Brook, NY, 11794, USA
| | - Zhibo Yang
- Stony Brook University, Department of Computer Science, Stony Brook, NY, 11794, USA
| | - Hehe Wang
- Stony Brook University, Department of Chemistry, Stony Brook, NY, 11794, USA
| | - Iwao Ojima
- Stony Brook University, Department of Chemistry, Stony Brook, NY, 11794, USA
| | - Dimitris Samaras
- Stony Brook University, Department of Computer Science, Stony Brook, NY, 11794, USA
| | - Fusheng Wang
- Stony Brook University, Department of Biomedical Informatics, Stony Brook, NY, 11794, USA.
- Stony Brook University, Department of Computer Science, Stony Brook, NY, 11794, USA.
| |
Collapse
|
67
|
Riedl M, Mukherjee S, Gauthier M. Descriptor-Free Deep Learning QSAR Model for the Fraction Unbound in Human Plasma. Mol Pharm 2023; 20:4984-4993. [PMID: 37656906 DOI: 10.1021/acs.molpharmaceut.3c00129] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/03/2023]
Abstract
Chemical-specific parameters are either measured in vitro or estimated using quantitative structure-activity relationship (QSAR) models. The existing body of QSAR work relies on extracting a set of descriptors or fingerprints, subset selection, and training a machine learning model. In this work, we used a state-of-the-art natural language processing model, Bidirectional Encoder Representations from Transformers, which allowed us to circumvent the need for calculation of these chemical descriptors. In this approach, simplified molecular-input line-entry system (SMILES) strings were embedded in a high-dimensional space using a two-stage training approach. The model was first pre-trained on a masked SMILES token task and then fine-tuned on a QSAR prediction task. The pre-training task learned meaningful high-dimensional embeddings based upon the relationships between the chemical tokens in the SMILES strings derived from the "in-stock" portion of the ZINC 15 dataset─a large dataset of commercially available chemicals. The fine-tuning task then perturbed the pre-trained embeddings to facilitate prediction of a specific QSAR endpoint of interest. The power of this model stems from the ability to reuse the pre-trained model for multiple different fine-tuning tasks, reducing the computational burden of developing multiple models for different endpoints. We used our framework to develop a predictive model for fraction unbound in human plasma (fu,p). This approach is flexible, requires minimum domain expertise, and can be generalized for other parameters of interest for rapid and accurate estimation of absorption, distribution, metabolism, excretion, and toxicity.
Collapse
|
68
|
Zhang X, Shen C, Wang T, Deng Y, Kang Y, Li D, Hou T, Pan P. ML-PLIC: a web platform for characterizing protein-ligand interactions and developing machine learning-based scoring functions. Brief Bioinform 2023; 24:bbad295. [PMID: 37738401 DOI: 10.1093/bib/bbad295] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2023] [Revised: 07/17/2023] [Accepted: 07/31/2023] [Indexed: 09/24/2023] Open
Abstract
Cracking the entangling code of protein-ligand interaction (PLI) is of great importance to structure-based drug design and discovery. Different physical and biochemical representations can be used to describe PLI such as energy terms and interaction fingerprints, which can be analyzed by machine learning (ML) algorithms to create ML-based scoring functions (MLSFs). Here, we propose the ML-based PLI capturer (ML-PLIC), a web platform that automatically characterizes PLI and generates MLSFs to identify the potential binders of a specific protein target through virtual screening (VS). ML-PLIC comprises five modules, including Docking for ligand docking, Descriptors for PLI generation, Modeling for MLSF training, Screening for VS and Pipeline for the integration of the aforementioned functions. We validated the MLSFs constructed by ML-PLIC in three benchmark datasets (Directory of Useful Decoys-Enhanced, Active as Decoys and TocoDecoy), demonstrating accuracy outperforming traditional docking tools and competitive performance to the deep learning-based SF, and provided a case study of the Serine/threonine-protein kinase WEE1 in which MLSFs were developed by using the ML-based VS pipeline in ML-PLIC. Underpinning the latest version of ML-PLIC is a powerful platform that incorporates physical and biological knowledge about PLI, leveraging PLI characterization and MLSF generation into the design of structure-based VS pipeline. The ML-PLIC web platform is now freely available at http://cadd.zju.edu.cn/plic/.
Collapse
Affiliation(s)
- Xujun Zhang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, China
| | - Chao Shen
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, China
- Hangzhou Carbonsilicon AI Technology Co., Ltd, Hangzhou 310018, Zhejiang, China
| | - Tianyue Wang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, China
| | - Yafeng Deng
- Hangzhou Carbonsilicon AI Technology Co., Ltd, Hangzhou 310018, Zhejiang, China
| | - Yu Kang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, China
| | - Dan Li
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, China
| | - Tingjun Hou
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, China
| | - Peichen Pan
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, China
| |
Collapse
|
69
|
Wojtuch A, Danel T, Podlewska S, Maziarka Ł. Extended study on atomic featurization in graph neural networks for molecular property prediction. J Cheminform 2023; 15:81. [PMID: 37726841 PMCID: PMC10507875 DOI: 10.1186/s13321-023-00751-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2023] [Accepted: 08/23/2023] [Indexed: 09/21/2023] Open
Abstract
Graph neural networks have recently become a standard method for analyzing chemical compounds. In the field of molecular property prediction, the emphasis is now on designing new model architectures, and the importance of atom featurization is oftentimes belittled. When contrasting two graph neural networks, the use of different representations possibly leads to incorrect attribution of the results solely to the network architecture. To better understand this issue, we compare multiple atom representations by evaluating them on the prediction of free energy, solubility, and metabolic stability using graph convolutional networks. We discover that the choice of atom representation has a significant impact on model performance and that the optimal subset of features is task-specific. Additional experiments involving more sophisticated architectures, including graph transformers, support these findings. Moreover, we demonstrate that some commonly used atom features, such as the number of neighbors or the number of hydrogens, can be easily predicted using only information about bonds and atom type, yet their explicit inclusion in the representation has a positive impact on model performance. Finally, we explain the predictions of the best-performing models to better understand how they utilize the available atomic features.
Collapse
Affiliation(s)
- Agnieszka Wojtuch
- Faculty of Mathematics and Computer Science, Jagiellonian University, Łojasiewicza 6, 30-348, Kraków, Poland.
| | - Tomasz Danel
- Faculty of Mathematics and Computer Science, Jagiellonian University, Łojasiewicza 6, 30-348, Kraków, Poland
| | - Sabina Podlewska
- Maj Institute of Pharmacology, Polish Academy of Sciences, Smętna 12, 31-343, Kraków, Poland
| | - Łukasz Maziarka
- Faculty of Mathematics and Computer Science, Jagiellonian University, Łojasiewicza 6, 30-348, Kraków, Poland
| |
Collapse
|
70
|
Sun H, Yang Q, Yu X, Huang M, Ding M, Li W, Tang Y, Liu G. Prediction of IDO1 Inhibitors by a Fingerprint-Based Stacking Ensemble Model Named IDO1Stack. ChemMedChem 2023; 18:e202300151. [PMID: 37340939 DOI: 10.1002/cmdc.202300151] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2023] [Revised: 06/16/2023] [Accepted: 06/20/2023] [Indexed: 06/22/2023]
Abstract
Indoleamine 2,3-dioxygenase 1 (IDO1) is viewed as an extremely promising target for cancer immunotherapy. Here, we proposed a two-layer stacking ensemble model, IDO1Stack, that can efficiently predict IDO1 inhibitors. First, we constructed a series of classification models based on five machine learning algorithms and eight molecular characterization methods. Then, a stacking ensemble model was built using the top five models as the base classifier and logistic regression as the meta-classifier. The areas under the receiver operating characteristic curve (AUC) of IDO1Stack on the test set and external validation set were 0.952 and 0.918, respectively. Furthermore, we computed the applicability domain and privileged substructures of the model and interpreted the model using SHapley Additive exPlanations (SHAP). It is expected that IDO1Stack can well study the interaction between target and ligand, providing practitioners with a reliable tool for rapid screening and discovery of IDO1 inhibitors.
Collapse
Affiliation(s)
- Huimin Sun
- Shanghai Frontiers Science Center of Optogenetic Techniques for Cell Metabolism, Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, 130 Meilong Road, Shanghai, 200237, China
| | - Qing Yang
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Fudan University, 2005 Songhu Road, Shanghai, 200438, China
| | - Xinxin Yu
- Shanghai Frontiers Science Center of Optogenetic Techniques for Cell Metabolism, Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, 130 Meilong Road, Shanghai, 200237, China
| | - Mengting Huang
- Shanghai Frontiers Science Center of Optogenetic Techniques for Cell Metabolism, Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, 130 Meilong Road, Shanghai, 200237, China
| | - Meng Ding
- Shanghai Frontiers Science Center of Optogenetic Techniques for Cell Metabolism, Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, 130 Meilong Road, Shanghai, 200237, China
| | - Weihua Li
- Shanghai Frontiers Science Center of Optogenetic Techniques for Cell Metabolism, Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, 130 Meilong Road, Shanghai, 200237, China
| | - Yun Tang
- Shanghai Frontiers Science Center of Optogenetic Techniques for Cell Metabolism, Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, 130 Meilong Road, Shanghai, 200237, China
| | - Guixia Liu
- Shanghai Frontiers Science Center of Optogenetic Techniques for Cell Metabolism, Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, 130 Meilong Road, Shanghai, 200237, China
| |
Collapse
|
71
|
Boldini D, Grisoni F, Kuhn D, Friedrich L, Sieber SA. Practical guidelines for the use of gradient boosting for molecular property prediction. J Cheminform 2023; 15:73. [PMID: 37641120 PMCID: PMC10464382 DOI: 10.1186/s13321-023-00743-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2023] [Accepted: 08/09/2023] [Indexed: 08/31/2023] Open
Abstract
Decision tree ensembles are among the most robust, high-performing and computationally efficient machine learning approaches for quantitative structure-activity relationship (QSAR) modeling. Among them, gradient boosting has recently garnered particular attention, for its performance in data science competitions, virtual screening campaigns, and bioactivity prediction. However, different variants of gradient boosting exist, the most popular being XGBoost, LightGBM and CatBoost. Our study provides the first comprehensive comparison of these approaches for QSAR. To this end, we trained 157,590 gradient boosting models, which were evaluated on 16 datasets and 94 endpoints, comprising 1.4 million compounds in total. Our results show that XGBoost generally achieves the best predictive performance, while LightGBM requires the least training time, especially for larger datasets. In terms of feature importance, the models surprisingly rank molecular features differently, reflecting differences in regularization techniques and decision tree structures. Thus, expert knowledge must always be employed when evaluating data-driven explanations of bioactivity. Furthermore, our results show that the relevance of each hyperparameter varies greatly across datasets and that it is crucial to optimize as many hyperparameters as possible to maximize the predictive performance. In conclusion, our study provides the first set of guidelines for cheminformatics practitioners to effectively train, optimize and evaluate gradient boosting models for virtual screening and QSAR applications.
Collapse
Affiliation(s)
- Davide Boldini
- Department of Bioscience, Center for Functional Protein Assemblies (CPA), Technical University of Munich, Garching bei Munich, Germany
| | - Francesca Grisoni
- Department of Biomedical Engineering, Institute for Complex Molecular Sciences, Eindhoven University of Technology, Eindhoven, The Netherlands
- Centre for Living Technologies, Alliance TU/E, WUR, UU, UMC Utrecht, Utrecht, The Netherlands
| | | | | | - Stephan A Sieber
- Department of Bioscience, Center for Functional Protein Assemblies (CPA), Technical University of Munich, Garching bei Munich, Germany.
| |
Collapse
|
72
|
Ng SS, Lu Y. Evaluating the Use of Graph Neural Networks and Transfer Learning for Oral Bioavailability Prediction. J Chem Inf Model 2023; 63:5035-5044. [PMID: 37582507 PMCID: PMC10467575 DOI: 10.1021/acs.jcim.3c00554] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2023] [Indexed: 08/17/2023]
Abstract
Oral bioavailability is a pharmacokinetic property that plays an important role in drug discovery. Recently developed computational models involve the use of molecular descriptors, fingerprints, and conventional machine-learning models. However, determining the type of molecular descriptors requires domain expert knowledge and time for feature selection. With the emergence of the graph neural network (GNN), models can be trained to automatically extract features that they deem important. In this article, we exploited the automatic feature selection of GNN to predict oral bioavailability. To enhance the prediction performance of GNN, we utilized transfer learning by pre-training a model to predict solubility and obtained a final average accuracy of 0.797, an F1 score of 0.840, and an AUC-ROC of 0.867, which outperformed previous studies on predicting oral bioavailability with the same test data set.
Collapse
Affiliation(s)
- Sherwin
S. S. Ng
- School of Chemistry, Chemistry Engineering
and Biotechnology, Nanyang Technological
University, 21 Nanyang Link, Singapore 637371, Singapore
| | - Yunpeng Lu
- School of Chemistry, Chemistry Engineering
and Biotechnology, Nanyang Technological
University, 21 Nanyang Link, Singapore 637371, Singapore
| |
Collapse
|
73
|
Miao Y, Ma H, Huang J. Recent Advances in Toxicity Prediction: Applications of Deep Graph Learning. Chem Res Toxicol 2023; 36:1206-1226. [PMID: 37562046 DOI: 10.1021/acs.chemrestox.2c00384] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/12/2023]
Abstract
The development of new drugs is time-consuming and expensive, and as such, accurately predicting the potential toxicity of a drug candidate is crucial in ensuring its safety and efficacy. Recently, deep graph learning has become prevalent in this field due to its computational power and cost efficiency. Many novel deep graph learning methods aid toxicity prediction and further prompt drug development. This review aims to connect fundamental knowledge with burgeoning deep graph learning methods. We first summarize the essential components of deep graph learning models for toxicity prediction, including molecular descriptors, molecular representations, evaluation metrics, validation methods, and data sets. Furthermore, based on various graph-related representations of molecules, we introduce several representative studies and methods for toxicity prediction from the perspective of GNN architectures and graph pretrained models. Compared to other types of models, deep graph models not only advance in higher accuracy and efficiency but also provide more intuitive insights, which is significant in the development of model interpretation and generalization ability. The graph pretrained models are emerging as they can extract prominent features from large-scale unlabeled molecular graph data and improve the performance of downstream toxicity prediction tasks. We hope this survey can serve as a handbook for individuals interested in exploring deep graph learning for toxicity prediction.
Collapse
Affiliation(s)
- Yuwei Miao
- Department of Computer Science and Engineering, University of Texas at Arlington, Arlington, Texas 76019, United States
| | - Hehuan Ma
- Department of Computer Science and Engineering, University of Texas at Arlington, Arlington, Texas 76019, United States
| | - Junzhou Huang
- Department of Computer Science and Engineering, University of Texas at Arlington, Arlington, Texas 76019, United States
| |
Collapse
|
74
|
Malhotra P, Biswas S, Sharma GD. Directed Message Passing Neural Network for Predicting Power Conversion Efficiency in Organic Solar Cells. ACS APPLIED MATERIALS & INTERFACES 2023; 15:37741-37747. [PMID: 37490851 DOI: 10.1021/acsami.3c08068] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/27/2023]
Abstract
Organic solar cells (OSCs) have emerged as a promising technology for renewable energy generation, and researchers are constantly exploring ways to improve their efficiency. For prediction of photovoltaic properties in OSCs, many machine learning models have been used in the past. All the models are used with fixed molecular descriptors and molecular fingerprints as input for power conversion efficiency (PCE) prediction. Recently, the graph neural network (GNN), which can model graph structures of the molecule, has received increasing attention as a method that could potentially overcome the limitations of fixed descriptors by learning the task-specific representations using graph convolutions. In this study, we have used the directed message passing neural network (D-MPNN), an emerging type of GNN for predicting PCE of organic solar cells, and the results are compared for the same train and test set with fixed descriptors and fingerprints. The excellent performance demonstrated by the D-MPNN model in this investigation highlights its potential for predicting PCE, surpassing the limitations of conventional fixed descriptors.
Collapse
Affiliation(s)
- Prateek Malhotra
- Department of Physics, The LNM Institute of Information Technology, Jamdoli, Jaipur, Rajasthan 302031, India
| | - Subhayan Biswas
- Department of Physics, The LNM Institute of Information Technology, Jamdoli, Jaipur, Rajasthan 302031, India
| | - Ganesh D Sharma
- Department of Physics, The LNM Institute of Information Technology, Jamdoli, Jaipur, Rajasthan 302031, India
| |
Collapse
|
75
|
Chaka M, Geffe CA, Rodriguez A, Seriani N, Wu Q, Mekonnen YS. High-Throughput Screening of Promising Redox-Active Molecules with MolGAT. ACS OMEGA 2023; 8:24268-24278. [PMID: 37457475 PMCID: PMC10339396 DOI: 10.1021/acsomega.3c01295] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/26/2023] [Accepted: 06/12/2023] [Indexed: 07/18/2023]
Abstract
Redox flow batteries (RFBs) have emerged as a promising option for large-scale energy storage, owing to their high energy density, low cost, and environmental benefits. However, the identification of organic compounds with high redox activity, aqueous solubility, stability, and fast redox kinetics is a crucial and challenging step in developing an RFB technology. Density functional theory-based computational materials prediction and screening is a time-consuming and computationally expensive technique, yet it has a high success rate. To speed up the discovery of new materials with desired properties, machine-learning-based models can be trained on large data sets. Graph neural networks (GNNs) are particularly well-suited for non-Euclidean data and can model complex relationships, making them ideal for accelerating the discovery of novel materials. In this study, a GNN-based model called MolGAT was developed to predict the redox potential of organic molecules using molecular structures, atomic properties, and bond attributes. The model was trained on a data set of over 15,000 compounds with redox potentials ranging from -4.11 to 2.56. MolGAT outperformed other GNN variants, such as the Graph Attention Network, Graph Convolution Network, and AttentiveFP models. The trained model was used to screen a vast chemical data set comprising 581,014 molecules, namely OMDB, QM9, ZINC, CHEMBL, and DELANEY, and identified 23,467 potential redox-active compounds for use in redox flow batteries. Of those, 20,716 molecules were identified as potential catholytes with predicted redox potentials up to 2.87 V, while 2,751 molecules were deemed potential anolytes with predicted redox potentials as low as -2.88 V. This work demonstrates the capabilities of graph neural networks in condensed matter physics and materials science to screen promising redox-active species for further electronic structure calculations and experimental testing.
Collapse
Affiliation(s)
- Mesfin
Diro Chaka
- Department
of Physics, College of Natural and Computational Sciences, Addis Ababa University, P.O. Box 1176, Addis Ababa 1176, Ethiopia
- Computational
Data Science, College of Natural and Computational Sciences, Addis Ababa University, P.O. Box 1176, Addis Ababa 1176, Ethiopia
| | - Chernet Amente Geffe
- Department
of Physics, College of Natural and Computational Sciences, Addis Ababa University, P.O. Box 1176, Addis Ababa 1176, Ethiopia
| | - Alex Rodriguez
- The Abdus
Salam International Centre for Theoretical Physics(ICTP) Condensed Matter and Statistical Physics Section, 34100 Trieste, Italy
| | - Nicola Seriani
- The Abdus
Salam International Centre for Theoretical Physics(ICTP) Condensed Matter and Statistical Physics Section, 34100 Trieste, Italy
| | - Qin Wu
- Brookhaven
National Laboratory, Center for Functional Nanomaterials, Upton New York 11973, United States
| | - Yedilfana Setarge Mekonnen
- Center for
Environmental Science, College of Natural and Computational Sciences, Addis Ababa University, P.O. Box 1176, Addis Ababa 1176, Ethiopia
| |
Collapse
|
76
|
Walke D, Micheel D, Schallert K, Muth T, Broneske D, Saake G, Heyer R. The importance of graph databases and graph learning for clinical applications. Database (Oxford) 2023; 2023:baad045. [PMID: 37428679 PMCID: PMC10332447 DOI: 10.1093/database/baad045] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2022] [Revised: 05/26/2023] [Accepted: 06/16/2023] [Indexed: 07/12/2023]
Abstract
The increasing amount and complexity of clinical data require an appropriate way of storing and analyzing those data. Traditional approaches use a tabular structure (relational databases) for storing data and thereby complicate storing and retrieving interlinked data from the clinical domain. Graph databases provide a great solution for this by storing data in a graph as nodes (vertices) that are connected by edges (links). The underlying graph structure can be used for the subsequent data analysis (graph learning). Graph learning consists of two parts: graph representation learning and graph analytics. Graph representation learning aims to reduce high-dimensional input graphs to low-dimensional representations. Then, graph analytics uses the obtained representations for analytical tasks like visualization, classification, link prediction and clustering which can be used to solve domain-specific problems. In this survey, we review current state-of-the-art graph database management systems, graph learning algorithms and a variety of graph applications in the clinical domain. Furthermore, we provide a comprehensive use case for a clearer understanding of complex graph learning algorithms. Graphical abstract.
Collapse
Affiliation(s)
- Daniel Walke
- Bioprocess Engineering, Otto von Guericke University, Universitätsplatz 2, Magdeburg 39106, Germany
- Database and Software Engineering Group, Otto von Guericke University, Universitätsplatz 2, Magdeburg 39106, Germany
| | - Daniel Micheel
- Database and Software Engineering Group, Otto von Guericke University, Universitätsplatz 2, Magdeburg 39106, Germany
| | - Kay Schallert
- Multidimensional Omics Analyses Group, Leibniz-Institut für Analytische Wissenschaften—ISAS—e.V., Bunsen-Kirchhoff-Straße 11, Dortmund 44139, Germany
| | - Thilo Muth
- Section eScience (S.3), Federal Institute for Materials Research and Testing (BAM), Unter den Eichen 87, Berlin 12205, Germany
| | - David Broneske
- Infrastructure and Methods, German Center for Higher Education Research and Science Studies (DZHW), Lange Laube 12, Hannover 30159, Germany
| | - Gunter Saake
- Database and Software Engineering Group, Otto von Guericke University, Universitätsplatz 2, Magdeburg 39106, Germany
| | - Robert Heyer
- Multidimensional Omics Analyses Group, Leibniz-Institut für Analytische Wissenschaften—ISAS—e.V., Bunsen-Kirchhoff-Straße 11, Dortmund 44139, Germany
- Faculty of Technology, Bielefeld University, Universitätsstraße 25, Bielefeld 33615, Germany
| |
Collapse
|
77
|
Qureshi R, Irfan M, Gondal TM, Khan S, Wu J, Hadi MU, Heymach J, Le X, Yan H, Alam T. AI in drug discovery and its clinical relevance. Heliyon 2023; 9:e17575. [PMID: 37396052 PMCID: PMC10302550 DOI: 10.1016/j.heliyon.2023.e17575] [Citation(s) in RCA: 28] [Impact Index Per Article: 28.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2023] [Revised: 06/17/2023] [Accepted: 06/21/2023] [Indexed: 07/04/2023] Open
Abstract
The COVID-19 pandemic has emphasized the need for novel drug discovery process. However, the journey from conceptualizing a drug to its eventual implementation in clinical settings is a long, complex, and expensive process, with many potential points of failure. Over the past decade, a vast growth in medical information has coincided with advances in computational hardware (cloud computing, GPUs, and TPUs) and the rise of deep learning. Medical data generated from large molecular screening profiles, personal health or pathology records, and public health organizations could benefit from analysis by Artificial Intelligence (AI) approaches to speed up and prevent failures in the drug discovery pipeline. We present applications of AI at various stages of drug discovery pipelines, including the inherently computational approaches of de novo design and prediction of a drug's likely properties. Open-source databases and AI-based software tools that facilitate drug design are discussed along with their associated problems of molecule representation, data collection, complexity, labeling, and disparities among labels. How contemporary AI methods, such as graph neural networks, reinforcement learning, and generated models, along with structure-based methods, (i.e., molecular dynamics simulations and molecular docking) can contribute to drug discovery applications and analysis of drug responses is also explored. Finally, recent developments and investments in AI-based start-up companies for biotechnology, drug design and their current progress, hopes and promotions are discussed in this article.
Collapse
Affiliation(s)
- Rizwan Qureshi
- College of Science and Engineering, Hamad Bin Khalifa University, Doha, Qatar
- Department of Imaging Physics, MD Anderson Cancer Center, The University of Texas, Houston, USA
| | - Muhammad Irfan
- Faculty of Electrical Engineering, Ghulam Ishaq Khan Institute of Engineering Sciences and Technology, Swabi, Pakistan
| | | | - Sheheryar Khan
- School of Professional Education & Executive Development, The Hong Kong Polytechnic University, Hong Kong
| | - Jia Wu
- Department of Imaging Physics, MD Anderson Cancer Center, The University of Texas, Houston, USA
| | | | - John Heymach
- Department of Thoracic Head and Neck Medical Oncology, Division of Cancer Medicine, The University of Texas, MD Anderson Cancer Center, Houston, USA
| | - Xiuning Le
- Department of Thoracic Head and Neck Medical Oncology, Division of Cancer Medicine, The University of Texas, MD Anderson Cancer Center, Houston, USA
| | - Hong Yan
- Department of Electrical Engineering, City University of Hong Kong, Kowloon, Hong Kong
| | - Tanvir Alam
- College of Science and Engineering, Hamad Bin Khalifa University, Doha, Qatar
| |
Collapse
|
78
|
Zhang S, Jin Y, Liu T, Wang Q, Zhang Z, Zhao S, Shan B. SS-GNN: A Simple-Structured Graph Neural Network for Affinity Prediction. ACS OMEGA 2023; 8:22496-22507. [PMID: 37396234 PMCID: PMC10308598 DOI: 10.1021/acsomega.3c00085] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/02/2023] [Accepted: 06/01/2023] [Indexed: 07/04/2023]
Abstract
Efficient and effective drug-target binding affinity (DTBA) prediction is a challenging task due to the limited computational resources in practical applications and is a crucial basis for drug screening. Inspired by the good representation ability of graph neural networks (GNNs), we propose a simple-structured GNN model named SS-GNN to accurately predict DTBA. By constructing a single undirected graph based on a distance threshold to represent protein-ligand interactions, the scale of the graph data is greatly reduced. Moreover, ignoring covalent bonds in the protein further reduces the computational cost of the model. The graph neural network-multilayer perceptron (GNN-MLP) module takes the latent feature extraction of atoms and edges in the graph as two mutually independent processes. We also develop an edge-based atom-pair feature aggregation method to represent complex interactions and a graph pooling-based method to predict the binding affinity of the complex. We achieve state-of-the-art prediction performance using a simple model (with only 0.6 M parameters) without introducing complicated geometric feature descriptions. SS-GNN achieves Pearson's Rp = 0.853 on the PDBbind v2016 core set, outperforming state-of-the-art GNN-based methods by 5.2%. Moreover, the simplified model structure and concise data processing procedure improve the prediction efficiency of the model. For a typical protein-ligand complex, affinity prediction takes only 0.2 ms. All codes are freely accessible at https://github.com/xianyuco/SS-GNN.
Collapse
Affiliation(s)
- Shuke Zhang
- Software
College, Hebei Normal University, Shijiazhuang 050024, China
- Shijiazhuang
Xianyu Digital Biotechnology Co., Ltd, Shijiazhuang 050024, China
| | - Yanzhao Jin
- Software
College, Hebei Normal University, Shijiazhuang 050024, China
- Shijiazhuang
Xianyu Digital Biotechnology Co., Ltd, Shijiazhuang 050024, China
| | - Tianmeng Liu
- Software
College, Hebei Normal University, Shijiazhuang 050024, China
- Shijiazhuang
Xianyu Digital Biotechnology Co., Ltd, Shijiazhuang 050024, China
| | - Qi Wang
- Software
College, Hebei Normal University, Shijiazhuang 050024, China
- Shijiazhuang
Xianyu Digital Biotechnology Co., Ltd, Shijiazhuang 050024, China
| | - Zhaohui Zhang
- Software
College, Hebei Normal University, Shijiazhuang 050024, China
- College
of Computer and Cyber Security, Hebei Normal
University, Shijiazhuang 050024, China
| | - Shuliang Zhao
- College
of Computer and Cyber Security, Hebei Normal
University, Shijiazhuang 050024, China
- Hebei
Provincial Key Laboratory of Network and Information Security, Shijiazhuang 050024, China
- Hebei
Provincial Engineering Research Center for Supply Chain Big Data Analytics
& Data Security, Shijiazhuang 050024, China
| | - Bo Shan
- Software
College, Hebei Normal University, Shijiazhuang 050024, China
- Shijiazhuang
Xianyu Digital Biotechnology Co., Ltd, Shijiazhuang 050024, China
| |
Collapse
|
79
|
Li J, Fang L, Lou JG. RetroRanker: leveraging reaction changes to improve retrosynthesis prediction through re-ranking. J Cheminform 2023; 15:58. [PMID: 37291642 DOI: 10.1186/s13321-023-00727-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2023] [Accepted: 05/28/2023] [Indexed: 06/10/2023] Open
Abstract
Retrosynthesis is an important task in organic chemistry. Recently, numerous data-driven approaches have achieved promising results in this task. However, in practice, these data-driven methods might lead to sub-optimal outcomes by making predictions based on the training data distribution, a phenomenon we refer as frequency bias. For example, in template-based approaches, low-ranked predictions are typically generated by less common templates with low confidence scores which might be too low to be comparable, and it is observed that recorded reactants can be among these low-ranked predictions. In this work, we introduce RetroRanker, a ranking model built upon graph neural networks, designed to mitigate the frequency bias in predictions of existing retrosynthesis models through re-ranking. RetroRanker incorporates potential reaction changes of each set of predicted reactants in obtaining the given product to lower the rank of chemically unreasonable predictions. The predicted re-ranked results on publicly available retrosynthesis benchmarks demonstrate that we can achieve improvement on most state-of-the-art models with RetroRanker. Our preliminary studies also indicate that RetroRanker can enhance the performance of multi-step retrosynthesis.
Collapse
Affiliation(s)
- Junren Li
- College of Chemistry and Molecular Engineering, Peking University, No. 5 Yiheyuan Road, Beijing, 100871, China
| | - Lei Fang
- Microsoft Research Asia, Building 2, No. 5 Dan Ling Street, Beijing, 100080, China.
| | - Jian-Guang Lou
- Microsoft Research Asia, Building 2, No. 5 Dan Ling Street, Beijing, 100080, China
| |
Collapse
|
80
|
Xu H, Lin J, Zhang D, Mo F. Retention time prediction for chromatographic enantioseparation by quantile geometry-enhanced graph neural network. Nat Commun 2023; 14:3095. [PMID: 37248214 DOI: 10.1038/s41467-023-38853-3] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2022] [Accepted: 05/17/2023] [Indexed: 05/31/2023] Open
Abstract
The enantioseparation of chiral molecules is a crucial and challenging task in the field of experimental chemistry, often requiring extensive trial and error with different experimental settings. To overcome this challenge, here we show a research framework that employs machine learning techniques to predict retention times of enantiomers and facilitate chromatographic enantioseparation. A documentary dataset of chiral molecular retention times in high-performance liquid chromatography (CMRT dataset) is established to handle the challenge of data acquisition. A quantile geometry-enhanced graph neural network is proposed to learn the molecular structure-retention time relationship, which shows a satisfactory predictive ability for enantiomers. The domain knowledge of chromatography is incorporated into the machine learning model to achieve multi-column prediction, which paves the way for chromatographic enantioseparation prediction by calculating the separation probability. The proposed research framework works well in retention time prediction and chromatographic enantioseparation facilitation, which sheds light on the application of machine learning techniques to the experimental scene and improves the efficiency of experimenters to speed up scientific discovery.
Collapse
Affiliation(s)
- Hao Xu
- School of Materials Science and Engineering, Peking University, 100871, Beijing, P. R. China
- BIC-ESAT, ERE, and SKLTCS, College of Engineering, Peking University, 100871, Beijing, P. R. China
| | - Jinglong Lin
- School of Materials Science and Engineering, Peking University, 100871, Beijing, P. R. China
| | - Dongxiao Zhang
- Eastern Institute for Advanced Study, Eastern Institute of Technology, 315200, Ningbo, Zhejiang, P. R. China.
- Department of Mathematics and Theories, Peng Cheng Laboratory, 518000, Shenzhen, Guangdong, P. R. China.
| | - Fanyang Mo
- School of Materials Science and Engineering, Peking University, 100871, Beijing, P. R. China.
- AI for Science (AI4S)-Preferred Program, Peking University Shenzhen Graduate School, 518055, Shenzhen, P. R. China.
| |
Collapse
|
81
|
Fang C, Wang Y, Grater R, Kapadnis S, Black C, Trapa P, Sciabola S. Prospective Validation of Machine Learning Algorithms for Absorption, Distribution, Metabolism, and Excretion Prediction: An Industrial Perspective. J Chem Inf Model 2023. [PMID: 37216672 DOI: 10.1021/acs.jcim.3c00160] [Citation(s) in RCA: 10] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/24/2023]
Abstract
Absorption, distribution, metabolism, and excretion (ADME), which collectively define the concentration profile of a drug at the site of action, are of critical importance to the success of a drug candidate. Recent advances in machine learning algorithms and the availability of larger proprietary as well as public ADME data sets have generated renewed interest within the academic and pharmaceutical science communities in predicting pharmacokinetic and physicochemical endpoints in early drug discovery. In this study, we collected 120 internal prospective data sets over 20 months across six ADME in vitro endpoints: human and rat liver microsomal stability, MDR1-MDCK efflux ratio, solubility, and human and rat plasma protein binding. A variety of machine learning algorithms in combination with different molecular representations were evaluated. Our results suggest that gradient boosting decision tree and deep learning models consistently outperformed random forest over time. We also observed better performance when models were retrained on a fixed schedule, and the more frequent retraining generally resulted in increased accuracy, while hyperparameters tuning only improved the prospective predictions marginally.
Collapse
Affiliation(s)
- Cheng Fang
- Medicinal Chemistry, Biogen, Cambridge, Massachusetts 02142, United States
| | - Ye Wang
- Medicinal Chemistry, Biogen, Cambridge, Massachusetts 02142, United States
| | - Richard Grater
- DMPK, Biogen, Cambridge, Massachusetts 02142, United States
| | | | - Cheryl Black
- DMPK, Biogen, Cambridge, Massachusetts 02142, United States
| | - Patrick Trapa
- DMPK, Biogen, Cambridge, Massachusetts 02142, United States
| | - Simone Sciabola
- Medicinal Chemistry, Biogen, Cambridge, Massachusetts 02142, United States
| |
Collapse
|
82
|
Guha R, Velegol D. Harnessing Shannon entropy-based descriptors in machine learning models to enhance the prediction accuracy of molecular properties. J Cheminform 2023; 15:54. [PMID: 37211605 DOI: 10.1186/s13321-023-00712-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2022] [Accepted: 03/18/2023] [Indexed: 05/23/2023] Open
Abstract
Accurate prediction of molecular properties is essential in the screening and development of drug molecules and other functional materials. Traditionally, property-specific molecular descriptors are used in machine learning models. This in turn requires the identification and development of target or problem-specific descriptors. Additionally, an increase in the prediction accuracy of the model is not always feasible from the standpoint of targeted descriptor usage. We explored the accuracy and generalizability issues using a framework of Shannon entropies, based on SMILES, SMARTS and/or InChiKey strings of respective molecules. Using various public databases of molecules, we showed that the accuracy of the prediction of machine learning models could be significantly enhanced simply by using Shannon entropy-based descriptors evaluated directly from SMILES. Analogous to partial pressures and total pressure of gases in a mixture, we used atom-wise fractional Shannon entropy in combination with total Shannon entropy from respective tokens of the string representation to model the molecule efficiently. The proposed descriptor was competitive in performance with standard descriptors such as Morgan fingerprints and SHED in regression models. Additionally, we found that either a hybrid descriptor set containing the Shannon entropy-based descriptors or an optimized, ensemble architecture of multilayer perceptrons and graph neural networks using the Shannon entropies was synergistic to improve the prediction accuracy. This simple approach of coupling the Shannon entropy framework to other standard descriptors and/or using it in ensemble models could find applications in boosting the performance of molecular property predictions in chemistry and material science.
Collapse
Affiliation(s)
- Rajarshi Guha
- Intel Corporation, 2501 NE Century Blvd, Hillsboro, OR, 97124, USA.
| | - Darrell Velegol
- Department of Chemical Engineering, Pennsylvania State University, University Park, PA, 16802, USA
| |
Collapse
|
83
|
Liu W, Wang Z, Chen J, Tang W, Wang H. Machine Learning Model for Screening Thyroid Stimulating Hormone Receptor Agonists Based on Updated Datasets and Improved Applicability Domain Metrics. Chem Res Toxicol 2023. [PMID: 37209109 DOI: 10.1021/acs.chemrestox.3c00074] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/22/2023]
Abstract
Machine learning (ML) models for screening endocrine-disrupting chemicals (EDCs), such as thyroid stimulating hormone receptor (TSHR) agonists, are essential for sound management of chemicals. Previous models for screening TSHR agonists were built on imbalanced datasets and lacked applicability domain (AD) characterization essential for regulatory application. Herein, an updated TSHR agonist dataset was built, for which the ratio of active to inactive compounds greatly increased to 1:2.6, and chemical spaces of structure-activity landscapes (SALs) were enhanced. Resulting models based on 7 molecular representations and 4 ML algorithms were proven to outperform previous ones. Weighted similarity density (ρs) and weighted inconsistency of activities (IA) were proposed to characterize the SALs, and a state-of-the-art AD characterization methodology ADSAL{ρs, IA} was established. An optimal classifier developed with PubChem fingerprints and the random forest algorithm, coupled with ADSAL{ρs ≥ 0.15, IA ≤ 0.65}, exhibited good performance on the validation set with the area under the receiver operating characteristic curve being 0.984 and balanced accuracy being 0.941 and identified 90 TSHR agonist classes that could not be found previously. The classifier together with the ADSAL{ρs, IA} may serve as efficient tools for screening EDCs, and the AD characterization methodology may be applied to other ML models.
Collapse
Affiliation(s)
- Wenjia Liu
- Key Laboratory of Industrial Ecology and Environmental Engineering (Ministry of Education), Dalian Key Laboratory on Chemicals Risk Control and Pollution Prevention Technology, School of Environmental Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Zhongyu Wang
- Key Laboratory of Industrial Ecology and Environmental Engineering (Ministry of Education), Dalian Key Laboratory on Chemicals Risk Control and Pollution Prevention Technology, School of Environmental Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Jingwen Chen
- Key Laboratory of Industrial Ecology and Environmental Engineering (Ministry of Education), Dalian Key Laboratory on Chemicals Risk Control and Pollution Prevention Technology, School of Environmental Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Weihao Tang
- Key Laboratory of Industrial Ecology and Environmental Engineering (Ministry of Education), Dalian Key Laboratory on Chemicals Risk Control and Pollution Prevention Technology, School of Environmental Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Haobo Wang
- Key Laboratory of Industrial Ecology and Environmental Engineering (Ministry of Education), Dalian Key Laboratory on Chemicals Risk Control and Pollution Prevention Technology, School of Environmental Science and Technology, Dalian University of Technology, Dalian 116024, China
| |
Collapse
|
84
|
Hajiabolhassan H, Taheri Z, Hojatnia A, Yeganeh YT. FunQG: Molecular Representation Learning via Quotient Graphs. J Chem Inf Model 2023. [PMID: 37186874 DOI: 10.1021/acs.jcim.3c00445] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/17/2023]
Abstract
To accurately predict molecular properties, it is important to learn expressive molecular representations. Graph neural networks (GNNs) have made significant advances in this area, but they often face limitations like neighbors-explosion, under-reaching, oversmoothing, and oversquashing. Additionally, GNNs tend to have high computational costs due to their large number of parameters. These limitations emerge or increase when dealing with larger graphs or deeper GNN models. One potential solution is to simplify the molecular graph into a smaller, richer, and more informative one that is easier to train GNNs. Our proposed molecular graph coarsening framework called FunQG, uses Functional groups as building blocks to determine a molecule's properties, based on a graph-theoretic concept called Quotient Graph. We show through experiments that the resulting informative graphs are much smaller than the original molecular graphs and are thus more suitable for training GNNs. We apply FunQG to popular molecular property prediction benchmarks and compare the performance of popular baseline GNNs on the resulting data sets to that of state-of-the-art baselines on the original data sets. Our experiments demonstrate that FunQG yields notable results on various data sets while dramatically reducing the number of parameters and computational costs. By utilizing functional groups, we can achieve an interpretable framework that indicates their significant role in determining the properties of molecular quotient graphs. Consequently, FunQG is a straightforward, computationally efficient, and generalizable solution for addressing the molecular representation learning problem.
Collapse
Affiliation(s)
- Hossein Hajiabolhassan
- Department of Mathematics and Information Technology, Chair of Information Technology, Montanuniversität Leoben, Franz-Josef-Strasse 18, A-8700 Leoben, Austria
- Machine Learning and Graph Mining Lab, Department of Applied Mathematics, Faculty of Mathematical Sciences, Shahid Beheshti University, 19839-69411 Tehran, Iran
| | - Zahra Taheri
- Machine Learning and Graph Mining Lab, Department of Applied Mathematics, Faculty of Mathematical Sciences, Shahid Beheshti University, 19839-69411 Tehran, Iran
| | - Ali Hojatnia
- Machine Learning and Graph Mining Lab, Department of Applied Mathematics, Faculty of Mathematical Sciences, Shahid Beheshti University, 19839-69411 Tehran, Iran
| | - Yavar Taheri Yeganeh
- Machine Learning and Graph Mining Lab, Department of Applied Mathematics, Faculty of Mathematical Sciences, Shahid Beheshti University, 19839-69411 Tehran, Iran
| |
Collapse
|
85
|
Stewart M, Martin ST. Machine Learning for Ionization Potentials and Photoionization Cross Sections of Volatile Organic Compounds. ACS EARTH & SPACE CHEMISTRY 2023; 7:863-875. [PMID: 37152449 PMCID: PMC10152554 DOI: 10.1021/acsearthspacechem.3c00009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/11/2023] [Revised: 03/16/2023] [Accepted: 03/17/2023] [Indexed: 05/09/2023]
Abstract
Molecular ionization potentials (IP) and photoionization cross sections (σ) can affect the sensitivity of photoionization detectors (PIDs) and other sensors for gaseous species. This study employs several methods of machine learning (ML) to predict IP and σ values at 10.6 eV (117 nm) for a dataset of 1251 gaseous organic species. The explicitness of the treatment of the species electronic structure progressively increases among the methods. The study compares the ML predictions of the IP and σ values to those obtained by quantum chemical calculations. The ML predictions are comparable in performance to those of the quantum calculations when evaluated against measurements. Pretraining further reduces the mean absolute errors (ε) compared to the measurements. The graph-based attentive fingerprint model was most accurate, for which εIP = 0.23 ± 0.01 eV and εσ = 2.8 ± 0.2 Mb compared to measurements and computed cross sections, respectively. The ML predictions for IP correlate well with both the measured IPs (R 2 = 0.88) and with IPs computed at the level of M06-2X/aug-cc-pVTZ (R 2 = 0.82). The ML predictions for σ correlated reasonably well with computed cross sections (R 2 = 0.66). The developed ML methods for IP and σ values, representing the properties of a generalizable set of volatile organic compounds (VOCs) relevant to industrial applications and atmospheric chemistry, can be used to quantitatively describe the species-dependent sensitivity of chemical sensors that use ionizing radiation as part of the sensing mechanism, such as photoionization detectors.
Collapse
Affiliation(s)
- Matthew
P. Stewart
- School
of Engineering and Applied Sciences, Harvard
University, Cambridge, Massachusetts 02138, United States
| | - Scot T. Martin
- School
of Engineering and Applied Sciences, Harvard
University, Cambridge, Massachusetts 02138, United States
- Department
of Earth and Planetary Sciences, Harvard
University, Cambridge, Massachusetts 02138, United States
| |
Collapse
|
86
|
Dablander M, Hanser T, Lambiotte R, Morris GM. Exploring QSAR models for activity-cliff prediction. J Cheminform 2023; 15:47. [PMID: 37069675 PMCID: PMC10107580 DOI: 10.1186/s13321-023-00708-w] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2023] [Accepted: 03/10/2023] [Indexed: 04/19/2023] Open
Abstract
INTRODUCTION AND METHODOLOGY Pairs of similar compounds that only differ by a small structural modification but exhibit a large difference in their binding affinity for a given target are known as activity cliffs (ACs). It has been hypothesised that QSAR models struggle to predict ACs and that ACs thus form a major source of prediction error. However, the AC-prediction power of modern QSAR methods and its quantitative relationship to general QSAR-prediction performance is still underexplored. We systematically construct nine distinct QSAR models by combining three molecular representation methods (extended-connectivity fingerprints, physicochemical-descriptor vectors and graph isomorphism networks) with three regression techniques (random forests, k-nearest neighbours and multilayer perceptrons); we then use each resulting model to classify pairs of similar compounds as ACs or non-ACs and to predict the activities of individual molecules in three case studies: dopamine receptor D2, factor Xa, and SARS-CoV-2 main protease. RESULTS AND CONCLUSIONS Our results provide strong support for the hypothesis that indeed QSAR models frequently fail to predict ACs. We observe low AC-sensitivity amongst the evaluated models when the activities of both compounds are unknown, but a substantial increase in AC-sensitivity when the actual activity of one of the compounds is given. Graph isomorphism features are found to be competitive with or superior to classical molecular representations for AC-classification and can thus be employed as baseline AC-prediction models or simple compound-optimisation tools. For general QSAR-prediction, however, extended-connectivity fingerprints still consistently deliver the best performance amongs the tested input representations. A potential future pathway to improve QSAR-modelling performance might be the development of techniques to increase AC-sensitivity.
Collapse
Affiliation(s)
- Markus Dablander
- Mathematical Institute, University of Oxford, Andrew Wiles Building, Radcliffe Observatory Quarter (550), Woodstock Road, Oxford, OX2 6GG, UK
| | - Thierry Hanser
- Lhasa Limited, Granary Wharf House, 2 Canal Wharf, Leeds, LS11 5PS, UK
| | - Renaud Lambiotte
- Mathematical Institute, University of Oxford, Andrew Wiles Building, Radcliffe Observatory Quarter (550), Woodstock Road, Oxford, OX2 6GG, UK
| | - Garrett M Morris
- Department of Statistics, University of Oxford, 24-29 St Giles', Oxford, OX1 3LB, UK.
| |
Collapse
|
87
|
Ektefaie Y, Dasoulas G, Noori A, Farhat M, Zitnik M. Multimodal learning with graphs. NAT MACH INTELL 2023; 5:340-350. [PMID: 38076673 PMCID: PMC10704992 DOI: 10.1038/s42256-023-00624-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2022] [Accepted: 02/01/2023] [Indexed: 04/05/2023]
Abstract
Artificial intelligence for graphs has achieved remarkable success in modeling complex systems, ranging from dynamic networks in biology to interacting particle systems in physics. However, the increasingly heterogeneous graph datasets call for multimodal methods that can combine different inductive biases-the set of assumptions that algorithms use to make predictions for inputs they have not encountered during training. Learning on multimodal datasets presents fundamental challenges because the inductive biases can vary by data modality and graphs might not be explicitly given in the input. To address these challenges, multimodal graph AI methods combine different modalities while leveraging cross-modal dependencies using graphs. Diverse datasets are combined using graphs and fed into sophisticated multimodal architectures, specified as image-intensive, knowledge-grounded and language-intensive models. Using this categorization, we introduce a blueprint for multimodal graph learning, use it to study existing methods and provide guidelines to design new models.
Collapse
Affiliation(s)
- Yasha Ektefaie
- Bioinformatics and Integrative Genomics Program, Harvard Medical School, Boston, MA 02115, USA
- Department of Biomedical Informatics, Harvard University, Boston, MA 02115, USA
| | - George Dasoulas
- Department of Biomedical Informatics, Harvard University, Boston, MA 02115, USA
- Harvard Data Science Initiative, Cambridge, MA 02138, USA
| | - Ayush Noori
- Department of Biomedical Informatics, Harvard University, Boston, MA 02115, USA
- Harvard College, Cambridge, MA 02138, USA
| | - Maha Farhat
- Department of Biomedical Informatics, Harvard University, Boston, MA 02115, USA
- Division of Pulmonary and Critical Care, Department of Medicine, Massachusetts General Hospital, Boston, MA, USA
| | - Marinka Zitnik
- Department of Biomedical Informatics, Harvard University, Boston, MA 02115, USA
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
- Harvard Data Science Initiative, Cambridge, MA 02138, USA
| |
Collapse
|
88
|
Sharma B, Chenthamarakshan V, Dhurandhar A, Pereira S, Hendler JA, Dordick JS, Das P. Accurate clinical toxicity prediction using multi-task deep neural nets and contrastive molecular explanations. Sci Rep 2023; 13:4908. [PMID: 36966203 PMCID: PMC10039880 DOI: 10.1038/s41598-023-31169-8] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2022] [Accepted: 03/07/2023] [Indexed: 03/27/2023] Open
Abstract
Explainable machine learning for molecular toxicity prediction is a promising approach for efficient drug development and chemical safety. A predictive ML model of toxicity can reduce experimental cost and time while mitigating ethical concerns by significantly reducing animal and clinical testing. Herein, we use a deep learning framework for simultaneously modeling in vitro, in vivo, and clinical toxicity data. Two different molecular input representations are used; Morgan fingerprints and pre-trained SMILES embeddings. A multi-task deep learning model accurately predicts toxicity for all endpoints, including clinical, as indicated by the area under the Receiver Operator Characteristic curve and balanced accuracy. In particular, pre-trained molecular SMILES embeddings as input to the multi-task model improved clinical toxicity predictions compared to existing models in MoleculeNet benchmark. Additionally, our multitask approach is comprehensive in the sense that it is comparable to state-of-the-art approaches for specific endpoints in in vitro, in vivo and clinical platforms. Through both the multi-task model and transfer learning, we were able to indicate the minimal need of in vivo data for clinical toxicity predictions. To provide confidence and explain the model's predictions, we adapt a post-hoc contrastive explanation method that returns pertinent positive and negative features, which correspond well to known mutagenic and reactive toxicophores, such as unsubstituted bonded heteroatoms, aromatic amines, and Michael receptors. Furthermore, toxicophore recovery by pertinent feature analysis captures more of the in vitro (53%) and in vivo (56%), rather than of the clinical (8%), endpoints, and indeed uncovers a preference in known toxicophore data towards in vitro and in vivo experimental data. To our knowledge, this is the first contrastive explanation, using both present and absent substructures, for predictions of clinical and in vivo molecular toxicity.
Collapse
Affiliation(s)
| | | | | | - Shiranee Pereira
- ICARE, International Center for Alternatives in Research and Education, Chennai, India
| | | | | | - Payel Das
- IBM Research, Yorktown Heights, NY, USA.
| |
Collapse
|
89
|
Huang M, Zhu K, Wang Y, Lou C, Sun H, Li W, Tang Y, Liu G. In Silico Prediction of Metabolic Reaction Catalyzed by Human Aldehyde Oxidase. Metabolites 2023; 13:metabo13030449. [PMID: 36984889 PMCID: PMC10059660 DOI: 10.3390/metabo13030449] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2023] [Revised: 03/17/2023] [Accepted: 03/17/2023] [Indexed: 03/30/2023] Open
Abstract
Aldehyde oxidase (AOX) plays an important role in drug metabolism. Human AOX (hAOX) is widely distributed in the body, and there are some differences between species. Currently, animal models cannot accurately predict the metabolism of hAOX. Therefore, more and more in silico models have been constructed for the prediction of the hAOX metabolism. These models are based on molecular docking and quantum chemistry theory, which are time-consuming and difficult to automate. Therefore, in this study, we compared traditional machine learning methods, graph convolutional neural network methods, and sequence-based methods with limited data, and proposed a ligand-based model for the metabolism prediction catalyzed by hAOX. Compared with the published models, our model achieved better performance (ACC = 0.91, F1 = 0.77). What's more, we built a web server to predict the sites of metabolism (SOMs) for hAOX. In summary, this study provides a convenient and automatable model and builds a web server named Meta-hAOX for accelerating the drug design and optimization stage.
Collapse
Affiliation(s)
- Mengting Huang
- Shanghai Frontiers Science Center of Optogenetic Techniques for Cell Metabolism, Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai 200237, China
| | - Keyun Zhu
- Shanghai Frontiers Science Center of Optogenetic Techniques for Cell Metabolism, Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai 200237, China
| | - Yimeng Wang
- Shanghai Frontiers Science Center of Optogenetic Techniques for Cell Metabolism, Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai 200237, China
| | - Chaofeng Lou
- Shanghai Frontiers Science Center of Optogenetic Techniques for Cell Metabolism, Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai 200237, China
| | - Huimin Sun
- Shanghai Frontiers Science Center of Optogenetic Techniques for Cell Metabolism, Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai 200237, China
| | - Weihua Li
- Shanghai Frontiers Science Center of Optogenetic Techniques for Cell Metabolism, Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai 200237, China
| | - Yun Tang
- Shanghai Frontiers Science Center of Optogenetic Techniques for Cell Metabolism, Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai 200237, China
| | - Guixia Liu
- Shanghai Frontiers Science Center of Optogenetic Techniques for Cell Metabolism, Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai 200237, China
| |
Collapse
|
90
|
Win ZM, Cheong AMY, Hopkins WS. Using Machine Learning To Predict Partition Coefficient (Log P) and Distribution Coefficient (Log D) with Molecular Descriptors and Liquid Chromatography Retention Time. J Chem Inf Model 2023; 63:1906-1913. [PMID: 36926888 DOI: 10.1021/acs.jcim.2c01373] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/18/2023]
Abstract
During preclinical evaluations of drug candidates, several physicochemical (p-chem) properties are measured and employed as metrics to estimate drug efficacy in vivo. Two such p-chem properties are the octanol-water partition coefficient, Log P, and distribution coefficient, Log D, which are useful in estimating the distribution of drugs within the body. Log P and Log D are traditionally measured using the shake-flask method and high-performance liquid chromatography. However, it is challenging to measure these properties for species that are very hydrophobic (or hydrophilic) owing to the very low equilibrium concentrations partitioned into octanol (or aqueous) phases. Moreover, the shake-flask method is relatively time-consuming and can require multistep dilutions as the range of analyte concentrations can differ by several orders of magnitude. Here, we circumvent these limitations by using machine learning (ML) to correlate Log P and Log D with liquid chromatography (LC) retention time (RT). Predictive models based on four ML algorithms, which used molecular descriptors and LC RTs as features, were extensively tested and compared. The inclusion of RT as an additional descriptor improves model performance (MAE = 0.366 and R2 = 0.89), and Shapley additive explanations analysis indicates that RT has the highest impact on model accuracy.
Collapse
Affiliation(s)
- Zaw-Myo Win
- Centre for Eye and Vision Research, Hong Kong Science Park, New Territories 999077, Hong Kong.,School of Optometry, The Hong Kong Polytechnic University, Kowloon 999077, Hong Kong.,Department of Chemistry, University of Waterloo, 200 University Avenue West, Waterloo, Ontario N2L 3G1, Canada
| | - Allen M Y Cheong
- Centre for Eye and Vision Research, Hong Kong Science Park, New Territories 999077, Hong Kong.,School of Optometry, The Hong Kong Polytechnic University, Kowloon 999077, Hong Kong
| | - W Scott Hopkins
- Centre for Eye and Vision Research, Hong Kong Science Park, New Territories 999077, Hong Kong.,Department of Chemistry, University of Waterloo, 200 University Avenue West, Waterloo, Ontario N2L 3G1, Canada.,Waterloo Institute for Nanotechnology, University of Waterloo, 200 University Avenue West, Waterloo, Ontario N2L 3G1, Canada.,WaterMine Innovation, Inc., Waterloo, Ontario N0B 2T0, Canada
| |
Collapse
|
91
|
Lameiro RF, Montanari CA. Investigating the Lack of Translation from Cruzain Inhibition to Trypanosoma cruzi Activity with Machine Learning and Chemical Space Analyses. ChemMedChem 2023; 18:e202200434. [PMID: 36692246 DOI: 10.1002/cmdc.202200434] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2022] [Revised: 01/17/2023] [Accepted: 01/17/2023] [Indexed: 01/25/2023]
Abstract
Chagas disease is a neglected tropical disease caused by the protozoa Trypanosoma cruzi. Cruzain, its main cysteine protease, is commonly targeted in drug discovery efforts to find new treatments for this disease. Even though the essentiality of this enzyme for the parasite has been established, many cruzain inhibitors fail as trypanocidal agents. This lack of translation from biochemical to biological assays can involve several factors, including suboptimal physicochemical properties. In this work, we aim to rationalize this phenomenon through chemical space analyses of calculated molecular descriptors. These include statistical tests, visualization of projections, scaffold analysis, and creation of machine learning models coupled with interpretability methods. Our results demonstrate a significant difference between the chemical spaces of cruzain and T. cruzi inhibitors, with compounds with more hydrogen bond donors and rotatable bonds being more likely to be good cruzain inhibitors, but less likely to be active on T. cruzi. In addition, cruzain inhibitors seem to occupy specific regions of the chemical space that cannot be easily correlated with T. cruzi activity, which means that using predictive modeling to determine whether cruzain inhibitors will be trypanocidal is not a straightforward task. We believe that the conclusions from this work might be of interest for future projects that aim to develop novel trypanocidal compounds.
Collapse
Affiliation(s)
- Rafael F Lameiro
- Medicinal and Biological Chemistry Group, São Carlos Institute of Chemistry, University of São Paulo, Trabalhador São-Carlense Avenue 400, São Carlos, Brazil
| | - Carlos A Montanari
- Medicinal and Biological Chemistry Group, São Carlos Institute of Chemistry, University of São Paulo, Trabalhador São-Carlense Avenue 400, São Carlos, Brazil
| |
Collapse
|
92
|
Bian Y, Kwon JJ, Liu C, Margiotta E, Shekhar M, Gould AE. Target-driven machine learning-enabled virtual screening (TAME-VS) platform for early-stage hit identification. Front Mol Biosci 2023; 10:1163536. [PMID: 36994428 PMCID: PMC10040869 DOI: 10.3389/fmolb.2023.1163536] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2023] [Accepted: 02/28/2023] [Indexed: 03/15/2023] Open
Abstract
High-throughput screening (HTS) methods enable the empirical evaluation of a large scale of compounds and can be augmented by virtual screening (VS) techniques to save time and money by using potential active compounds for experimental testing. Structure-based and ligand-based virtual screening approaches have been extensively studied and applied in drug discovery practice with proven outcomes in advancing candidate molecules. However, the experimental data required for VS are expensive, and hit identification in an effective and efficient manner is particularly challenging during early-stage drug discovery for novel protein targets. Herein, we present our TArget-driven Machine learning-Enabled VS (TAME-VS) platform, which leverages existing chemical databases of bioactive molecules to modularly facilitate hit finding. Our methodology enables bespoke hit identification campaigns through a user-defined protein target. The input target ID is used to perform a homology-based target expansion, followed by compound retrieval from a large compilation of molecules with experimentally validated activity. Compounds are subsequently vectorized and adopted for machine learning (ML) model training. These machine learning models are deployed to perform model-based inferential virtual screening, and compounds are nominated based on predicted activity. Our platform was retrospectively validated across ten diverse protein targets and demonstrated clear predictive power. The implemented methodology provides a flexible and efficient approach that is accessible to a wide range of users. The TAME-VS platform is publicly available at https://github.com/bymgood/Target-driven-ML-enabled-VS to facilitate early-stage hit identification.
Collapse
Affiliation(s)
- Yuemin Bian
- Center for the Development of Therapeutics, Broad Institute of MIT and Harvard, Cambridge, MA, United States
- *Correspondence: Yuemin Bian,
| | - Jason J. Kwon
- Cancer Program, Broad Institute of MIT and Harvard, Cambridge, MA, United States
- Department of Medical Oncology, Dana-Farber Cancer Institute, Boston, MA, United States
- Harvard Medical School, Boston, MA, United States
| | - Cong Liu
- Center for the Development of Therapeutics, Broad Institute of MIT and Harvard, Cambridge, MA, United States
| | - Enrico Margiotta
- Center for the Development of Therapeutics, Broad Institute of MIT and Harvard, Cambridge, MA, United States
| | - Mrinal Shekhar
- Center for the Development of Therapeutics, Broad Institute of MIT and Harvard, Cambridge, MA, United States
| | - Alexandra E. Gould
- Center for the Development of Therapeutics, Broad Institute of MIT and Harvard, Cambridge, MA, United States
| |
Collapse
|
93
|
Di Lascio E, Gerebtzoff G, Rodríguez-Pérez R. Systematic Evaluation of Local and Global Machine Learning Models for the Prediction of ADME Properties. Mol Pharm 2023; 20:1758-1767. [PMID: 36745394 DOI: 10.1021/acs.molpharmaceut.2c00962] [Citation(s) in RCA: 10] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
Machine learning (ML) has become an indispensable tool to predict absorption, distribution, metabolism, and excretion (ADME) properties in pharmaceutical research. ML algorithms are trained on molecular structures and corresponding ADME assay data to develop quantitative structure-property relationship (QSPR) models. Traditional QSPR models were trained on compound sets of limited size. With the advent of more complex ML algorithms and data availability, training sets have become larger and more diverse. Most common training approaches consist in either training a model with a small set of similar compounds, namely, compounds designed for the same drug discovery project or chemical series (local model approach) or with a larger set of diverse compounds (global model approach). Global models are built with all experimental data available for an assay, combining compound data from different projects and disease areas. Despite the ML progress made so far, the choice of the appropriate data composition for building ML models is still unclear. Herein, a systematic evaluation of local and global ML models was performed for 10 different experimental assays and 112 drug discovery projects. Results show a consistent superior performance of global models for ADME property predictions. Diagnostic analyses were also carried out to investigate the influence of training set size, structural diversity, and data shift in the relative performance of local and global ML models. Training set and structural diversity did not have an impact in the relative performance on the methods. Instead, data shift helped to identify the projects with larger performance differences between local and global models. Results presented in this work can be leveraged to improve ML-based ADME properties predictions and thus decision-making in drug discovery projects.
Collapse
Affiliation(s)
- Elena Di Lascio
- Novartis Institutes for Biomedical Research, Novartis Campus, BaselCH-4002, Switzerland
| | - Grégori Gerebtzoff
- Novartis Institutes for Biomedical Research, Novartis Campus, BaselCH-4002, Switzerland
| | | |
Collapse
|
94
|
Yang X, Niu Z, Liu Y, Song B, Lu W, Zeng L, Zeng X. Modality-DTA: Multimodality Fusion Strategy for Drug-Target Affinity Prediction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:1200-1210. [PMID: 36083952 DOI: 10.1109/tcbb.2022.3205282] [Citation(s) in RCA: 13] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Prediction of the drug-target affinity (DTA) plays an important role in drug discovery. Existing deep learning methods for DTA prediction typically leverage a single modality, namely simplified molecular input line entry specification (SMILES) or amino acid sequence to learn representations. SMILES or amino acid sequences can be encoded into different modalities. Multimodality data provide different kinds of information, with complementary roles for DTA prediction. We propose Modality-DTA, a novel deep learning method for DTA prediction that leverages the multimodality of drugs and targets. A group of backward propagation neural networks is applied to ensure the completeness of the reconstruction process from the latent feature representation to original multimodality data. The tag between the drug and target is used to reduce the noise information in the latent representation from multimodality data. Experiments on three benchmark datasets show that our Modality-DTA outperforms existing methods in all metrics. Modality-DTA reduces the mean square error by 15.7% and improves the area under the precisionrecall curve by 12.74% in the Davis dataset. We further find that the drug modality Morgan fingerprint and the target modality generated by one-hot-encoding play the most significant roles. To the best of our knowledge, Modality-DTA is the first method to explore multimodality for DTA prediction.
Collapse
|
95
|
Zhan H, Zhu X, Qiao Z, Hu J. Graph Neural Tree: A novel and interpretable deep learning-based framework for accurate molecular property predictions. Anal Chim Acta 2023; 1244:340558. [PMID: 36737143 DOI: 10.1016/j.aca.2022.340558] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2022] [Accepted: 10/24/2022] [Indexed: 11/06/2022]
Abstract
Determining various properties of molecules is a critical step in drug discovery. Recently, with the improvement of large heterogeneous datasets and the development of deep learning approaches, more and more scientists have turned their attention to neural network-based virtual preliminary screening to reduce the time and monetary cost of drug discovery. However, the poor interpretability of deep learning masks causality, so models' conclusions are often beyond the comprehension of human users, which reduces the credibility of the model and makes it difficult for chemists to further narrow the huge chemical space based on models' results. Thus, this study develops a novel framework consisting of Graph Neural Networks for feature extraction, Curriculum-Based Learning Strategies for optimization, and a Learning Binary Neural Tree (LBNT) for prediction, to improve the performance of neural networks and reveal their decision-making process to chemists. The framework encodes molecular graph data with graph neural networks (GNNs), then retrains the encoder with curriculum-based learning strategies to reduce uncertainty and improve accuracy, and finally uses LBNT as the predictor, which joint retrains with the encoder after independently training, for prediction and visualization. The framework is validated on the public datasets and compared to single GNNs with normal training strategies as well as GNN encoders with common machine learning predictors instead of the LBNT predictor. The result reveals that the proposed framework enhances the point prediction accuracy of the completely trained GNN and reduces its uncertainty through curriculum-based learning, and further improves the accuracy by combining LBNT. Besides, compared with common machine learning tools, the LBNT predictor generally has the best performance because of joint retraining with the GNN encoder. The decision-making process of LBNT is also better and easier to explain than that of other models.
Collapse
Affiliation(s)
- Haolin Zhan
- Guangzhou Key Laboratory for New Energy and Green Catalysis, School of Chemistry and Chemical Engineering, Guangzhou University, Guangzhou, China; College of Economics and Statistics, Guangzhou University, Guangzhou, China
| | - Xin Zhu
- Guangzhou Key Laboratory for New Energy and Green Catalysis, School of Chemistry and Chemical Engineering, Guangzhou University, Guangzhou, China.
| | - Zhiwei Qiao
- Guangzhou Key Laboratory for New Energy and Green Catalysis, School of Chemistry and Chemical Engineering, Guangzhou University, Guangzhou, China; Joint Institute of Guangzhou University & Institute of Corrosion Science and Technology, Guangzhou University, Guangzhou, 510006, China.
| | - Jianming Hu
- College of Economics and Statistics, Guangzhou University, Guangzhou, China.
| |
Collapse
|
96
|
Partin A, Brettin TS, Zhu Y, Narykov O, Clyde A, Overbeek J, Stevens RL. Deep learning methods for drug response prediction in cancer: Predominant and emerging trends. Front Med (Lausanne) 2023; 10:1086097. [PMID: 36873878 PMCID: PMC9975164 DOI: 10.3389/fmed.2023.1086097] [Citation(s) in RCA: 19] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2022] [Accepted: 01/23/2023] [Indexed: 02/17/2023] Open
Abstract
Cancer claims millions of lives yearly worldwide. While many therapies have been made available in recent years, by in large cancer remains unsolved. Exploiting computational predictive models to study and treat cancer holds great promise in improving drug development and personalized design of treatment plans, ultimately suppressing tumors, alleviating suffering, and prolonging lives of patients. A wave of recent papers demonstrates promising results in predicting cancer response to drug treatments while utilizing deep learning methods. These papers investigate diverse data representations, neural network architectures, learning methodologies, and evaluations schemes. However, deciphering promising predominant and emerging trends is difficult due to the variety of explored methods and lack of standardized framework for comparing drug response prediction models. To obtain a comprehensive landscape of deep learning methods, we conducted an extensive search and analysis of deep learning models that predict the response to single drug treatments. A total of 61 deep learning-based models have been curated, and summary plots were generated. Based on the analysis, observable patterns and prevalence of methods have been revealed. This review allows to better understand the current state of the field and identify major challenges and promising solution paths.
Collapse
Affiliation(s)
- Alexander Partin
- Division of Data Science and Learning, Argonne National Laboratory, Lemont, IL, United States
| | - Thomas S. Brettin
- Division of Data Science and Learning, Argonne National Laboratory, Lemont, IL, United States
| | - Yitan Zhu
- Division of Data Science and Learning, Argonne National Laboratory, Lemont, IL, United States
| | - Oleksandr Narykov
- Division of Data Science and Learning, Argonne National Laboratory, Lemont, IL, United States
| | - Austin Clyde
- Division of Data Science and Learning, Argonne National Laboratory, Lemont, IL, United States
| | - Jamie Overbeek
- Division of Data Science and Learning, Argonne National Laboratory, Lemont, IL, United States
| | - Rick L. Stevens
- Division of Data Science and Learning, Argonne National Laboratory, Lemont, IL, United States
- Department of Computer Science, The University of Chicago, Chicago, IL, United States
| |
Collapse
|
97
|
Aouichaoui ARN, Fan F, Mansouri SS, Abildskov J, Sin G. Combining Group-Contribution Concept and Graph Neural Networks Toward Interpretable Molecular Property Models. J Chem Inf Model 2023; 63:725-744. [PMID: 36716461 DOI: 10.1021/acs.jcim.2c01091] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2023]
Abstract
Quantitative structure-property relationships (QSPRs) are important tools to facilitate and accelerate the discovery of compounds with desired properties. While many QSPRs have been developed, they are associated with various shortcomings such as a lack of generalizability and modest accuracy. Albeit various machine-learning and deep-learning techniques have been integrated into such models, another shortcoming has emerged in the form of a lack of transparency and interpretability of such models. In this work, two interpretable graph neural network (GNN) models (attentive group-contribution (AGC) and group-contribution-based graph attention (GroupGAT)) are developed by integrating fundamentals using the concept of group contributions (GC). The interpretability consists of highlighting the substructure with the highest attention weights in the latent representation of the molecules using the attention mechanism. The proposed models showcased better performance compared to classical group-contribution models, as well as against various other GNN models describing the aqueous solubility, melting point, and enthalpies of formation, combustion, and fusion of organic compounds. The insights provided are consistent with insights obtained from the semiempirical GC models confirming that the proposed framework allows highlighting the important substructures of the molecules for a specific property.
Collapse
Affiliation(s)
- Adem R N Aouichaoui
- Process and Systems Engineering Center (PROSYS), Department of Chemical and Biochemical Engineering, Technical University of Denmark, Kgs. LyngbyDK-2800, Denmark
| | - Fan Fan
- Process and Systems Engineering Center (PROSYS), Department of Chemical and Biochemical Engineering, Technical University of Denmark, Kgs. LyngbyDK-2800, Denmark
| | - Seyed Soheil Mansouri
- Process and Systems Engineering Center (PROSYS), Department of Chemical and Biochemical Engineering, Technical University of Denmark, Kgs. LyngbyDK-2800, Denmark
| | - Jens Abildskov
- Process and Systems Engineering Center (PROSYS), Department of Chemical and Biochemical Engineering, Technical University of Denmark, Kgs. LyngbyDK-2800, Denmark
| | - Gürkan Sin
- Process and Systems Engineering Center (PROSYS), Department of Chemical and Biochemical Engineering, Technical University of Denmark, Kgs. LyngbyDK-2800, Denmark
| |
Collapse
|
98
|
Liu J, Lei X, Zhang Y, Pan Y. The prediction of molecular toxicity based on BiGRU and GraphSAGE. Comput Biol Med 2023; 153:106524. [PMID: 36623439 DOI: 10.1016/j.compbiomed.2022.106524] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2022] [Revised: 12/10/2022] [Accepted: 12/31/2022] [Indexed: 01/04/2023]
Abstract
The prediction of molecules toxicity properties plays an crucial role in the realm of the drug discovery, since it can swiftly screen out the expected drug moleculars. The conventional method for predicting toxicity is to use some in vivo or in vitro biological experiments in the laboratory, which can easily pose a threat significant time and financial waste and even ethical issues. Therefore, using computational approaches to predict molecular toxicity has become a common strategy in modern drug discovery. In this article, we propose a novel model named MTBG, which primarily makes use of both SMILES (Simplified molecular input line entry system) strings and graph structures of molecules to extract drug molecular feature in the field of drug molecular toxicity prediction. To verify the performance of the MTBG model, we opt the Tox21 dataset and several widely used baseline models. Experimental results demonstrate that our model can perform better than these baseline models.
Collapse
Affiliation(s)
- Jianping Liu
- School of Computer Science, Shaanxi Normal University, Xi'an, 710119, China
| | - Xiujuan Lei
- School of Computer Science, Shaanxi Normal University, Xi'an, 710119, China.
| | - Yuchen Zhang
- School of Computer Science, Shaanxi Normal University, Xi'an, 710119, China
| | - Yi Pan
- Faculty of Computer Science and Control Engineering, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, China
| |
Collapse
|
99
|
Wang D, Wu Z, Shen C, Bao L, Luo H, Wang Z, Yao H, Kong DX, Luo C, Hou T. Learning with uncertainty to accelerate the discovery of histone lysine-specific demethylase 1A (KDM1A/LSD1) inhibitors. Brief Bioinform 2023; 24:6961473. [PMID: 36573494 DOI: 10.1093/bib/bbac592] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2022] [Revised: 12/01/2022] [Accepted: 12/02/2022] [Indexed: 12/28/2022] Open
Abstract
Machine learning including modern deep learning models has been extensively used in drug design and screening. However, reliable prediction of molecular properties is still challenging when exploring out-of-domain regimes, even for deep neural networks. Therefore, it is important to understand the uncertainty of model predictions, especially when the predictions are used to guide further experiments. In this study, we explored the utility and effectiveness of evidential uncertainty in compound screening. The evidential Graphormer model was proposed for uncertainty-guided discovery of KDM1A/LSD1 inhibitors. The benchmarking results illustrated that (i) Graphormer exhibited comparative predictive power to state-of-the-art models, and (ii) evidential regression enabled well-ranked uncertainty estimates and calibrated predictions. Subsequently, we leveraged time-splitting on the curated KDM1A/LSD1 dataset to simulate out-of-distribution predictions. The retrospective virtual screening showed that the evidential uncertainties helped reduce false positives among the top-acquired compounds and thus enabled higher experimental validation rates. The trained model was then used to virtually screen an independent in-house compound set. The top 50 compounds ranked by two different ranking strategies were experimentally validated, respectively. In general, our study highlighted the importance to understand the uncertainty in prediction, which can be recognized as an interpretable dimension to model predictions.
Collapse
Affiliation(s)
- Dong Wang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, China.,State Key Lab of CAD&CG, Zhejiang University, Hangzhou 310058 Zhejiang, China
| | - Zhenxing Wu
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, China.,State Key Lab of CAD&CG, Zhejiang University, Hangzhou 310058 Zhejiang, China
| | - Chao Shen
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, China.,State Key Lab of CAD&CG, Zhejiang University, Hangzhou 310058 Zhejiang, China.,CarbonSilicon AI Technology Co., Ltd, Hangzhou 310018, Zhejiang, China
| | - Lingjie Bao
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, China.,State Key Lab of CAD&CG, Zhejiang University, Hangzhou 310058 Zhejiang, China
| | - Hao Luo
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, China.,State Key Lab of CAD&CG, Zhejiang University, Hangzhou 310058 Zhejiang, China
| | - Zhe Wang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, China.,State Key Lab of CAD&CG, Zhejiang University, Hangzhou 310058 Zhejiang, China
| | - Hucheng Yao
- State Key Laboratory of Agricultural Microbiology, Agricultural Bioinformatics Key Laboratory of Hubei Province, College of Informatics, Huazhong Agricultural University, Wuhan 430070, China
| | - De-Xin Kong
- State Key Laboratory of Agricultural Microbiology, Agricultural Bioinformatics Key Laboratory of Hubei Province, College of Informatics, Huazhong Agricultural University, Wuhan 430070, China
| | - Cheng Luo
- The Center for Chemical Biology, Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai 201203 China
| | - Tingjun Hou
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, China.,State Key Lab of CAD&CG, Zhejiang University, Hangzhou 310058 Zhejiang, China
| |
Collapse
|
100
|
Skoraczyński G, Kitlas M, Miasojedow B, Gambin A. Critical assessment of synthetic accessibility scores in computer-assisted synthesis planning. J Cheminform 2023; 15:6. [PMID: 36641473 PMCID: PMC9840255 DOI: 10.1186/s13321-023-00678-z] [Citation(s) in RCA: 11] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2022] [Accepted: 01/04/2023] [Indexed: 01/15/2023] Open
Abstract
Modern computer-assisted synthesis planning tools provide strong support for this problem. However, they are still limited by computational complexity. This limitation may be overcome by scoring the synthetic accessibility as a pre-retrosynthesis heuristic. A wide range of machine learning scoring approaches is available, however, their applicability and correctness were studied to a limited extent. Moreover, there is a lack of critical assessment of synthetic accessibility scores with common test conditions.In the present work, we assess if synthetic accessibility scores can reliably predict the outcomes of retrosynthesis planning. Using a specially prepared compounds database, we examine the outcomes of the retrosynthetic tool AiZynthFinder. We test whether synthetic accessibility scores: SAscore, SYBA, SCScore, and RAscore accurately predict the results of retrosynthesis planning. Furthermore, we investigate if synthetic accessibility scores can speed up retrosynthesis planning by better prioritizing explored partial synthetic routes and thus reducing the size of the search space. For that purpose, we analyze the AiZynthFinder partial solutions search trees, their structure, and complexity parameters, such as the number of nodes, or treewidth.We confirm that synthetic accessibility scores in most cases well discriminate feasible molecules from infeasible ones and can be potential boosters of retrosynthesis planning tools. Moreover, we show the current challenges of designing computer-assisted synthesis planning tools. We conclude that hybrid machine learning and human intuition-based synthetic accessibility scores can efficiently boost the effectiveness of computer-assisted retrosynthesis planning, however, they need to be carefully crafted for retrosynthesis planning algorithms.The source code of this work is publicly available at https://github.com/grzsko/ASAP .
Collapse
Affiliation(s)
- Grzegorz Skoraczyński
- grid.12847.380000 0004 1937 1290Faculty of Mathematics, Informatics, and Mechanics, University of Warsaw, Stefana Banacha 2, Warsaw, Poland
| | - Mateusz Kitlas
- grid.12847.380000 0004 1937 1290Faculty of Mathematics, Informatics, and Mechanics, University of Warsaw, Stefana Banacha 2, Warsaw, Poland
| | - Błażej Miasojedow
- grid.12847.380000 0004 1937 1290Faculty of Mathematics, Informatics, and Mechanics, University of Warsaw, Stefana Banacha 2, Warsaw, Poland
| | - Anna Gambin
- grid.12847.380000 0004 1937 1290Faculty of Mathematics, Informatics, and Mechanics, University of Warsaw, Stefana Banacha 2, Warsaw, Poland
| |
Collapse
|