1
|
Zhao Z, Zhou Q, Wu C, Su R, Xiong W. Boosting the performance of molecular property prediction via graph-text alignment and multi-granularity representation enhancement. J Mol Graph Model 2024; 132:108843. [PMID: 39173218 DOI: 10.1016/j.jmgm.2024.108843] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2024] [Revised: 05/18/2024] [Accepted: 08/02/2024] [Indexed: 08/24/2024]
Abstract
Deep learning is playing an increasingly important role in accurate prediction of molecular properties. Prior to being processed by a deep learning model, a molecule is typically represented in the form of a text or a graph. While some methods attempt to integrate these two forms of molecular representations, the misalignment of graph and text embeddings presents a significant challenge to fuse two modalities. To solve this problem, we propose a method that aligns and fuses graph and text features in the embedding space by using contrastive loss and cross attentions. Additionally, we enhance the molecular representation by incorporating multi-granularity information of molecules on the levels of atoms, functional groups, and molecules. Extensive experiments show that our model outperforms state-of-the-art models in downstream tasks of molecular property prediction, achieving superior performance with less pretraining data. The source codes and data are available at https://github.com/zzr624663649/multimodal_molecular_property.
Collapse
Affiliation(s)
- Zhuoran Zhao
- College of Computer Science, Chongqing University, Chongqing 400044, China.
| | - Qing Zhou
- College of Computer Science, Chongqing University, Chongqing 400044, China.
| | - Chengkai Wu
- Department of Ultrasound, Xinxiang Medical University Henan Provincial People's Hospital, Zhengzhou 450003, China.
| | - Renbin Su
- Central China Branch of State Grid Corporation of China, Wuhan 430000, China
| | - Weihong Xiong
- Central China Branch of State Grid Corporation of China, Wuhan 430000, China
| |
Collapse
|
2
|
Achar J, Firman JW, Tran C, Kim D, Cronin MTD, Öberg G. Analysis of implicit and explicit uncertainties in QSAR prediction of chemical toxicity: A case study of neurotoxicity. Regul Toxicol Pharmacol 2024; 154:105716. [PMID: 39393519 DOI: 10.1016/j.yrtph.2024.105716] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2024] [Revised: 09/24/2024] [Accepted: 10/08/2024] [Indexed: 10/13/2024]
Abstract
Although uncertainties expressed in texts within QSAR studies can guide quantitative uncertainty estimations, they are often overlooked during uncertainty analysis. Using neurotoxicity as an example, this study developed a method to support analysis of implicitly and explicitly expressed uncertainties in QSAR modeling studies. Text content analysis was employed to identify implicit and explicit uncertainty indicators, whereafter uncertainties within the indicator-containing sentences were identified and systematically categorized according to 20 uncertainty sources. Our results show that implicit uncertainty was more frequent within most uncertainty sources (13/20), while explicit uncertainty was more frequent in only three sources, indicating that uncertainty is predominantly expressed implicitly in the field. The most highly cited sources included Mechanistic plausibility, Model relevance and Model performance, suggesting they constitute sources of most concern. The fact that other sources like Data balance were not mentioned, although it is recognized in the broader QSAR literature as an area of concern, demonstrates that the output from the type of analysis conducted here must be interpreted in the context of the broader QSAR literature before conclusions are drawn. Overall, the method established here can be applied in other QSAR modeling contexts and ultimately guide efforts targeted towards addressing the identified uncertainty sources.
Collapse
Affiliation(s)
- Jerry Achar
- Institute for Resources Environment, and Sustainability, The University of British Columbia, 2202 Main Mall, Vancouver, BC, V6T 1Z4, Canada.
| | - James W Firman
- School of Pharmacy and Biomolecular Sciences, Liverpool John Moores University, Byrom Street, Liverpool, L3 3AF, UK
| | - Chantelle Tran
- Department of Microbiology and Immunology, The University of British Columbia, 2350 Health Sciences Mall, Vancouver, BC, V6T 1Z4, Canada
| | - Daniella Kim
- Department of Earth, Ocean, and Atmospheric Sciences, The University of British Columbia, 2207 Main Mall, Vancouver, BC, V6T 1Z4, Canada
| | - Mark T D Cronin
- School of Pharmacy and Biomolecular Sciences, Liverpool John Moores University, Byrom Street, Liverpool, L3 3AF, UK
| | - Gunilla Öberg
- Institute for Resources Environment, and Sustainability, The University of British Columbia, 2202 Main Mall, Vancouver, BC, V6T 1Z4, Canada
| |
Collapse
|
3
|
Komissarov L, Manevski N, Groebke Zbinden K, Schindler T, Zitnik M, Sach-Peltason L. Actionable Predictions of Human Pharmacokinetics at the Drug Design Stage. Mol Pharm 2024; 21:4356-4371. [PMID: 39132855 DOI: 10.1021/acs.molpharmaceut.4c00311] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/13/2024]
Abstract
We present a novel computational approach for predicting human pharmacokinetics (PK) that addresses the challenges of early stage drug design. Our study introduces and describes a large-scale data set of 11 clinical PK end points, encompassing over 2700 unique chemical structures to train machine learning models. To that end multiple advanced training strategies are compared, including the integration of in vitro data and a novel self-supervised pretraining task. In addition to the predictions, our final model provides meaningful epistemic uncertainties for every data point. This allows us to successfully identify regions of exceptional predictive performance, with an absolute average fold error (AAFE/geometric mean fold error) of less than 2.5 across multiple end points. Together, these advancements represent a significant leap toward actionable PK predictions, which can be utilized early on in the drug design process to expedite development and reduce reliance on nonclinical studies.
Collapse
Affiliation(s)
- Leonid Komissarov
- Roche Pharmaceutical Research and Early Development, Roche Innovation Center Basel, Basel 4070, Switzerland
| | - Nenad Manevski
- Roche Pharmaceutical Research and Early Development, Roche Innovation Center Basel, Basel 4070, Switzerland
| | - Katrin Groebke Zbinden
- Roche Pharmaceutical Research and Early Development, Roche Innovation Center Basel, Basel 4070, Switzerland
| | - Torsten Schindler
- Roche Pharmaceutical Research and Early Development, Roche Innovation Center Basel, Basel 4070, Switzerland
| | - Marinka Zitnik
- Harvard Medical School, Department of Biomedical Informatics, Boston, Massachusetts 02115, United States
| | - Lisa Sach-Peltason
- Roche Pharmaceutical Research and Early Development, Roche Innovation Center Basel, Basel 4070, Switzerland
| |
Collapse
|
4
|
Liu Y, Zhang R, Yuan Y, Ma J, Li T, Yu Z. A Multi-view Molecular Pre-training with Generative Contrastive Learning. Interdiscip Sci 2024; 16:741-754. [PMID: 38710957 DOI: 10.1007/s12539-024-00632-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2023] [Revised: 03/20/2024] [Accepted: 04/06/2024] [Indexed: 05/08/2024]
Abstract
Molecular representation learning can preserve meaningful molecular structures as embedding vectors, which is a necessary prerequisite for molecular property prediction. Yet, learning how to accurately represent molecules remains challenging. Previous approaches to learning molecular representations in an end-to-end manner potentially suffered information loss while neglecting the utilization of molecular generative representations. To obtain rich molecular feature information, the pre-training molecular representation model utilized different molecular representations to reduce information loss caused by a single molecular representation. Therefore, we provide the MVGC, a unique multi-view generative contrastive learning pre-training model. Our pre-training framework specifically acquires knowledge of three fundamental feature representations of molecules and effectively integrates them to predict molecular properties on benchmark datasets. Comprehensive experiments on seven classification tasks and three regression tasks demonstrate that our proposed MVGC model surpasses the majority of state-of-the-art approaches. Moreover, we explore the potential of the MVGC model to learn the representation of molecules with chemical significance.
Collapse
Affiliation(s)
- Yunwu Liu
- School of Information Science and Engineering, Lanzhou University, Lanzhou, 730000, China.
| | - Ruisheng Zhang
- School of Information Science and Engineering, Lanzhou University, Lanzhou, 730000, China.
| | - Yongna Yuan
- School of Information Science and Engineering, Lanzhou University, Lanzhou, 730000, China
| | - Jun Ma
- School of Information Science and Engineering, Lanzhou University, Lanzhou, 730000, China
| | - Tongfeng Li
- School of Information Science and Engineering, Lanzhou University, Lanzhou, 730000, China
| | - Zhixuan Yu
- School of Information Science and Engineering, Lanzhou University, Lanzhou, 730000, China
| |
Collapse
|
5
|
Heid E, Schörghuber J, Wanzenböck R, Madsen GKH. Spatially Resolved Uncertainties for Machine Learning Potentials. J Chem Inf Model 2024; 64:6377-6387. [PMID: 39110874 DOI: 10.1021/acs.jcim.4c00904] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/27/2024]
Abstract
Machine learning potentials have become an essential tool for atomistic simulations, yielding results close to ab initio simulations at a fraction of computational cost. With recent improvements on the achievable accuracies, the focus has now shifted on the data set composition itself. The reliable identification of erroneously predicted configurations to extend a given data set is therefore of high priority. Yet, uncertainty estimation techniques have achieved mixed results for machine learning potentials. Consequently, a general and versatile method to correlate energy or atomic force uncertainties with the model error has remained elusive to date. In the current work, we show that epistemic uncertainty cannot correlate with model error by definition but can be aggregated over groups of atoms to yield a strong correlation. We demonstrate that our method correctly estimates prediction errors both globally per structure and locally resolved per atom. The direct correlation of local uncertainty and local error is used to design an active learning framework based on identifying local subregions of a large simulation cell and performing ab initio calculations only for the subregion subsequently. We successfully utilized this method to perform active learning in the low-data regime for liquid water.
Collapse
Affiliation(s)
- Esther Heid
- Institute of Materials Chemistry, TU Wien, A-1060 Vienna, Austria
| | | | - Ralf Wanzenböck
- Institute of Materials Chemistry, TU Wien, A-1060 Vienna, Austria
| | - Georg K H Madsen
- Institute of Materials Chemistry, TU Wien, A-1060 Vienna, Austria
| |
Collapse
|
6
|
Xu Y, Liaw A, Sheridan RP, Svetnik V. Development and Evaluation of Conformal Prediction Methods for Quantitative Structure-Activity Relationship. ACS OMEGA 2024; 9:29478-29490. [PMID: 39005801 PMCID: PMC11238240 DOI: 10.1021/acsomega.4c02017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/29/2024] [Revised: 06/10/2024] [Accepted: 06/12/2024] [Indexed: 07/16/2024]
Abstract
The quantitative structure-activity relationship (QSAR) regression model is a commonly used technique for predicting the biological activities of compounds using their molecular descriptors. Besides accurate activity estimation, obtaining a prediction uncertainty metric like a prediction interval is highly desirable. Quantifying prediction uncertainty is an active research area in statistical and machine learning (ML), but the implementation for QSAR remains challenging. However, most ML algorithms with high predictive performance require add-on companions for estimating the uncertainty of their prediction. Conformal prediction (CP) is a promising approach as its main components are agnostic to the prediction modes, and it produces valid prediction intervals under weak assumptions on the data distribution. We proposed computationally efficient CP algorithms tailored to the most widely used ML models, including random forests, deep neural networks, and gradient boosting. The algorithms use a novel approach to the derivation of nonconformity scores from the estimates of prediction uncertainty generated by the ensembles of point predictions. The validity and efficiency of proposed algorithms are demonstrated on a diverse collection of QSAR data sets as well as simulation studies. The provided software implementing our algorithms can be used as stand-alone or easily incorporated into other ML software packages for QSAR modeling.
Collapse
Affiliation(s)
- Yuting Xu
- Early
Development Statistics, Merck & Co.,
Inc., Rahway, New Jersey 07065, United States
| | - Andy Liaw
- Early
Development Statistics, Merck & Co.,
Inc., Rahway, New Jersey 07065, United States
| | - Robert P. Sheridan
- Modeling
and Informatics, Merck & Co., Inc., Rahway, New Jersey 07033, United States
| | - Vladimir Svetnik
- Early
Development Statistics, Merck & Co.,
Inc., Rahway, New Jersey 07065, United States
| |
Collapse
|
7
|
Dutschmann TM, Schlenker V, Baumann K. Chemoinformatic regression methods and their applicability domain. Mol Inform 2024; 43:e202400018. [PMID: 38803302 DOI: 10.1002/minf.202400018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2024] [Revised: 03/24/2024] [Accepted: 03/25/2024] [Indexed: 05/29/2024]
Abstract
The growing interest in chemoinformatic model uncertainty calls for a summary of the most widely used regression techniques and how to estimate their reliability. Regression models learn a mapping from the space of explanatory variables to the space of continuous output values. Among other limitations, the predictive performance of the model is restricted by the training data used for model fitting. Identification of unusual objects by outlier detection methods can improve model performance. Additionally, proper model evaluation necessitates defining the limitations of the model, often called the applicability domain. Comparable to certain classifiers, some regression techniques come with built-in methods or augmentations to quantify their (un)certainty, while others rely on generic procedures. The theoretical background of their working principles and how to deduce specific and general definitions for their domain of applicability shall be explained.
Collapse
Affiliation(s)
- Thomas-Martin Dutschmann
- Institute of Medicinal and Pharmaceutical Chemistry, University of Technology Braunschweig, 38106, Braunschweig, Germany
| | - Valerie Schlenker
- Institute of Medicinal and Pharmaceutical Chemistry, University of Technology Braunschweig, 38106, Braunschweig, Germany
| | - Knut Baumann
- Institute of Medicinal and Pharmaceutical Chemistry, University of Technology Braunschweig, 38106, Braunschweig, Germany
| |
Collapse
|
8
|
Fan Z, Yu J, Zhang X, Chen Y, Sun S, Zhang Y, Chen M, Xiao F, Wu W, Li X, Zheng M, Luo X, Wang D. Reducing overconfident errors in molecular property classification using Posterior Network. PATTERNS (NEW YORK, N.Y.) 2024; 5:100991. [PMID: 39005492 PMCID: PMC11240180 DOI: 10.1016/j.patter.2024.100991] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/16/2023] [Revised: 12/20/2023] [Accepted: 04/15/2024] [Indexed: 07/16/2024]
Abstract
Deep-learning-based classification models are increasingly used for predicting molecular properties in drug development. However, traditional classification models using the Softmax function often give overconfident mispredictions for out-of-distribution samples, highlighting a critical lack of accurate uncertainty estimation. Such limitations can result in substantial costs and should be avoided during drug development. Inspired by advances in evidential deep learning and Posterior Network, we replaced the Softmax function with a normalizing flow to enhance the uncertainty estimation ability of the model in molecular property classification. The proposed strategy was evaluated across diverse scenarios, including simulated experiments based on a synthetic dataset, ADMET predictions, and ligand-based virtual screening. The results demonstrate that compared with the vanilla model, the proposed strategy effectively alleviates the problem of giving overconfident but incorrect predictions. Our findings support the promising application of evidential deep learning in drug development and offer a valuable framework for further research.
Collapse
Affiliation(s)
- Zhehuan Fan
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
- University of Chinese Academy of Sciences, 19A Yuquan Road, Beijing 100049, China
| | - Jie Yu
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
- University of Chinese Academy of Sciences, 19A Yuquan Road, Beijing 100049, China
| | - Xiang Zhang
- School of Chinese Materia Medica, Nanjing University of Chinese Medicine, Nanjing 210023, China
| | - Yijie Chen
- School of Chinese Materia Medica, Nanjing University of Chinese Medicine, Nanjing 210023, China
| | - Shihui Sun
- School of Chinese Materia Medica, Nanjing University of Chinese Medicine, Nanjing 210023, China
| | - Yuanyuan Zhang
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
- University of Chinese Academy of Sciences, 19A Yuquan Road, Beijing 100049, China
| | - Mingan Chen
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
- School of Physical Science and Technology, ShanghaiTech University, Shanghai 201210, China
- Lingang Laboratory, Shanghai 200031, China
| | - Fu Xiao
- School of Chinese Materia Medica, Nanjing University of Chinese Medicine, Nanjing 210023, China
| | - Wenyong Wu
- Lingang Laboratory, Shanghai 200031, China
| | - Xutong Li
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
- University of Chinese Academy of Sciences, 19A Yuquan Road, Beijing 100049, China
| | - Mingyue Zheng
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
- University of Chinese Academy of Sciences, 19A Yuquan Road, Beijing 100049, China
- School of Chinese Materia Medica, Nanjing University of Chinese Medicine, Nanjing 210023, China
| | - Xiaomin Luo
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
- University of Chinese Academy of Sciences, 19A Yuquan Road, Beijing 100049, China
- School of Chinese Materia Medica, Nanjing University of Chinese Medicine, Nanjing 210023, China
| | | |
Collapse
|
9
|
Michael R, Kæstel-Hansen J, Mørch Groth P, Bartels S, Salomon J, Tian P, Hatzakis NS, Boomsma W. A systematic analysis of regression models for protein engineering. PLoS Comput Biol 2024; 20:e1012061. [PMID: 38701099 PMCID: PMC11095727 DOI: 10.1371/journal.pcbi.1012061] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2023] [Revised: 05/15/2024] [Accepted: 04/10/2024] [Indexed: 05/05/2024] Open
Abstract
To optimize proteins for particular traits holds great promise for industrial and pharmaceutical purposes. Machine Learning is increasingly applied in this field to predict properties of proteins, thereby guiding the experimental optimization process. A natural question is: How much progress are we making with such predictions, and how important is the choice of regressor and representation? In this paper, we demonstrate that different assessment criteria for regressor performance can lead to dramatically different conclusions, depending on the choice of metric, and how one defines generalization. We highlight the fundamental issues of sample bias in typical regression scenarios and how this can lead to misleading conclusions about regressor performance. Finally, we make the case for the importance of calibrated uncertainty in this domain.
Collapse
Affiliation(s)
- Richard Michael
- Department of Computer Science, University of Copenhagen, Copenhagen, Denmark
| | | | - Peter Mørch Groth
- Department of Computer Science, University of Copenhagen, Copenhagen, Denmark
- Enzyme Research, Novozymes A/S, Kongens Lyngby, Denmark
| | - Simon Bartels
- Department of Computer Science, University of Copenhagen, Copenhagen, Denmark
| | | | - Pengfei Tian
- Enzyme Research, Novozymes A/S, Kongens Lyngby, Denmark
| | - Nikos S. Hatzakis
- Department of Chemistry, University of Copenhagen, Copenhagen, Denmark
| | - Wouter Boomsma
- Department of Computer Science, University of Copenhagen, Copenhagen, Denmark
| |
Collapse
|
10
|
Ramos MC, White AD. Predicting small molecules solubility on endpoint devices using deep ensemble neural networks. DIGITAL DISCOVERY 2024; 3:786-795. [PMID: 38638648 PMCID: PMC11022985 DOI: 10.1039/d3dd00217a] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/03/2023] [Accepted: 03/07/2024] [Indexed: 04/20/2024]
Abstract
Aqueous solubility is a valuable yet challenging property to predict. Computing solubility using first-principles methods requires accounting for the competing effects of entropy and enthalpy, resulting in long computations for relatively poor accuracy. Data-driven approaches, such as deep learning, offer improved accuracy and computational efficiency but typically lack uncertainty quantification. Additionally, ease of use remains a concern for any computational technique, resulting in the sustained popularity of group-based contribution methods. In this work, we addressed these problems with a deep learning model with predictive uncertainty that runs on a static website (without a server). This approach moves computing needs onto the website visitor without requiring installation, removing the need to pay for and maintain servers. Our model achieves satisfactory results in solubility prediction. Furthermore, we demonstrate how to create molecular property prediction models that balance uncertainty and ease of use. The code is available at https://github.com/ur-whitelab/mol.dev, and the model is useable at https://mol.dev.
Collapse
Affiliation(s)
- Mayk Caldas Ramos
- Chemical Engineer Department, University of Rochester Rochester NY 14642 USA
| | - Andrew D White
- Chemical Engineer Department, University of Rochester Rochester NY 14642 USA
| |
Collapse
|
11
|
Roth JP, Bajorath J. Relationship between prediction accuracy and uncertainty in compound potency prediction using deep neural networks and control models. Sci Rep 2024; 14:6536. [PMID: 38503823 PMCID: PMC10950896 DOI: 10.1038/s41598-024-57135-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2023] [Accepted: 03/14/2024] [Indexed: 03/21/2024] Open
Abstract
The assessment of prediction variance or uncertainty contributes to the evaluation of machine learning models. In molecular machine learning, uncertainty quantification is an evolving area of research where currently no standard approaches or general guidelines are available. We have carried out a detailed analysis of deep neural network variants and simple control models for compound potency prediction to study relationships between prediction accuracy and uncertainty. For comparably accurate predictions obtained with models of different complexity, highly variable prediction uncertainties were detected using different metrics. Furthermore, a strong dependence of prediction characteristics and uncertainties on potency levels of test compounds was observed, often leading to over- or under-confident model decisions with respect to the expected variance of predictions. Moreover, neural network models responded very differently to training set modifications. Taken together, our findings indicate that there is only little, if any correlation between compound potency prediction accuracy and uncertainty, especially for deep neural network models, when predictions are assessed on the basis of currently used metrics for uncertainty quantification.
Collapse
Affiliation(s)
- Jannik P Roth
- Department of Life Science Informatics and Data Science, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 5/6, 53115, Bonn, Germany
| | - Jürgen Bajorath
- Department of Life Science Informatics and Data Science, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 5/6, 53115, Bonn, Germany.
| |
Collapse
|
12
|
Sandström H, Rissanen M, Rousu J, Rinke P. Data-Driven Compound Identification in Atmospheric Mass Spectrometry. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2024; 11:e2306235. [PMID: 38095508 PMCID: PMC10885664 DOI: 10.1002/advs.202306235] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/31/2023] [Revised: 11/04/2023] [Indexed: 02/24/2024]
Abstract
Aerosol particles found in the atmosphere affect the climate and worsen air quality. To mitigate these adverse impacts, aerosol particle formation and aerosol chemistry in the atmosphere need to be better mapped out and understood. Currently, mass spectrometry is the single most important analytical technique in atmospheric chemistry and is used to track and identify compounds and processes. Large amounts of data are collected in each measurement of current time-of-flight and orbitrap mass spectrometers using modern rapid data acquisition practices. However, compound identification remains a major bottleneck during data analysis due to lacking reference libraries and analysis tools. Data-driven compound identification approaches could alleviate the problem, yet remain rare to non-existent in atmospheric science. In this perspective, the authors review the current state of data-driven compound identification with mass spectrometry in atmospheric science and discuss current challenges and possible future steps toward a digital era for atmospheric mass spectrometry.
Collapse
Affiliation(s)
- Hilda Sandström
- Department of Applied Physics, Aalto University, P.O. Box 11000, FI-00076, Aalto, Espoo, Finland
| | - Matti Rissanen
- Aerosol Physics Laboratory, Tampere University, FI-33720, Tampere, Finland
- Department of Chemistry, University of Helsinki, P.O. Box 55, A.I. Virtasen aukio 1, FI-00560, Helsinki, Finland
| | - Juho Rousu
- Department of Computer Science, Aalto University, P.O. Box 11000, FI-00076, Aalto, Espoo, Finland
| | - Patrick Rinke
- Department of Applied Physics, Aalto University, P.O. Box 11000, FI-00076, Aalto, Espoo, Finland
| |
Collapse
|
13
|
Heid E, Greenman KP, Chung Y, Li SC, Graff DE, Vermeire FH, Wu H, Green WH, McGill CJ. Chemprop: A Machine Learning Package for Chemical Property Prediction. J Chem Inf Model 2024; 64:9-17. [PMID: 38147829 PMCID: PMC10777403 DOI: 10.1021/acs.jcim.3c01250] [Citation(s) in RCA: 38] [Impact Index Per Article: 38.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2023] [Revised: 12/04/2023] [Accepted: 12/05/2023] [Indexed: 12/28/2023]
Abstract
Deep learning has become a powerful and frequently employed tool for the prediction of molecular properties, thus creating a need for open-source and versatile software solutions that can be operated by nonexperts. Among the current approaches, directed message-passing neural networks (D-MPNNs) have proven to perform well on a variety of property prediction tasks. The software package Chemprop implements the D-MPNN architecture and offers simple, easy, and fast access to machine-learned molecular properties. Compared to its initial version, we present a multitude of new Chemprop functionalities such as the support of multimolecule properties, reactions, atom/bond-level properties, and spectra. Further, we incorporate various uncertainty quantification and calibration methods along with related metrics as well as pretraining and transfer learning workflows, improved hyperparameter optimization, and other customization options concerning loss functions or atom/bond features. We benchmark D-MPNN models trained using Chemprop with the new reaction, atom-level, and spectra functionality on a variety of property prediction data sets, including MoleculeNet and SAMPL, and observe state-of-the-art performance on the prediction of water-octanol partition coefficients, reaction barrier heights, atomic partial charges, and absorption spectra. Chemprop enables out-of-the-box training of D-MPNN models for a variety of problem settings in fast, user-friendly, and open-source software.
Collapse
Affiliation(s)
- Esther Heid
- Department
of Chemical Engineering, Massachusetts Institute
of Technology, Cambridge, Massachusetts 02139, United States
- Institute
of Materials Chemistry, TU Wien, 1060 Vienna, Austria
| | - Kevin P. Greenman
- Department
of Chemical Engineering, Massachusetts Institute
of Technology, Cambridge, Massachusetts 02139, United States
| | - Yunsie Chung
- Department
of Chemical Engineering, Massachusetts Institute
of Technology, Cambridge, Massachusetts 02139, United States
| | - Shih-Cheng Li
- Department
of Chemical Engineering, Massachusetts Institute
of Technology, Cambridge, Massachusetts 02139, United States
- Department
of Chemical Engineering, National Taiwan
University, Taipei 10617, Taiwan
| | - David E. Graff
- Department
of Chemical Engineering, Massachusetts Institute
of Technology, Cambridge, Massachusetts 02139, United States
- Department
of Chemistry and Chemical Biology, Harvard
University, Cambridge, Massachusetts 02138, United States
| | - Florence H. Vermeire
- Department
of Chemical Engineering, Massachusetts Institute
of Technology, Cambridge, Massachusetts 02139, United States
- Department
of Chemical Engineering, KU Leuven, Celestijnenlaan 200F, B-3001 Leuven, Belgium
| | - Haoyang Wu
- Department
of Chemical Engineering, Massachusetts Institute
of Technology, Cambridge, Massachusetts 02139, United States
| | - William H. Green
- Department
of Chemical Engineering, Massachusetts Institute
of Technology, Cambridge, Massachusetts 02139, United States
| | - Charles J. McGill
- Department
of Chemical Engineering, Massachusetts Institute
of Technology, Cambridge, Massachusetts 02139, United States
- Department
of Chemical and Life Science Engineering, Virginia Commonwealth University, Richmond, Virginia 23284, United States
| |
Collapse
|
14
|
Koscher BA, Canty RB, McDonald MA, Greenman KP, McGill CJ, Bilodeau CL, Jin W, Wu H, Vermeire FH, Jin B, Hart T, Kulesza T, Li SC, Jaakkola TS, Barzilay R, Gómez-Bombarelli R, Green WH, Jensen KF. Autonomous, multiproperty-driven molecular discovery: From predictions to measurements and back. Science 2023; 382:eadi1407. [PMID: 38127734 DOI: 10.1126/science.adi1407] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2023] [Accepted: 11/09/2023] [Indexed: 12/23/2023]
Abstract
A closed-loop, autonomous molecular discovery platform driven by integrated machine learning tools was developed to accelerate the design of molecules with desired properties. We demonstrated two case studies on dye-like molecules, targeting absorption wavelength, lipophilicity, and photooxidative stability. In the first study, the platform experimentally realized 294 unreported molecules across three automatic iterations of molecular design-make-test-analyze cycles while exploring the structure-function space of four rarely reported scaffolds. In each iteration, the property prediction models that guided exploration learned the structure-property space of diverse scaffold derivatives, which were realized with multistep syntheses and a variety of reactions. The second study exploited property models trained on the explored chemical space and previously reported molecules to discover nine top-performing molecules within a lightly explored structure-property space.
Collapse
Affiliation(s)
- Brent A Koscher
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Richard B Canty
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Matthew A McDonald
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Kevin P Greenman
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Charles J McGill
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Camille L Bilodeau
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Wengong Jin
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Haoyang Wu
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Florence H Vermeire
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Brooke Jin
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Travis Hart
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Timothy Kulesza
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Shih-Cheng Li
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Tommi S Jaakkola
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Regina Barzilay
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Rafael Gómez-Bombarelli
- Department of Materials Science and Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - William H Green
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Klavs F Jensen
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA
| |
Collapse
|
15
|
Rasmussen MH, Duan C, Kulik HJ, Jensen JH. Uncertain of uncertainties? A comparison of uncertainty quantification metrics for chemical data sets. J Cheminform 2023; 15:121. [PMID: 38111020 PMCID: PMC10729461 DOI: 10.1186/s13321-023-00790-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2023] [Accepted: 11/28/2023] [Indexed: 12/20/2023] Open
Abstract
With the increasingly more important role of machine learning (ML) models in chemical research, the need for putting a level of confidence to the model predictions naturally arises. Several methods for obtaining uncertainty estimates have been proposed in recent years but consensus on the evaluation of these have yet to be established and different studies on uncertainties generally uses different metrics to evaluate them. We compare three of the most popular validation metrics (Spearman's rank correlation coefficient, the negative log likelihood (NLL) and the miscalibration area) to the error-based calibration introduced by Levi et al. (Sensors 2022, 22, 5540). Importantly, metrics such as the negative log likelihood (NLL) and Spearman's rank correlation coefficient bear little information in themselves. We therefore introduce reference values obtained through errors simulated directly from the uncertainty distribution. The different metrics target different properties and we show how to interpret them, but we generally find the best overall validation to be done based on the error-based calibration plot introduced by Levi et al. Finally, we illustrate the sensitivity of ranking-based methods (e.g. Spearman's rank correlation coefficient) towards test set design by using the same toy model ferent test sets and obtaining vastly different metrics (0.05 vs. 0.65).
Collapse
Affiliation(s)
- Maria H Rasmussen
- Department of Chemistry, University of Copenhagen, Copenhagen, Denmark.
| | - Chenru Duan
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, USA
- Department of Chemistry, Massachusetts Institute of Technology, Cambridge, USA
| | - Heather J Kulik
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, USA
- Department of Chemistry, Massachusetts Institute of Technology, Cambridge, USA
| | - Jan H Jensen
- Department of Chemistry, University of Copenhagen, Copenhagen, Denmark.
| |
Collapse
|
16
|
Seoni S, Jahmunah V, Salvi M, Barua PD, Molinari F, Acharya UR. Application of uncertainty quantification to artificial intelligence in healthcare: A review of last decade (2013-2023). Comput Biol Med 2023; 165:107441. [PMID: 37683529 DOI: 10.1016/j.compbiomed.2023.107441] [Citation(s) in RCA: 9] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2023] [Revised: 08/27/2023] [Accepted: 08/29/2023] [Indexed: 09/10/2023]
Abstract
Uncertainty estimation in healthcare involves quantifying and understanding the inherent uncertainty or variability associated with medical predictions, diagnoses, and treatment outcomes. In this era of Artificial Intelligence (AI) models, uncertainty estimation becomes vital to ensure safe decision-making in the medical field. Therefore, this review focuses on the application of uncertainty techniques to machine and deep learning models in healthcare. A systematic literature review was conducted using the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines. Our analysis revealed that Bayesian methods were the predominant technique for uncertainty quantification in machine learning models, with Fuzzy systems being the second most used approach. Regarding deep learning models, Bayesian methods emerged as the most prevalent approach, finding application in nearly all aspects of medical imaging. Most of the studies reported in this paper focused on medical images, highlighting the prevalent application of uncertainty quantification techniques using deep learning models compared to machine learning models. Interestingly, we observed a scarcity of studies applying uncertainty quantification to physiological signals. Thus, future research on uncertainty quantification should prioritize investigating the application of these techniques to physiological signals. Overall, our review highlights the significance of integrating uncertainty techniques in healthcare applications of machine learning and deep learning models. This can provide valuable insights and practical solutions to manage uncertainty in real-world medical data, ultimately improving the accuracy and reliability of medical diagnoses and treatment recommendations.
Collapse
Affiliation(s)
- Silvia Seoni
- Biolab, PolitoBIOMedLab, Department of Electronics and Telecommunications, Politecnico di Torino, Turin, Italy
| | | | - Massimo Salvi
- Biolab, PolitoBIOMedLab, Department of Electronics and Telecommunications, Politecnico di Torino, Turin, Italy
| | - Prabal Datta Barua
- School of Business (Information System), University of Southern Queensland, Toowoomba, QLD, 4350, Australia; Faculty of Engineering and Information Technology, University of Technology Sydney, Sydney, NSW, 2007, Australia
| | - Filippo Molinari
- Biolab, PolitoBIOMedLab, Department of Electronics and Telecommunications, Politecnico di Torino, Turin, Italy.
| | - U Rajendra Acharya
- School of Mathematics, Physics and Computing, University of Southern Queensland, Springfield, Australia
| |
Collapse
|
17
|
Busk J, Schmidt MN, Winther O, Vegge T, Jørgensen PB. Graph neural network interatomic potential ensembles with calibrated aleatoric and epistemic uncertainty on energy and forces. Phys Chem Chem Phys 2023; 25:25828-25837. [PMID: 37724552 DOI: 10.1039/d3cp02143b] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/21/2023]
Abstract
Inexpensive machine learning (ML) potentials are increasingly being used to speed up structural optimization and molecular dynamics simulations of materials by iteratively predicting and applying interatomic forces. In these settings, it is crucial to detect when predictions are unreliable to avoid wrong or misleading results. Here, we present a complete framework for training and recalibrating graph neural network ensemble models to produce accurate predictions of energy and forces with calibrated uncertainty estimates. The proposed method considers both epistemic and aleatoric uncertainty and the total uncertainties are recalibrated post hoc using a nonlinear scaling function to achieve good calibration on previously unseen data, without loss of predictive accuracy. The method is demonstrated and evaluated on two challenging, publicly available datasets, ANI-1x (Smith et al. J. Chem. Phys., 2018, 148, 241733.) and Transition1x (Schreiner et al. Sci. Data, 2022, 9, 779.), both containing diverse conformations far from equilibrium. A detailed analysis of the predictive performance and uncertainty calibration is provided. In all experiments, the proposed method achieved low prediction error and good uncertainty calibration, with predicted uncertainty correlating with expected error, on energy and forces. To the best of our knowledge, the method presented in this paper is the first to consider a complete framework for obtaining calibrated epistemic and aleatoric uncertainty predictions on both energy and forces in ML potentials.
Collapse
Affiliation(s)
- Jonas Busk
- Department of Energy Conversion and Storage, Technical University of Denmark, Kongens Lyngby, Denmark.
| | - Mikkel N Schmidt
- Department of Applied Mathematics and Computer Science, Technical University of Denmark, Kongens Lyngby, Denmark.
| | - Ole Winther
- Department of Applied Mathematics and Computer Science, Technical University of Denmark, Kongens Lyngby, Denmark.
- Center for Genomic Medicine, Rigshospitalet, Copenhagen University Hospital, Denmark
- Bioinformatics Centre, Department of Biology, University of Copenhagen, Denmark
| | - Tejs Vegge
- Department of Energy Conversion and Storage, Technical University of Denmark, Kongens Lyngby, Denmark.
| | - Peter Bjørn Jørgensen
- Department of Energy Conversion and Storage, Technical University of Denmark, Kongens Lyngby, Denmark.
| |
Collapse
|
18
|
Zhang Y, Menke J, He J, Nittinger E, Tyrchan C, Koch O, Zhao H. Similarity-based pairing improves efficiency of siamese neural networks for regression tasks and uncertainty quantification. J Cheminform 2023; 15:75. [PMID: 37649050 PMCID: PMC10469421 DOI: 10.1186/s13321-023-00744-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2022] [Accepted: 08/10/2023] [Indexed: 09/01/2023] Open
Abstract
Siamese networks, representing a novel class of neural networks, consist of two identical subnetworks sharing weights but receiving different inputs. Here we present a similarity-based pairing method for generating compound pairs to train Siamese neural networks for regression tasks. In comparison with the conventional exhaustive pairing, it reduces the algorithm complexity from O(n2) to O(n). It also results in a better prediction performance consistently on the three physicochemical datasets, using a multilayer perceptron with the circular fingerprint as a proof of concept. We further include into a Siamese neural network the transformer-based Chemformer, which extracts task-specific features from the simplified molecular-input line-entry system representation of compounds. Additionally, we propose a means to measure the prediction uncertainty by utilizing the variance in predictions from a set of reference compounds. Our results demonstrate that the high prediction accuracy correlates with the high confidence. Finally, we investigate implications of the similarity property principle in machine learning.
Collapse
Affiliation(s)
- Yumeng Zhang
- Medicinal Chemistry, Research and Early Development, Respiratory and Immunology (R&I), BioPharmaceuticals R&D, AstraZeneca, 43183, Gothenburg, Sweden
- Department of Pharmaceutical Biosciences, Uppsala University, Uppsala, Sweden
| | - Janosch Menke
- Medicinal Chemistry, Research and Early Development, Respiratory and Immunology (R&I), BioPharmaceuticals R&D, AstraZeneca, 43183, Gothenburg, Sweden.
- Institute of Pharmaceutical and Medicinal Chemistry, Westfälische Wilhelms-Universität Münster, 48149, Münster, Germany.
| | - Jiazhen He
- Molecular AI, Discovery Sciences, R&D, AstraZeneca, 43183, Gothenburg, Sweden
| | - Eva Nittinger
- Medicinal Chemistry, Research and Early Development, Respiratory and Immunology (R&I), BioPharmaceuticals R&D, AstraZeneca, 43183, Gothenburg, Sweden
| | - Christian Tyrchan
- Medicinal Chemistry, Research and Early Development, Respiratory and Immunology (R&I), BioPharmaceuticals R&D, AstraZeneca, 43183, Gothenburg, Sweden
| | - Oliver Koch
- Institute of Pharmaceutical and Medicinal Chemistry, Westfälische Wilhelms-Universität Münster, 48149, Münster, Germany
| | - Hongtao Zhao
- Medicinal Chemistry, Research and Early Development, Respiratory and Immunology (R&I), BioPharmaceuticals R&D, AstraZeneca, 43183, Gothenburg, Sweden.
| |
Collapse
|
19
|
Biswas S, Chung Y, Ramirez J, Wu H, Green WH. Predicting Critical Properties and Acentric Factors of Fluids Using Multitask Machine Learning. J Chem Inf Model 2023; 63:4574-4588. [PMID: 37487557 DOI: 10.1021/acs.jcim.3c00546] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/26/2023]
Abstract
Knowledge of critical properties, such as critical temperature, pressure, density, as well as acentric factor, is essential to calculate thermo-physical properties of chemical compounds. Experiments to determine critical properties and acentric factors are expensive and time intensive; therefore, we developed a machine learning (ML) model that can predict these molecular properties given the SMILES representation of a chemical species. We explored directed message passing neural network (D-MPNN) and graph attention network as ML architecture choices. Additionally, we investigated featurization with additional atomic and molecular features, multitask training, and pretraining using estimated data to optimize model performance. Our final model utilizes a D-MPNN layer to learn the molecular representation and is supplemented by Abraham parameters. A multitask training scheme was used to train a single model to predict all the critical properties and acentric factors along with boiling point, melting point, enthalpy of vaporization, and enthalpy of fusion. The model was evaluated on both random and scaffold splits where it shows state-of-the-art accuracies. The extensive data set of critical properties and acentric factors contains 1144 chemical compounds and is made available in the public domain together with the source code that can be used for further exploration.
Collapse
Affiliation(s)
- Sayandeep Biswas
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Yunsie Chung
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Josephine Ramirez
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Haoyang Wu
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - William H Green
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| |
Collapse
|
20
|
Heid E, McGill CJ, Vermeire FH, Green WH. Characterizing Uncertainty in Machine Learning for Chemistry. J Chem Inf Model 2023; 63:4012-4029. [PMID: 37338239 PMCID: PMC10336963 DOI: 10.1021/acs.jcim.3c00373] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2023] [Indexed: 06/21/2023]
Abstract
Characterizing uncertainty in machine learning models has recently gained interest in the context of machine learning reliability, robustness, safety, and active learning. Here, we separate the total uncertainty into contributions from noise in the data (aleatoric) and shortcomings of the model (epistemic), further dividing epistemic uncertainty into model bias and variance contributions. We systematically address the influence of noise, model bias, and model variance in the context of chemical property predictions, where the diverse nature of target properties and the vast chemical chemical space give rise to many different distinct sources of prediction error. We demonstrate that different sources of error can each be significant in different contexts and must be individually addressed during model development. Through controlled experiments on data sets of molecular properties, we show important trends in model performance associated with the level of noise in the data set, size of the data set, model architecture, molecule representation, ensemble size, and data set splitting. In particular, we show that 1) noise in the test set can limit a model's observed performance when the actual performance is much better, 2) using size-extensive model aggregation structures is crucial for extensive property prediction, and 3) ensembling is a reliable tool for uncertainty quantification and improvement specifically for the contribution of model variance. We develop general guidelines on how to improve an underperforming model when falling into different uncertainty contexts.
Collapse
Affiliation(s)
- Esther Heid
- Department
of Chemical Engineering, Massachusetts Institute
of Technology, Cambridge, Massachusetts 02139, United States
- Institute
of Materials Chemistry, TU Wien, 1060 Vienna, Austria
| | - Charles J. McGill
- Department
of Chemical Engineering, Massachusetts Institute
of Technology, Cambridge, Massachusetts 02139, United States
- Department
of Chemical and Life Science Engineering, Virginia Commonwealth University, Richmond, Virginia 23284, United States
| | - Florence H. Vermeire
- Department
of Chemical Engineering, Massachusetts Institute
of Technology, Cambridge, Massachusetts 02139, United States
- Department
of Chemical Engineering, KU Leuven, Celestijnenlaan 200F, B-3001 Leuven, Belgium
| | - William H. Green
- Department
of Chemical Engineering, Massachusetts Institute
of Technology, Cambridge, Massachusetts 02139, United States
| |
Collapse
|
21
|
Fan YJ, Allen JE, McLoughlin KS, Shi D, Bennion BJ, Zhang X, Lightstone FC. Evaluating point-prediction uncertainties in neural networks for protein-ligand binding prediction. ARTIFICIAL INTELLIGENCE CHEMISTRY 2023; 1:100004. [PMID: 37583465 PMCID: PMC10426331 DOI: 10.1016/j.aichem.2023.100004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 08/17/2023]
Abstract
Neural Network (NN) models provide potential to speed up the drug discovery process and reduce its failure rates. The success of NN models requires uncertainty quantification (UQ) as drug discovery explores chemical space beyond the training data distribution. Standard NN models do not provide uncertainty information. Some methods require changing the NN architecture or training procedure, limiting the selection of NN models. Moreover, predictive uncertainty can come from different sources. It is important to have the ability to separately model different types of predictive uncertainty, as the model can take assorted actions depending on the source of uncertainty. In this paper, we examine UQ methods that estimate different sources of predictive uncertainty for NN models aiming at protein-ligand binding prediction. We use our prior knowledge on chemical compounds to design the experiments. By utilizing a visualization method we create non-overlapping and chemically diverse partitions from a collection of chemical compounds. These partitions are used as training and test set splits to explore NN model uncertainty. We demonstrate how the uncertainties estimated by the selected methods describe different sources of uncertainty under different partitions and featurization schemes and the relationship to prediction error.
Collapse
Affiliation(s)
- Ya Ju Fan
- Center for Applied Scientific Computing, Lawrence Livermore National Laboratory, 7000 East Ave., Livermore, CA, USA
| | - Jonathan E. Allen
- Biological Science and Security Center, Lawrence Livermore National Laboratory, Livermore, CA, USA
| | - Kevin S. McLoughlin
- Biological Science and Security Center, Lawrence Livermore National Laboratory, Livermore, CA, USA
| | - Da Shi
- Biomedical Informatics and Data Science Directorate, Frederick National Laboratory for Cancer Research, Frederick, MD, USA
| | - Brian J. Bennion
- Physical and Life Sciences Directorate, Lawrence Livermore National Laboratory, Livermore, CA, USA
| | - Xiaohua Zhang
- Physical and Life Sciences Directorate, Lawrence Livermore National Laboratory, Livermore, CA, USA
| | - Felice C. Lightstone
- Physical and Life Sciences Directorate, Lawrence Livermore National Laboratory, Livermore, CA, USA
| |
Collapse
|
22
|
Dutschmann TM, Kinzel L, Ter Laak A, Baumann K. Large-scale evaluation of k-fold cross-validation ensembles for uncertainty estimation. J Cheminform 2023; 15:49. [PMID: 37118768 PMCID: PMC10142532 DOI: 10.1186/s13321-023-00709-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2022] [Accepted: 03/10/2023] [Indexed: 04/30/2023] Open
Abstract
It is insightful to report an estimator that describes how certain a model is in a prediction, additionally to the prediction alone. For regression tasks, most approaches implement a variation of the ensemble method, apart from few exceptions. Instead of a single estimator, a group of estimators yields several predictions for an input. The uncertainty can then be quantified by measuring the disagreement between the predictions, for example by the standard deviation. In theory, ensembles should not only provide uncertainties, they also boost the predictive performance by reducing errors arising from variance. Despite the development of novel methods, they are still considered the "golden-standard" to quantify the uncertainty of regression models. Subsampling-based methods to obtain ensembles can be applied to all models, regardless whether they are related to deep learning or traditional machine learning. However, little attention has been given to the question whether the ensemble method is applicable to virtually all scenarios occurring in the field of cheminformatics. In a widespread and diversified attempt, ensembles are evaluated for 32 datasets of different sizes and modeling difficulty, ranging from physicochemical properties to biological activities. For increasing ensemble sizes with up to 200 members, the predictive performance as well as the applicability as uncertainty estimator are shown for all combinations of five modeling techniques and four molecular featurizations. Useful recommendations were derived for practitioners regarding the success and minimum size of ensembles, depending on whether predictive performance or uncertainty quantification is of more importance for the task at hand.
Collapse
Affiliation(s)
- Thomas-Martin Dutschmann
- Institute of Medicinal and Pharmaceutical Chemistry, University of Technology Braunschweig, Beethovenstrasse 55, 38106, Brunswick, Germany
| | - Lennart Kinzel
- Institute of Medicinal and Pharmaceutical Chemistry, University of Technology Braunschweig, Beethovenstrasse 55, 38106, Brunswick, Germany
| | - Antonius Ter Laak
- Bayer AG, Research & Development, Pharmaceuticals, Muellerstrasse 178, 13353, Berlin, Germany
| | - Knut Baumann
- Institute of Medicinal and Pharmaceutical Chemistry, University of Technology Braunschweig, Beethovenstrasse 55, 38106, Brunswick, Germany.
| |
Collapse
|
23
|
Zhan H, Zhu X, Qiao Z, Hu J. Graph Neural Tree: A novel and interpretable deep learning-based framework for accurate molecular property predictions. Anal Chim Acta 2023; 1244:340558. [PMID: 36737143 DOI: 10.1016/j.aca.2022.340558] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2022] [Accepted: 10/24/2022] [Indexed: 11/06/2022]
Abstract
Determining various properties of molecules is a critical step in drug discovery. Recently, with the improvement of large heterogeneous datasets and the development of deep learning approaches, more and more scientists have turned their attention to neural network-based virtual preliminary screening to reduce the time and monetary cost of drug discovery. However, the poor interpretability of deep learning masks causality, so models' conclusions are often beyond the comprehension of human users, which reduces the credibility of the model and makes it difficult for chemists to further narrow the huge chemical space based on models' results. Thus, this study develops a novel framework consisting of Graph Neural Networks for feature extraction, Curriculum-Based Learning Strategies for optimization, and a Learning Binary Neural Tree (LBNT) for prediction, to improve the performance of neural networks and reveal their decision-making process to chemists. The framework encodes molecular graph data with graph neural networks (GNNs), then retrains the encoder with curriculum-based learning strategies to reduce uncertainty and improve accuracy, and finally uses LBNT as the predictor, which joint retrains with the encoder after independently training, for prediction and visualization. The framework is validated on the public datasets and compared to single GNNs with normal training strategies as well as GNN encoders with common machine learning predictors instead of the LBNT predictor. The result reveals that the proposed framework enhances the point prediction accuracy of the completely trained GNN and reduces its uncertainty through curriculum-based learning, and further improves the accuracy by combining LBNT. Besides, compared with common machine learning tools, the LBNT predictor generally has the best performance because of joint retraining with the GNN encoder. The decision-making process of LBNT is also better and easier to explain than that of other models.
Collapse
Affiliation(s)
- Haolin Zhan
- Guangzhou Key Laboratory for New Energy and Green Catalysis, School of Chemistry and Chemical Engineering, Guangzhou University, Guangzhou, China; College of Economics and Statistics, Guangzhou University, Guangzhou, China
| | - Xin Zhu
- Guangzhou Key Laboratory for New Energy and Green Catalysis, School of Chemistry and Chemical Engineering, Guangzhou University, Guangzhou, China.
| | - Zhiwei Qiao
- Guangzhou Key Laboratory for New Energy and Green Catalysis, School of Chemistry and Chemical Engineering, Guangzhou University, Guangzhou, China; Joint Institute of Guangzhou University & Institute of Corrosion Science and Technology, Guangzhou University, Guangzhou, 510006, China.
| | - Jianming Hu
- College of Economics and Statistics, Guangzhou University, Guangzhou, China.
| |
Collapse
|
24
|
Aouichaoui ARN, Fan F, Mansouri SS, Abildskov J, Sin G. Combining Group-Contribution Concept and Graph Neural Networks Toward Interpretable Molecular Property Models. J Chem Inf Model 2023; 63:725-744. [PMID: 36716461 DOI: 10.1021/acs.jcim.2c01091] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2023]
Abstract
Quantitative structure-property relationships (QSPRs) are important tools to facilitate and accelerate the discovery of compounds with desired properties. While many QSPRs have been developed, they are associated with various shortcomings such as a lack of generalizability and modest accuracy. Albeit various machine-learning and deep-learning techniques have been integrated into such models, another shortcoming has emerged in the form of a lack of transparency and interpretability of such models. In this work, two interpretable graph neural network (GNN) models (attentive group-contribution (AGC) and group-contribution-based graph attention (GroupGAT)) are developed by integrating fundamentals using the concept of group contributions (GC). The interpretability consists of highlighting the substructure with the highest attention weights in the latent representation of the molecules using the attention mechanism. The proposed models showcased better performance compared to classical group-contribution models, as well as against various other GNN models describing the aqueous solubility, melting point, and enthalpies of formation, combustion, and fusion of organic compounds. The insights provided are consistent with insights obtained from the semiempirical GC models confirming that the proposed framework allows highlighting the important substructures of the molecules for a specific property.
Collapse
Affiliation(s)
- Adem R N Aouichaoui
- Process and Systems Engineering Center (PROSYS), Department of Chemical and Biochemical Engineering, Technical University of Denmark, Kgs. LyngbyDK-2800, Denmark
| | - Fan Fan
- Process and Systems Engineering Center (PROSYS), Department of Chemical and Biochemical Engineering, Technical University of Denmark, Kgs. LyngbyDK-2800, Denmark
| | - Seyed Soheil Mansouri
- Process and Systems Engineering Center (PROSYS), Department of Chemical and Biochemical Engineering, Technical University of Denmark, Kgs. LyngbyDK-2800, Denmark
| | - Jens Abildskov
- Process and Systems Engineering Center (PROSYS), Department of Chemical and Biochemical Engineering, Technical University of Denmark, Kgs. LyngbyDK-2800, Denmark
| | - Gürkan Sin
- Process and Systems Engineering Center (PROSYS), Department of Chemical and Biochemical Engineering, Technical University of Denmark, Kgs. LyngbyDK-2800, Denmark
| |
Collapse
|
25
|
Yang CI, Li YP. Explainable uncertainty quantifications for deep learning-based molecular property prediction. J Cheminform 2023; 15:13. [PMID: 36737786 PMCID: PMC9898940 DOI: 10.1186/s13321-023-00682-3] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2022] [Accepted: 01/15/2023] [Indexed: 02/05/2023] Open
Abstract
Quantifying uncertainty in machine learning is important in new research areas with scarce high-quality data. In this work, we develop an explainable uncertainty quantification method for deep learning-based molecular property prediction. This method can capture aleatoric and epistemic uncertainties separately and attribute the uncertainties to atoms present in the molecule. The atom-based uncertainty method provides an extra layer of chemical insight to the estimated uncertainties, i.e., one can analyze individual atomic uncertainty values to diagnose the chemical component that introduces uncertainty to the prediction. Our experiments suggest that atomic uncertainty can detect unseen chemical structures and identify chemical species whose data are potentially associated with significant noise. Furthermore, we propose a post-hoc calibration method to refine the uncertainty quantified by ensemble models for better confidence interval estimates. This work improves uncertainty calibration and provides a framework for assessing whether and why a prediction should be considered unreliable.
Collapse
Affiliation(s)
- Chu-I Yang
- grid.19188.390000 0004 0546 0241Department of Chemical Engineering, National Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei, 10617 Taiwan
| | - Yi-Pei Li
- grid.19188.390000 0004 0546 0241Department of Chemical Engineering, National Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei, 10617 Taiwan ,grid.28665.3f0000 0001 2287 1366Taiwan International Graduate Program (TIGP), Academia Sinica, No. 128, Sec. 2, Academia Road, Taipei, 11529 Taiwan
| |
Collapse
|
26
|
Stoyanova R, Katzberger PM, Komissarov L, Khadhraoui A, Sach-Peltason L, Groebke Zbinden K, Schindler T, Manevski N. Computational Predictions of Nonclinical Pharmacokinetics at the Drug Design Stage. J Chem Inf Model 2023; 63:442-458. [PMID: 36595708 DOI: 10.1021/acs.jcim.2c01134] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]
Abstract
Although computational predictions of pharmacokinetics (PK) are desirable at the drug design stage, existing approaches are often limited by prediction accuracy and human interpretability. Using a discovery data set of mouse and rat PK studies at Roche (9,685 unique compounds), we performed a proof-of-concept study to predict key PK properties from chemical structure alone, including plasma clearance (CLp), volume of distribution at steady-state (Vss), and oral bioavailability (F). Ten machine learning (ML) models were evaluated, including Single-Task, Multitask, and transfer learning approaches (i.e., pretraining with in vitro data). In addition to prediction accuracy, we emphasized human interpretability of outcomes, especially the quantification of uncertainty, applicability domains, and explanations of predictions in terms of molecular features. Results show that intravenous (IV) PK properties (CLp and Vss) can be predicted with good precision (average absolute fold error, AAFE of 1.96-2.84 depending on data split) and low bias (average fold error, AFE of 0.98-1.36), with AutoGluon, Gaussian Process Regressor (GP), and ChemProp displaying the best performance. Driven by higher complexity of oral PK studies, predictions of F were more challenging, with the best AAFE values of 2.35-2.60 and higher overprediction bias (AFE of 1.45-1.62). Multi-Task approaches and pretraining of ChemProp neural networks with in vitro data showed similar precision to Single-Task models but helped reduce the bias and increase correlations between observations and predictions. A combination of GP-computed prediction variance, molecular clustering, and dimensionality-reduction provided valuable quantitative insights into prediction uncertainty and applicability domains. SHAPley Additive exPlanations (SHAPs) highlighted molecular features contributing to prediction outcomes of Vss, providing explanations that could aid drug design. Combined results show that computational predictions of PK are feasible at the drug design stage, with several ML technologies converging to successfully leverage historical PK data sets. Further studies are needed to unlock the full potential of this approach, especially with respect to data set sizes and quality, transfer learning between in vitro and in vivo data sets, model-independent quantification of uncertainty, and explainability of predictions.
Collapse
Affiliation(s)
- Raya Stoyanova
- Roche Pharmaceutical Research and Early Development, Roche Innovation Center Basel, 4070Basel, Switzerland
| | - Paul Maximilian Katzberger
- Roche Pharmaceutical Research and Early Development, Roche Innovation Center Basel, 4070Basel, Switzerland
| | - Leonid Komissarov
- Roche Pharmaceutical Research and Early Development, Roche Innovation Center Basel, 4070Basel, Switzerland
| | - Aous Khadhraoui
- Roche Pharmaceutical Research and Early Development, Roche Innovation Center Basel, 4070Basel, Switzerland
| | - Lisa Sach-Peltason
- Roche Pharmaceutical Research and Early Development, Roche Innovation Center Basel, 4070Basel, Switzerland
| | - Katrin Groebke Zbinden
- Roche Pharmaceutical Research and Early Development, Roche Innovation Center Basel, 4070Basel, Switzerland
| | - Torsten Schindler
- Roche Pharmaceutical Research and Early Development, Roche Innovation Center Basel, 4070Basel, Switzerland
| | - Nenad Manevski
- Roche Pharmaceutical Research and Early Development, Roche Innovation Center Basel, 4070Basel, Switzerland
| |
Collapse
|
27
|
Wang D, Wu Z, Shen C, Bao L, Luo H, Wang Z, Yao H, Kong DX, Luo C, Hou T. Learning with uncertainty to accelerate the discovery of histone lysine-specific demethylase 1A (KDM1A/LSD1) inhibitors. Brief Bioinform 2023; 24:6961473. [PMID: 36573494 DOI: 10.1093/bib/bbac592] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2022] [Revised: 12/01/2022] [Accepted: 12/02/2022] [Indexed: 12/28/2022] Open
Abstract
Machine learning including modern deep learning models has been extensively used in drug design and screening. However, reliable prediction of molecular properties is still challenging when exploring out-of-domain regimes, even for deep neural networks. Therefore, it is important to understand the uncertainty of model predictions, especially when the predictions are used to guide further experiments. In this study, we explored the utility and effectiveness of evidential uncertainty in compound screening. The evidential Graphormer model was proposed for uncertainty-guided discovery of KDM1A/LSD1 inhibitors. The benchmarking results illustrated that (i) Graphormer exhibited comparative predictive power to state-of-the-art models, and (ii) evidential regression enabled well-ranked uncertainty estimates and calibrated predictions. Subsequently, we leveraged time-splitting on the curated KDM1A/LSD1 dataset to simulate out-of-distribution predictions. The retrospective virtual screening showed that the evidential uncertainties helped reduce false positives among the top-acquired compounds and thus enabled higher experimental validation rates. The trained model was then used to virtually screen an independent in-house compound set. The top 50 compounds ranked by two different ranking strategies were experimentally validated, respectively. In general, our study highlighted the importance to understand the uncertainty in prediction, which can be recognized as an interpretable dimension to model predictions.
Collapse
Affiliation(s)
- Dong Wang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, China.,State Key Lab of CAD&CG, Zhejiang University, Hangzhou 310058 Zhejiang, China
| | - Zhenxing Wu
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, China.,State Key Lab of CAD&CG, Zhejiang University, Hangzhou 310058 Zhejiang, China
| | - Chao Shen
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, China.,State Key Lab of CAD&CG, Zhejiang University, Hangzhou 310058 Zhejiang, China.,CarbonSilicon AI Technology Co., Ltd, Hangzhou 310018, Zhejiang, China
| | - Lingjie Bao
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, China.,State Key Lab of CAD&CG, Zhejiang University, Hangzhou 310058 Zhejiang, China
| | - Hao Luo
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, China.,State Key Lab of CAD&CG, Zhejiang University, Hangzhou 310058 Zhejiang, China
| | - Zhe Wang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, China.,State Key Lab of CAD&CG, Zhejiang University, Hangzhou 310058 Zhejiang, China
| | - Hucheng Yao
- State Key Laboratory of Agricultural Microbiology, Agricultural Bioinformatics Key Laboratory of Hubei Province, College of Informatics, Huazhong Agricultural University, Wuhan 430070, China
| | - De-Xin Kong
- State Key Laboratory of Agricultural Microbiology, Agricultural Bioinformatics Key Laboratory of Hubei Province, College of Informatics, Huazhong Agricultural University, Wuhan 430070, China
| | - Cheng Luo
- The Center for Chemical Biology, Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai 201203 China
| | - Tingjun Hou
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, China.,State Key Lab of CAD&CG, Zhejiang University, Hangzhou 310058 Zhejiang, China
| |
Collapse
|
28
|
Xia S, Zhang D, Zhang Y. Multitask Deep Ensemble Prediction of Molecular Energetics in Solution: From Quantum Mechanics to Experimental Properties. J Chem Theory Comput 2023; 19:10.1021/acs.jctc.2c01024. [PMID: 36607141 PMCID: PMC10323048 DOI: 10.1021/acs.jctc.2c01024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
Abstract
The past few years have witnessed significant advances in developing machine learning methods for molecular energetics predictions, including calculated electronic energies with high-level quantum mechanical methods and experimental properties, such as solvation free energy and logP. Typically, task-specific machine learning models are developed for distinct prediction tasks. In this work, we present a multitask deep ensemble model, sPhysNet-MT-ens5, which can simultaneously and accurately predict electronic energies of molecules in gas, water, and octanol phases, as well as transfer free energies at both calculated and experimental levels. On the calculated data set Frag20-solv-678k, which is developed in this work and contains 678,916 molecular conformations, up to 20 heavy atoms, and their properties calculated at B3LYP/6-31G* level of theory with continuum solvent models, sPhysNet-MT-ens5 predicts density functional theory (DFT)-level electronic energies directly from force field-optimized geometry within chemical accuracy. On the experimental data sets, sPhysNet-MT-ens5 achieves state-of-the-art performances, which predict both experimental hydration free energy with a RMSE of 0.620 kcal/mol on the FreeSolv data set and experimental logP with a RMSE of 0.393 on the PHYSPROP data set. Furthermore, sPhysNet-MT-ens5 also provides a reasonable estimation of model uncertainty which shows correlations with prediction error. Finally, by analyzing the atomic contributions of its predictions, we find that the developed deep learning model is aware of the chemical environment of each atom by assigning reasonable atomic contributions consistent with our chemical knowledge.
Collapse
Affiliation(s)
- Song Xia
- Department of Chemistry, New York University, New York, New York 10003, United States
| | - Dongdong Zhang
- Department of Chemistry, New York University, New York, New York 10003, United States
| | - Yingkai Zhang
- Department of Chemistry, New York University, New York, New York 10003, United States
- Simons Center for Computational Physical Chemistry at New York University, New York, New York 10003, United States
- NYU-ECNU Center for Computational Chemistry at NYU Shanghai, Shanghai 200062, China
| |
Collapse
|
29
|
Liu Y, Zhang R, Li T, Jiang J, Ma J, Wang P. MolRoPE-BERT: An enhanced molecular representation with Rotary Position Embedding for molecular property prediction. J Mol Graph Model 2023; 118:108344. [PMID: 36242862 DOI: 10.1016/j.jmgm.2022.108344] [Citation(s) in RCA: 10] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2022] [Revised: 09/21/2022] [Accepted: 09/21/2022] [Indexed: 11/28/2022]
Abstract
Molecular property prediction is a significant task in drug discovery. Most deep learning-based computational methods either develop unique chemical representation or combine complex model. However, researchers are less concerned with the possible advantages of enormous quantities of unlabeled molecular data. Since the obvious limited amount of labeled data available, this task becomes more difficult. In some senses, SMILES of the drug molecule may be regarded of as a language for chemistry, taking inspiration from natural language processing research and current advances in pretrained models. In this paper, we incorporated Rotary Position Embedding(RoPE) efficiently encode the position information of SMILES sequences, ultimately enhancing the capability of the BERT pretrained model to extract potential molecular substructure information for molecular property prediction. We proposed the MolRoPE-BERT framework, an new end-to-end deep learning framework that integrates an efficient position coding approach for capturing sequence position information with a pretrained BERT model for molecular property prediction. To generate useful molecular substructure embeddings, we first exclusively train the MolRoPE-BERT on four million unlabeled drug SMILES(i.e., ZINC 15 and ChEMBL 27). Then, we conduct a series of experiments to evaluate the performance of our proposed MolRoPE-BERT on four well-studied datasets. Compared with conventional and state-of-the-art baselines, our experiment demonstrated comparable or superior performance.
Collapse
Affiliation(s)
- Yunwu Liu
- School of Information Science and Engineering, Lanzhou University, TianshuiRoad, Lanzhou city, 730000, Lanzhou, China.
| | - Ruisheng Zhang
- School of Information Science and Engineering, Lanzhou University, TianshuiRoad, Lanzhou city, 730000, Lanzhou, China.
| | - Tongfeng Li
- School of Information Science and Engineering, Lanzhou University, TianshuiRoad, Lanzhou city, 730000, Lanzhou, China.
| | - Jing Jiang
- School of Information Science and Engineering, Lanzhou University, TianshuiRoad, Lanzhou city, 730000, Lanzhou, China.
| | - Jun Ma
- School of Information Science and Engineering, Lanzhou University, TianshuiRoad, Lanzhou city, 730000, Lanzhou, China.
| | - Ping Wang
- School of Information Science and Engineering, Lanzhou University, TianshuiRoad, Lanzhou city, 730000, Lanzhou, China.
| |
Collapse
|
30
|
Vazquez-Salazar LI, Boittier ED, Meuwly M. Uncertainty quantification for predictions of atomistic neural networks. Chem Sci 2022; 13:13068-13084. [PMID: 36425481 PMCID: PMC9667919 DOI: 10.1039/d2sc04056e] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2022] [Accepted: 10/16/2022] [Indexed: 12/31/2023] Open
Abstract
The value of uncertainty quantification on predictions for trained neural networks (NNs) on quantum chemical reference data is quantitatively explored. For this, the architecture of the PhysNet NN was suitably modified and the resulting model (PhysNet-DER) was evaluated with different metrics to quantify its calibration, the quality of its predictions, and whether prediction error and the predicted uncertainty can be correlated. Training on the QM9 database and evaluating data in the test set within and outside the distribution indicate that error and uncertainty are not linearly related. However, the observed variance provides insight into the quality of the data used for training. Additionally, the influence of the chemical space covered by the training data set was studied by using a biased database. The results clarify that noise and redundancy complicate property prediction for molecules even in cases for which changes - such as double bond migration in two otherwise identical molecules - are small. The model was also applied to a real database of tautomerization reactions. Analysis of the distance between members in feature space in combination with other parameters shows that redundant information in the training dataset can lead to large variances and small errors whereas the presence of similar but unspecific information returns large errors but small variances. This was, e.g., observed for nitro-containing aliphatic chains for which predictions were difficult although the training set contained several examples for nitro groups bound to aromatic molecules. The finding underlines the importance of the composition of the training data and provides chemical insight into how this affects the prediction capabilities of a ML model. Finally, the presented method can be used for information-based improvement of chemical databases for target applications through active learning optimization.
Collapse
Affiliation(s)
| | - Eric D Boittier
- Department of Chemistry, University of Basel Basel Switzerland
| | - Markus Meuwly
- Department of Chemistry, University of Basel Basel Switzerland
- Department of Chemistry, Brown University USA
| |
Collapse
|
31
|
Korolev V, Nevolin I, Protsenko P. A universal similarity based approach for predictive uncertainty quantification in materials science. Sci Rep 2022; 12:14931. [PMID: 36056050 PMCID: PMC9440040 DOI: 10.1038/s41598-022-19205-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2022] [Accepted: 08/25/2022] [Indexed: 11/08/2022] Open
Abstract
Immense effort has been exerted in the materials informatics community towards enhancing the accuracy of machine learning (ML) models; however, the uncertainty quantification (UQ) of state-of-the-art algorithms also demands further development. Most prominent UQ methods are model-specific or are related to the ensembles of models; therefore, there is a need to develop a universal technique that can be readily applied to a single model from a diverse set of ML algorithms. In this study, we suggest a new UQ measure known as the Δ-metric to address this issue. The presented quantitative criterion was inspired by the k-nearest neighbor approach adopted for applicability domain estimation in chemoinformatics. It surpasses several UQ methods in accurately ranking the predictive errors and could be considered a low-cost option for a more advanced deep ensemble strategy. We also evaluated the performance of the presented UQ measure on various classes of materials, ML algorithms, and types of input features, thus demonstrating its universality.
Collapse
Affiliation(s)
- Vadim Korolev
- Department of Chemistry, Lomonosov Moscow State University, Moscow, 119991, Russia.
| | - Iurii Nevolin
- Frumkin Institute of Physical Chemistry and Electrochemistry, Russian Academy of Sciences, Moscow, 119071, Russia
| | - Pavel Protsenko
- Department of Chemistry, Lomonosov Moscow State University, Moscow, 119991, Russia
| |
Collapse
|
32
|
Abstract
The problem of human trust is one of the most fundamental problems in applied artificial intelligence in drug discovery. In silico models have been widely used to accelerate the process of drug discovery in recent years. However, most of these models can only give reliable predictions within a limited chemical space that the training set covers (applicability domain). Predictions of samples falling outside the applicability domain are unreliable and sometimes dangerous for the drug-design decision-making process. Uncertainty quantification accordingly has drawn great attention to enable autonomous drug designing. By quantifying the confidence level of model predictions, the reliability of the predictions can be quantitatively represented to assist researchers in their molecular reasoning and experimental design. Here we summarize the state-of-the-art approaches to uncertainty quantification and underline how they can be used for drug design and discovery projects. Furthermore, we also outline four representative application scenarios of uncertainty quantification in drug discovery.
Collapse
Affiliation(s)
- Jie Yu
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
- University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing 100049, China
| | - Dingyan Wang
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
- University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing 100049, China
| | - Mingyue Zheng
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
- University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing 100049, China
| |
Collapse
|
33
|
Schmähling F, Martin J, Elster C. A framework for benchmarking uncertainty in deep regression. APPL INTELL 2022. [DOI: 10.1007/s10489-022-03908-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
Abstract
AbstractWe propose a framework for the assessment of uncertainty quantification in deep regression. The framework is based on regression problems where the regression function is a linear combination of nonlinear functions. Basically, any level of complexity can be realized through the choice of the nonlinear functions and the dimensionality of their domain. Results of an uncertainty quantification for deep regression are compared against those obtained by a statistical reference method. The reference method utilizes knowledge about the underlying nonlinear functions and is based on Bayesian linear regression using a prior reference. The flexibility, together with the availability of a reference solution, makes the framework suitable for defining benchmark sets for uncertainty quantification. Reliability of uncertainty quantification is assessed in terms of coverage probabilities, and accuracy through the size of calculated uncertainties. We illustrate the proposed framework by applying it to current approaches for uncertainty quantification in deep regression. In addition, results for three real-world regression tasks are presented.
Collapse
|
34
|
Kolluru A, Shuaibi M, Palizhati A, Shoghi N, Das A, Wood B, Zitnick CL, Kitchin JR, Ulissi ZW. Open Challenges in Developing Generalizable Large-Scale Machine-Learning Models for Catalyst Discovery. ACS Catal 2022. [DOI: 10.1021/acscatal.2c02291] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Affiliation(s)
- Adeesh Kolluru
- Department of Chemical Engineering, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, United States
| | - Muhammed Shuaibi
- Department of Chemical Engineering, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, United States
| | - Aini Palizhati
- Department of Chemical Engineering, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, United States
| | - Nima Shoghi
- Fundamental AI Research at Meta AI, Menlo Park, California 94025, United States
| | - Abhishek Das
- Fundamental AI Research at Meta AI, Menlo Park, California 94025, United States
| | - Brandon Wood
- Fundamental AI Research at Meta AI, Menlo Park, California 94025, United States
| | - C. Lawrence Zitnick
- Fundamental AI Research at Meta AI, Menlo Park, California 94025, United States
| | - John R. Kitchin
- Department of Chemical Engineering, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, United States
| | - Zachary W. Ulissi
- Department of Chemical Engineering, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, United States
| |
Collapse
|
35
|
Zhang W, Xie B. Temporal and Spatial Differences of Urban Ecological Environment and Economic Development Based on Graph Neural Network. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2022; 2022:7078671. [PMID: 35774439 PMCID: PMC9239789 DOI: 10.1155/2022/7078671] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/19/2022] [Revised: 05/12/2022] [Accepted: 05/23/2022] [Indexed: 11/25/2022]
Abstract
The temporal and spatial difference between the urban ecological environment and economic development refers to the unbalanced and insufficient contradiction between the quality of urban ecological environment and the development of economic strength. Based on the relevant theories of urban ecological environment and economic development, this paper explores the development laws of urban ecology and economic development and uses graph neural network algorithm to model the spatial and temporal dependence of the city's ecological environment in a province. The quality data and economic development strength data are analyzed in detail. The analysis results show that the ecological benefit index and economic benefit index of each city in the province have reached above 0.6 after 5 years of development. The level of coordinated urban development has improved significantly compared with 2017. However, in the process of the development of the market economy, it is necessary to rationally adjust the proportion of the secondary industry and the tertiary industry in the urban production structure and continuously promote the balanced development of the economy and ecology.
Collapse
Affiliation(s)
- Wenbo Zhang
- College of Resources and Environmental Sciences, Hunan Normal University, Changsha 410081, Hunan, China
- School of Geography and Environmental Engineering, Gannan Normal University, Gannan 341000, Jiangxi, China
| | - Binggeng Xie
- College of Resources and Environmental Sciences, Hunan Normal University, Changsha 410081, Hunan, China
| |
Collapse
|
36
|
Kim H, Park M, Lee I, Nam H. BayeshERG: a robust, reliable and interpretable deep learning model for predicting hERG channel blockers. Brief Bioinform 2022; 23:6609519. [PMID: 35709752 DOI: 10.1093/bib/bbac211] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2022] [Revised: 04/19/2022] [Accepted: 05/06/2022] [Indexed: 11/13/2022] Open
Abstract
Unintended inhibition of the human ether-à-go-go-related gene (hERG) ion channel by small molecules leads to severe cardiotoxicity. Thus, hERG channel blockage is a significant concern in the development of new drugs. Several computational models have been developed to predict hERG channel blockage, including deep learning models; however, they lack robustness, reliability and interpretability. Here, we developed a graph-based Bayesian deep learning model for hERG channel blocker prediction, named BayeshERG, which has robust predictive power, high reliability and high resolution of interpretability. First, we applied transfer learning with 300 000 large data in initial pre-training to increase the predictive performance. Second, we implemented a Bayesian neural network with Monte Carlo dropout to calibrate the uncertainty of the prediction. Third, we utilized global multihead attentive pooling to augment the high resolution of structural interpretability for the hERG channel blockers and nonblockers. We conducted both internal and external validations for stringent evaluation; in particular, we benchmarked most of the publicly available hERG channel blocker prediction models. We showed that our proposed model outperformed predictive performance and uncertainty calibration performance. Furthermore, we found that our model learned to focus on the essential substructures of hERG channel blockers via an attention mechanism. Finally, we validated the prediction results of our model by conducting in vitro experiments and confirmed its high validity. In summary, BayeshERG could serve as a versatile tool for discovering hERG channel blockers and helping maximize the possibility of successful drug discovery. The data and source code are available at our GitHub repository (https://github.com/GIST-CSBL/BayeshERG).
Collapse
Affiliation(s)
- Hyunho Kim
- School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology (GIST), Buk-gu, Gwangju, 61005, Republic of Korea
| | - Minsu Park
- School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology (GIST), Buk-gu, Gwangju, 61005, Republic of Korea
| | - Ingoo Lee
- School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology (GIST), Buk-gu, Gwangju, 61005, Republic of Korea
| | - Hojung Nam
- School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology (GIST), Buk-gu, Gwangju, 61005, Republic of Korea
| |
Collapse
|
37
|
Ismail I, Robertson C, Habershon S. Successes and challenges in using machine-learned activation energies in kinetic simulations. J Chem Phys 2022; 157:014109. [DOI: 10.1063/5.0096027] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
The prediction of the thermodynamic and kinetic properties of chemical reactions is increasingly being addressed by machine-learning (ML) methods such as artificial neural networks (ANNs). While a number of recent studies have reported success in predicting chemical reaction activation energies, less attention has focused on how the accuracy of ML predictions filter through to predictions of macroscopic observables. Here, we consider the impact of the uncertainty associated with ML prediction of activation energies on observable properties of chemical reaction networks, as given by microkinetics simulations based on ML-predicted reaction rates. After training an ANN to predict activation energies given standard molecular descriptors for reactants and products alone, we performed microkinetics simulations of three different prototypical reaction networks: formamide decomposition, aldol reactions and decomposition of 3-hydroperoxypropanal. We find that the kinetic modelling predictions can be in excellent agreement with corresponding simulations performed with ab initio calculations, but this is dependent on the inherent energetic landscape of the networks. We use these simulations to suggest some guidelines for when ML-based activation energies can be reliable, and when one should take more care in applications to kinetics modelling.
Collapse
Affiliation(s)
| | | | - Scott Habershon
- Department of Chemistry, University of Warwick, United Kingdom
| |
Collapse
|
38
|
|
39
|
Obrezanova O, Martinsson A, Whitehead T, Mahmoud S, Bender A, Miljković F, Grabowski P, Irwin B, Oprisiu I, Conduit G, Segall M, Smith GF, Williamson B, Winiwarter S, Greene N. Prediction of In Vivo Pharmacokinetic Parameters and Time-Exposure Curves in Rats Using Machine Learning from the Chemical Structure. Mol Pharm 2022; 19:1488-1504. [PMID: 35412314 DOI: 10.1021/acs.molpharmaceut.2c00027] [Citation(s) in RCA: 18] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Animal pharmacokinetic (PK) data as well as human and animal in vitro systems are utilized in drug discovery to define the rate and route of drug elimination. Accurate prediction and mechanistic understanding of drug clearance and disposition in animals provide a degree of confidence for extrapolation to humans. In addition, prediction of in vivo properties can be used to improve design during drug discovery, help select compounds with better properties, and reduce the number of in vivo experiments. In this study, we generated machine learning models able to predict rat in vivo PK parameters and concentration-time PK profiles based on the molecular chemical structure and either measured or predicted in vitro parameters. The models were trained on internal in vivo rat PK data for over 3000 diverse compounds from multiple projects and therapeutic areas, and the predicted endpoints include clearance and oral bioavailability. We compared the performance of various traditional machine learning algorithms and deep learning approaches, including graph convolutional neural networks. The best models for PK parameters achieved R2 = 0.63 [root mean squared error (RMSE) = 0.26] for clearance and R2 = 0.55 (RMSE = 0.46) for bioavailability. The models provide a fast and cost-efficient way to guide the design of molecules with optimal PK profiles, to enable the prediction of virtual compounds at the point of design, and to drive prioritization of compounds for in vivo assays.
Collapse
Affiliation(s)
- Olga Obrezanova
- Imaging and Data Analytics, Clinical Pharmacology & Safety Sciences, R&D, AstraZeneca, Cambridge CB4 0FZ, U.K
| | - Anton Martinsson
- Imaging and Data Analytics, Clinical Pharmacology & Safety Sciences, R&D, AstraZeneca, Gothenburg SE-43183, Sweden
| | - Tom Whitehead
- Intellegens Ltd., Eagle Labs, Cambridge CB4 3AZ, U.K
| | - Samar Mahmoud
- Optibrium Ltd., Cambridge Innovation Park, Cambridge CB25 9PB, U.K
| | - Andreas Bender
- Imaging and Data Analytics, Clinical Pharmacology & Safety Sciences, R&D, AstraZeneca, Cambridge CB4 0FZ, U.K.,Department of Chemistry, Centre for Molecular Informatics, University of Cambridge, Cambridge CB2 1EW, U.K
| | - Filip Miljković
- Imaging and Data Analytics, Clinical Pharmacology & Safety Sciences, R&D, AstraZeneca, Gothenburg SE-43183, Sweden
| | - Piotr Grabowski
- Imaging and Data Analytics, Clinical Pharmacology & Safety Sciences, R&D, AstraZeneca, Cambridge CB4 0FZ, U.K
| | - Ben Irwin
- Optibrium Ltd., Cambridge Innovation Park, Cambridge CB25 9PB, U.K
| | - Ioana Oprisiu
- Imaging and Data Analytics, Clinical Pharmacology & Safety Sciences, R&D, AstraZeneca, Gothenburg SE-43183, Sweden
| | | | - Matthew Segall
- Optibrium Ltd., Cambridge Innovation Park, Cambridge CB25 9PB, U.K
| | - Graham F Smith
- Imaging and Data Analytics, Clinical Pharmacology & Safety Sciences, R&D, AstraZeneca, Cambridge CB4 0FZ, U.K
| | - Beth Williamson
- Drug Metabolism and Pharmacokinetics, Research and Early Development, Oncology R&D, AstraZeneca, Cambridge CB10 1XL, U.K
| | - Susanne Winiwarter
- Drug Metabolism and Pharmacokinetics, Research and Early Development, Cardiovascular, Renal and Metabolism (CVRM), Biopharmaceutical R&D, AstraZeneca, Gothenburg SE-43183, Sweden
| | - Nigel Greene
- Imaging and Data Analytics, Clinical Pharmacology & Safety Sciences, R&D, AstraZeneca, Waltham, Massachusetts 02451, United States
| |
Collapse
|
40
|
Aouichaoui ARN, Mansouri SS, Abildskov J, Sin G. Uncertainty estimation in deep learning‐based property models: Graph neural networks applied to the critical properties. AIChE J 2022. [DOI: 10.1002/aic.17696] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Affiliation(s)
- Adem R. N. Aouichaoui
- Department of Chemical and Biochemical Engineering Technical University of Denmark Lyngby Denmark
| | - Seyed Soheil Mansouri
- Department of Chemical and Biochemical Engineering Technical University of Denmark Lyngby Denmark
| | - Jens Abildskov
- Department of Chemical and Biochemical Engineering Technical University of Denmark Lyngby Denmark
| | - Gürkan Sin
- Department of Chemical and Biochemical Engineering Technical University of Denmark Lyngby Denmark
| |
Collapse
|
41
|
Moon S, Zhung W, Yang S, Lim J, Kim WY. PIGNet: a physics-informed deep learning model toward generalized drug-target interaction predictions. Chem Sci 2022; 13:3661-3673. [PMID: 35432900 PMCID: PMC8966633 DOI: 10.1039/d1sc06946b] [Citation(s) in RCA: 56] [Impact Index Per Article: 28.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2021] [Accepted: 02/06/2022] [Indexed: 12/21/2022] Open
Abstract
Recently, deep neural network (DNN)-based drug-target interaction (DTI) models were highlighted for their high accuracy with affordable computational costs. Yet, the models' insufficient generalization remains a challenging problem in the practice of in silico drug discovery. We propose two key strategies to enhance generalization in the DTI model. The first is to predict the atom-atom pairwise interactions via physics-informed equations parameterized with neural networks and provides the total binding affinity of a protein-ligand complex as their sum. We further improved the model generalization by augmenting a broader range of binding poses and ligands to training data. We validated our model, PIGNet, in the comparative assessment of scoring functions (CASF) 2016, demonstrating the outperforming docking and screening powers than previous methods. Our physics-informing strategy also enables the interpretation of predicted affinities by visualizing the contribution of ligand substructures, providing insights for further ligand optimization.
Collapse
Affiliation(s)
- Seokhyun Moon
- Department of Chemistry, KAIST 291 Daehak-ro, Yuseong-gu Daejeon 34141 Republic of Korea
| | - Wonho Zhung
- Department of Chemistry, KAIST 291 Daehak-ro, Yuseong-gu Daejeon 34141 Republic of Korea
| | - Soojung Yang
- Department of Chemistry, KAIST 291 Daehak-ro, Yuseong-gu Daejeon 34141 Republic of Korea
| | - Jaechang Lim
- HITS Incorporation 124 Teheran-ro, Gangnam-gu Seoul 06234 Republic of Korea
| | - Woo Youn Kim
- Department of Chemistry, KAIST 291 Daehak-ro, Yuseong-gu Daejeon 34141 Republic of Korea
- HITS Incorporation 124 Teheran-ro, Gangnam-gu Seoul 06234 Republic of Korea
- KI for Artificial Intelligence, KAIST 291 Daehak-ro, Yuseong-gu Daejeon 34141 Republic of Korea
| |
Collapse
|
42
|
Combined Anomaly Detection Framework for Digital Twins of Water Treatment Facilities. WATER 2022. [DOI: 10.3390/w14071001] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
Abstract
Digital twins of cyber-physical systems with automated process control systems using programmable logic controllers (PLCs) are increasingly popular nowadays. At the same time, cyber-physical security is also a growing concern with system connectivity. This study develops a combined anomaly detection framework (CADF) against various types of security attacks on the digital twin of process control in water treatment facilities. CADF utilizes the PLC-based whitelist system to detect anomalies that target the actuators and the deep learning approach of natural gradient boosting (NGBoost) and probabilistic assessment to detect anomalies that target the sensors. The effectiveness of CADF is verified using a physical facility for water treatment with membrane processes called the Secure Water Treatment (SWaT) system in the Singapore University of Technology and Design. Various attack scenarios are tested in SWaT by falsifying the reported values of sensors and actuators in the digital twin process. These scenarios include both trivial attacks, which are commonly studied, as well as non-trivial (i.e., sophisticated) attacks, which are rarely reported. The results show that CADF performs very well with good detection accuracy in all scenarios, and particularly, it is able to detect all sophisticated attacks while ongoing before they can induce damage to the water treatment facility. CADF can be further extended to other cyber-physical systems in the future.
Collapse
|
43
|
Bilodeau C, Jin W, Jaakkola T, Barzilay R, Jensen KF. Generative models for molecular discovery: Recent advances and challenges. WIRES COMPUTATIONAL MOLECULAR SCIENCE 2022. [DOI: 10.1002/wcms.1608] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Affiliation(s)
- Camille Bilodeau
- Department of Chemical Engineering Massachusetts Institute of Technology Cambridge Massachusetts USA
| | - Wengong Jin
- Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge Massachusetts USA
| | - Tommi Jaakkola
- Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge Massachusetts USA
| | - Regina Barzilay
- Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge Massachusetts USA
| | - Klavs F. Jensen
- Department of Chemical Engineering Massachusetts Institute of Technology Cambridge Massachusetts USA
| |
Collapse
|
44
|
Accurate predictions of drugs aqueous solubility via deep learning tools. J Mol Struct 2022. [DOI: 10.1016/j.molstruc.2021.131562] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
45
|
Greenman KP, Green WH, Gómez-Bombarelli R. Multi-fidelity prediction of molecular optical peaks with deep learning. Chem Sci 2022; 13:1152-1162. [PMID: 35211282 PMCID: PMC8790778 DOI: 10.1039/d1sc05677h] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2021] [Accepted: 01/04/2022] [Indexed: 01/24/2023] Open
Abstract
Optical properties are central to molecular design for many applications, including solar cells and biomedical imaging. A variety of ab initio and statistical methods have been developed for their prediction, each with a trade-off between accuracy, generality, and cost. Existing theoretical methods such as time-dependent density functional theory (TD-DFT) are generalizable across chemical space because of their robust physics-based foundations but still exhibit random and systematic errors with respect to experiment despite their high computational cost. Statistical methods can achieve high accuracy at a lower cost, but data sparsity and unoptimized molecule and solvent representations often limit their ability to generalize. Here, we utilize directed message passing neural networks (D-MPNNs) to represent both dye molecules and solvents for predictions of molecular absorption peaks in solution. Additionally, we demonstrate a multi-fidelity approach based on an auxiliary model trained on over 28 000 TD-DFT calculations that further improves accuracy and generalizability, as shown through rigorous splitting strategies. Combining several openly-available experimental datasets, we benchmark these methods against a state-of-the-art regression tree algorithm and compare the D-MPNN solvent representation to several alternatives. Finally, we explore the interpretability of the learned representations using dimensionality reduction and evaluate the use of ensemble variance as an estimator of the epistemic uncertainty in our predictions of molecular peak absorption in solution. The prediction methods proposed herein can be integrated with active learning, generative modeling, and experimental workflows to enable the more rapid design of molecules with targeted optical properties.
Collapse
Affiliation(s)
- Kevin P Greenman
- Department of Chemical Engineering, Massachusetts Institute of Technology 77 Massachusetts Ave Cambridge MA 02139 USA
| | - William H Green
- Department of Chemical Engineering, Massachusetts Institute of Technology 77 Massachusetts Ave Cambridge MA 02139 USA
| | - Rafael Gómez-Bombarelli
- Department of Materials Science and Engineering, Massachusetts Institute of Technology 77 Massachusetts Ave Cambridge MA 02139 USA
| |
Collapse
|
46
|
Lee K, Jang J, Seo S, Lim J, Kim WY. Drug-likeness scoring based on unsupervised learning. Chem Sci 2022; 13:554-565. [PMID: 35126987 PMCID: PMC8729801 DOI: 10.1039/d1sc05248a] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2021] [Accepted: 12/10/2021] [Indexed: 01/20/2023] Open
Abstract
Drug-likeness prediction is important for the virtual screening of drug candidates. It is challenging because the drug-likeness is presumably associated with the whole set of necessary properties to pass through clinical trials, and thus no definite data for regression is available. Recently, binary classification models based on graph neural networks have been proposed but with strong dependency of their performances on the choice of the negative set for training. Here we propose a novel unsupervised learning model that requires only known drugs for training. We adopted a language model based on a recurrent neural network for unsupervised learning. It showed relatively consistent performance across different datasets, unlike such classification models. In addition, the unsupervised learning model provides drug-likeness scores that well separate distributions with increasing mean values in the order of datasets composed of molecules at a later step in a drug development process, whereas the classification model predicted a polarized distribution with two extreme values for all datasets presumably due to the overconfident prediction for unseen data. Thus, this new concept offers a pragmatic tool for drug-likeness scoring and further can be applied to other biochemical applications.
Collapse
Affiliation(s)
- Kyunghoon Lee
- Department of Chemistry, KAIST 291 Daehak-ro, Yuseong-gu Daejeon 34 141 Republic of Korea
| | - Jinho Jang
- Department of Chemistry, KAIST 291 Daehak-ro, Yuseong-gu Daejeon 34 141 Republic of Korea
| | - Seonghwan Seo
- Department of Chemistry, KAIST 291 Daehak-ro, Yuseong-gu Daejeon 34 141 Republic of Korea
| | - Jaechang Lim
- HITS Incorporation 124 Teheran-ro, Gangnam-gu Seoul 06 234 Republic of Korea
| | - Woo Youn Kim
- Department of Chemistry, KAIST 291 Daehak-ro, Yuseong-gu Daejeon 34 141 Republic of Korea
- HITS Incorporation 124 Teheran-ro, Gangnam-gu Seoul 06 234 Republic of Korea
- KI for Artificial Intelligence, KAIST 291 Daehak-ro, Yuseong-gu Daejeon 34 141 Republic of Korea
| |
Collapse
|
47
|
Abstract
Mistrust is a major barrier to implementing deep learning in healthcare settings. Entrustment could be earned by conveying model certainty, or the probability that a given model output is accurate, but the use of uncertainty estimation for deep learning entrustment is largely unexplored, and there is no consensus regarding optimal methods for quantifying uncertainty. Our purpose is to critically evaluate methods for quantifying uncertainty in deep learning for healthcare applications and propose a conceptual framework for specifying certainty of deep learning predictions. We searched Embase, MEDLINE, and PubMed databases for articles relevant to study objectives, complying with PRISMA guidelines, rated study quality using validated tools, and extracted data according to modified CHARMS criteria. Among 30 included studies, 24 described medical imaging applications. All imaging model architectures used convolutional neural networks or a variation thereof. The predominant method for quantifying uncertainty was Monte Carlo dropout, producing predictions from multiple networks for which different neurons have dropped out and measuring variance across the distribution of resulting predictions. Conformal prediction offered similar strong performance in estimating uncertainty, along with ease of interpretation and application not only to deep learning but also to other machine learning approaches. Among the six articles describing non-imaging applications, model architectures and uncertainty estimation methods were heterogeneous, but predictive performance was generally strong, and uncertainty estimation was effective in comparing modeling methods. Overall, the use of model learning curves to quantify epistemic uncertainty (attributable to model parameters) was sparse. Heterogeneity in reporting methods precluded the performance of a meta-analysis. Uncertainty estimation methods have the potential to identify rare but important misclassifications made by deep learning models and compare modeling methods, which could build patient and clinician trust in deep learning applications in healthcare. Efficient maturation of this field will require standardized guidelines for reporting performance and uncertainty metrics.
Collapse
|
48
|
Abstract
Quantitative structure-activity relationship (QSAR) models are routinely applied computational tools in the drug discovery process. QSAR models are regression or classification models that predict the biological activities of molecules based on the features derived from their molecular structures. These models are usually used to prioritize a list of candidate molecules for future laboratory experiments and to help chemists gain better insights into how structural changes affect a molecule's biological activities. Developing accurate and interpretable QSAR models is therefore of the utmost importance in the drug discovery process. Deep neural networks, which are powerful supervised learning algorithms, have shown great promise for addressing regression and classification problems in various research fields, including the pharmaceutical industry. In this chapter, we briefly review the applications of deep neural networks in QSAR modeling and describe commonly used techniques to improve model performance.
Collapse
|
49
|
Busk J, Bjørn Jørgensen P, Bhowmik A, Schmidt MN, Winther O, Vegge T. Calibrated uncertainty for molecular property prediction using ensembles of message passing neural networks. MACHINE LEARNING: SCIENCE AND TECHNOLOGY 2021. [DOI: 10.1088/2632-2153/ac3eb3] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
Abstract
Data-driven methods based on machine learning have the potential to accelerate computational analysis of atomic structures. In this context, reliable uncertainty estimates are important for assessing confidence in predictions and enabling decision making. However, machine learning models can produce badly calibrated uncertainty estimates and it is therefore crucial to detect and handle uncertainty carefully. In this work we extend a message passing neural network designed specifically for predicting properties of molecules and materials with a calibrated probabilistic predictive distribution. The method presented in this paper differs from previous work by considering both aleatoric and epistemic uncertainty in a unified framework, and by recalibrating the predictive distribution on unseen data. Through computer experiments, we show that our approach results in accurate models for predicting molecular formation energies with well calibrated uncertainty in and out of the training data distribution on two public molecular benchmark datasets, QM9 and PC9. The proposed method provides a general framework for training and evaluating neural network ensemble models that are able to produce accurate predictions of properties of molecules with well calibrated uncertainty estimates.
Collapse
|
50
|
Murad A, Kraemer FA, Bach K, Taylor G. Probabilistic Deep Learning to Quantify Uncertainty in Air Quality Forecasting. SENSORS 2021; 21:s21238009. [PMID: 34884011 PMCID: PMC8659533 DOI: 10.3390/s21238009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/04/2021] [Revised: 11/24/2021] [Accepted: 11/25/2021] [Indexed: 11/16/2022]
Abstract
Data-driven forecasts of air quality have recently achieved more accurate short-term predictions. However, despite their success, most of the current data-driven solutions lack proper quantifications of model uncertainty that communicate how much to trust the forecasts. Recently, several practical tools to estimate uncertainty have been developed in probabilistic deep learning. However, there have not been empirical applications and extensive comparisons of these tools in the domain of air quality forecasts. Therefore, this work applies state-of-the-art techniques of uncertainty quantification in a real-world setting of air quality forecasts. Through extensive experiments, we describe training probabilistic models and evaluate their predictive uncertainties based on empirical performance, reliability of confidence estimate, and practical applicability. We also propose improving these models using "free" adversarial training and exploiting temporal and spatial correlation inherent in air quality data. Our experiments demonstrate that the proposed models perform better than previous works in quantifying uncertainty in data-driven air quality forecasts. Overall, Bayesian neural networks provide a more reliable uncertainty estimate but can be challenging to implement and scale. Other scalable methods, such as deep ensemble, Monte Carlo (MC) dropout, and stochastic weight averaging-Gaussian (SWAG), can perform well if applied correctly but with different tradeoffs and slight variations in performance metrics. Finally, our results show the practical impact of uncertainty estimation and demonstrate that, indeed, probabilistic models are more suitable for making informed decisions.
Collapse
Affiliation(s)
- Abdulmajid Murad
- Department of Information Security and Communication Technology, Norwegian University of Science and Technology, 7491 Trondheim, Norway;
- Correspondence:
| | - Frank Alexander Kraemer
- Department of Information Security and Communication Technology, Norwegian University of Science and Technology, 7491 Trondheim, Norway;
| | - Kerstin Bach
- Department of Computer Science, Norwegian University of Science and Technology, 7491 Trondheim, Norway;
| | - Gavin Taylor
- Department of Computer Science, United States Naval Academy, Annapolis, MD 21402, USA;
| |
Collapse
|