1
|
Vallevik VB, Babic A, Marshall SE, Elvatun S, Brøgger HMB, Alagaratnam S, Edwin B, Veeraragavan NR, Befring AK, Nygård JF. Can I trust my fake data - A comprehensive quality assessment framework for synthetic tabular data in healthcare. Int J Med Inform 2024; 185:105413. [PMID: 38493547 DOI: 10.1016/j.ijmedinf.2024.105413] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2023] [Revised: 02/17/2024] [Accepted: 03/11/2024] [Indexed: 03/19/2024]
Abstract
BACKGROUND Ensuring safe adoption of AI tools in healthcare hinges on access to sufficient data for training, testing and validation. Synthetic data has been suggested in response to privacy concerns and regulatory requirements and can be created by training a generator on real data to produce a dataset with similar statistical properties. Competing metrics with differing taxonomies for quality evaluation have been proposed, resulting in a complex landscape. Optimising quality entails balancing considerations that make the data fit for use, yet relevant dimensions are left out of existing frameworks. METHOD We performed a comprehensive literature review on the use of quality evaluation metrics on synthetic data within the scope of synthetic tabular healthcare data using deep generative methods. Based on this and the collective team experiences, we developed a conceptual framework for quality assurance. The applicability was benchmarked against a practical case from the Dutch National Cancer Registry. CONCLUSION We present a conceptual framework for quality assuranceof synthetic data for AI applications in healthcare that aligns diverging taxonomies, expands on common quality dimensions to include the dimensions of Fairness and Carbon footprint, and proposes stages necessary to support real-life applications. Building trust in synthetic data by increasing transparency and reducing the safety risk will accelerate the development and uptake of trustworthy AI tools for the benefit of patients. DISCUSSION Despite the growing emphasis on algorithmic fairness and carbon footprint, these metrics were scarce in the literature review. The overwhelming focus was on statistical similarity using distance metrics while sequential logic detection was scarce. A consensus-backed framework that includes all relevant quality dimensions can provide assurance for safe and responsible real-life applications of synthetic data. As the choice of appropriate metrics are highly context dependent, further research is needed on validation studies to guide metric choices and support the development of technical standards.
Collapse
Affiliation(s)
- Vibeke Binz Vallevik
- University of Oslo, Boks 1072 Blindern, NO-0316 Oslo, Norway; DNV AS, Veritasveien 1, 1322 Høvik, Norway.
| | | | | | - Severin Elvatun
- Cancer Registry of Norway, Ullernchausseen 64, 0379 Oslo, Norway
| | - Helga M B Brøgger
- DNV AS, Veritasveien 1, 1322 Høvik, Norway; Oslo University Hospital, Sognsvannsveien 20, 0372 Oslo, Norway
| | | | - Bjørn Edwin
- University of Oslo, Boks 1072 Blindern, NO-0316 Oslo, Norway; The Intervention Centre and Department of HPB Surgery, Oslo University Hospital and Institute of Clinical Medicine, Faculty of Medicine, University of Oslo, Oslo, Norway
| | | | | | - Jan F Nygård
- Cancer Registry of Norway, Ullernchausseen 64, 0379 Oslo, Norway; UiT - The Arctic University of Norway, Tromsø, Norway
| |
Collapse
|
2
|
Shi B, Zhang K, Fleet DJ, McLeod RA, Dwayne Miller RJ, Howe JY. Deep generative priors for biomolecular 3D heterogeneous reconstruction from cryo-EM projections. J Struct Biol 2024; 216:108073. [PMID: 38432598 DOI: 10.1016/j.jsb.2024.108073] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2023] [Revised: 02/25/2024] [Accepted: 02/25/2024] [Indexed: 03/05/2024]
Abstract
Cryo-electron microscopy has become a powerful tool to determine three-dimensional (3D) structures of rigid biological macromolecules from noisy micrographs with single-particle reconstruction. Recently, deep neural networks, e.g., CryoDRGN, have demonstrated conformational and compositional heterogeneity of complexes. However, the lack of ground-truth conformations poses a challenge to assess the performance of heterogeneity analysis methods. In this work, variational autoencoders (VAE) with three types of deep generative priors were learned for latent variable inference and heterogeneous 3D reconstruction via Bayesian inference. More specifically, VAEs with "Variational Mixture of Posteriors" priors (VampPrior-SPR), non-parametric exemplar-based priors (ExemplarPrior-SPR) and priors from latent score-based generative models (LSGM-SPR) were quantitatively compared with CryoDRGN. We built four simulated datasets composed of hypothetical continuous conformation or discrete states of the hERG K + channel. Empirical and quantitative comparisons of inferred latent representations were performed with affine-transformation-based metrics. These models with more informative priors gave better regularized, interpretable factorized latent representations with better conserved pairwise distances, less deformed latent distributions and lower within-cluster variances. They were also tested on experimental datasets to resolve compositional and conformational heterogeneity (50S ribosome assembly, cowpea chlorotic mottle virus, and pre-catalytic spliceosome) with comparable high resolution. Codes and data are available: https://github.com/benjamin3344/DGP-SPR.
Collapse
Affiliation(s)
- Bin Shi
- Department of Materials Science and Engineering, University of Toronto, ON M5S 3H5, Canada
| | - Kevin Zhang
- Department of Materials Science and Engineering, University of Toronto, ON M5S 3H5, Canada
| | - David J Fleet
- Department of Computer Science, University of Toronto, ON M5S 3H5, Canada
| | - Robert A McLeod
- Hitachi High-Technologies Canada, Inc. Based out of Victoria, BC, Canada, British Columbia, Canada
| | - R J Dwayne Miller
- Departments of Chemistry and Physics, University of Toronto, ON M5S 3H6, Canada.
| | - Jane Y Howe
- Department of Materials Science and Engineering, University of Toronto, ON M5S 3H5, Canada; Department of Chemical Engineering and Applied Chemistry, University of Toronto, ON M5S 3E5, Canada
| |
Collapse
|
3
|
Monachino G, Zanchi B, Fiorillo L, Conte G, Auricchio A, Tzovara A, Faraci FD. Deep Generative Models: The winning key for large and easily accessible ECG datasets? Comput Biol Med 2023; 167:107655. [PMID: 37976830 DOI: 10.1016/j.compbiomed.2023.107655] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2023] [Revised: 10/04/2023] [Accepted: 10/31/2023] [Indexed: 11/19/2023]
Abstract
Large high-quality datasets are essential for building powerful artificial intelligence (AI) algorithms capable of supporting advancement in cardiac clinical research. However, researchers working with electrocardiogram (ECG) signals struggle to get access and/or to build one. The aim of the present work is to shed light on a potential solution to address the lack of large and easily accessible ECG datasets. Firstly, the main causes of such a lack are identified and examined. Afterward, the potentials and limitations of cardiac data generation via deep generative models (DGMs) are deeply analyzed. These very promising algorithms have been found capable not only of generating large quantities of ECG signals but also of supporting data anonymization processes, to simplify data sharing while respecting patients' privacy. Their application could help research progress and cooperation in the name of open science. However several aspects, such as a standardized synthetic data quality evaluation and algorithm stability, need to be further explored.
Collapse
Affiliation(s)
- Giuliana Monachino
- Institute of Digital Technologies for Personalized Healthcare - MeDiTech, Department of Innovative Technologies, University of Applied Sciences and Arts of Southern Switzerland, Via la Santa 1, Lugano 6900, Switzerland; Institute of Informatics, University of Bern, Neubrückstrasse 10, Bern 3012, Switzerland.
| | - Beatrice Zanchi
- Institute of Digital Technologies for Personalized Healthcare - MeDiTech, Department of Innovative Technologies, University of Applied Sciences and Arts of Southern Switzerland, Via la Santa 1, Lugano 6900, Switzerland; Department of Quantitative Biomedicine, University of Zurich, Schmelzbergstrasse 26, Zurich 8091, Switzerland
| | - Luigi Fiorillo
- Institute of Digital Technologies for Personalized Healthcare - MeDiTech, Department of Innovative Technologies, University of Applied Sciences and Arts of Southern Switzerland, Via la Santa 1, Lugano 6900, Switzerland
| | - Giulio Conte
- Division of Cardiology, Fondazione Cardiocentro Ticino, Via Tesserete 48, Lugano 6900, Switzerland; Centre for Computational Medicine in Cardiology, Faculty of Informatics, Università della Svizzera Italiana, Via la Santa 1, Lugano 6900, Switzerland
| | - Angelo Auricchio
- Division of Cardiology, Fondazione Cardiocentro Ticino, Via Tesserete 48, Lugano 6900, Switzerland; Centre for Computational Medicine in Cardiology, Faculty of Informatics, Università della Svizzera Italiana, Via la Santa 1, Lugano 6900, Switzerland
| | - Athina Tzovara
- Institute of Informatics, University of Bern, Neubrückstrasse 10, Bern 3012, Switzerland; Sleep Wake Epilepsy Center | NeuroTec, Department of Neurology, Inselspital, Bern University Hospital, University of Bern, Freiburgstrasse 16, Bern 3010, Switzerland
| | - Francesca Dalia Faraci
- Institute of Digital Technologies for Personalized Healthcare - MeDiTech, Department of Innovative Technologies, University of Applied Sciences and Arts of Southern Switzerland, Via la Santa 1, Lugano 6900, Switzerland
| |
Collapse
|
4
|
Prada-Luengo I, Schuster V, Liang Y, Terkelsen T, Sora V, Krogh A. N-of-one differential gene expression without control samples using a deep generative model. Genome Biol 2023; 24:263. [PMID: 37974217 PMCID: PMC10655485 DOI: 10.1186/s13059-023-03104-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2023] [Accepted: 11/06/2023] [Indexed: 11/19/2023] Open
Abstract
Differential analysis of bulk RNA-seq data often suffers from lack of good controls. Here, we present a generative model that replaces controls, trained solely on healthy tissues. The unsupervised model learns a low-dimensional representation and can identify the closest normal representation for a given disease sample. This enables control-free, single-sample differential expression analysis. In breast cancer, we demonstrate how our approach selects marker genes and outperforms a state-of-the-art method. Furthermore, significant genes identified by the model are enriched in driver genes across cancers. Our results show that the in silico closest normal provides a more favorable comparison than control samples.
Collapse
Affiliation(s)
- Iñigo Prada-Luengo
- Department of Computer Science, University of Copenhagen, Copenhagen, Denmark
| | - Viktoria Schuster
- Center for Health Data Science, University of Copenhagen, Copenhagen, Denmark
| | - Yuhu Liang
- Department of Computer Science, University of Copenhagen, Copenhagen, Denmark
| | - Thilde Terkelsen
- Center for Health Data Science, University of Copenhagen, Copenhagen, Denmark
| | - Valentina Sora
- Department of Computer Science, University of Copenhagen, Copenhagen, Denmark
| | - Anders Krogh
- Department of Computer Science, University of Copenhagen, Copenhagen, Denmark.
- Center for Health Data Science, University of Copenhagen, Copenhagen, Denmark.
| |
Collapse
|
5
|
Hwang U, Kim SW, Jung D, Kim S, Lee H, Seo SW, Seong JK, Yoon S. Real-world prediction of preclinical Alzheimer's disease with a deep generative model. Artif Intell Med 2023; 144:102654. [PMID: 37783547 DOI: 10.1016/j.artmed.2023.102654] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2023] [Revised: 08/29/2023] [Accepted: 08/29/2023] [Indexed: 10/04/2023]
Abstract
Amyloid positivity is an early indicator of Alzheimer's disease and is necessary to determine the disease. In this study, a deep generative model is utilized to predict the amyloid positivity of cognitively normal individuals using proxy measures, such as structural MRI scans, demographic variables, and cognitive scores, instead of invasive direct measurements. Through its remarkable efficacy in handling imperfect datasets caused by missing data or labels, and imbalanced classes, the model outperforms previous studies and widely used machine learning approaches with an AUROC of 0.8609. Furthermore, this study illuminates the model's adaptability to diverse clinical scenarios, even when feature sets or diagnostic criteria differ from the training data. We identify the brain regions and variables that contribute most to classification, including the lateral occipital lobes, posterior temporal lobe, and APOE ϵ4 allele. Taking advantage of deep generative models, our approach can not only provide inexpensive, non-invasive, and accurate diagnostics for preclinical Alzheimer's disease, but also meet real-world requirements for clinical translation of a deep learning model, including transferability and interpretability.
Collapse
Affiliation(s)
- Uiwon Hwang
- Division of Digital Healthcare, Yonsei University, Wonju, 26493, Republic of Korea
| | - Sung-Woo Kim
- Department of Bio-convergence Engineering, Korea University, Seoul, 02841, Republic of Korea
| | - Dahuin Jung
- Department of Electrical and Computer Engineering, Seoul National University, Seoul, 08826, Republic of Korea
| | - SeungWook Kim
- Department of Bio-convergence Engineering, Korea University, Seoul, 02841, Republic of Korea
| | - Hyejoo Lee
- Department of Neurology, Samsung Medical Center, Sungkyunkwan University School of Medicine, Seoul, 06351, Republic of Korea; Neuroscience Center, Samsung Medical Center, Seoul, 06351, Republic of Korea
| | - Sang Won Seo
- Department of Neurology, Samsung Medical Center, Sungkyunkwan University School of Medicine, Seoul, 06351, Republic of Korea; Neuroscience Center, Samsung Medical Center, Seoul, 06351, Republic of Korea
| | - Joon-Kyung Seong
- Department of Artificial Intelligence, Korea University, Seoul, 02841, Republic of Korea; School of Biomedical Engineering, Korea University, Seoul, 02841, Republic of Korea; Interdisciplinary Program in Precision Public Health, College of Health Science, Korea University, Seoul, 02841, Republic of Korea.
| | - Sungroh Yoon
- Department of Electrical and Computer Engineering, Seoul National University, Seoul, 08826, Republic of Korea; Interdisciplinary Program in Artificial Intelligence, Seoul National University, Seoul, 08826, Republic of Korea.
| |
Collapse
|
6
|
Bae B, Bae H, Nam H. LOGICS: Learning optimal generative distribution for designing de novo chemical structures. J Cheminform 2023; 15:77. [PMID: 37674239 PMCID: PMC10483765 DOI: 10.1186/s13321-023-00747-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2023] [Accepted: 08/23/2023] [Indexed: 09/08/2023] Open
Abstract
In recent years, the field of computational drug design has made significant strides in the development of artificial intelligence (AI) models for the generation of de novo chemical compounds with desired properties and biological activities, such as enhanced binding affinity to target proteins. These high-affinity compounds have the potential to be developed into more potent therapeutics for a broad spectrum of diseases. Due to the lack of data required for the training of deep generative models, however, some of these approaches have fine-tuned their molecular generators using data obtained from a separate predictor. While these studies show that generative models can produce structures with the desired target properties, it remains unclear whether the diversity of the generated structures and the span of their chemical space align with the distribution of the intended target molecules. In this study, we present a novel generative framework, LOGICS, a framework for Learning Optimal Generative distribution Iteratively for designing target-focused Chemical Structures. We address the exploration-exploitation dilemma, which weighs the choice between exploring new options and exploiting current knowledge. To tackle this issue, we incorporate experience memory and employ a layered tournament selection approach to refine the fine-tuning process. The proposed method was applied to the binding affinity optimization of two target proteins of different protein classes, κ-opioid receptors, and PIK3CA, and the quality and the distribution of the generative molecules were evaluated. The results showed that LOGICS outperforms competing state-of-the-art models and generates more diverse de novo chemical structures with optimized properties. The source code is available at the GitHub repository ( https://github.com/GIST-CSBL/LOGICS ).
Collapse
Affiliation(s)
- Bongsung Bae
- School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology (GIST), Buk-Gu, Gwangju, 61005, Republic of Korea
| | - Haelee Bae
- AI Graduate School, Gwangju Institute of Science and Technology (GIST), Buk-Gu, Gwangju, 61005, Republic of Korea
| | - Hojung Nam
- School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology (GIST), Buk-Gu, Gwangju, 61005, Republic of Korea.
- AI Graduate School, Gwangju Institute of Science and Technology (GIST), Buk-Gu, Gwangju, 61005, Republic of Korea.
- Center for AI-Applied High Efficiency Drug Discovery (AHEDD), Gwangju Institute of Science and Technology (GIST), Buk-Gu, Gwangju, 61005, Republic of Korea.
| |
Collapse
|
7
|
Luleci F, Catbas FN. A brief introductory review to deep generative models for civil structural health monitoring. AI Civil Eng 2023; 2:9. [PMID: 37621778 PMCID: PMC10444648 DOI: 10.1007/s43503-023-00017-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/07/2023] [Revised: 07/25/2023] [Accepted: 07/27/2023] [Indexed: 08/26/2023]
Abstract
The use of deep generative models (DGMs) such as variational autoencoders, autoregressive models, flow-based models, energy-based models, generative adversarial networks, and diffusion models has been advantageous in various disciplines due to their high data generative skills. Using DGMs has become one of the most trending research topics in Artificial Intelligence in recent years. On the other hand, the research and development endeavors in the civil structural health monitoring (SHM) area have also been very progressive owing to the increasing use of Machine Learning techniques. As such, some of the DGMs have also been used in the civil SHM field lately. This short review communication paper aims to assist researchers in the civil SHM field in understanding the fundamentals of DGMs and, consequently, to help initiate their use for current and possible future engineering applications. On this basis, this study briefly introduces the concept and mechanism of different DGMs in a comparative fashion. While preparing this short review communication, it was observed that some DGMs had not been utilized or exploited fully in the SHM area. Accordingly, some representative studies presented in the civil SHM field that use DGMs are briefly overviewed. The study also presents a short comparative discussion on DGMs, their link to the SHM, and research directions.
Collapse
Affiliation(s)
- Furkan Luleci
- Department of Civil, Environmental, and Construction Engineering, University of Central Florida, Orlando, FL 32816 USA
| | - F. Necati Catbas
- Department of Civil, Environmental, and Construction Engineering, University of Central Florida, Orlando, FL 32816 USA
| |
Collapse
|
8
|
Chen Z, Yang Z, Zhu L, Gao P, Matsubara T, Kanaya S, Altaf-Ul-Amin M. Learning vector quantized representation for cancer subtypes identification. Comput Methods Programs Biomed 2023; 236:107543. [PMID: 37100024 DOI: 10.1016/j.cmpb.2023.107543] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/27/2022] [Revised: 02/13/2023] [Accepted: 04/07/2023] [Indexed: 05/21/2023]
Abstract
BACKGROUND AND OBJECTIVE Defining and separating cancer subtypes is essential for facilitating personalized therapy modality and prognosis of patients. The definition of subtypes has been constantly recalibrated as a result of our deepened understanding. During this recalibration, researchers often rely on clustering of cancer data to provide an intuitive visual reference that could reveal the intrinsic characteristics of subtypes. The data being clustered are often omics data such as transcriptomics that have strong correlations to the underlying biological mechanism. However, while existing studies have shown promising results, they suffer from issues associated with omics data: sample scarcity and high dimensionality while they impose unrealistic assumptions to extract useful features from the data while avoiding overfitting to spurious correlations. METHODS This paper proposes to leverage a recent strong generative model, Vector-Quantized Variational AutoEncoder, to tackle the data issues and extract discrete representations that are crucial to the quality of subsequent clustering by retaining only information relevant to reconstructing the input. RESULTS Extensive experiments and medical analysis on multiple datasets comprising 10 distinct cancers demonstrate the proposed clustering results can significantly and robustly improve prognosis over prevalent subtyping systems. CONCLUSION Our proposal does not impose strict assumptions on data distribution; while, its latent features are better representations of the transcriptomic data in different cancer subtypes, capable of yielding superior clustering performance with any mainstream clustering method.
Collapse
Affiliation(s)
- Zheng Chen
- Graduate School of Engineering Science, Osaka University, Japan.
| | - Ziwei Yang
- Graduate School of Science and Technology, Nara Institute of Science and Technology, Japan
| | - Lingwei Zhu
- Department of Computing Science, University of Alberta, Canada
| | - Peng Gao
- Institute for Quantitative Biosciences, University of Tokyo, Japan
| | | | - Shigehiko Kanaya
- Graduate School of Science and Technology, Nara Institute of Science and Technology, Japan; Data Science Center, Nara Insitute of Science and Technology, Japan
| | - Md Altaf-Ul-Amin
- Graduate School of Science and Technology, Nara Institute of Science and Technology, Japan
| |
Collapse
|
9
|
Pombo G, Gray R, Cardoso MJ, Ourselin S, Rees G, Ashburner J, Nachev P. Equitable modelling of brain imaging by counterfactual augmentation with morphologically constrained 3D deep generative models. Med Image Anal 2023; 84:102723. [PMID: 36542907 PMCID: PMC10591114 DOI: 10.1016/j.media.2022.102723] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2021] [Revised: 11/21/2022] [Accepted: 12/02/2022] [Indexed: 12/12/2022]
Abstract
We describe CounterSynth, a conditional generative model of diffeomorphic deformations that induce label-driven, biologically plausible changes in volumetric brain images. The model is intended to synthesise counterfactual training data augmentations for downstream discriminative modelling tasks where fidelity is limited by data imbalance, distributional instability, confounding, or underspecification, and exhibits inequitable performance across distinct subpopulations. Focusing on demographic attributes, we evaluate the quality of synthesised counterfactuals with voxel-based morphometry, classification and regression of the conditioning attributes, and the Fréchet inception distance. Examining downstream discriminative performance in the context of engineered demographic imbalance and confounding, we use UK Biobank and OASIS magnetic resonance imaging data to benchmark CounterSynth augmentation against current solutions to these problems. We achieve state-of-the-art improvements, both in overall fidelity and equity. The source code for CounterSynth is available at https://github.com/guilherme-pombo/CounterSynth.
Collapse
Affiliation(s)
- Guilherme Pombo
- UCL Queen Square Institute of Neurology, University College London, London, UK.
| | - Robert Gray
- UCL Queen Square Institute of Neurology, University College London, London, UK
| | - M Jorge Cardoso
- School of Biomedical Engineering & Imaging Sciences, King's College London, London, UK
| | - Sebastien Ourselin
- School of Biomedical Engineering & Imaging Sciences, King's College London, London, UK
| | - Geraint Rees
- UCL Queen Square Institute of Neurology, University College London, London, UK
| | - John Ashburner
- UCL Queen Square Institute of Neurology, University College London, London, UK
| | - Parashkev Nachev
- UCL Queen Square Institute of Neurology, University College London, London, UK
| |
Collapse
|
10
|
Abstract
Artificial intelligence (AI) tools find increasing application in drug discovery supporting every stage of the Design-Make-Test-Analyse (DMTA) cycle. The main focus of this chapter is the application in molecular generation with the aid of deep neural networks (DNN). We present a historical overview of the main advances in the field. We analyze the concepts of distribution and goal-directed learning and then highlight some of the recent applications of generative models in drug design with a focus into research work from the biopharmaceutical industry. We present in some more detail REINVENT which is an open-source software developed within our group in AstraZeneca and the main platform for AI molecular design support for a number of medicinal chemistry projects in the company and we also demonstrate some of our work in library design. Finally, we present some of the main challenges in the application of AI in Drug Discovery and different approaches to respond to these challenges which define areas for current and future work.
Collapse
|
11
|
Bhisetti G, Fang C. Artificial Intelligence-Enabled De Novo Design of Novel Compounds that Are Synthesizable. Methods Mol Biol 2022; 2390:409-19. [PMID: 34731479 DOI: 10.1007/978-1-0716-1787-8_17] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Abstract
Development of computer-aided de novo design methods to discover novel compounds in a speedy manner to treat human diseases has been of interest to drug discovery scientists for the past three decades. In the beginning, the efforts were mostly concentrated to generate molecules that fit the active site of the target protein by sequential building of a molecule atom-by-atom and/or group-by-group while exploring all possible conformations to optimize binding interactions with the target protein. In recent years, deep learning approaches are applied to generate molecules that are iteratively optimized against a binding hypothesis (to optimize potency) and predictive models of drug-likeness (to optimize properties). Synthesizability of molecules generated by these de novo methods remains a challenge. This review will focus on the recent development of synthetic planning methods that are suitable for enhancing synthesizability of molecules designed by de novo methods.
Collapse
|
12
|
Edinburgh T, Smielewski P, Czosnyka M, Cabeleira M, Eglen SJ, Ercole A. DeepClean: Self-Supervised Artefact Rejection for Intensive Care Waveform Data Using Deep Generative Learning. Acta Neurochir Suppl 2021; 131:235-41. [PMID: 33839851 DOI: 10.1007/978-3-030-59436-7_45] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Abstract
Waveform physiological data are important in the treatment of critically ill patients in the intensive care unit. Such recordings are susceptible to artefacts, which must be removed before the data can be reused for alerting or reprocessed for other clinical or research purposes. Accurate removal of artefacts reduces bias and uncertainty in clinical assessment, as well as the false positive rate of ICU alarms, and is therefore a key component in providing optimal clinical care. In this work, we present DeepClean, a prototype self-supervised artefact detection system using a convolutional variational autoencoder deep neural network that avoids costly and painstaking manual annotation, requiring only easily obtained 'good' data for training. For a test case with invasive arterial blood pressure, we demonstrate that our algorithm can detect the presence of an artefact within a 10s sample of data with sensitivity and specificity around 90%. Furthermore, DeepClean was able to identify regions of artefacts within such samples with high accuracy, and we show that it significantly outperforms a baseline principal component analysis approach in both signal reconstruction and artefact detection. DeepClean learns a generative model and therefore may also be used for imputation of missing data.
Collapse
|
13
|
Arús-Pous J, Blaschke T, Ulander S, Reymond JL, Chen H, Engkvist O. Exploring the GDB-13 chemical space using deep generative models. J Cheminform 2019; 11:20. [PMID: 30868314 PMCID: PMC6419837 DOI: 10.1186/s13321-019-0341-z] [Citation(s) in RCA: 70] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2018] [Accepted: 02/26/2019] [Indexed: 11/15/2022] Open
Abstract
Recent applications of recurrent neural networks (RNN) enable training models that sample the chemical space. In this study we train RNN with molecular string representations (SMILES) with a subset of the enumerated database GDB-13 (975 million molecules). We show that a model trained with 1 million structures (0.1% of the database) reproduces 68.9% of the entire database after training, when sampling 2 billion molecules. We also developed a method to assess the quality of the training process using negative log-likelihood plots. Furthermore, we use a mathematical model based on the “coupon collector problem” that compares the trained model to an upper bound and thus we are able to quantify how much it has learned. We also suggest that this method can be used as a tool to benchmark the learning capabilities of any molecular generative model architecture. Additionally, an analysis of the generated chemical space was performed, which shows that, mostly due to the syntax of SMILES, complex molecules with many rings and heteroatoms are more difficult to sample.![]()
Collapse
Affiliation(s)
- Josep Arús-Pous
- Hit Discovery, Discovery Sciences, IMED Biotech Unit, AstraZeneca, Gothenburg, Pepparedsleden 1, 43183, Mölndal, Sweden. .,Department of Chemistry and Biochemistry, University of Bern, Freiestrasse 3, 3012, Bern, Switzerland.
| | - Thomas Blaschke
- Hit Discovery, Discovery Sciences, IMED Biotech Unit, AstraZeneca, Gothenburg, Pepparedsleden 1, 43183, Mölndal, Sweden.,Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Endenicher Allee 19C, 53115, Bonn, Germany
| | - Silas Ulander
- Medicinal Chemistry, Cardiovascular, Renal and Metabolism, IMED Biotech Unit, AstraZeneca, Gothenburg, Pepparedsleden 1, 43183, Mölndal, Sweden
| | - Jean-Louis Reymond
- Department of Chemistry and Biochemistry, University of Bern, Freiestrasse 3, 3012, Bern, Switzerland
| | - Hongming Chen
- Hit Discovery, Discovery Sciences, IMED Biotech Unit, AstraZeneca, Gothenburg, Pepparedsleden 1, 43183, Mölndal, Sweden
| | - Ola Engkvist
- Hit Discovery, Discovery Sciences, IMED Biotech Unit, AstraZeneca, Gothenburg, Pepparedsleden 1, 43183, Mölndal, Sweden
| |
Collapse
|
14
|
Alam M, Vidyaratne L, Iftekharuddin KM. Novel deep generative simultaneous recurrent model for efficient representation learning. Neural Netw 2018; 107:12-22. [PMID: 30143328 DOI: 10.1016/j.neunet.2018.04.020] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2018] [Revised: 04/18/2018] [Accepted: 04/19/2018] [Indexed: 11/20/2022]
Abstract
Representation learning plays an important role for building effective deep neural network models. Deep generative probabilistic models have shown to be efficient in the data representation learning task which is usually carried out in an unsupervised fashion. Throughout the past decade, there has been almost exclusive focus on the learning algorithms to improve representation capability of the generative models. However, effective data representation requires improvement in both learning algorithm and architecture of the generative models. Therefore, improvement to the neural architecture is critical for improved data representation capability of deep generative models. Furthermore, the prevailing class of deep generative models such as deep belief network (DBN), deep Boltzman machine (DBM) and deep sigmoid belief network (DSBN) are inherently unidirectional and lack recurrent connections ubiquitous in the biological neuronal structures. Introduction of recurrent connections may offer further improvement in data representation learning performance to the deep generative models. Consequently, for the first time in literature, this work proposes a deep recurrent generative model known as deep simultaneous recurrent belief network (D-SRBN) to efficiently learn representations from unlabeled data. Experimentation on four benchmark datasets: MNIST, Caltech 101 Silhouettes, OCR letters and Omniglot show that the proposed D-SRBN model achieves superior representation learning performance while utilizing less computing resources when compared to the four state-of-the-art generative models such as deep belief network (DBN), DBM, DSBN and VAE (variational auto-encoder).
Collapse
|