51
|
Analysis of Application Examples of Differential Privacy in Deep Learning. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2021; 2021:4244040. [PMID: 34745246 PMCID: PMC8564206 DOI: 10.1155/2021/4244040] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/06/2021] [Accepted: 09/08/2021] [Indexed: 11/21/2022]
Abstract
Artificial Intelligence has been widely applied today, and the subsequent privacy leakage problems have also been paid attention to. Attacks such as model inference attacks on deep neural networks can easily extract user information from neural networks. Therefore, it is necessary to protect privacy in deep learning. Differential privacy, as a popular topic in privacy-preserving in recent years, which provides rigorous privacy guarantee, can also be used to preserve privacy in deep learning. Although many articles have proposed different methods to combine differential privacy and deep learning, there are no comprehensive papers to analyze and compare the differences and connections between these technologies. For this purpose, this paper is proposed to compare different differential private methods in deep learning. We comparatively analyze and classify several deep learning models under differential privacy. Meanwhile, we also pay attention to the application of differential privacy in Generative Adversarial Networks (GANs), comparing and analyzing these models. Finally, we summarize the application of differential privacy in deep neural networks.
Collapse
|
52
|
Zuo Z, Watson M, Budgen D, Hall R, Kennelly C, Al Moubayed N. Data Anonymization for Pervasive Health Care: Systematic Literature Mapping Study. JMIR Med Inform 2021; 9:e29871. [PMID: 34652278 PMCID: PMC8556642 DOI: 10.2196/29871] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2021] [Revised: 06/21/2021] [Accepted: 08/02/2021] [Indexed: 01/29/2023] Open
Abstract
BACKGROUND Data science offers an unparalleled opportunity to identify new insights into many aspects of human life with recent advances in health care. Using data science in digital health raises significant challenges regarding data privacy, transparency, and trustworthiness. Recent regulations enforce the need for a clear legal basis for collecting, processing, and sharing data, for example, the European Union's General Data Protection Regulation (2016) and the United Kingdom's Data Protection Act (2018). For health care providers, legal use of the electronic health record (EHR) is permitted only in clinical care cases. Any other use of the data requires thoughtful considerations of the legal context and direct patient consent. Identifiable personal and sensitive information must be sufficiently anonymized. Raw data are commonly anonymized to be used for research purposes, with risk assessment for reidentification and utility. Although health care organizations have internal policies defined for information governance, there is a significant lack of practical tools and intuitive guidance about the use of data for research and modeling. Off-the-shelf data anonymization tools are developed frequently, but privacy-related functionalities are often incomparable with regard to use in different problem domains. In addition, tools to support measuring the risk of the anonymized data with regard to reidentification against the usefulness of the data exist, but there are question marks over their efficacy. OBJECTIVE In this systematic literature mapping study, we aim to alleviate the aforementioned issues by reviewing the landscape of data anonymization for digital health care. METHODS We used Google Scholar, Web of Science, Elsevier Scopus, and PubMed to retrieve academic studies published in English up to June 2020. Noteworthy gray literature was also used to initialize the search. We focused on review questions covering 5 bottom-up aspects: basic anonymization operations, privacy models, reidentification risk and usability metrics, off-the-shelf anonymization tools, and the lawful basis for EHR data anonymization. RESULTS We identified 239 eligible studies, of which 60 were chosen for general background information; 16 were selected for 7 basic anonymization operations; 104 covered 72 conventional and machine learning-based privacy models; four and 19 papers included seven and 15 metrics, respectively, for measuring the reidentification risk and degree of usability; and 36 explored 20 data anonymization software tools. In addition, we also evaluated the practical feasibility of performing anonymization on EHR data with reference to their usability in medical decision-making. Furthermore, we summarized the lawful basis for delivering guidance on practical EHR data anonymization. CONCLUSIONS This systematic literature mapping study indicates that anonymization of EHR data is theoretically achievable; yet, it requires more research efforts in practical implementations to balance privacy preservation and usability to ensure more reliable health care applications.
Collapse
Affiliation(s)
- Zheming Zuo
- Department of Computer Science, Durham University, Durham, United Kingdom
| | - Matthew Watson
- Department of Computer Science, Durham University, Durham, United Kingdom
| | - David Budgen
- Department of Computer Science, Durham University, Durham, United Kingdom
| | - Robert Hall
- Cievert Ltd, Newcastle upon Tyne, United Kingdom
| | | | - Noura Al Moubayed
- Department of Computer Science, Durham University, Durham, United Kingdom
| |
Collapse
|
53
|
Caudai C, Galizia A, Geraci F, Le Pera L, Morea V, Salerno E, Via A, Colombo T. AI applications in functional genomics. Comput Struct Biotechnol J 2021; 19:5762-5790. [PMID: 34765093 PMCID: PMC8566780 DOI: 10.1016/j.csbj.2021.10.009] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2021] [Revised: 10/05/2021] [Accepted: 10/05/2021] [Indexed: 12/13/2022] Open
Abstract
We review the current applications of artificial intelligence (AI) in functional genomics. The recent explosion of AI follows the remarkable achievements made possible by "deep learning", along with a burst of "big data" that can meet its hunger. Biology is about to overthrow astronomy as the paradigmatic representative of big data producer. This has been made possible by huge advancements in the field of high throughput technologies, applied to determine how the individual components of a biological system work together to accomplish different processes. The disciplines contributing to this bulk of data are collectively known as functional genomics. They consist in studies of: i) the information contained in the DNA (genomics); ii) the modifications that DNA can reversibly undergo (epigenomics); iii) the RNA transcripts originated by a genome (transcriptomics); iv) the ensemble of chemical modifications decorating different types of RNA transcripts (epitranscriptomics); v) the products of protein-coding transcripts (proteomics); and vi) the small molecules produced from cell metabolism (metabolomics) present in an organism or system at a given time, in physiological or pathological conditions. After reviewing main applications of AI in functional genomics, we discuss important accompanying issues, including ethical, legal and economic issues and the importance of explainability.
Collapse
Affiliation(s)
- Claudia Caudai
- CNR, Institute of Information Science and Technologies “A. Faedo” (ISTI), Pisa, Italy
| | - Antonella Galizia
- CNR, Institute of Applied Mathematics and Information Technologies (IMATI), Genoa, Italy
| | - Filippo Geraci
- CNR, Institute for Informatics and Telematics (IIT), Pisa, Italy
| | - Loredana Le Pera
- CNR, Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies (IBIOM), Bari, Italy
- CNR, Institute of Molecular Biology and Pathology (IBPM), Rome, Italy
| | - Veronica Morea
- CNR, Institute of Molecular Biology and Pathology (IBPM), Rome, Italy
| | - Emanuele Salerno
- CNR, Institute of Information Science and Technologies “A. Faedo” (ISTI), Pisa, Italy
| | - Allegra Via
- CNR, Institute of Molecular Biology and Pathology (IBPM), Rome, Italy
| | - Teresa Colombo
- CNR, Institute of Molecular Biology and Pathology (IBPM), Rome, Italy
| |
Collapse
|
54
|
Aliberti MJR, Kotwal A, Smith AK, Lee SJ, Banda S, Boscardin WJ. Pre-estimating subsets: A new approach for unavailable predictors in prognostic modeling. J Am Geriatr Soc 2021; 69:2675-2678. [PMID: 34002370 PMCID: PMC8440346 DOI: 10.1111/jgs.17278] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2021] [Accepted: 04/25/2021] [Indexed: 12/30/2022]
Affiliation(s)
- Márlon J. R. Aliberti
- Laboratorio de Investigacao Medica em Envelhecimento (LIM-66), Servico de Geriatria, Hospital das Clinicas HCFMUSP, Faculdade de Medicina, Universidade de Sao Paulo, Brazil
- Research Institute, Hospital Sirio-Libanes, Sao Paulo, Brazil
| | - Ashwin Kotwal
- University of California, San Francisco, Department of Medicine, Division of Geriatrics, San Francisco, CA
- San Francisco Veterans Affairs Medical Center, San Francisco, CA
| | - Alexander K. Smith
- University of California, San Francisco, Department of Medicine, Division of Geriatrics, San Francisco, CA
- San Francisco Veterans Affairs Medical Center, San Francisco, CA
| | - Sei J. Lee
- University of California, San Francisco, Department of Medicine, Division of Geriatrics, San Francisco, CA
- San Francisco Veterans Affairs Medical Center, San Francisco, CA
| | - Snigdha Banda
- University of California, San Francisco, Department of Medicine, Division of Geriatrics, San Francisco, CA
- San Francisco Veterans Affairs Medical Center, San Francisco, CA
| | - W. John Boscardin
- University of California, San Francisco, Department of Medicine, Division of Geriatrics, San Francisco, CA
- San Francisco Veterans Affairs Medical Center, San Francisco, CA
- University of California, San Francisco, Department of Epidemiology and Biostatistics, San Francisco, CA
| |
Collapse
|
55
|
Ficek J, Wang W, Chen H, Dagne G, Daley E. Differential privacy in health research: A scoping review. J Am Med Inform Assoc 2021; 28:2269-2276. [PMID: 34333623 DOI: 10.1093/jamia/ocab135] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2021] [Revised: 06/11/2021] [Accepted: 06/16/2021] [Indexed: 11/14/2022] Open
Abstract
OBJECTIVE Differential privacy is a relatively new method for data privacy that has seen growing use due its strong protections that rely on added noise. This study assesses the extent of its awareness, development, and usage in health research. MATERIALS AND METHODS A scoping review was conducted by searching for ["differential privacy" AND "health"] in major health science databases, with additional articles obtained via expert consultation. Relevant articles were classified according to subject area and focus. RESULTS A total of 54 articles met the inclusion criteria. Nine articles provided descriptive overviews, 31 focused on algorithm development, 9 presented novel data sharing systems, and 8 discussed appraisals of the privacy-utility tradeoff. The most common areas of health research where differential privacy has been discussed are genomics, neuroimaging studies, and health surveillance with personal devices. Algorithms were most commonly developed for the purposes of data release and predictive modeling. Studies on privacy-utility appraisals have considered economic cost-benefit analysis, low-utility situations, personal attitudes toward sharing health data, and mathematical interpretations of privacy risk. DISCUSSION Differential privacy remains at an early stage of development for applications in health research, and accounts of real-world implementations are scant. There are few algorithms for explanatory modeling and statistical inference, particularly with correlated data. Furthermore, diminished accuracy in small datasets is problematic. Some encouraging work has been done on decision making with regard to epsilon. The dissemination of future case studies can inform successful appraisals of privacy and utility. CONCLUSIONS More development, case studies, and evaluations are needed before differential privacy can see widespread use in health research.
Collapse
Affiliation(s)
- Joseph Ficek
- College of Public Health, University of South Florida, Tampa, Florida, USA
| | - Wei Wang
- Centre for Addiction and Mental Health, Toronto, Ontario, Canada
| | - Henian Chen
- College of Public Health, University of South Florida, Tampa, Florida, USA
| | - Getachew Dagne
- College of Public Health, University of South Florida, Tampa, Florida, USA
| | - Ellen Daley
- College of Public Health, University of South Florida, Tampa, Florida, USA
| |
Collapse
|
56
|
Powell M, Koenecke A, Byrd JB, Nishimura A, Konig MF, Xiong R, Mahmood S, Mucaj V, Bettegowda C, Rose L, Tamang S, Sacarny A, Caffo B, Athey S, Stuart EA, Vogelstein JT. Ten Rules for Conducting Retrospective Pharmacoepidemiological Analyses: Example COVID-19 Study. Front Pharmacol 2021; 12:700776. [PMID: 34393782 PMCID: PMC8357144 DOI: 10.3389/fphar.2021.700776] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2021] [Accepted: 06/30/2021] [Indexed: 11/13/2022] Open
Abstract
Since the beginning of the COVID-19 pandemic, pharmaceutical treatment hypotheses have abounded, each requiring careful evaluation. A randomized controlled trial generally provides the most credible evaluation of a treatment, but the efficiency and effectiveness of the trial depend on the existing evidence supporting the treatment. The researcher must therefore compile a body of evidence justifying the use of time and resources to further investigate a treatment hypothesis in a trial. An observational study can provide this evidence, but the lack of randomized exposure and the researcher's inability to control treatment administration and data collection introduce significant challenges. A proper analysis of observational health care data thus requires contributions from experts in a diverse set of topics ranging from epidemiology and causal analysis to relevant medical specialties and data sources. Here we summarize these contributions as 10 rules that serve as an end-to-end introduction to retrospective pharmacoepidemiological analyses of observational health care data using a running example of a hypothetical COVID-19 study. A detailed supplement presents a practical how-to guide for following each rule. When carefully designed and properly executed, a retrospective pharmacoepidemiological analysis framed around these rules will inform the decisions of whether and how to investigate a treatment hypothesis in a randomized controlled trial. This work has important implications for any future pandemic by prescribing what we can and should do while the world waits for global vaccine distribution.
Collapse
Affiliation(s)
- Michael Powell
- Department of Biomedical Engineering, Institute for Computational Medicine, The Johns Hopkins University, Baltimore, MD, United States
| | - Allison Koenecke
- Institute for Computational & Mathematical Engineering, Stanford University, Stanford, CA, United States
| | - James Brian Byrd
- Department of Internal Medicine, Division of Cardiovascular Medicine, University of Michigan Medical School, Ann Arbor, MI, United States
| | - Akihiko Nishimura
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health at Johns Hopkins University, Baltimore, MD, United States
| | - Maximilian F. Konig
- Ludwig Center, Lustgarten Laboratory, Howard Hughes Medical Institute, The Johns Hopkins University School of Medicine, Baltimore, MD, United States
- Division of Rheumatology, Department of Medicine, The Johns Hopkins University School of Medicine, Baltimore, MD, United States
| | - Ruoxuan Xiong
- Graduate School of Business, Stanford University, Stanford, CA, United States
| | | | - Vera Mucaj
- Datavant Inc., San Francisco, CA, United States
| | - Chetan Bettegowda
- Ludwig Center, Lustgarten Laboratory, Howard Hughes Medical Institute, The Johns Hopkins University School of Medicine, Baltimore, MD, United States
- Department of Neurosurgery, The Johns Hopkins University School of Medicine, Baltimore, MD, United States
| | - Liam Rose
- VA Health Economics Resource Center, Palo Alto VA, Menlo Park, CA, United States
| | - Suzanne Tamang
- Department of Biomedical Data Science, Stanford University, Stanford, CA, United States
| | - Adam Sacarny
- Department of Health Policy and Management, Columbia University Mailman School of Public Health, New York, NY, United States
| | - Brian Caffo
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health at Johns Hopkins University, Baltimore, MD, United States
| | - Susan Athey
- Graduate School of Business, Stanford University, Stanford, CA, United States
| | - Elizabeth A. Stuart
- Department of Mental Health, Johns Hopkins Bloomberg School of Public Health at Johns Hopkins University, Baltimore, MD, United States
| | - Joshua T. Vogelstein
- Department of Biomedical Engineering, Institute for Computational Medicine, The Johns Hopkins University, Baltimore, MD, United States
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health at Johns Hopkins University, Baltimore, MD, United States
| |
Collapse
|
57
|
Li J, Zhou Y, Jiang X, Natarajan K, Pakhomov SV, Liu H, Xu H. Are synthetic clinical notes useful for real natural language processing tasks: A case study on clinical entity recognition. J Am Med Inform Assoc 2021; 28:2193-2201. [PMID: 34272955 DOI: 10.1093/jamia/ocab112] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2020] [Revised: 05/09/2021] [Accepted: 06/07/2021] [Indexed: 11/13/2022] Open
Abstract
OBJECTIVE : Developing clinical natural language processing systems often requires access to many clinical documents, which are not widely available to the public due to privacy and security concerns. To address this challenge, we propose to develop methods to generate synthetic clinical notes and evaluate their utility in real clinical natural language processing tasks. MATERIALS AND METHODS : We implemented 4 state-of-the-art text generation models, namely CharRNN, SegGAN, GPT-2, and CTRL, to generate clinical text for the History and Present Illness section. We then manually annotated clinical entities for randomly selected 500 History and Present Illness notes generated from the best-performing algorithm. To compare the utility of natural and synthetic corpora, we trained named entity recognition (NER) models from all 3 corpora and evaluated their performance on 2 independent natural corpora. RESULTS : Our evaluation shows GPT-2 achieved the best BLEU (bilingual evaluation understudy) score (with a BLEU-2 of 0.92). NER models trained on synthetic corpus generated by GPT-2 showed slightly better performance on 2 independent corpora: strict F1 scores of 0.709 and 0.748, respectively, when compared with the NER models trained on natural corpus (F1 scores of 0.706 and 0.737, respectively), indicating the good utility of synthetic corpora in clinical NER model development. In addition, we also demonstrated that an augmented method that combines both natural and synthetic corpora achieved better performance than that uses the natural corpus only. CONCLUSIONS : Recent advances in text generation have made it possible to generate synthetic clinical notes that could be useful for training NER models for information extraction from natural clinical notes, thus lowering the privacy concern and increasing data availability. Further investigation is needed to apply this technology to practice.
Collapse
Affiliation(s)
- Jianfu Li
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Yujia Zhou
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Xiaoqian Jiang
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Karthik Natarajan
- Department of Biomedical Informatics, Columbia University, New York, USA
| | | | - Hongfang Liu
- Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota, USA
| | - Hua Xu
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, Texas, USA
| |
Collapse
|
58
|
Oestreich M, Chen D, Schultze JL, Fritz M, Becker M. Privacy considerations for sharing genomics data. EXCLI JOURNAL 2021; 20:1243-1260. [PMID: 34345236 PMCID: PMC8326502 DOI: 10.17179/excli2021-4002] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/19/2021] [Accepted: 07/07/2021] [Indexed: 01/23/2023]
Abstract
An increasing amount of attention has been geared towards understanding the privacy risks that arise from sharing genomic data of human origin. Most of these efforts have focused on issues in the context of genomic sequence data, but the popularity of techniques for collecting other types of genome-related data has prompted researchers to investigate privacy concerns in a broader genomic context. In this review, we give an overview of different types of genome-associated data, their individual ways of revealing sensitive information, the motivation to share them as well as established and upcoming methods to minimize information leakage. We further discuss the concise threats that are being posed, who is at risk, and how the risk level compares to potential benefits, all while addressing the topic in the context of modern technology, methodology, and information sharing culture. Additionally, we will discuss the current legal situation regarding the sharing of genomic data in a selection of countries, evaluating the scope of their applicability as well as their limitations. We will finalize this review by evaluating the development that is required in the scientific field in the near future in order to improve and develop privacy-preserving data sharing techniques for the genomic context.
Collapse
Affiliation(s)
- Marie Oestreich
- Systems Medicine, Deutsches Zentrum für Neurodegenerative Erkrankungen (DZNE), Venusberg-Campus 1/99, 53127 Bonn, Germany
| | - Dingfan Chen
- CISPA Helmholtz Center for Information Security, Saarbrücken, Germany, Stuhlsatzenhaus 5, 66123 Saarbrücken, Germany
| | - Joachim L. Schultze
- Systems Medicine, Deutsches Zentrum für Neurodegenerative Erkrankungen (DZNE), Venusberg-Campus 1/99, 53127 Bonn, Germany
- Genomics and Immunoregulation, Life & Medical Sciences (LIMES) Institute, University of Bonn, Bonn, Germany, Carl-Troll-Straße 31, 53115 Bonn, Germany
- PRECISE Platform for Single Cell Genomics and Epigenomics at Deutsches Zentrum für Neurodegenerative Erkrankungen (DZNE) and the University of Bonn, Germany, Venusberg-Campus 1/99, 53127 Bonn, Germany
| | - Mario Fritz
- CISPA Helmholtz Center for Information Security, Saarbrücken, Germany, Stuhlsatzenhaus 5, 66123 Saarbrücken, Germany
| | - Matthias Becker
- Systems Medicine, Deutsches Zentrum für Neurodegenerative Erkrankungen (DZNE), Venusberg-Campus 1/99, 53127 Bonn, Germany
| |
Collapse
|
59
|
Thomas JA, Foraker RE, Zamstein N, Payne PR, Wilcox AB. Demonstrating an approach for evaluating synthetic geospatial and temporal epidemiologic data utility: Results from analyzing >1.8 million SARS-CoV-2 tests in the United States National COVID Cohort Collaborative (N3C). MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2021:2021.07.06.21259051. [PMID: 34268525 PMCID: PMC8282114 DOI: 10.1101/2021.07.06.21259051] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
OBJECTIVE To evaluate whether synthetic data derived from a national COVID-19 data set could be used for geospatial and temporal epidemic analyses. MATERIALS AND METHODS Using an original data set (n=1,854,968 SARS-CoV-2 tests) and its synthetic derivative, we compared key indicators of COVID-19 community spread through analysis of aggregate and zip-code level epidemic curves, patient characteristics and outcomes, distribution of tests by zip code, and indicator counts stratified by month and zip code. Similarity between the data was statistically and qualitatively evaluated. RESULTS In general, synthetic data closely matched original data for epidemic curves, patient characteristics, and outcomes. Synthetic data suppressed labels of zip codes with few total tests (mean=2.9±2.4; max=16 tests; 66% reduction of unique zip codes). Epidemic curves and monthly indicator counts were similar between synthetic and original data in a random sample of the most tested (top 1%; n=171) and for all unsuppressed zip codes (n=5,819), respectively. In small sample sizes, synthetic data utility was notably decreased. DISCUSSION Analyses on the population-level and of densely-tested zip codes (which contained most of the data) were similar between original and synthetically-derived data sets. Analyses of sparsely-tested populations were less similar and had more data suppression. CONCLUSION In general, synthetic data were successfully used to analyze geospatial and temporal trends. Analyses using small sample sizes or populations were limited, in part due to purposeful data label suppression -an attribute disclosure countermeasure. Users should consider data fitness for use in these cases.
Collapse
Affiliation(s)
- Jason A. Thomas
- Department of Biomedical Informatics & Medical Education, University of Washington, Seattle, WA, USA
| | - Randi E. Foraker
- Division of General Medical Sciences, School of Medicine, Washington University in St. Louis, St. Louis, MO, USA
- Institute for Informatics, School of Medicine, Washington University in St. Louis, St. Louis, MO, USA
| | | | - Philip R.O. Payne
- Division of General Medical Sciences, School of Medicine, Washington University in St. Louis, St. Louis, MO, USA
- Institute for Informatics, School of Medicine, Washington University in St. Louis, St. Louis, MO, USA
| | - Adam B. Wilcox
- Department of Biomedical Informatics & Medical Education, University of Washington, Seattle, WA, USA
- UW Medicine, Seattle, WA, USA
| | | |
Collapse
|
60
|
Platzer M, Reutterer T. Holdout-Based Empirical Assessment of Mixed-Type Synthetic Data. Front Big Data 2021; 4:679939. [PMID: 34268491 PMCID: PMC8276128 DOI: 10.3389/fdata.2021.679939] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2021] [Accepted: 06/04/2021] [Indexed: 11/13/2022] Open
Abstract
AI-based data synthesis has seen rapid progress over the last several years and is increasingly recognized for its promise to enable privacy-respecting high-fidelity data sharing. This is reflected by the growing availability of both commercial and open-sourced software solutions for synthesizing private data. However, despite these recent advances, adequately evaluating the quality of generated synthetic datasets is still an open challenge. We aim to close this gap and introduce a novel holdout-based empirical assessment framework for quantifying the fidelity as well as the privacy risk of synthetic data solutions for mixed-type tabular data. Measuring fidelity is based on statistical distances of lower-dimensional marginal distributions, which provide a model-free and easy-to-communicate empirical metric for the representativeness of a synthetic dataset. Privacy risk is assessed by calculating the individual-level distances to closest record with respect to the training data. By showing that the synthetic samples are just as close to the training as to the holdout data, we yield strong evidence that the synthesizer indeed learned to generalize patterns and is independent of individual training records. We empirically demonstrate the presented framework for seven distinct synthetic data solutions across four mixed-type datasets and compare these then to traditional data perturbation techniques. Both a Python-based implementation of the proposed metrics and the demonstration study setup is made available open-source. The results highlight the need to systematically assess the fidelity just as well as the privacy of these emerging class of synthetic data generators.
Collapse
Affiliation(s)
| | - Thomas Reutterer
- Department of Marketing, WU Vienna University of Economics and Business, Vienna, Austria
| |
Collapse
|
61
|
Chen RJ, Lu MY, Chen TY, Williamson DFK, Mahmood F. Synthetic data in machine learning for medicine and healthcare. Nat Biomed Eng 2021; 5:493-497. [PMID: 34131324 PMCID: PMC9353344 DOI: 10.1038/s41551-021-00751-8] [Citation(s) in RCA: 172] [Impact Index Per Article: 57.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023]
Abstract
The proliferation of synthetic data in artificial intelligence for medicine and healthcare raises concerns about the vulnerabilities of the software and the challenges of current policy.
Collapse
Affiliation(s)
- Richard J Chen
- Department of Pathology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
- Cancer Program, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Cancer Data Science Program, Dana-Farber Cancer Institute, Boston, MA, USA
| | - Ming Y Lu
- Department of Pathology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
- Cancer Program, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Cancer Data Science Program, Dana-Farber Cancer Institute, Boston, MA, USA
| | - Tiffany Y Chen
- Department of Pathology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
- Cancer Program, Broad Institute of Harvard and MIT, Cambridge, MA, USA
| | - Drew F K Williamson
- Department of Pathology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
- Cancer Program, Broad Institute of Harvard and MIT, Cambridge, MA, USA
| | - Faisal Mahmood
- Department of Pathology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA.
- Cancer Program, Broad Institute of Harvard and MIT, Cambridge, MA, USA.
- Cancer Data Science Program, Dana-Farber Cancer Institute, Boston, MA, USA.
| |
Collapse
|
62
|
Kasthurirathne SN, Dexter G, Grannis SJ. Generative Adversarial Networks for Creating Synthetic Free-Text Medical Data: A Proposal for Collaborative Research and Re-use of Machine Learning Models. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2021; 2021:335-344. [PMID: 34457148 PMCID: PMC8378601] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Restrictions in sharing Patient Health Identifiers (PHI) limit cross-organizational re-use of free-text medical data. We leverage Generative Adversarial Networks (GAN) to produce synthetic unstructured free-text medical data with low re-identification risk, and assess the suitability of these datasets to replicate machine learning models. We trained GAN models using unstructured free-text laboratory messages pertaining to salmonella, and identified the most accurate models for creating synthetic datasets that reflect the informational characteristics of the original dataset. Natural Language Generation metrics comparing the real and synthetic datasets demonstrated high similarity. Decision models generated using these datasets reported high performance metrics. There was no statistically significant difference in performance measures reported by models trained using real and synthetic datasets. Our results inform the use of GAN models to generate synthetic unstructured free-text data with limited re-identification risk, and use of this data to enable collaborative research and re-use of machine learning models.
Collapse
Affiliation(s)
- Suranga N Kasthurirathne
- Regenstrief Institute, Indianapolis, IN, USA
- Indiana University School of Medicine, Indianapolis, IN, USA
| | | | - Shaun J Grannis
- Regenstrief Institute, Indianapolis, IN, USA
- Indiana University School of Medicine, Indianapolis, IN, USA
| |
Collapse
|
63
|
Azizi Z, Zheng C, Mosquera L, Pilote L, El Emam K. Can synthetic data be a proxy for real clinical trial data? A validation study. BMJ Open 2021; 11:e043497. [PMID: 33863713 PMCID: PMC8055130 DOI: 10.1136/bmjopen-2020-043497] [Citation(s) in RCA: 30] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/06/2020] [Revised: 01/14/2021] [Accepted: 03/18/2021] [Indexed: 11/03/2022] Open
Abstract
OBJECTIVES There are increasing requirements to make research data, especially clinical trial data, more broadly available for secondary analyses. However, data availability remains a challenge due to complex privacy requirements. This challenge can potentially be addressed using synthetic data. SETTING Replication of a published stage III colon cancer trial secondary analysis using synthetic data generated by a machine learning method. PARTICIPANTS There were 1543 patients in the control arm that were included in our analysis. PRIMARY AND SECONDARY OUTCOME MEASURES Analyses from a study published on the real dataset were replicated on synthetic data to investigate the relationship between bowel obstruction and event-free survival. Information theoretic metrics were used to compare the univariate distributions between real and synthetic data. Percentage CI overlap was used to assess the similarity in the size of the bivariate relationships, and similarly for the multivariate Cox models derived from the two datasets. RESULTS Analysis results were similar between the real and synthetic datasets. The univariate distributions were within 1% of difference on an information theoretic metric. All of the bivariate relationships had CI overlap on the tau statistic above 50%. The main conclusion from the published study, that lack of bowel obstruction has a strong impact on survival, was replicated directionally and the HR CI overlap between the real and synthetic data was 61% for overall survival (real data: HR 1.56, 95% CI 1.11 to 2.2; synthetic data: HR 2.03, 95% CI 1.44 to 2.87) and 86% for disease-free survival (real data: HR 1.51, 95% CI 1.18 to 1.95; synthetic data: HR 1.63, 95% CI 1.26 to 2.1). CONCLUSIONS The high concordance between the analytical results and conclusions from synthetic and real data suggests that synthetic data can be used as a reasonable proxy for real clinical trial datasets. TRIAL REGISTRATION NUMBER NCT00079274.
Collapse
Affiliation(s)
- Zahra Azizi
- Center for Outcomes Research and Evaluation, Faculty of Medicine, McGill University, Montreal, Québec, Canada
| | - Chaoyi Zheng
- Data Science, Replica Analytics Ltd, Ottawa, Ontario, Canada
| | - Lucy Mosquera
- Data Science, Replica Analytics Ltd, Ottawa, Ontario, Canada
| | - Louise Pilote
- Medicine, McGill University, Montreal, Québec, Canada
- Centre for Outcomes Research and Evaluation, Research Institute of the McGill University Health Centre, Montreal, Québec, Canada
| | - Khaled El Emam
- Electronic Health Information Laboratory, Children's Hospital of Eastern Ontario Research Institute, Ottawa, Ontario, Canada
- School of Epidemiology and Public Health, University of Ottawa, Ottawa, Ontario, Canada
| |
Collapse
|
64
|
Morrison D, Harris-Birtill D, Caie PD. Generative Deep Learning in Digital Pathology Workflows. THE AMERICAN JOURNAL OF PATHOLOGY 2021; 191:1717-1723. [PMID: 33838127 DOI: 10.1016/j.ajpath.2021.02.024] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/15/2020] [Revised: 02/16/2021] [Accepted: 02/24/2021] [Indexed: 02/07/2023]
Abstract
Many modern histopathology laboratories are in the process of digitizing their workflows. Once images of the tissue exist as digital data, it becomes feasible to research the augmentation or automation of clinical reporting and diagnosis. The application of modern computer vision techniques, based on deep learning, promises systems that can identify pathologies in slide images with a high degree of accuracy. Generative modeling is an approach to machine learning and deep learning that can be used to transform and generate data. It can be applied to a broad range of tasks within digital pathology, including the removal of color and intensity artifacts, the adaption of images in one domain into those of another, and the generation of synthetic digital tissue samples. This review provides an introduction to the topic, considers these applications, and discusses some future directions for generative models within histopathology.
Collapse
Affiliation(s)
- David Morrison
- School of Medicine, University of St Andrews, St Andrews, Scotland; School of Computer Science, University of St Andrews, St Andrews, Scotland; Sir James Mackenzie Institute for Early Diagnosis, University of St Andrews, St Andrews, Scotland.
| | | | - Peter D Caie
- School of Medicine, University of St Andrews, St Andrews, Scotland; Sir James Mackenzie Institute for Early Diagnosis, University of St Andrews, St Andrews, Scotland
| |
Collapse
|
65
|
Gootjes-Dreesbach L, Sood M, Sahay A, Hofmann-Apitius M, Fröhlich H. Variational Autoencoder Modular Bayesian Networks for Simulation of Heterogeneous Clinical Study Data. Front Big Data 2021; 3:16. [PMID: 33693390 PMCID: PMC7931863 DOI: 10.3389/fdata.2020.00016] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2019] [Accepted: 04/16/2020] [Indexed: 11/13/2022] Open
Abstract
In the area of Big Data, one of the major obstacles for the progress of biomedical research is the existence of data “silos” because legal and ethical constraints often do not allow for sharing sensitive patient data from clinical studies across institutions. While federated machine learning now allows for building models from scattered data of the same format, there is still the need to investigate, mine, and understand data of separate and very differently designed clinical studies that can only be accessed within each of the data-hosting organizations. Simulation of sufficiently realistic virtual patients based on the data within each individual organization could be a way to fill this gap. In this work, we propose a new machine learning approach [Variational Autoencoder Modular Bayesian Network (VAMBN)] to learn a generative model of longitudinal clinical study data. VAMBN considers typical key aspects of such data, namely limited sample size coupled with comparable many variables of different numerical scales and statistical properties, and many missing values. We show that with VAMBN, we can simulate virtual patients in a sufficiently realistic manner while making theoretical guarantees on data privacy. In addition, VAMBN allows for simulating counterfactual scenarios. Hence, VAMBN could facilitate data sharing as well as design of clinical trials.
Collapse
Affiliation(s)
| | - Meemansa Sood
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Sankt Augustin, Germany.,Bonn-Aachen International Center for IT, University of Bonn, Bonn, Germany
| | - Akrishta Sahay
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Sankt Augustin, Germany
| | - Martin Hofmann-Apitius
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Sankt Augustin, Germany.,Bonn-Aachen International Center for IT, University of Bonn, Bonn, Germany
| | - Holger Fröhlich
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Sankt Augustin, Germany.,Bonn-Aachen International Center for IT, University of Bonn, Bonn, Germany.,UCB Pharma (UCB Biosciences GmbH), Monheim am Rhein, Germany
| |
Collapse
|
66
|
Quer G, Arnaout R, Henne M, Arnaout R. Machine Learning and the Future of Cardiovascular Care: JACC State-of-the-Art Review. J Am Coll Cardiol 2021; 77:300-313. [PMID: 33478654 PMCID: PMC7839163 DOI: 10.1016/j.jacc.2020.11.030] [Citation(s) in RCA: 165] [Impact Index Per Article: 55.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/23/2020] [Revised: 11/12/2020] [Accepted: 11/13/2020] [Indexed: 12/14/2022]
Abstract
The role of physicians has always been to synthesize the data available to them to identify diagnostic patterns that guide treatment and follow response. Today, increasingly sophisticated machine learning algorithms may grow to support clinical experts in some of these tasks. Machine learning has the potential to benefit patients and cardiologists, but only if clinicians take an active role in bringing these new algorithms into practice. The aim of this review is to introduce clinicians who are not data science experts to key concepts in machine learning that will allow them to better understand the field and evaluate new literature and developments. The current published data in machine learning for cardiovascular disease is then summarized, using both a bibliometric survey, with code publicly available to enable similar analysis for any research topic of interest, and select case studies. Finally, several ways that clinicians can and must be involved in this emerging field are presented.
Collapse
Affiliation(s)
- Giorgio Quer
- Scripps Research Translational Institute, La Jolla, California, USA. https://twitter.com/giorgioquer
| | - Ramy Arnaout
- Division of Clinical Pathology, Department of Pathology, Beth Israel Deaconess Medical Center, Beth Israel Lahey Health, Boston, Massachusetts, USA
| | - Michael Henne
- Department of Medicine, Division of Cardiology, University of California, San Francisco, California, USA
| | - Rima Arnaout
- Department of Medicine, Division of Cardiology, Bakar Computational Health Sciences Institute, Center for Intelligent Imaging, University of California, San Francisco, California, USA.
| |
Collapse
|
67
|
Emam KE, Mosquera L, Zheng C. Optimizing the synthesis of clinical trial data using sequential trees. J Am Med Inform Assoc 2021; 28:3-13. [PMID: 33186440 PMCID: PMC7810457 DOI: 10.1093/jamia/ocaa249] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2020] [Accepted: 09/22/2020] [Indexed: 12/13/2022] Open
Abstract
OBJECTIVE With the growing demand for sharing clinical trial data, scalable methods to enable privacy protective access to high-utility data are needed. Data synthesis is one such method. Sequential trees are commonly used to synthesize health data. It is hypothesized that the utility of the generated data is dependent on the variable order. No assessments of the impact of variable order on synthesized clinical trial data have been performed thus far. Through simulation, we aim to evaluate the variability in the utility of synthetic clinical trial data as variable order is randomly shuffled and implement an optimization algorithm to find a good order if variability is too high. MATERIALS AND METHODS Six oncology clinical trial datasets were evaluated in a simulation. Three utility metrics were computed comparing real and synthetic data: univariate similarity, similarity in multivariate prediction accuracy, and a distinguishability metric. Particle swarm was implemented to optimize variable order, and was compared with a curriculum learning approach to ordering variables. RESULTS As the number of variables in a clinical trial dataset increases, there is a pattern of a marked increase in variability of data utility with order. Particle swarm with a distinguishability hinge loss ensured adequate utility across all 6 datasets. The hinge threshold was selected to avoid overfitting which can create a privacy problem. This was superior to curriculum learning in terms of utility. CONCLUSIONS The optimization approach presented in this study gives a reliable way to synthesize high-utility clinical trial datasets.
Collapse
Affiliation(s)
- Khaled El Emam
- School of Epidemiology and Public Health, Faculty of Medicine, University of Ottawa, Ottawa, Ontario, Canada
- Electronic Health Information Laboratory, Childrens Hospital of Eastern Ontario Research Institute, Ottawa, Ontario, Canada
- Replica Analytics Ltd, Ottawa, Ontario, Canada
| | | | | |
Collapse
|
68
|
Chang K, Singh P, Vepakomma P, Poirot MG, Raskar R, Rubin DL, Kalpathy-Cramer J. Privacy-preserving collaborative deep learning methods for multiinstitutional training without sharing patient data. Artif Intell Med 2021. [DOI: 10.1016/b978-0-12-821259-2.00006-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
69
|
Artificial Intelligence and Hypertension Management. Artif Intell Med 2021. [DOI: 10.1007/978-3-030-58080-3_263-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
70
|
Cronin NJ, Finni T, Seynnes O. Using deep learning to generate synthetic B-mode musculoskeletal ultrasound images. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2020; 196:105583. [PMID: 32544777 DOI: 10.1016/j.cmpb.2020.105583] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/03/2020] [Accepted: 05/29/2020] [Indexed: 06/11/2023]
Abstract
BACKGROUND AND OBJECTIVE Deep learning approaches are common in image processing, but often rely on supervised learning, which requires a large volume of training images, usually accompanied by hand-crafted labels. As labelled data are often not available, it would be desirable to develop methods that allow such data to be compiled automatically. In this study, we used a Generative Adversarial Network (GAN) to generate realistic B-mode musculoskeletal ultrasound images, and tested the suitability of two automated labelling approaches. METHODS We used a model including two GANs each trained to transfer an image from one domain to another. The two inputs were a set of 100 longitudinal images of the gastrocnemius medialis muscle, and a set of 100 synthetic segmented masks that featured two aponeuroses and a random number of 'fascicles'. The model output a set of synthetic ultrasound images and an automated segmentation of each real input image. This automated segmentation process was one of the two approaches we assessed. The second approach involved synthesising ultrasound images and then feeding these images into an ImageJ/Fiji-based automated algorithm, to determine whether it could detect the aponeuroses and muscle fascicles. RESULTS Histogram distributions were similar between real and synthetic images, but synthetic images displayed less variation between samples and a narrower range. Mean entropy values were statistically similar (real: 6.97, synthetic: 7.03; p = 0.218), but the range was much narrower for synthetic images (6.91 - 7.11 versus 6.30 - 7.62). When comparing GAN-derived and manually labelled segmentations, intersection-over-union values- denoting the degree of overlap between aponeurosis labels- varied between 0.0280 - 0.612 (mean ± SD: 0.312 ± 0.159), and pennation angles were higher for the GAN-derived segmentations (25.1° vs. 19.3°; p < 0.001). For the second segmentation approach, the algorithm generally performed equally well on synthetic and real images, yielding pennation angles within the physiological range (13.8-20°). CONCLUSIONS We used a GAN to generate realistic B-mode ultrasound images, and extracted muscle architectural parameters from these images automatically. This approach could enable generation of large labelled datasets for image segmentation tasks, and may also be useful for data sharing. Automatic generation and labelling of ultrasound images minimises user input and overcomes several limitations associated with manual analysis.
Collapse
Affiliation(s)
- Neil J Cronin
- Neuromuscular Research Centre, Faculty of Sport and Health Sciences, University of Jyvaskyla, Finland; Department for Health, Bath University, UK; School of Sport & Exercise, University of Gloucestershire, Gloucestershire, UK.
| | - Taija Finni
- Neuromuscular Research Centre, Faculty of Sport and Health Sciences, University of Jyvaskyla, Finland
| | | |
Collapse
|
71
|
Nußberger J, Boesel F, Lenz S, Binder H, Hess M. Synthetic observations from deep generative models and binary omics data with limited sample size. Brief Bioinform 2020; 22:5917048. [PMID: 33003196 DOI: 10.1093/bib/bbaa226] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2020] [Revised: 08/06/2020] [Accepted: 08/21/2020] [Indexed: 01/19/2023] Open
Abstract
Deep generative models can be trained to represent the joint distribution of data, such as measurements of single nucleotide polymorphisms (SNPs) from several individuals. Subsequently, synthetic observations are obtained by drawing from this distribution. This has been shown to be useful for several tasks, such as removal of noise, imputation, for better understanding underlying patterns, or even exchanging data under privacy constraints. Yet, it is still unclear how well these approaches work with limited sample size. We investigate such settings specifically for binary data, e.g. as relevant when considering SNP measurements, and evaluate three frequently employed generative modeling approaches, variational autoencoders (VAEs), deep Boltzmann machines (DBMs) and generative adversarial networks (GANs). This includes conditional approaches, such as when considering gene expression conditional on SNPs. Recovery of pair-wise odds ratios (ORs) is considered as a primary performance criterion. For simulated as well as real SNP data, we observe that DBMs generally can recover structure for up to 300 variables, with a tendency of over-estimating ORs when not carefully tuned. VAEs generally get the direction and relative strength of pairwise relations right, yet with considerable under-estimation of ORs. GANs provide stable results only with larger sample sizes and strong pair-wise relations in the data. Taken together, DBMs and VAEs (in contrast to GANs) appear to be well suited for binary omics data, even at rather small sample sizes. This opens the way for many potential applications where synthetic observations from omics data might be useful.
Collapse
Affiliation(s)
- Jens Nußberger
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Germany
| | - Frederic Boesel
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Germany
| | - Stefan Lenz
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Germany
| | - Harald Binder
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Germany
| | - Moritz Hess
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Germany
| |
Collapse
|
72
|
Lopez R, Gayoso A, Yosef N. Enhancing scientific discoveries in molecular biology with deep generative models. Mol Syst Biol 2020; 16:e9198. [PMID: 32975352 PMCID: PMC7517326 DOI: 10.15252/msb.20199198] [Citation(s) in RCA: 31] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2019] [Revised: 04/10/2020] [Accepted: 07/09/2020] [Indexed: 12/15/2022] Open
Abstract
Generative models provide a well-established statistical framework for evaluating uncertainty and deriving conclusions from large data sets especially in the presence of noise, sparsity, and bias. Initially developed for computer vision and natural language processing, these models have been shown to effectively summarize the complexity that underlies many types of data and enable a range of applications including supervised learning tasks, such as assigning labels to images; unsupervised learning tasks, such as dimensionality reduction; and out-of-sample generation, such as de novo image synthesis. With this early success, the power of generative models is now being increasingly leveraged in molecular biology, with applications ranging from designing new molecules with properties of interest to identifying deleterious mutations in our genomes and to dissecting transcriptional variability between single cells. In this review, we provide a brief overview of the technical notions behind generative models and their implementation with deep learning techniques. We then describe several different ways in which these models can be utilized in practice, using several recent applications in molecular biology as examples.
Collapse
Affiliation(s)
- Romain Lopez
- Department of Electrical Engineering and Computer SciencesUniversity of CaliforniaBerkeleyCAUSA
| | - Adam Gayoso
- Center for Computational BiologyUniversity of CaliforniaBerkeleyCAUSA
| | - Nir Yosef
- Department of Electrical Engineering and Computer SciencesUniversity of CaliforniaBerkeleyCAUSA
- Center for Computational BiologyUniversity of CaliforniaBerkeleyCAUSA
- Chan‐Zuckerberg BiohubSan FranciscoCAUSA
- Ragon Institute of MGH, MIT, and HarvardCambridgeMAUSA
| |
Collapse
|
73
|
Lee D, Yu H, Jiang X, Rogith D, Gudala M, Tejani M, Zhang Q, Xiong L. Generating sequential electronic health records using dual adversarial autoencoder. J Am Med Inform Assoc 2020; 27:1411-1419. [PMID: 32989459 PMCID: PMC7647348 DOI: 10.1093/jamia/ocaa119] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2019] [Revised: 05/18/2020] [Accepted: 06/16/2020] [Indexed: 11/12/2022] Open
Abstract
OBJECTIVE Recent studies on electronic health records (EHRs) started to learn deep generative models and synthesize a huge amount of realistic records, in order to address significant privacy issues surrounding the EHR. However, most of them only focus on structured records about patients' independent visits, rather than on chronological clinical records. In this article, we aim to learn and synthesize realistic sequences of EHRs based on the generative autoencoder. MATERIALS AND METHODS We propose a dual adversarial autoencoder (DAAE), which learns set-valued sequences of medical entities, by combining a recurrent autoencoder with 2 generative adversarial networks (GANs). DAAE improves the mode coverage and quality of generated sequences by adversarially learning both the continuous latent distribution and the discrete data distribution. Using the MIMIC-III (Medical Information Mart for Intensive Care-III) and UT Physicians clinical databases, we evaluated the performances of DAAE in terms of predictive modeling, plausibility, and privacy preservation. RESULTS Our generated sequences of EHRs showed the comparable performances to real data for a predictive modeling task, and achieved the best score in plausibility evaluation conducted by medical experts among all baseline models. In addition, differentially private optimization of our model enables to generate synthetic sequences without increasing the privacy leakage of patients' data. CONCLUSIONS DAAE can effectively synthesize sequential EHRs by addressing its main challenges: the synthetic records should be realistic enough not to be distinguished from the real records, and they should cover all the training patients to reproduce the performance of specific downstream tasks.
Collapse
Affiliation(s)
- Dongha Lee
- Department of Computer Science and Engineering, Pohang University of Science and Technology, Pohang, South Korea
| | - Hwanjo Yu
- Department of Computer Science and Engineering, Pohang University of Science and Technology, Pohang, South Korea
| | - Xiaoqian Jiang
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Deevakar Rogith
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Meghana Gudala
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Mubeen Tejani
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Qiuchen Zhang
- Department of Computer Science, Emory University, Atlanta, Georgia, USA
| | - Li Xiong
- Department of Computer Science, Emory University, Atlanta, Georgia, USA
| |
Collapse
|
74
|
Song X, Waitman LR, Hu Y, Luo B, Li F, Liu M. The Impact of Medical Big Data Anonymization on Early Acute Kidney Injury Risk Prediction. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2020; 2020:617-625. [PMID: 32477684 PMCID: PMC7233037] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Artificial intelligence enabled medical big data analysis has the potential to revolutionize medical practice from diagnosis and prediction of complex diseases to making recommendations and resource allocation decisions in an evidence-based manner. However, big data comes with big disclosure risks. To preserve privacy, excessive data anonymization is often necessary, leading to significant loss of data utility. In this paper, we develop a systematic data scrubbing procedure for large datasets when key variables are uncertain for re-identification risk assessment and assess the trade-off between anonymization of electronic health record data for sharing in support of open science and performance of machine learning models for early acute kidney injury risk prediction using the data. Results demonstrate that our proposed data scrubbing procedure can maintain good feature diversity and moderate data utility but raises concerns regarding its impact on knowledge discovery capability.
Collapse
Affiliation(s)
- Xing Song
- University of Kansas Medical Center, Department of Internal Medicine, Division of Medical Informatics, Kansas City, KS, USA
| | - Lemuel R Waitman
- University of Kansas Medical Center, Department of Internal Medicine, Division of Medical Informatics, Kansas City, KS, USA
| | - Yong Hu
- Jinan University, Big Data Decision Institute, Guangzhou, PRC
| | - Bo Luo
- University of Kansas, Department of Electrical Engineering and Computer Science, Lawrence, KS, USA
| | - Fengjun Li
- University of Kansas, Department of Electrical Engineering and Computer Science, Lawrence, KS, USA
| | - Mei Liu
- University of Kansas Medical Center, Department of Internal Medicine, Division of Medical Informatics, Kansas City, KS, USA
| |
Collapse
|
75
|
Abstract
There is a strong positive correlation between the development of deep learning and the amount of public data available. Not all data can be released in their raw form because of the risk to the privacy of the related individuals. The main objective of privacy-preserving data publication is to anonymize the data while maintaining their utility. In this paper, we propose a privacy-preserving semi-generative adversarial network (PPSGAN) that selectively adds noise to class-independent features of each image to enable the processed image to maintain its original class label. Our experiments on training classifiers with synthetic datasets anonymized with various methods confirm that PPSGAN shows better utility than other conventional methods, including blurring, noise-adding, filtering, and generation using GANs.
Collapse
|
76
|
Glicksberg BS, Burns S, Currie R, Griffin A, Wang ZJ, Haussler D, Goldstein T, Collisson E. Blockchain-Authenticated Sharing of Genomic and Clinical Outcomes Data of Patients With Cancer: A Prospective Cohort Study. J Med Internet Res 2020; 22:e16810. [PMID: 32196460 PMCID: PMC7125440 DOI: 10.2196/16810] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2019] [Revised: 12/09/2019] [Accepted: 12/15/2019] [Indexed: 12/21/2022] Open
Abstract
Background Efficiently sharing health data produced during standard care could dramatically accelerate progress in cancer treatments, but various barriers make this difficult. Not sharing these data to ensure patient privacy is at the cost of little to no learning from real-world data produced during cancer care. Furthermore, recent research has demonstrated a willingness of patients with cancer to share their treatment experiences to fuel research, despite potential risks to privacy. Objective The objective of this study was to design, pilot, and release a decentralized, scalable, efficient, economical, and secure strategy for the dissemination of deidentified clinical and genomic data with a focus on late-stage cancer. Methods We created and piloted a blockchain-authenticated system to enable secure sharing of deidentified patient data derived from standard of care imaging, genomic testing, and electronic health records (EHRs), called the Cancer Gene Trust (CGT). We prospectively consented and collected data for a pilot cohort (N=18), which we uploaded to the CGT. EHR data were extracted from both a hospital cancer registry and a common data model (CDM) format to identify optimal data extraction and dissemination practices. Specifically, we scored and compared the level of completeness between two EHR data extraction formats against the gold standard source documentation for patients with available data (n=17). Results Although the total completeness scores were greater for the registry reports than those for the CDM, this difference was not statistically significant. We did find that some specific data fields, such as histology site, were better captured using the registry reports, which can be used to improve the continually adapting CDM. In terms of the overall pilot study, we found that CGT enables rapid integration of real-world data of patients with cancer in a more clinically useful time frame. We also developed an open-source Web application to allow users to seamlessly search, browse, explore, and download CGT data. Conclusions Our pilot demonstrates the willingness of patients with cancer to participate in data sharing and how blockchain-enabled structures can maintain relationships between individual data elements while preserving patient privacy, empowering findings by third-party researchers and clinicians. We demonstrate the feasibility of CGT as a framework to share health data trapped in silos to further cancer research. Further studies to optimize data representation, stream, and integrity are required.
Collapse
Affiliation(s)
- Benjamin Scott Glicksberg
- Bakar Computational Health Sciences Institute, University of California, San Francisco, CA, United States.,Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai, New York, NY, United States.,Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, United States
| | - Shohei Burns
- Division of Hematology and Oncology, Department of Medicine, University of California, San Francisco, CA, United States
| | - Rob Currie
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, United States
| | - Ann Griffin
- Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, CA, United States
| | - Zhen Jane Wang
- Department of Radiology and Biomedical Imaging, University of California, San Francisco, CA, United States
| | - David Haussler
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, United States.,Howard Hughes Medical Institute, Santa Cruz, CA, United States
| | - Theodore Goldstein
- Bakar Computational Health Sciences Institute, University of California, San Francisco, CA, United States.,UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, United States
| | - Eric Collisson
- Division of Hematology and Oncology, Department of Medicine, University of California, San Francisco, CA, United States
| |
Collapse
|
77
|
Into the latent space. NAT MACH INTELL 2020. [DOI: 10.1038/s42256-020-0164-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
78
|
A multicenter random forest model for effective prognosis prediction in collaborative clinical research network. Artif Intell Med 2020; 103:101814. [PMID: 32143809 DOI: 10.1016/j.artmed.2020.101814] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2019] [Revised: 02/04/2020] [Accepted: 02/04/2020] [Indexed: 12/17/2022]
Abstract
BACKGROUND The accuracy of a prognostic prediction model has become an essential aspect of the quality and reliability of the health-related decisions made by clinicians in modern medicine. Unfortunately, individual institutions often lack sufficient samples, which might not provide sufficient statistical power for models. One mitigation is to expand data collection from a single institution to multiple centers to collectively increase the sample size. However, sharing sensitive biomedical data for research involves complicated issues. Machine learning models such as random forests (RF), though they are commonly used and achieve good performances for prognostic prediction, usually suffer worse performance under multicenter privacy-preserving data mining scenarios compared to a centrally trained version. METHODS AND MATERIALS In this study, a multicenter random forest prognosis prediction model is proposed that enables federated clinical data mining from horizontally partitioned datasets. By using a novel data enhancement approach based on a differentially private generative adversarial network customized to clinical prognosis data, the proposed model is able to provide a multicenter RF model with performances on par with-or even better than-centrally trained RF but without the need to aggregate the raw data. Moreover, our model also incorporates an importance ranking step designed for feature selection without sharing patient-level information. RESULT The proposed model was evaluated on colorectal cancer datasets from the US and China. Two groups of datasets with different levels of heterogeneity within the collaborative research network were selected. First, we compare the performance of the distributed random forest model under different privacy parameters with different percentages of enhancement datasets and validate the effectiveness and plausibility of our approach. Then, we compare the discrimination and calibration ability of the proposed multicenter random forest with a centrally trained random forest model and other tree-based classifiers as well as some commonly used machine learning methods. The results show that the proposed model can provide better prediction performance in terms of discrimination and calibration ability than the centrally trained RF model or the other candidate models while following the privacy-preserving rules in both groups. Additionally, good discrimination and calibration ability are shown on the simplified model based on the feature importance ranking in the proposed approach. CONCLUSION The proposed random forest model exhibits ideal prediction capability using multicenter clinical data and overcomes the performance limitation arising from privacy guarantees. It can also provide feature importance ranking across institutions without pooling the data at a central site. This study offers a practical solution for building a prognosis prediction model in the collaborative clinical research network and solves practical issues in real-world applications of medical artificial intelligence.
Collapse
|
79
|
Fisher CK, Smith AM, Walsh JR. Machine learning for comprehensive forecasting of Alzheimer's Disease progression. Sci Rep 2019; 9:13622. [PMID: 31541187 PMCID: PMC6754403 DOI: 10.1038/s41598-019-49656-2] [Citation(s) in RCA: 74] [Impact Index Per Article: 14.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2018] [Accepted: 08/29/2019] [Indexed: 01/09/2023] Open
Abstract
Most approaches to machine learning from electronic health data can only predict a single endpoint. The ability to simultaneously simulate dozens of patient characteristics is a crucial step towards personalized medicine for Alzheimer's Disease. Here, we use an unsupervised machine learning model called a Conditional Restricted Boltzmann Machine (CRBM) to simulate detailed patient trajectories. We use data comprising 18-month trajectories of 44 clinical variables from 1909 patients with Mild Cognitive Impairment or Alzheimer's Disease to train a model for personalized forecasting of disease progression. We simulate synthetic patient data including the evolution of each sub-component of cognitive exams, laboratory tests, and their associations with baseline clinical characteristics. Synthetic patient data generated by the CRBM accurately reflect the means, standard deviations, and correlations of each variable over time to the extent that synthetic data cannot be distinguished from actual data by a logistic regression. Moreover, our unsupervised model predicts changes in total ADAS-Cog scores with the same accuracy as specifically trained supervised models, additionally capturing the correlation structure in the components of ADAS-Cog, and identifies sub-components associated with word recall as predictive of progression.
Collapse
Affiliation(s)
- Charles K Fisher
- Unlearn.AI, Inc., 450 Geary St, San Francisco, CA, 94102, San Francisco, USA.
| | - Aaron M Smith
- Unlearn.AI, Inc., 450 Geary St, San Francisco, CA, 94102, San Francisco, USA
| | - Jonathan R Walsh
- Unlearn.AI, Inc., 450 Geary St, San Francisco, CA, 94102, San Francisco, USA
| |
Collapse
|
80
|
Beaulieu-Jones BK, Wu ZS, Williams C, Lee R, Bhavnani SP, Byrd JB, Greene CS. Privacy-Preserving Generative Deep Neural Networks Support Clinical Data Sharing. Circ Cardiovasc Qual Outcomes 2019; 12:e005122. [PMID: 31284738 PMCID: PMC7041894 DOI: 10.1161/circoutcomes.118.005122] [Citation(s) in RCA: 75] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Supplemental Digital Content is available in the text. Background: Data sharing accelerates scientific progress but sharing individual-level data while preserving patient privacy presents a barrier. Methods and Results: Using pairs of deep neural networks, we generated simulated, synthetic participants that closely resemble participants of the SPRINT trial (Systolic Blood Pressure Trial). We showed that such paired networks can be trained with differential privacy, a formal privacy framework that limits the likelihood that queries of the synthetic participants’ data could identify a real a participant in the trial. Machine learning predictors built on the synthetic population generalize to the original data set. This finding suggests that the synthetic data can be shared with others, enabling them to perform hypothesis-generating analyses as though they had the original trial data. Conclusions: Deep neural networks that generate synthetic participants facilitate secondary analyses and reproducible investigation of clinical data sets by enhancing data sharing while preserving participant privacy.
Collapse
Affiliation(s)
- Brett K Beaulieu-Jones
- Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania, Philadelphia. (B.K.B.-J.)
| | - Zhiwei Steven Wu
- Computer Science and Electrical Engineering Department, University of Minnesota, Minneapolis (Z.S.W.)
| | - Chris Williams
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia. (C.W., C.S.G.)
| | - Ran Lee
- Division of Cardiovascular Medicine, Department of Medicine, University of Michigan Medical School, Ann Arbor (R.L., J.B.B.)
| | | | - James Brian Byrd
- Division of Cardiovascular Medicine, Department of Medicine, University of Michigan Medical School, Ann Arbor (R.L., J.B.B.)
| | - Casey S Greene
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia. (C.W., C.S.G.)
| |
Collapse
|