1
|
Yan C, Zhang Z, Nyemba S, Li Z. Generating Synthetic Electronic Health Record Data Using Generative Adversarial Networks: Tutorial. JMIR AI 2024; 3:e52615. [PMID: 38875595 PMCID: PMC11074891 DOI: 10.2196/52615] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/10/2023] [Revised: 01/24/2024] [Accepted: 03/07/2024] [Indexed: 06/16/2024]
Abstract
Synthetic electronic health record (EHR) data generation has been increasingly recognized as an important solution to expand the accessibility and maximize the value of private health data on a large scale. Recent advances in machine learning have facilitated more accurate modeling for complex and high-dimensional data, thereby greatly enhancing the data quality of synthetic EHR data. Among various approaches, generative adversarial networks (GANs) have become the main technical path in the literature due to their ability to capture the statistical characteristics of real data. However, there is a scarcity of detailed guidance within the domain regarding the development procedures of synthetic EHR data. The objective of this tutorial is to present a transparent and reproducible process for generating structured synthetic EHR data using a publicly accessible EHR data set as an example. We cover the topics of GAN architecture, EHR data types and representation, data preprocessing, GAN training, synthetic data generation and postprocessing, and data quality evaluation. We conclude this tutorial by discussing multiple important issues and future opportunities in this domain. The source code of the entire process has been made publicly available.
Collapse
Affiliation(s)
- Chao Yan
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, United States
| | - Ziqi Zhang
- Department of Computer Science, Vanderbilt University, Nashville, TN, United States
| | - Steve Nyemba
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, United States
| | - Zhuohang Li
- Department of Computer Science, Vanderbilt University, Nashville, TN, United States
| |
Collapse
|
2
|
El Emam K, Mosquera L, Fang X, El-Hussuna A. An evaluation of the replicability of analyses using synthetic health data. Sci Rep 2024; 14:6978. [PMID: 38521806 PMCID: PMC10960851 DOI: 10.1038/s41598-024-57207-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2023] [Accepted: 03/15/2024] [Indexed: 03/25/2024] Open
Abstract
Synthetic data generation is being increasingly used as a privacy preserving approach for sharing health data. In addition to protecting privacy, it is important to ensure that generated data has high utility. A common way to assess utility is the ability of synthetic data to replicate results from the real data. Replicability has been defined using two criteria: (a) replicate the results of the analyses on real data, and (b) ensure valid population inferences from the synthetic data. A simulation study using three heterogeneous real-world datasets evaluated the replicability of logistic regression workloads. Eight replicability metrics were evaluated: decision agreement, estimate agreement, standardized difference, confidence interval overlap, bias, confidence interval coverage, statistical power, and precision (empirical SE). The analysis of synthetic data used a multiple imputation approach whereby up to 20 datasets were generated and the fitted logistic regression models were combined using combining rules for fully synthetic datasets. The effects of synthetic data amplification were evaluated, and two types of generative models were used: sequential synthesis using boosted decision trees and a generative adversarial network (GAN). Privacy risk was evaluated using a membership disclosure metric. For sequential synthesis, adjusted model parameters after combining at least ten synthetic datasets gave high decision and estimate agreement, low standardized difference, as well as high confidence interval overlap, low bias, the confidence interval had nominal coverage, and power close to the nominal level. Amplification had only a marginal benefit. Confidence interval coverage from a single synthetic dataset without applying combining rules were erroneous, and statistical power, as expected, was artificially inflated when amplification was used. Sequential synthesis performed considerably better than the GAN across multiple datasets. Membership disclosure risk was low for all datasets and models. For replicable results, the statistical analysis of fully synthetic data should be based on at least ten generated datasets of the same size as the original whose analyses results are combined. Analysis results from synthetic data without applying combining rules can be misleading. Replicability results are dependent on the type of generative model used, with our study suggesting that sequential synthesis has good replicability characteristics for common health research workloads.
Collapse
Affiliation(s)
- Khaled El Emam
- School of Epidemiology and Public Health, University of Ottawa, Ottawa, ON, Canada.
- Replica Analytics, Ottawa, ON, Canada.
- Children's Hospital of Eastern Ontario (CHEO) Research Institute, 401 Smyth Road, Ottawa, ON, K1H 8L1, Canada.
| | - Lucy Mosquera
- Replica Analytics, Ottawa, ON, Canada
- Children's Hospital of Eastern Ontario (CHEO) Research Institute, 401 Smyth Road, Ottawa, ON, K1H 8L1, Canada
| | - Xi Fang
- Replica Analytics, Ottawa, ON, Canada
| | | |
Collapse
|
3
|
Zhang T, Qu Y, wang D, Zhong M, Cheng Y, Zhang M. Optimizing sepsis treatment strategies via a reinforcement learning model. Biomed Eng Lett 2024; 14:279-289. [PMID: 38374908 PMCID: PMC10874349 DOI: 10.1007/s13534-023-00343-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2023] [Revised: 10/28/2023] [Accepted: 11/13/2023] [Indexed: 02/21/2024] Open
Abstract
Purpose The existing sepsis treatment lacks effective reference and relies too much on the experience of clinicians. Therefore, we used the reinforcement learning model to build an assisted model for the sepsis medication treatment. Methods Using the latest Sepsis 3.0 diagnostic criteria, 19,582 sepsis patients were screened from the Medical Intensive Care Information III database for treatment strategy research, and forty-six features were used in modeling. The study object of the medication strategy is the dosage of vasopressor drugs and intravenous infusion. Dueling DDQN is proposed to predict the patient's medication strategy (vasopressor and intravenous infusion dosage) through the relationship between the patient's state, reward function, and medication action. We also constructed protection against the possible high-risk behaviors of Dueling DDQN, especially sudden dose changes of vasopressors can lead to harmful clinical effects. In order to improve the guiding effect of clinically effective medication strategies on the model, we proposed a hybrid model (safe-dueling DDQN + expert strategies) to optimize medication strategies. Results The Dueling DDQN medication model for sepsis patients is superior to clinical strategies and other models in terms of off-policy evaluation values and mortality, and reduced the mortality of clinical strategies from 16.8 to 13.8%. Safe-Dueling DDQN we proposed, compared with Dueling DDQN, has an overall reduction in actions involving vasopressors and reduces large dose fluctuations. The hybrid model we proposed can switch between expert strategies and safe dueling DDQN strategies based on the current state of patients. Conclusions The reinforcement learning model we proposed for sepsis medication treatment, has practical clinical value and can improve the survival rate of patients to a certain extent while ensuring the balance and safety of medication.
Collapse
Affiliation(s)
- Tianyi Zhang
- School of Health Sciences and Engineering, University of Shanghai for Science and Technology, Shanghai, 200093 China
- Shanghai Interventional Medical Device Engineering Technology Research Center, Shanghai, 200093 China
| | - Yimeng Qu
- Suzhou Medical College, Suzhou University, Suzhou, 215031 China
| | - Deyong wang
- School of Health Sciences and Engineering, University of Shanghai for Science and Technology, Shanghai, 200093 China
- Shanghai Interventional Medical Device Engineering Technology Research Center, Shanghai, 200093 China
| | - Ming Zhong
- Department of Critical Care Medicine, Zhongshan Hospital Affiliated to Fudan University, Shanghai, 200032 China
| | - Yunzhang Cheng
- School of Health Sciences and Engineering, University of Shanghai for Science and Technology, Shanghai, 200093 China
- Shanghai Interventional Medical Device Engineering Technology Research Center, Shanghai, 200093 China
| | - Mingwei Zhang
- School of Health Sciences and Engineering, University of Shanghai for Science and Technology, Shanghai, 200093 China
- Shanghai Interventional Medical Device Engineering Technology Research Center, Shanghai, 200093 China
| |
Collapse
|
4
|
Gwon H, Ahn I, Kim Y, Kang HJ, Seo H, Choi H, Cho HN, Kim M, Han J, Kee G, Park S, Lee KH, Jun TJ, Kim YH. LDP-GAN : Generative adversarial networks with local differential privacy for patient medical records synthesis. Comput Biol Med 2024; 168:107738. [PMID: 37995536 DOI: 10.1016/j.compbiomed.2023.107738] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2023] [Revised: 10/31/2023] [Accepted: 11/16/2023] [Indexed: 11/25/2023]
Abstract
Electronic medical records(EMR) have considerable potential to advance healthcare technologies, including medical AI. Nevertheless, due to the privacy issues associated with the sharing of patient's personal information, it is difficult to sufficiently utilize them. Generative models based on deep learning can solve this problem by creating synthetic data similar to real patient data. However, the data used for training these deep learning models run into the risk of getting leaked because of malicious attacks. This means that traditional deep learning-based generative models cannot completely solve the privacy issues. Therefore, we suggested a method to prevent the leakage of training data by protecting the model from malicious attacks using local differential privacy(LDP). Our method was evaluated in terms of utility and privacy. Experimental results demonstrated that the proposed method can generate medical data with reasonable performance while protecting training data from malicious attacks.
Collapse
Affiliation(s)
- Hansle Gwon
- Department of Information Medicine, Asan Medical Center, 8, Olympicro 43gil, Songpagu, Seoul, 05505, Republic of Korea
| | - Imjin Ahn
- Department of Information Medicine, Asan Medical Center, 8, Olympicro 43gil, Songpagu, Seoul, 05505, Republic of Korea
| | - Yunha Kim
- Department of Medical Science, Asan Medical Institute of Convergence Science and Technology, Asan Medical Center, University of Ulsan College of Medicine, 88, Olympicro 43gil, Songpagu, Seoul, 05505, Republic of Korea
| | - Hee Jun Kang
- Division of Cardiology, Asan Medical Center, 88, Olympicro 43gil, Songpagu, Seoul, 05505, Republic of Korea
| | - Hyeram Seo
- Department of Medical Science, Asan Medical Institute of Convergence Science and Technology, Asan Medical Center, University of Ulsan College of Medicine, 88, Olympicro 43gil, Songpagu, Seoul, 05505, Republic of Korea
| | - Heejung Choi
- Department of Medical Science, Asan Medical Institute of Convergence Science and Technology, Asan Medical Center, University of Ulsan College of Medicine, 88, Olympicro 43gil, Songpagu, Seoul, 05505, Republic of Korea
| | - Ha Na Cho
- Department of Information Medicine, Asan Medical Center, 8, Olympicro 43gil, Songpagu, Seoul, 05505, Republic of Korea
| | - Minkyoung Kim
- Department of Medical Science, Asan Medical Institute of Convergence Science and Technology, Asan Medical Center, University of Ulsan College of Medicine, 88, Olympicro 43gil, Songpagu, Seoul, 05505, Republic of Korea
| | - JiYe Han
- Department of Medical Science, Asan Medical Institute of Convergence Science and Technology, Asan Medical Center, University of Ulsan College of Medicine, 88, Olympicro 43gil, Songpagu, Seoul, 05505, Republic of Korea
| | - Gaeun Kee
- Department of Information Medicine, Asan Medical Center, 8, Olympicro 43gil, Songpagu, Seoul, 05505, Republic of Korea
| | - Seohyun Park
- Department of Information Medicine, Asan Medical Center, 8, Olympicro 43gil, Songpagu, Seoul, 05505, Republic of Korea
| | - Kye Hwa Lee
- Department of Information Medicine, Asan Medical Center, 8, Olympicro 43gil, Songpagu, Seoul, 05505, Republic of Korea
| | - Tae Joon Jun
- Big Data Research Center, Asan Institute for Life Sciences, Asan Medical Center, 88, Olympicro 43gil, Songpagu, Seoul, 05505, Republic of Korea.
| | - Young-Hak Kim
- Division of Cardiology, Department of Information Medicine, Asan Medical Center, University of Ulsan College of Medicine, 88, Olympicro 43gil, Songpagu, Seoul, 05505, Republic of Korea
| |
Collapse
|
5
|
Lim B, Seth I, Kah S, Sofiadellis F, Ross RJ, Rozen WM, Cuomo R. Using Generative Artificial Intelligence Tools in Cosmetic Surgery: A Study on Rhinoplasty, Facelifts, and Blepharoplasty Procedures. J Clin Med 2023; 12:6524. [PMID: 37892665 PMCID: PMC10607912 DOI: 10.3390/jcm12206524] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2023] [Revised: 10/03/2023] [Accepted: 10/13/2023] [Indexed: 10/29/2023] Open
Abstract
Artificial intelligence (AI), notably Generative Adversarial Networks, has the potential to transform medical and patient education. Leveraging GANs in medical fields, especially cosmetic surgery, provides a plethora of benefits, including upholding patient confidentiality, ensuring broad exposure to diverse patient scenarios, and democratizing medical education. This study investigated the capacity of AI models, DALL-E 2, Midjourney, and Blue Willow, to generate realistic images pertinent to cosmetic surgery. We combined the generative powers of ChatGPT-4 and Google's BARD with these GANs to produce images of various noses, faces, and eyelids. Four board-certified plastic surgeons evaluated the generated images, eliminating the need for real patient photographs. Notably, generated images predominantly showcased female faces with lighter skin tones, lacking representation of males, older women, and those with a body mass index above 20. The integration of AI in cosmetic surgery offers enhanced patient education and training but demands careful and ethical incorporation to ensure comprehensive representation and uphold medical standards.
Collapse
Affiliation(s)
- Bryan Lim
- Department of Plastic and Reconstructive Surgery, Peninsula Health, Frankston, VIC 3199, Australia
- Central Clinical School, Faculty of Medicine, Monash University, Melbourne, VIC 3004, Australia
| | - Ishith Seth
- Department of Plastic and Reconstructive Surgery, Peninsula Health, Frankston, VIC 3199, Australia
- Central Clinical School, Faculty of Medicine, Monash University, Melbourne, VIC 3004, Australia
| | - Skyler Kah
- Department of Plastic and Reconstructive Surgery, Peninsula Health, Frankston, VIC 3199, Australia
| | - Foti Sofiadellis
- Department of Plastic and Reconstructive Surgery, Peninsula Health, Frankston, VIC 3199, Australia
| | - Richard J. Ross
- Department of Plastic and Reconstructive Surgery, Peninsula Health, Frankston, VIC 3199, Australia
| | - Warren M. Rozen
- Department of Plastic and Reconstructive Surgery, Peninsula Health, Frankston, VIC 3199, Australia
- Central Clinical School, Faculty of Medicine, Monash University, Melbourne, VIC 3004, Australia
| | - Roberto Cuomo
- Plastic Surgery Unit, Department of Medicine, Surgery and Neuroscience, University of Siena, 53100 Siena, Italy
| |
Collapse
|
6
|
El Kababji S, Mitsakakis N, Fang X, Beltran-Bless AA, Pond G, Vandermeer L, Radhakrishnan D, Mosquera L, Paterson A, Shepherd L, Chen B, Barlow WE, Gralow J, Savard MF, Clemons M, El Emam K. Evaluating the Utility and Privacy of Synthetic Breast Cancer Clinical Trial Data Sets. JCO Clin Cancer Inform 2023; 7:e2300116. [PMID: 38011617 PMCID: PMC10703127 DOI: 10.1200/cci.23.00116] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2023] [Revised: 08/24/2023] [Accepted: 09/19/2023] [Indexed: 11/29/2023] Open
Abstract
PURPOSE There is strong interest from patients, researchers, the pharmaceutical industry, medical journal editors, funders of research, and regulators in sharing clinical trial data for secondary analysis. However, data access remains a challenge because of concerns about patient privacy. It has been argued that synthetic data generation (SDG) is an effective way to address these privacy concerns. There is a dearth of evidence supporting this on oncology clinical trial data sets, and on the utility of privacy-preserving synthetic data. The objective of the proposed study is to validate the utility and privacy risks of synthetic clinical trial data sets across multiple SDG techniques. METHODS We synthesized data sets from eight breast cancer clinical trial data sets using three types of generative models: sequential synthesis, conditional generative adversarial network, and variational autoencoder. Synthetic data utility was evaluated by replicating the published analyses on the synthetic data and assessing concordance of effect estimates and CIs between real and synthetic data. Privacy was evaluated by measuring attribution disclosure risk and membership disclosure risk. RESULTS Utility was highest using the sequential synthesis method where all results were replicable and the CI overlap most similar or higher for seven of eight data sets. Both types of privacy risks were low across all three types of generative models. DISCUSSION Synthetic data using sequential synthesis methods can act as a proxy for real clinical trial data sets, and simultaneously have low privacy risks. This type of generative model can be one way to enable broader sharing of clinical trial data.
Collapse
Affiliation(s)
| | | | - Xi Fang
- Replica Analytics Ltd, Ottawa, ON, Canada
| | - Ana-Alicia Beltran-Bless
- Ottawa Hospital Research Institute, Ottawa, ON, Canada
- Division of Medical Oncology, Department of Medicine, University of Ottawa, ON, Canada
| | - Greg Pond
- McMaster University, Hamilton, ON, Canada
| | | | - Dhenuka Radhakrishnan
- CHEO Research Institute, Ottawa, ON, Canada
- Department of Paediatrics, University of Ottawa, Ottawa, ON, Canada
| | - Lucy Mosquera
- CHEO Research Institute, Ottawa, ON, Canada
- Replica Analytics Ltd, Ottawa, ON, Canada
| | | | | | | | | | | | - Marie-France Savard
- Ottawa Hospital Research Institute, Ottawa, ON, Canada
- Division of Medical Oncology, Department of Medicine, University of Ottawa, ON, Canada
| | - Mark Clemons
- Ottawa Hospital Research Institute, Ottawa, ON, Canada
- Division of Medical Oncology, Department of Medicine, University of Ottawa, ON, Canada
| | - Khaled El Emam
- CHEO Research Institute, Ottawa, ON, Canada
- Replica Analytics Ltd, Ottawa, ON, Canada
- School of Epidemiology and Public Health, University of Ottawa, Ottawa, ON, Canada
| |
Collapse
|
7
|
Theodorou B, Xiao C, Sun J. Synthesize high-dimensional longitudinal electronic health records via hierarchical autoregressive language model. Nat Commun 2023; 14:5305. [PMID: 37652934 PMCID: PMC10471716 DOI: 10.1038/s41467-023-41093-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2023] [Accepted: 08/23/2023] [Indexed: 09/02/2023] Open
Abstract
Synthetic electronic health records (EHRs) that are both realistic and privacy-preserving offer alternatives to real EHRs for machine learning (ML) and statistical analysis. However, generating high-fidelity EHR data in its original, high-dimensional form poses challenges for existing methods. We propose Hierarchical Autoregressive Language mOdel (HALO) for generating longitudinal, high-dimensional EHR, which preserve the statistical properties of real EHRs and can train accurate ML models without privacy concerns. HALO generates a probability density function over medical codes, clinical visits, and patient records, allowing for generating realistic EHR data without requiring variable selection or aggregation. Extensive experiments demonstrated that HALO can generate high-fidelity data with high-dimensional disease code probabilities closely mirroring (above 0.9 R2 correlation) real EHR data. HALO also enhances the accuracy of predictive modeling and enables downstream ML models to attain similar accuracy as models trained on genuine data.
Collapse
Affiliation(s)
- Brandon Theodorou
- University of Illinois at Urbana-Champaign, 201 North Goodwin Avenue, Urbana, IL, USA
- Medisyn Inc., Las Vegas, NV, USA
| | - Cao Xiao
- Medisyn Inc., Las Vegas, NV, USA
| | - Jimeng Sun
- University of Illinois at Urbana-Champaign, 201 North Goodwin Avenue, Urbana, IL, USA.
- Medisyn Inc., Las Vegas, NV, USA.
| |
Collapse
|
8
|
Yin Y. Prediction and analysis of time series data based on granular computing. Front Comput Neurosci 2023; 17:1192876. [PMID: 37576071 PMCID: PMC10413556 DOI: 10.3389/fncom.2023.1192876] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2023] [Accepted: 07/06/2023] [Indexed: 08/15/2023] Open
Abstract
The advent of the Big Data era and the rapid development of the Internet of Things have led to a dramatic increase in the amount of data from various time series. How to classify, correlation rule mining and prediction of these large-sample time series data has a crucial role. However, due to the characteristics of high dimensionality, large data volume and transmission lag of sensor data, large sample time series data are affected by multiple factors and have complex characteristics such as multi-scale, non-linearity and burstiness. Traditional time series prediction methods are no longer applicable to the study of large sample time series data. Granular computing has unique advantages in dealing with continuous and complex data, and can compensate for the limitations of traditional support vector machines in dealing with large sample data. Therefore, this paper proposes to combine granular computing theory with support vector machines to achieve large-sample time series data prediction. Firstly, the definition of time series is analyzed, and the basic principles of traditional time series forecasting methods and granular computing are investigated. Secondly, in terms of predicting the trend of data changes, it is proposed to apply the fuzzy granulation algorithm to first convert the sample data into coarser granules. Then, it is combined with a support vector machine to predict the range of change of continuous time series data over a period of time. The results of the simulation experiments show that the proposed model is able to make accurate predictions of the range of data changes in future time periods. Compared with other prediction models, the proposed model reduces the complexity of the samples and improves the prediction accuracy.
Collapse
Affiliation(s)
- Yushan Yin
- School of Electro-Mechanical Engineering, Xidian University, Xi’an, China
| |
Collapse
|
9
|
Azizi Z, Lindner S, Shiba Y, Raparelli V, Norris CM, Kublickiene K, Herrero MT, Kautzky-Willer A, Klimek P, Gisinger T, Pilote L, El Emam K. A comparison of synthetic data generation and federated analysis for enabling international evaluations of cardiovascular health. Sci Rep 2023; 13:11540. [PMID: 37460705 DOI: 10.1038/s41598-023-38457-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2022] [Accepted: 07/08/2023] [Indexed: 07/20/2023] Open
Abstract
Sharing health data for research purposes across international jurisdictions has been a challenge due to privacy concerns. Two privacy enhancing technologies that can enable such sharing are synthetic data generation (SDG) and federated analysis, but their relative strengths and weaknesses have not been evaluated thus far. In this study we compared SDG with federated analysis to enable such international comparative studies. The objective of the analysis was to assess country-level differences in the role of sex on cardiovascular health (CVH) using a pooled dataset of Canadian and Austrian individuals. The Canadian data was synthesized and sent to the Austrian team for analysis. The utility of the pooled (synthetic Canadian + real Austrian) dataset was evaluated by comparing the regression results from the two approaches. The privacy of the Canadian synthetic data was assessed using a membership disclosure test which showed an F1 score of 0.001, indicating low privacy risk. The outcome variable of interest was CVH, calculated through a modified CANHEART index. The main and interaction effect parameter estimates of the federated and pooled analyses were consistent and directionally the same. It took approximately one month to set up the synthetic data generation platform and generate the synthetic data, whereas it took over 1.5 years to set up the federated analysis system. Synthetic data generation can be an efficient and effective tool for enabling multi-jurisdictional studies while addressing privacy concerns.
Collapse
Affiliation(s)
- Zahra Azizi
- Centre for Outcomes Research and Evaluation, Research Institute of the McGill University Health Centre, 5252 De Maisonneuve Blvd, Office 2B.39, Montréal, QC, H4A 3S5, Canada
| | - Simon Lindner
- Department of Internal Medicine III, Division of Endocrinology and Metabolism, Gender Medicine Unit, Medical University of Vienna, Vienna, Austria
| | - Yumika Shiba
- Centre for Outcomes Research and Evaluation, Research Institute of the McGill University Health Centre, 5252 De Maisonneuve Blvd, Office 2B.39, Montréal, QC, H4A 3S5, Canada
- Faculty of Medicine, McGill University, Montreal, Canada
| | - Valeria Raparelli
- Department of Translational Medicine, University of Ferrara, Ferrara, Italy
- Faculty of Nursing, University of Alberta, Edmonton, AB, Canada
| | - Colleen M Norris
- Faculty of Nursing, University of Alberta, Edmonton, AB, Canada
- Heart and Stroke Strategic Clinical Networks, Alberta Health Services, Alberta, Canada
| | | | - Maria Trinidad Herrero
- Clinical & Experimental Neuroscience (NiCE-IMIB-IUIE), School of Medicine, University of Murcia, Murcia, Spain
| | - Alexandra Kautzky-Willer
- Department of Internal Medicine III, Division of Endocrinology and Metabolism, Gender Medicine Unit, Medical University of Vienna, Vienna, Austria
| | - Peter Klimek
- Section for Science of Complex Systems, CeMSIIS, Medical University of Vienna, Vienna, Austria
- Complexity Science Hub Vienna, Vienna, Austria
| | - Teresa Gisinger
- Division of Endocrinology and Metabolism, Medical University of Vienna, Vienna, Austria
| | - Louise Pilote
- Centre for Outcomes Research and Evaluation, Research Institute of the McGill University Health Centre, 5252 De Maisonneuve Blvd, Office 2B.39, Montréal, QC, H4A 3S5, Canada.
- Divisions of Clinical Epidemiology and General Internal Medicine, McGill University Health Centre Research Institute, Montreal, QC, Canada.
| | - Khaled El Emam
- Children's Hospital of Eastern Ontario Research Institute, 401 Smyth Road, Ottawa, ON, K1H 8L1, Canada.
- School of Epidemiology and Public Health, University of Ottawa, Ottawa, ON, Canada.
- Replica Analytics Ltd, Ottawa, ON, Canada.
| |
Collapse
|
10
|
Mosquera L, El Emam K, Ding L, Sharma V, Zhang XH, Kababji SE, Carvalho C, Hamilton B, Palfrey D, Kong L, Jiang B, Eurich DT. A method for generating synthetic longitudinal health data. BMC Med Res Methodol 2023; 23:67. [PMID: 36959532 PMCID: PMC10034254 DOI: 10.1186/s12874-023-01869-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2022] [Accepted: 02/19/2023] [Indexed: 03/25/2023] Open
Abstract
Getting access to administrative health data for research purposes is a difficult and time-consuming process due to increasingly demanding privacy regulations. An alternative method for sharing administrative health data would be to share synthetic datasets where the records do not correspond to real individuals, but the patterns and relationships seen in the data are reproduced. This paper assesses the feasibility of generating synthetic administrative health data using a recurrent deep learning model. Our data comes from 120,000 individuals from Alberta Health's administrative health database. We assess how similar our synthetic data is to the real data using utility assessments that assess the structure and general patterns in the data as well as by recreating a specific analysis in the real data commonly applied to this type of administrative health data. We also assess the privacy risks associated with the use of this synthetic dataset. Generic utility assessments that used Hellinger distance to quantify the difference in distributions between real and synthetic datasets for event types (0.027), attributes (mean 0.0417), Markov transition matrices (order 1 mean absolute difference: 0.0896, sd: 0.159; order 2: mean Hellinger distance 0.2195, sd: 0.2724), the Hellinger distance between the joint distributions was 0.352, and the similarity of random cohorts generated from real and synthetic data had a mean Hellinger distance of 0.3 and mean Euclidean distance of 0.064, indicating small differences between the distributions in the real data and the synthetic data. By applying a realistic analysis to both real and synthetic datasets, Cox regression hazard ratios achieved a mean confidence interval overlap of 68% for adjusted hazard ratios among 5 key outcomes of interest, indicating synthetic data produces similar analytic results to real data. The privacy assessment concluded that the attribution disclosure risk associated with this synthetic dataset was substantially less than the typical 0.09 acceptable risk threshold. Based on these metrics our results show that our synthetic data is suitably similar to the real data and could be shared for research purposes thereby alleviating concerns associated with the sharing of real data in some circumstances.
Collapse
Affiliation(s)
- Lucy Mosquera
- Replica Analytics Ltd, Ottawa, ON, Canada
- Children's Hospital of Eastern Ontario Research Institute, 401 Smyth Road, Ottawa, ON, K1J 8L1, Canada
| | - Khaled El Emam
- Replica Analytics Ltd, Ottawa, ON, Canada.
- Children's Hospital of Eastern Ontario Research Institute, 401 Smyth Road, Ottawa, ON, K1J 8L1, Canada.
- School of Epidemiology and Public Health, University of Ottawa, Ottawa, ON, Canada.
| | - Lei Ding
- Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, AB, Canada
| | - Vishal Sharma
- School of Public Health, University of Alberta, Edmonton, AB, Canada
| | | | - Samer El Kababji
- Children's Hospital of Eastern Ontario Research Institute, 401 Smyth Road, Ottawa, ON, K1J 8L1, Canada
| | | | | | - Dan Palfrey
- Institute of Health Economics, Edmonton, Alberta, Canada
| | - Linglong Kong
- Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, AB, Canada
| | - Bei Jiang
- Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, AB, Canada
| | - Dean T Eurich
- School of Public Health, University of Alberta, Edmonton, AB, Canada
| |
Collapse
|
11
|
Theodorou B, Xiao C, Sun J. Synthesize Extremely High-dimensional Longitudinal Electronic Health Records via Hierarchical Autoregressive Language Model. RESEARCH SQUARE 2023:rs.3.rs-2644725. [PMID: 36945542 PMCID: PMC10029081 DOI: 10.21203/rs.3.rs-2644725/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/12/2023]
Abstract
Synthetic electronic health records (EHRs) that are both realistic and preserve privacy can serve as an alternative to real EHRs for machine learning (ML) modeling and statistical analysis. However, generating high-fidelity and granular electronic health record (EHR) data in its original, highly-dimensional form poses challenges for existing methods due to the complexities inherent in high-dimensional data. In this paper, we propose Hierarchical Autoregressive Language mOdel (HALO) for generating longitudinal high-dimensional EHR, which preserve the statistical properties of real EHR and can be used to train accurate ML models without privacy concerns. Our HALO method, designed as a hierarchical autoregressive model, generates a probability density function of medical codes, clinical visits, and patient records, allowing for the generation of realistic EHR data in its original, unaggregated form without the need for variable selection or aggregation. Additionally, our model also produces high-quality continuous variables in a longitudinal and probabilistic manner. We conducted extensive experiments and demonstrate that HALO can generate high-fidelity EHR data with high-dimensional disease code probabilities ( d ≈ 10,000), disease code co-occurrence probabilities within a visit ( d ≈ 1,000,000), and conditional probabilities across consecutive visits ( d ≈ 5,000,000) and achieve above 0.9 R 2 correlation in comparison to real EHR data. In comparison to the leading baseline, HALO improves predictive modeling by over 17% in its predictive accuracy and perplexity on a hold-off test set of real EHR data. This performance then enables downstream ML models trained on its synthetic data to achieve comparable accuracy to models trained on real data (0.938 area under the ROC curve with HALO data vs. 0.943 with real data). Finally, using a combination of real and synthetic data enhances the accuracy of ML models beyond that achieved by using only real EHR data.
Collapse
|
12
|
Yan C, Yan Y, Wan Z, Zhang Z, Omberg L, Guinney J, Mooney SD, Malin BA. A Multifaceted benchmarking of synthetic electronic health record generation models. Nat Commun 2022; 13:7609. [PMID: 36494374 PMCID: PMC9734113 DOI: 10.1038/s41467-022-35295-1] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2022] [Accepted: 11/28/2022] [Indexed: 12/13/2022] Open
Abstract
Synthetic health data have the potential to mitigate privacy concerns in supporting biomedical research and healthcare applications. Modern approaches for data generation continue to evolve and demonstrate remarkable potential. Yet there is a lack of a systematic assessment framework to benchmark methods as they emerge and determine which methods are most appropriate for which use cases. In this work, we introduce a systematic benchmarking framework to appraise key characteristics with respect to utility and privacy metrics. We apply the framework to evaluate synthetic data generation methods for electronic health records data from two large academic medical centers with respect to several use cases. The results illustrate that there is a utility-privacy tradeoff for sharing synthetic health data and further indicate that no method is unequivocally the best on all criteria in each use case, which makes it evident why synthetic data generation methods need to be assessed in context.
Collapse
Affiliation(s)
- Chao Yan
- grid.412807.80000 0004 1936 9916Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN USA
| | - Yao Yan
- grid.430406.50000 0004 6023 5303Sage Bionetworks, Seattle, WA USA
| | - Zhiyu Wan
- grid.412807.80000 0004 1936 9916Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN USA
| | - Ziqi Zhang
- grid.152326.10000 0001 2264 7217Department of Computer Science, Vanderbilt University, Nashville, TN USA
| | - Larsson Omberg
- grid.430406.50000 0004 6023 5303Sage Bionetworks, Seattle, WA USA
| | - Justin Guinney
- grid.34477.330000000122986657Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, WA USA ,grid.511425.60000 0004 9346 3636Tempus Labs, Chicago, IL USA
| | - Sean D. Mooney
- grid.34477.330000000122986657Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, WA USA
| | - Bradley A. Malin
- grid.412807.80000 0004 1936 9916Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN USA ,grid.152326.10000 0001 2264 7217Department of Computer Science, Vanderbilt University, Nashville, TN USA ,grid.412807.80000 0004 1936 9916Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN USA
| |
Collapse
|
13
|
Tuan Soh TY, Nik Mohd Rosdy NMM, Mohd Yusof MYP, Azhar Hilmy SH, Md Sabri BA. Adoption of a Digital Patient Health Passport as Part of a Primary Healthcare Service Delivery: Systematic Review. J Pers Med 2022; 12:jpm12111814. [PMID: 36579540 PMCID: PMC9694834 DOI: 10.3390/jpm12111814] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2022] [Revised: 10/21/2022] [Accepted: 10/24/2022] [Indexed: 11/06/2022] Open
Abstract
The utilization of digital personal health records is considered to be appropriate for present-time usage; it is expected to further enhance primary care's quality-of-service delivery. Despite numerous studies conducted on digital personal health records, efforts in a systematic evaluation of the topic have failed to establish the specific benefits gained by patients, health providers, and healthcare systems. This study aimed to conduct a systematic review regarding the impact of digital personal health records in relation to the delivery of primary care. The review methods included five methodological elements that were directed by the review protocol 2020 (PRISMA). Over a time period of 10 years (2011-2021), 2492 articles were retrieved from various established databases, including Scopus, Web of Science, PubMed, EBSCO-Medline, and Google Scholar, and based on reference mining. The Mixed Method Appraisal Tool (MMAT) was used for quality appraisal. A thematic analysis was performed to develop the themes in this study. The thematic analysis performed on 13 articles resulted in seven main themes, which were empowering the patient, helping with communication, improving relationships, improving the quality of care, maintaining health records, sharing records, and saving time. We concluded the study by expanding the seven themes into 26 sub-themes, of which each served as answers to our main research question that prompted this systematic review.
Collapse
Affiliation(s)
- Tuan Yuswana Tuan Soh
- Centre of Population Oral Health and Clinical Prevention, Faculty of Dentistry, Universiti Teknologi MARA, Sungai Buloh 47000, Selangor, Malaysia
| | - Nik Mohd Mazuan Nik Mohd Rosdy
- Centre of Oral and Maxillofacial Diagnostics & Medicine Studies, Faculty of Dentistry, Universiti Teknologi MARA, Sungai Buloh 47000, Selangor, Malaysia
| | - Mohd Yusmiaidil Putera Mohd Yusof
- Centre of Oral and Maxillofacial Diagnostics & Medicine Studies, Faculty of Dentistry, Universiti Teknologi MARA, Sungai Buloh 47000, Selangor, Malaysia
- Institute of Pathology, Laboratory and Forensic Medicine (I-PPerForM), Universiti Teknologi MARA, Sungai Buloh 47000, Selangor, Malaysia
| | - Syathirah Hanim Azhar Hilmy
- Centre of Population Oral Health and Clinical Prevention, Faculty of Dentistry, Universiti Teknologi MARA, Sungai Buloh 47000, Selangor, Malaysia
| | - Budi Aslinie Md Sabri
- Centre of Population Oral Health and Clinical Prevention, Faculty of Dentistry, Universiti Teknologi MARA, Sungai Buloh 47000, Selangor, Malaysia
- Correspondence: ; Tel.: +60-3-61266586 or +60-1-23939692; Fax: +60-3-61266103
| |
Collapse
|
14
|
El Emam K, Mosquera L, Fang X. Validating a membership disclosure metric for synthetic health data. JAMIA Open 2022; 5:ooac083. [PMID: 36238080 PMCID: PMC9553223 DOI: 10.1093/jamiaopen/ooac083] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2022] [Revised: 09/13/2022] [Accepted: 09/22/2022] [Indexed: 11/24/2022] Open
Abstract
BACKGROUND One of the increasingly accepted methods to evaluate the privacy of synthetic data is by measuring the risk of membership disclosure. This is a measure of the F1 accuracy that an adversary would correctly ascertain that a target individual from the same population as the real data is in the dataset used to train the generative model, and is commonly estimated using a data partitioning methodology with a 0.5 partitioning parameter. OBJECTIVE Validate the membership disclosure F1 score, evaluate and improve the parametrization of the partitioning method, and provide a benchmark for its interpretation. MATERIALS AND METHODS We performed a simulated membership disclosure attack on 4 population datasets: an Ontario COVID-19 dataset, a state hospital discharge dataset, a national health survey, and an international COVID-19 behavioral survey. Two generative methods were evaluated: sequential synthesis and a generative adversarial network. A theoretical analysis and a simulation were used to determine the correct partitioning parameter that would give the same F1 score as a ground truth simulated membership disclosure attack. RESULTS The default 0.5 parameter can give quite inaccurate membership disclosure values. The proportion of records from the training dataset in the attack dataset must be equal to the sampling fraction of the real dataset from the population. The approach is demonstrated on 7 clinical trial datasets. CONCLUSIONS Our proposed parameterization, as well as interpretation and generative model training guidance provide a theoretically and empirically grounded basis for evaluating and managing membership disclosure risk for synthetic data.
Collapse
Affiliation(s)
- Khaled El Emam
- Corresponding Author: Khaled El Emam, PhD, Research Institute, Children’s Hospital of Eastern Ontario, 401 Smyth Road, Ottawa, Ontario K1H 8L1, Canada;
| | - Lucy Mosquera
- Data Science, Replica Analytics Ltd., Ottawa, Ontario, Canada,Research Institute, Children’s Hospital of Eastern Ontario, Ottawa, Ontario, Canada
| | - Xi Fang
- Data Science, Replica Analytics Ltd., Ottawa, Ontario, Canada
| |
Collapse
|
15
|
Zhang Z, Yan C, Malin BA. Keeping synthetic patients on track: feedback mechanisms to mitigate performance drift in longitudinal health data simulation. J Am Med Inform Assoc 2022; 29:1890-1898. [PMID: 35927974 PMCID: PMC9552284 DOI: 10.1093/jamia/ocac131] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2022] [Revised: 06/25/2022] [Accepted: 07/22/2022] [Indexed: 11/13/2022] Open
Abstract
OBJECTIVE Synthetic data are increasingly relied upon to share electronic health record (EHR) data while maintaining patient privacy. Current simulation methods can generate longitudinal data, but the results are unreliable for several reasons. First, the synthetic data drifts from the real data distribution over time. Second, the typical approach to quality assessment, which is based on the extent to which real records can be distinguished from synthetic records using a critic model, often fails to recognize poor simulation results. In this article, we introduce a longitudinal simulation framework, called LS-EHR, which addresses these issues. MATERIALS AND METHODS LS-EHR enhances simulation through conditional fuzzing and regularization, rejection sampling, and prior knowledge embedding. We compare LS-EHR to the state-of-the-art using data from 60 000 EHRs from Vanderbilt University Medical Center (VUMC) and the All of Us Research Program. We assess discrimination between real and synthetic data over time. We evaluate the generation process and critic model using the area under the receiver operating characteristic curve (AUROC). For the critic, a higher value indicates a more robust model for quality assessment. For the generation process, a lower value indicates better synthetic data quality. RESULTS The LS-EHR critic improves discrimination AUROC from 0.655 to 0.909 and 0.692 to 0.918 for VUMC and All of Us data, respectively. By using the new critic, the LS-EHR generation model reduces the AUROC from 0.909 to 0.758 and 0.918 to 0.806. CONCLUSION LS-EHR can substantially improve the usability of simulated longitudinal EHR data.
Collapse
Affiliation(s)
- Ziqi Zhang
- Department of Computer Science, Vanderbilt University, Nashville, Tennessee, USA
| | - Chao Yan
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Bradley A Malin
- Department of Computer Science, Vanderbilt University, Nashville, Tennessee, USA.,Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA.,Department of Biostatistics, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| |
Collapse
|
16
|
Thomas JA, Foraker RE, Zamstein N, Morrow JD, Payne PRO, Wilcox AB. Demonstrating an approach for evaluating synthetic geospatial and temporal epidemiologic data utility: results from analyzing >1.8 million SARS-CoV-2 tests in the United States National COVID Cohort Collaborative (N3C). J Am Med Inform Assoc 2022; 29:1350-1365. [PMID: 35357487 PMCID: PMC8992357 DOI: 10.1093/jamia/ocac045] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2021] [Revised: 03/11/2022] [Accepted: 03/28/2022] [Indexed: 11/16/2022] Open
Abstract
OBJECTIVE This study sought to evaluate whether synthetic data derived from a national coronavirus disease 2019 (COVID-19) dataset could be used for geospatial and temporal epidemic analyses. MATERIALS AND METHODS Using an original dataset (n = 1 854 968 severe acute respiratory syndrome coronavirus 2 tests) and its synthetic derivative, we compared key indicators of COVID-19 community spread through analysis of aggregate and zip code-level epidemic curves, patient characteristics and outcomes, distribution of tests by zip code, and indicator counts stratified by month and zip code. Similarity between the data was statistically and qualitatively evaluated. RESULTS In general, synthetic data closely matched original data for epidemic curves, patient characteristics, and outcomes. Synthetic data suppressed labels of zip codes with few total tests (mean = 2.9 ± 2.4; max = 16 tests; 66% reduction of unique zip codes). Epidemic curves and monthly indicator counts were similar between synthetic and original data in a random sample of the most tested (top 1%; n = 171) and for all unsuppressed zip codes (n = 5819), respectively. In small sample sizes, synthetic data utility was notably decreased. DISCUSSION Analyses on the population-level and of densely tested zip codes (which contained most of the data) were similar between original and synthetically derived datasets. Analyses of sparsely tested populations were less similar and had more data suppression. CONCLUSION In general, synthetic data were successfully used to analyze geospatial and temporal trends. Analyses using small sample sizes or populations were limited, in part due to purposeful data label suppression-an attribute disclosure countermeasure. Users should consider data fitness for use in these cases.
Collapse
Affiliation(s)
- Jason A Thomas
- Corresponding Author: Jason A. Thomas, PhD, Philips North America, LLC, 22100 Bothell Everett Hwy, Bothell, WA 98021, USA;
| | - Randi E Foraker
- Division of General Medical Sciences, School of Medicine, Washington University in St. Louis, St. Louis, Missouri, USA,School of Medicine, Institute for Informatics, Washington University in St. Louis, St. Louis, Missouri, USA
| | | | - Jon D Morrow
- MDClone Ltd., Be’er Sheva, Israel,Department of Obstetrics and Gynecology, New York University Grossman School of Medicine, New York, New York, USA
| | - Philip R O Payne
- Division of General Medical Sciences, School of Medicine, Washington University in St. Louis, St. Louis, Missouri, USA,School of Medicine, Institute for Informatics, Washington University in St. Louis, St. Louis, Missouri, USA
| | - Adam B Wilcox
- Division of General Medical Sciences, School of Medicine, Washington University in St. Louis, St. Louis, Missouri, USA,School of Medicine, Institute for Informatics, Washington University in St. Louis, St. Louis, Missouri, USA
| | | |
Collapse
|
17
|
El Emam K, Mosquera L, Fang X, El-Hussuna A. Utility Metrics for Evaluating Synthetic Health Data Generation Methods: Validation Study. JMIR Med Inform 2022; 10:e35734. [PMID: 35389366 PMCID: PMC9030990 DOI: 10.2196/35734] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2021] [Revised: 01/27/2022] [Accepted: 02/13/2022] [Indexed: 01/06/2023] Open
Abstract
Background A regular task by developers and users of synthetic data generation (SDG) methods is to evaluate and compare the utility of these methods. Multiple utility metrics have been proposed and used to evaluate synthetic data. However, they have not been validated in general or for comparing SDG methods. Objective This study evaluates the ability of common utility metrics to rank SDG methods according to performance on a specific analytic workload. The workload of interest is the use of synthetic data for logistic regression prediction models, which is a very frequent workload in health research. Methods We evaluated 6 utility metrics on 30 different health data sets and 3 different SDG methods (a Bayesian network, a Generative Adversarial Network, and sequential tree synthesis). These metrics were computed by averaging across 20 synthetic data sets from the same generative model. The metrics were then tested on their ability to rank the SDG methods based on prediction performance. Prediction performance was defined as the difference between each of the area under the receiver operating characteristic curve and area under the precision-recall curve values on synthetic data logistic regression prediction models versus real data models. Results The utility metric best able to rank SDG methods was the multivariate Hellinger distance based on a Gaussian copula representation of real and synthetic joint distributions. Conclusions This study has validated a generative model utility metric, the multivariate Hellinger distance, which can be used to reliably rank competing SDG methods on the same data set. The Hellinger distance metric can be used to evaluate and compare alternate SDG methods.
Collapse
Affiliation(s)
- Khaled El Emam
- School of Epidemiology and Public Health, University of Ottawa, Ottawa, ON, Canada.,Children's Hospital of Eastern Ontario Research Institute, Ottawa, ON, Canada.,Replica Analytics Ltd, Ottawa, ON, Canada
| | - Lucy Mosquera
- Children's Hospital of Eastern Ontario Research Institute, Ottawa, ON, Canada.,Replica Analytics Ltd, Ottawa, ON, Canada
| | - Xi Fang
- Replica Analytics Ltd, Ottawa, ON, Canada
| | | |
Collapse
|
18
|
Zhang Z, Yan C, Malin BA. Membership inference attacks against synthetic health data. J Biomed Inform 2022; 125:103977. [PMID: 34920126 PMCID: PMC8766950 DOI: 10.1016/j.jbi.2021.103977] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2021] [Revised: 11/17/2021] [Accepted: 12/08/2021] [Indexed: 01/03/2023]
Abstract
Synthetic data generation has emerged as a promising method to protect patient privacy while sharing individual-level health data. Intuitively, sharing synthetic data should reduce disclosure risks because no explicit linkage is retained between the synthetic records and the real data upon which it is based. However, the risks associated with synthetic data are still evolving, and what seems protected today may not be tomorrow. In this paper, we show that membership inference attacks, whereby an adversary infers if the data from certain target individuals (known to the adversary a priori) were relied upon by the synthetic data generation process, can be substantially enhanced through state-of-the-art machine learning frameworks, which calls into question the protective nature of existing synthetic data generators. Specifically, we formulate the membership inference problem from the perspective of the data holder, who aims to perform a disclosure risk assessment prior to sharing any health data. To support such an assessment, we introduce a framework for effective membership inference against synthetic health data without specific assumptions about the generative model or a well-defined data structure, leveraging the principles of contrastive representation learning. To illustrate the potential for such an attack, we conducted experiments against synthesis approaches using two datasets derived from several health data resources (Vanderbilt University Medical Center, the All of Us Research Program) to determine the upper bound of risk brought by an adversary who invokes an optimal strategy. The results indicate that partially synthetic data are vulnerable to membership inference at a very high rate. By contrast, fully synthetic data are only marginally susceptible and, in most cases, could be deemed sufficiently protected from membership inference.
Collapse
Affiliation(s)
- Ziqi Zhang
- Vanderbilt University, 2525 West End Avenue, Nashville, TN 37240,Corresponding author: (Ziqi Zhang)
| | - Chao Yan
- Vanderbilt University, 2525 West End Avenue, Nashville, TN 37240
| | - Bradley A. Malin
- Vanderbilt University, 2525 West End Avenue, Nashville, TN 37240,Vanderbilt University Medical Center, 2525 West End Avenue, Nashville, TN 37240
| |
Collapse
|
19
|
Foomani FH, Anisuzzaman DM, Niezgoda J, Niezgoda J, Guns W, Gopalakrishnan S, Yu Z. Synthesizing time-series wound prognosis factors from electronic medical records using generative adversarial networks. J Biomed Inform 2021; 125:103972. [PMID: 34920125 DOI: 10.1016/j.jbi.2021.103972] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2021] [Revised: 09/20/2021] [Accepted: 12/03/2021] [Indexed: 11/26/2022]
Abstract
Wound prognostic models not only provide an estimate of wound healing time to motivate patients to follow up their treatments but also can help clinicians to decide whether to use a standard care or adjuvant therapies and to assist them with designing clinical trials. However, collecting prognosis factors from Electronic Medical Records (EMR) of patients is challenging due to privacy, sensitivity, and confidentiality. In this study, we developed time series medical generative adversarial networks (GANs) to generate synthetic wound prognosis factors using very limited information collected during routine care in a specialized wound care facility. The generated prognosis variables are used in developing a predictive model for chronic wound healing trajectory. Our novel medical GAN can produce both continuous and categorical features from EMR. Moreover, we applied temporal information to our model by considering data collected from the weekly follow-ups of patients. Conditional training strategies were utilized to enhance training and generate classified data in terms of healing or non-healing. The ability of the proposed model to generate realistic EMR data was evaluated by TSTR (test on the synthetic, train on the real), discriminative accuracy, and visualization. We utilized samples generated by our proposed GAN in training a prognosis model to demonstrate its real-life application. Using the generated samples in training predictive models improved the classification accuracy by 6.66-10.01% compared to the previous EMR-GAN. Additionally, the suggested prognosis classifier has achieved the area under the curve (AUC) of 0.875, 0.810, and 0.647 when training the network using data from the first three visits, first two visits, and first visit, respectively. These results indicate a significant improvement in wound healing prediction compared to the previous prognosis models.
Collapse
Affiliation(s)
- Farnaz H Foomani
- Department of Electrical Engineering, University of Wisconsin-Milwaukee, Milwaukee, WI, United States
| | - D M Anisuzzaman
- Department of Computer Science, University of Wisconsin-Milwaukee, Milwaukee, WI, United States
| | | | | | - William Guns
- AZH Wound and Vascular Center, Milwaukee, WI, United States
| | | | - Zeyun Yu
- Department of Electrical Engineering, University of Wisconsin-Milwaukee, Milwaukee, WI, United States; Department of Computer Science, University of Wisconsin-Milwaukee, Milwaukee, WI, United States.
| |
Collapse
|
20
|
Engr YS, Lalande A, Afilalo J, Jodoin PM. Generative Adversarial Networks in Cardiology. Can J Cardiol 2021; 38:196-203. [PMID: 34780990 DOI: 10.1016/j.cjca.2021.11.003] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2021] [Revised: 11/04/2021] [Accepted: 11/08/2021] [Indexed: 01/18/2023] Open
Abstract
Generative Adversarial Networks (GANs) are state-of-the-art neural network models used to synthesize images and other data. GANs brought a considerable improvement to the quality of synthetic data, quickly becoming the standard for data generation tasks. In this work, we summarize the applications of GANs in the field of cardiology, including generation of realistic cardiac images, electrocardiography signals, and synthetic electronic health records. The utility of GAN-generated data is discussed with respect to research, clinical care, and academia. Moreover, we present illustrative examples of our GAN-generated cardiac magnetic resonance and echocardiography images, showing the evolution in image quality across six different models, which has become almost indistinguishable from real images. Finally, we discuss future applications, such as modality translation or patient trajectory modeling. Moreover, we discuss the pending challenges that GANs need to overcome, namely their training dynamics, the medical fidelity or the data regulations and ethics questions, to become integrated in cardiology workflows.
Collapse
Affiliation(s)
| | - Alain Lalande
- Laboratoire ImVIA, Université de Bourgogne, 64 rue Sully, 21000 Dijon, France; Medical Imaging Department, University Hospital of Dijon, 1 Bld Jeanne d'Arc, 21079, Dijon, France
| | - Jonathan Afilalo
- Jewish General Hospital, McGill University, 3755 Côte Ste-Catherine Road, Montreal, Qc, Canada, H3T 1E2
| | - Pierre-Marc Jodoin
- Université de Sherbrooke, 2500 Boul. de l'Universite, Sherbrooke, Qc, Canada, J1K 2R1
| |
Collapse
|
21
|
Daniel C, Bellamine A, Kalra D. Key Contributions in Clinical Research Informatics. Yearb Med Inform 2021; 30:233-238. [PMID: 34479395 PMCID: PMC8416193 DOI: 10.1055/s-0041-1726514] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
Objectives:
To summarize key contributions to current research in the field of Clinical Research Informatics (CRI) and to select best papers published in 2020.
Method:
A bibliographic search using a combination of Medical Subject Headings (MeSH) descriptors and free-text terms on CRI was performed using PubMed, followed by a double-blind review in order to select a list of candidate best papers to be then peer-reviewed by external reviewers. After peer-review ranking, a consensus meeting between two section editors and the editorial team was organized to finally conclude on the selected four best papers.
Results:
Among the 877 papers published in 2020 and returned by the search, there were four best papers selected. The first best paper describes a method for mining temporal sequences from clinical documents to infer disease trajectories and enhancing high-throughput phenotyping. The authors of the second best paper demonstrate that the generation of synthetic Electronic Health Record (EHR) data through Generative Adversarial Networks (GANs) could be substantially improved by more appropriate training and evaluation criteria. The third best paper offers an efficient advance on methods to detect adverse drug events by computer-assisting expert reviewers with annotated candidate mentions in clinical documents. The large-scale data quality assessment study reported by the fourth best paper has clinical research informatics implications, in terms of the trustworthiness of inferences made from analysing electronic health records.
Conclusions:
The most significant research efforts in the CRI field are currently focusing on data science with active research in the development and evaluation of Artificial Intelligence/Machine Learning (AI/ML) algorithms based on ever more intensive use of real-world data and especially EHR real or synthetic data. A major lesson that the coronavirus disease 2019 (COVID-19) pandemic has already taught the scientific CRI community is that timely international high-quality data-sharing and collaborative data analysis is absolutely vital to inform policy decisions.
Collapse
Affiliation(s)
- Christel Daniel
- Information Technology Department, AP-HP, F-75012 Paris, France.,Sorbonne University, University Paris 13, Sorbonne Paris Cité, INSERM UMR_S 1142, LIMICS, F-75006 Paris, France
| | - Ali Bellamine
- Information Technology Department, AP-HP, F-75012 Paris, France
| | | | | |
Collapse
|
22
|
Thomas JA, Foraker RE, Zamstein N, Payne PR, Wilcox AB. Demonstrating an approach for evaluating synthetic geospatial and temporal epidemiologic data utility: Results from analyzing >1.8 million SARS-CoV-2 tests in the United States National COVID Cohort Collaborative (N3C). MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2021:2021.07.06.21259051. [PMID: 34268525 PMCID: PMC8282114 DOI: 10.1101/2021.07.06.21259051] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
OBJECTIVE To evaluate whether synthetic data derived from a national COVID-19 data set could be used for geospatial and temporal epidemic analyses. MATERIALS AND METHODS Using an original data set (n=1,854,968 SARS-CoV-2 tests) and its synthetic derivative, we compared key indicators of COVID-19 community spread through analysis of aggregate and zip-code level epidemic curves, patient characteristics and outcomes, distribution of tests by zip code, and indicator counts stratified by month and zip code. Similarity between the data was statistically and qualitatively evaluated. RESULTS In general, synthetic data closely matched original data for epidemic curves, patient characteristics, and outcomes. Synthetic data suppressed labels of zip codes with few total tests (mean=2.9±2.4; max=16 tests; 66% reduction of unique zip codes). Epidemic curves and monthly indicator counts were similar between synthetic and original data in a random sample of the most tested (top 1%; n=171) and for all unsuppressed zip codes (n=5,819), respectively. In small sample sizes, synthetic data utility was notably decreased. DISCUSSION Analyses on the population-level and of densely-tested zip codes (which contained most of the data) were similar between original and synthetically-derived data sets. Analyses of sparsely-tested populations were less similar and had more data suppression. CONCLUSION In general, synthetic data were successfully used to analyze geospatial and temporal trends. Analyses using small sample sizes or populations were limited, in part due to purposeful data label suppression -an attribute disclosure countermeasure. Users should consider data fitness for use in these cases.
Collapse
Affiliation(s)
- Jason A. Thomas
- Department of Biomedical Informatics & Medical Education, University of Washington, Seattle, WA, USA
| | - Randi E. Foraker
- Division of General Medical Sciences, School of Medicine, Washington University in St. Louis, St. Louis, MO, USA
- Institute for Informatics, School of Medicine, Washington University in St. Louis, St. Louis, MO, USA
| | | | - Philip R.O. Payne
- Division of General Medical Sciences, School of Medicine, Washington University in St. Louis, St. Louis, MO, USA
- Institute for Informatics, School of Medicine, Washington University in St. Louis, St. Louis, MO, USA
| | - Adam B. Wilcox
- Department of Biomedical Informatics & Medical Education, University of Washington, Seattle, WA, USA
- UW Medicine, Seattle, WA, USA
| | | |
Collapse
|
23
|
Zhang Z, Yan C, Lasko TA, Sun J, Malin BA. SynTEG: a framework for temporal structured electronic health data simulation. J Am Med Inform Assoc 2021; 28:596-604. [PMID: 33277896 PMCID: PMC7936402 DOI: 10.1093/jamia/ocaa262] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2020] [Accepted: 10/06/2020] [Indexed: 12/16/2022] Open
Abstract
OBJECTIVE Simulating electronic health record data offers an opportunity to resolve the tension between data sharing and patient privacy. Recent techniques based on generative adversarial networks have shown promise but neglect the temporal aspect of healthcare. We introduce a generative framework for simulating the trajectory of patients' diagnoses and measures to evaluate utility and privacy. MATERIALS AND METHODS The framework simulates date-stamped diagnosis sequences based on a 2-stage process that 1) sequentially extracts temporal patterns from clinical visits and 2) generates synthetic data conditioned on the learned patterns. We designed 3 utility measures to characterize the extent to which the framework maintains feature correlations and temporal patterns in clinical events. We evaluated the framework with billing codes, represented as phenome-wide association study codes (phecodes), from over 500 000 Vanderbilt University Medical Center electronic health records. We further assessed the privacy risks based on membership inference and attribute disclosure attacks. RESULTS The simulated temporal sequences exhibited similar characteristics to real sequences on the utility measures. Notably, diagnosis prediction models based on real versus synthetic temporal data exhibited an average relative difference in area under the ROC curve of 1.6% with standard deviation of 3.8% for 1276 phecodes. Additionally, the relative difference in the mean occurrence age and time between visits were 4.9% and 4.2%, respectively. The privacy risks in synthetic data, with respect to the membership and attribute inference were negligible. CONCLUSION This investigation indicates that temporal diagnosis code sequences can be simulated in a manner that provides utility and respects privacy.
Collapse
Affiliation(s)
- Ziqi Zhang
- Department of Electrical Engineering and Computer Science, Vanderbilt University, Nashville, Tennessee, USA
| | - Chao Yan
- Department of Electrical Engineering and Computer Science, Vanderbilt University, Nashville, Tennessee, USA
| | - Thomas A Lasko
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Jimeng Sun
- Department of Computer Science, University of Illinois Urbana-Champaign, Champaign, Illinois, USA
| | - Bradley A Malin
- Department of Electrical Engineering and Computer Science, Vanderbilt University, Nashville, Tennessee, USA
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| |
Collapse
|
24
|
Chu J, Chen J, Chen X, Dong W, Shi J, Huang Z. Knowledge-aware multi-center clinical dataset adaptation: Problem, method, and application. J Biomed Inform 2021; 115:103710. [PMID: 33581323 DOI: 10.1016/j.jbi.2021.103710] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2020] [Revised: 02/05/2021] [Accepted: 02/06/2021] [Indexed: 11/30/2022]
Abstract
Adaptable utilization of clinical data collected from multiple centers, prompted by the need to overcome the shifts between the dataset distributions, and exploit these different datasets for potential clinical applications, has received significant attention in recent years. In this study, we propose a novel approach to this task by infusing an external knowledge graph (KG) into multi-center clinical data mining. Specifically, we propose an adversarial learning model to capture shared patient feature representations from multi-center heterogeneous clinical datasets, and employ an external KG to enrich the semantics of the patient sample by providing both clinical center-specific and center-general knowledge features, which are trained with a graph convolutional autoencoder. We evaluate the proposed model on a real clinical dataset extracted from the general cardiology wards of a Chinese hospital and a well-known public clinical dataset (MIMIC III, pertaining to ICU clinical settings) for the task of predicting acute kidney injury in patients with heart failure. The achieved experimental results demonstrate the efficacy of our proposed model.
Collapse
Affiliation(s)
- Jiebin Chu
- College of Biomedical Engineering and Instrument Science, Zhejiang University, China
| | - Jinbiao Chen
- College of Biomedical Engineering and Instrument Science, Zhejiang University, China
| | - Xiaofang Chen
- College of Biomedical Engineering and Instrument Science, Zhejiang University, China
| | - Wei Dong
- Department of Cardiology, Chinese PLA General Hospital, China
| | - Jinlong Shi
- Department of Medical Innovation Research, Medical Big Data Center, Chinese PLA General Hospital, China
| | - Zhengxing Huang
- College of Biomedical Engineering and Instrument Science, Zhejiang University, China.
| |
Collapse
|
25
|
Yan C, Zhang Z, Nyemba S, Malin BA. Generating Electronic Health Records with Multiple Data Types and Constraints. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2021; 2020:1335-1344. [PMID: 33936510 PMCID: PMC8075510] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Sharing electronic health records (EHRs) on a large scale may lead to privacy intrusions. Recent research has shown that risks may be mitigated by simulating EHRs through generative adversarial network (GAN) frameworks. Yet the methods developed to date are limited because they 1) focus on generating data of a single type (e.g., diagnosis codes), neglecting other data types (e.g., demographics, procedures or vital signs), and 2) do not represent constraints betweenfeatures. In this paper, we introduce a method to simulate EHRs composed of multiple data types by 1) refining the GAN model, 2) accounting for feature constraints, and 3) incorporating key utility measures for such generation tasks. Our analysis with over 770,000 EHRs from Vanderbilt University Medical Center demonstrates that the new model achieves higher performance in terms ofretaining basic statistics, cross-feature correlations, latent structural properties, feature constraints and associated patterns from real data, without sacrificing privacy.
Collapse
Affiliation(s)
- Chao Yan
- Vanderbilt University, Nashville, TN
| | | | - Steve Nyemba
- Vanderbilt University Medical Center, Nashville, TN
| | - Bradley A Malin
- Vanderbilt University, Nashville, TN
- Vanderbilt University Medical Center, Nashville, TN
| |
Collapse
|
26
|
El Emam K, Mosquera L, Jonker E, Sood H. Evaluating the utility of synthetic COVID-19 case data. JAMIA Open 2021; 4:ooab012. [PMID: 33709065 PMCID: PMC7936723 DOI: 10.1093/jamiaopen/ooab012] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2020] [Revised: 02/01/2021] [Accepted: 02/10/2021] [Indexed: 01/22/2023] Open
Abstract
BACKGROUND Concerns about patient privacy have limited access to COVID-19 datasets. Data synthesis is one approach for making such data broadly available to the research community in a privacy protective manner. OBJECTIVES Evaluate the utility of synthetic data by comparing analysis results between real and synthetic data. METHODS A gradient boosted classification tree was built to predict death using Ontario's 90 514 COVID-19 case records linked with community comorbidity, demographic, and socioeconomic characteristics. Model accuracy and relationships were evaluated, as well as privacy risks. The same model was developed on a synthesized dataset and compared to one from the original data. RESULTS The AUROC and AUPRC for the real data model were 0.945 [95% confidence interval (CI), 0.941-0.948] and 0.34 (95% CI, 0.313-0.368), respectively. The synthetic data model had AUROC and AUPRC of 0.94 (95% CI, 0.936-0.944) and 0.313 (95% CI, 0.286-0.342) with confidence interval overlap of 45.05% and 52.02% when compared with the real data. The most important predictors of death for the real and synthetic models were in descending order: age, days since January 1, 2020, type of exposure, and gender. The functional relationships were similar between the two data sets. Attribute disclosure risks were 0.0585, and membership disclosure risk was low. CONCLUSIONS This synthetic dataset could be used as a proxy for the real dataset.
Collapse
Affiliation(s)
- Khaled El Emam
- School of Epidemiology and Public Health, University of Ottawa, Ottawa, Ontario, Canada
- Electronic Health Information Laboratory, Childrens Hospital of Eastern Ontario Research Institute, Ottawa, Ontario, Canada
- Data Science, Replica Analytics Ltd, Ottawa, Ontario, Canada
| | - Lucy Mosquera
- Data Science, Replica Analytics Ltd, Ottawa, Ontario, Canada
| | - Elizabeth Jonker
- Electronic Health Information Laboratory, Childrens Hospital of Eastern Ontario Research Institute, Ottawa, Ontario, Canada
| | - Harpreet Sood
- London School of Economics, London, UK
- National Health Service, London, UK
| |
Collapse
|
27
|
El Emam K, Mosquera L, Bass J. Evaluating Identity Disclosure Risk in Fully Synthetic Health Data: Model Development and Validation. J Med Internet Res 2020; 22:e23139. [PMID: 33196453 PMCID: PMC7704280 DOI: 10.2196/23139] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2020] [Revised: 09/02/2020] [Accepted: 10/10/2020] [Indexed: 01/13/2023] Open
Abstract
BACKGROUND There has been growing interest in data synthesis for enabling the sharing of data for secondary analysis; however, there is a need for a comprehensive privacy risk model for fully synthetic data: If the generative models have been overfit, then it is possible to identify individuals from synthetic data and learn something new about them. OBJECTIVE The purpose of this study is to develop and apply a methodology for evaluating the identity disclosure risks of fully synthetic data. METHODS A full risk model is presented, which evaluates both identity disclosure and the ability of an adversary to learn something new if there is a match between a synthetic record and a real person. We term this "meaningful identity disclosure risk." The model is applied on samples from the Washington State Hospital discharge database (2007) and the Canadian COVID-19 cases database. Both of these datasets were synthesized using a sequential decision tree process commonly used to synthesize health and social science data. RESULTS The meaningful identity disclosure risk for both of these synthesized samples was below the commonly used 0.09 risk threshold (0.0198 and 0.0086, respectively), and 4 times and 5 times lower than the risk values for the original datasets, respectively. CONCLUSIONS We have presented a comprehensive identity disclosure risk model for fully synthetic data. The results for this synthesis method on 2 datasets demonstrate that synthesis can reduce meaningful identity disclosure risks considerably. The risk model can be applied in the future to evaluate the privacy of fully synthetic data.
Collapse
Affiliation(s)
- Khaled El Emam
- School of Epidemiology and Public Health, Faculty of Medicine, University of Ottawa, Ottawa, ON, Canada
- Children's Hospital of Eastern Ontario Research Institute, Ottawa, ON, Canada
- Replica Analytics Ltd, Ottawa, ON, Canada
| | | | - Jason Bass
- Replica Analytics Ltd, Ottawa, ON, Canada
| |
Collapse
|
28
|
Lee D, Yu H, Jiang X, Rogith D, Gudala M, Tejani M, Zhang Q, Xiong L. Generating sequential electronic health records using dual adversarial autoencoder. J Am Med Inform Assoc 2020; 27:1411-1419. [PMID: 32989459 PMCID: PMC7647348 DOI: 10.1093/jamia/ocaa119] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2019] [Revised: 05/18/2020] [Accepted: 06/16/2020] [Indexed: 11/12/2022] Open
Abstract
OBJECTIVE Recent studies on electronic health records (EHRs) started to learn deep generative models and synthesize a huge amount of realistic records, in order to address significant privacy issues surrounding the EHR. However, most of them only focus on structured records about patients' independent visits, rather than on chronological clinical records. In this article, we aim to learn and synthesize realistic sequences of EHRs based on the generative autoencoder. MATERIALS AND METHODS We propose a dual adversarial autoencoder (DAAE), which learns set-valued sequences of medical entities, by combining a recurrent autoencoder with 2 generative adversarial networks (GANs). DAAE improves the mode coverage and quality of generated sequences by adversarially learning both the continuous latent distribution and the discrete data distribution. Using the MIMIC-III (Medical Information Mart for Intensive Care-III) and UT Physicians clinical databases, we evaluated the performances of DAAE in terms of predictive modeling, plausibility, and privacy preservation. RESULTS Our generated sequences of EHRs showed the comparable performances to real data for a predictive modeling task, and achieved the best score in plausibility evaluation conducted by medical experts among all baseline models. In addition, differentially private optimization of our model enables to generate synthetic sequences without increasing the privacy leakage of patients' data. CONCLUSIONS DAAE can effectively synthesize sequential EHRs by addressing its main challenges: the synthetic records should be realistic enough not to be distinguished from the real records, and they should cover all the training patients to reproduce the performance of specific downstream tasks.
Collapse
Affiliation(s)
- Dongha Lee
- Department of Computer Science and Engineering, Pohang University of Science and Technology, Pohang, South Korea
| | - Hwanjo Yu
- Department of Computer Science and Engineering, Pohang University of Science and Technology, Pohang, South Korea
| | - Xiaoqian Jiang
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Deevakar Rogith
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Meghana Gudala
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Mubeen Tejani
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Qiuchen Zhang
- Department of Computer Science, Emory University, Atlanta, Georgia, USA
| | - Li Xiong
- Department of Computer Science, Emory University, Atlanta, Georgia, USA
| |
Collapse
|
29
|
Goncalves A, Ray P, Soper B, Stevens J, Coyle L, Sales AP. Generation and evaluation of synthetic patient data. BMC Med Res Methodol 2020; 20:108. [PMID: 32381039 PMCID: PMC7204018 DOI: 10.1186/s12874-020-00977-1] [Citation(s) in RCA: 79] [Impact Index Per Article: 19.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2019] [Accepted: 04/13/2020] [Indexed: 01/12/2023] Open
Abstract
BACKGROUND Machine learning (ML) has made a significant impact in medicine and cancer research; however, its impact in these areas has been undeniably slower and more limited than in other application domains. A major reason for this has been the lack of availability of patient data to the broader ML research community, in large part due to patient privacy protection concerns. High-quality, realistic, synthetic datasets can be leveraged to accelerate methodological developments in medicine. By and large, medical data is high dimensional and often categorical. These characteristics pose multiple modeling challenges. METHODS In this paper, we evaluate three classes of synthetic data generation approaches; probabilistic models, classification-based imputation models, and generative adversarial neural networks. Metrics for evaluating the quality of the generated synthetic datasets are presented and discussed. RESULTS While the results and discussions are broadly applicable to medical data, for demonstration purposes we generate synthetic datasets for cancer based on the publicly available cancer registry data from the Surveillance Epidemiology and End Results (SEER) program. Specifically, our cohort consists of breast, respiratory, and non-solid cancer cases diagnosed between 2010 and 2015, which includes over 360,000 individual cases. CONCLUSIONS We discuss the trade-offs of the different methods and metrics, providing guidance on considerations for the generation and usage of medical synthetic data.
Collapse
Affiliation(s)
- Andre Goncalves
- Lawrence Livermore National Laboratory, 7000 East Ave, Livermore, CA, USA.
| | - Priyadip Ray
- Lawrence Livermore National Laboratory, 7000 East Ave, Livermore, CA, USA
| | - Braden Soper
- Lawrence Livermore National Laboratory, 7000 East Ave, Livermore, CA, USA
| | - Jennifer Stevens
- Information Management Systems, 1455 Research Blvd, Suite 315, Rockville, MD, USA
| | - Linda Coyle
- Information Management Systems, 1455 Research Blvd, Suite 315, Rockville, MD, USA
| | - Ana Paula Sales
- Lawrence Livermore National Laboratory, 7000 East Ave, Livermore, CA, USA
| |
Collapse
|
30
|
Reiner Benaim A, Almog R, Gorelik Y, Hochberg I, Nassar L, Mashiach T, Khamaisi M, Lurie Y, Azzam ZS, Khoury J, Kurnik D, Beyar R. Analyzing Medical Research Results Based on Synthetic Data and Their Relation to Real Data Results: Systematic Comparison From Five Observational Studies. JMIR Med Inform 2020; 8:e16492. [PMID: 32130148 PMCID: PMC7059086 DOI: 10.2196/16492] [Citation(s) in RCA: 47] [Impact Index Per Article: 11.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2019] [Revised: 12/01/2019] [Accepted: 12/27/2019] [Indexed: 12/16/2022] Open
Abstract
Background Privacy restrictions limit access to protected patient-derived health information for research purposes. Consequently, data anonymization is required to allow researchers data access for initial analysis before granting institutional review board approval. A system installed and activated at our institution enables synthetic data generation that mimics data from real electronic medical records, wherein only fictitious patients are listed. Objective This paper aimed to validate the results obtained when analyzing synthetic structured data for medical research. A comprehensive validation process concerning meaningful clinical questions and various types of data was conducted to assess the accuracy and precision of statistical estimates derived from synthetic patient data. Methods A cross-hospital project was conducted to validate results obtained from synthetic data produced for five contemporary studies on various topics. For each study, results derived from synthetic data were compared with those based on real data. In addition, repeatedly generated synthetic datasets were used to estimate the bias and stability of results obtained from synthetic data. Results This study demonstrated that results derived from synthetic data were predictive of results from real data. When the number of patients was large relative to the number of variables used, highly accurate and strongly consistent results were observed between synthetic and real data. For studies based on smaller populations that accounted for confounders and modifiers by multivariate models, predictions were of moderate accuracy, yet clear trends were correctly observed. Conclusions The use of synthetic structured data provides a close estimate to real data results and is thus a powerful tool in shaping research hypotheses and accessing estimated analyses, without risking patient privacy. Synthetic data enable broad access to data (eg, for out-of-organization researchers), and rapid, safe, and repeatable analysis of data in hospitals or other health organizations where patient privacy is a primary value.
Collapse
Affiliation(s)
| | - Ronit Almog
- Clinical Epidemiology Unit, Rambam Health Care Campus, Haifa, Israel.,School of Public Health, University of Haifa, Haifa, Israel
| | - Yuri Gorelik
- Department of Internal Medicine D, Rambam Health Care Campus, Haifa, Israel
| | - Irit Hochberg
- Institute of Endocrinology, Diabetes and Metabolism, Rambam Health Care Campus, Haifa, Israel.,The Ruth & Bruce Rappaport Faculty of Medicine, Technion-Israel Institute of Technology, Haifa, Israel
| | - Laila Nassar
- Clinical Pharmacology and Toxicology Section, Rambam Health Care Campus, Haifa, Israel
| | - Tanya Mashiach
- Clinical Epidemiology Unit, Rambam Health Care Campus, Haifa, Israel
| | - Mogher Khamaisi
- Department of Internal Medicine D, Rambam Health Care Campus, Haifa, Israel.,Institute of Endocrinology, Diabetes and Metabolism, Rambam Health Care Campus, Haifa, Israel.,Diabetes Stem Cell Laboratory, Rambam Health Care Campus, Haifa, Israel
| | - Yael Lurie
- The Ruth & Bruce Rappaport Faculty of Medicine, Technion-Israel Institute of Technology, Haifa, Israel.,Clinical Pharmacology and Toxicology Section, Rambam Health Care Campus, Haifa, Israel
| | - Zaher S Azzam
- Department of Internal Medicine B, Rambam Health Care Campus, Haifa, Israel.,The Ruth & Bruce Rappaport Faculty of Medicine and Rappaport Research Institute, Technion-Israel Institute of Technology, Haifa, Israel
| | - Johad Khoury
- Department of Internal Medicine B, Rambam Health Care Campus, Haifa, Israel
| | - Daniel Kurnik
- The Ruth & Bruce Rappaport Faculty of Medicine, Technion-Israel Institute of Technology, Haifa, Israel.,Clinical Pharmacology Unit, Rambam Health Care Campus, Haifa, Israel
| | - Rafael Beyar
- The Ruth & Bruce Rappaport Faculty of Medicine, Technion-Israel Institute of Technology, Haifa, Israel.,Rambam Health Care Campus, Haifa, Israel
| |
Collapse
|
31
|
A multicenter random forest model for effective prognosis prediction in collaborative clinical research network. Artif Intell Med 2020; 103:101814. [PMID: 32143809 DOI: 10.1016/j.artmed.2020.101814] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2019] [Revised: 02/04/2020] [Accepted: 02/04/2020] [Indexed: 12/17/2022]
Abstract
BACKGROUND The accuracy of a prognostic prediction model has become an essential aspect of the quality and reliability of the health-related decisions made by clinicians in modern medicine. Unfortunately, individual institutions often lack sufficient samples, which might not provide sufficient statistical power for models. One mitigation is to expand data collection from a single institution to multiple centers to collectively increase the sample size. However, sharing sensitive biomedical data for research involves complicated issues. Machine learning models such as random forests (RF), though they are commonly used and achieve good performances for prognostic prediction, usually suffer worse performance under multicenter privacy-preserving data mining scenarios compared to a centrally trained version. METHODS AND MATERIALS In this study, a multicenter random forest prognosis prediction model is proposed that enables federated clinical data mining from horizontally partitioned datasets. By using a novel data enhancement approach based on a differentially private generative adversarial network customized to clinical prognosis data, the proposed model is able to provide a multicenter RF model with performances on par with-or even better than-centrally trained RF but without the need to aggregate the raw data. Moreover, our model also incorporates an importance ranking step designed for feature selection without sharing patient-level information. RESULT The proposed model was evaluated on colorectal cancer datasets from the US and China. Two groups of datasets with different levels of heterogeneity within the collaborative research network were selected. First, we compare the performance of the distributed random forest model under different privacy parameters with different percentages of enhancement datasets and validate the effectiveness and plausibility of our approach. Then, we compare the discrimination and calibration ability of the proposed multicenter random forest with a centrally trained random forest model and other tree-based classifiers as well as some commonly used machine learning methods. The results show that the proposed model can provide better prediction performance in terms of discrimination and calibration ability than the centrally trained RF model or the other candidate models while following the privacy-preserving rules in both groups. Additionally, good discrimination and calibration ability are shown on the simplified model based on the feature importance ranking in the proposed approach. CONCLUSION The proposed random forest model exhibits ideal prediction capability using multicenter clinical data and overcomes the performance limitation arising from privacy guarantees. It can also provide feature importance ranking across institutions without pooling the data at a central site. This study offers a practical solution for building a prognosis prediction model in the collaborative clinical research network and solves practical issues in real-world applications of medical artificial intelligence.
Collapse
|