Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Zhang Z, Yan C, Mesa DA, Sun J, Malin BA. Ensuring electronic medical record simulation through better training, modeling, and evaluation. J Am Med Inform Assoc 2021;27:99-108. [PMID: 31592533 DOI: 10.1093/jamia/ocz161] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2019] [Revised: 07/29/2019] [Accepted: 08/15/2019] [Indexed: 12/15/2022] Open

For:	Zhang Z, Yan C, Mesa DA, Sun J, Malin BA. Ensuring electronic medical record simulation through better training, modeling, and evaluation. J Am Med Inform Assoc 2021;27:99-108. [PMID: 31592533 DOI: 10.1093/jamia/ocz161] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2019] [Revised: 07/29/2019] [Accepted: 08/15/2019] [Indexed: 12/15/2022] Open

Number

Cited by Other Article(s)

Yan C, Zhang Z, Nyemba S, Li Z. Generating Synthetic Electronic Health Record Data Using Generative Adversarial Networks: Tutorial. JMIR AI 2024;3:e52615. [PMID: 38875595 PMCID: PMC11074891 DOI: 10.2196/52615] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/10/2023] [Revised: 01/24/2024] [Accepted: 03/07/2024] [Indexed: 06/16/2024]

El Emam K, Mosquera L, Fang X, El-Hussuna A. An evaluation of the replicability of analyses using synthetic health data. Sci Rep 2024;14:6978. [PMID: 38521806 PMCID: PMC10960851 DOI: 10.1038/s41598-024-57207-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2023] [Accepted: 03/15/2024] [Indexed: 03/25/2024] Open

Abstract

Synthetic data generation is being increasingly used as a privacy preserving approach for sharing health data. In addition to protecting privacy, it is important to ensure that generated data has high utility. A common way to assess utility is the ability of synthetic data to replicate results from the real data. Replicability has been defined using two criteria: (a) replicate the results of the analyses on real data, and (b) ensure valid population inferences from the synthetic data. A simulation study using three heterogeneous real-world datasets evaluated the replicability of logistic regression workloads. Eight replicability metrics were evaluated: decision agreement, estimate agreement, standardized difference, confidence interval overlap, bias, confidence interval coverage, statistical power, and precision (empirical SE). The analysis of synthetic data used a multiple imputation approach whereby up to 20 datasets were generated and the fitted logistic regression models were combined using combining rules for fully synthetic datasets. The effects of synthetic data amplification were evaluated, and two types of generative models were used: sequential synthesis using boosted decision trees and a generative adversarial network (GAN). Privacy risk was evaluated using a membership disclosure metric. For sequential synthesis, adjusted model parameters after combining at least ten synthetic datasets gave high decision and estimate agreement, low standardized difference, as well as high confidence interval overlap, low bias, the confidence interval had nominal coverage, and power close to the nominal level. Amplification had only a marginal benefit. Confidence interval coverage from a single synthetic dataset without applying combining rules were erroneous, and statistical power, as expected, was artificially inflated when amplification was used. Sequential synthesis performed considerably better than the GAN across multiple datasets. Membership disclosure risk was low for all datasets and models. For replicable results, the statistical analysis of fully synthetic data should be based on at least ten generated datasets of the same size as the original whose analyses results are combined. Analysis results from synthetic data without applying combining rules can be misleading. Replicability results are dependent on the type of generative model used, with our study suggesting that sequential synthesis has good replicability characteristics for common health research workloads.

Collapse

Zhang T, Qu Y, wang D, Zhong M, Cheng Y, Zhang M. Optimizing sepsis treatment strategies via a reinforcement learning model. Biomed Eng Lett 2024;14:279-289. [PMID: 38374908 PMCID: PMC10874349 DOI: 10.1007/s13534-023-00343-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2023] [Revised: 10/28/2023] [Accepted: 11/13/2023] [Indexed: 02/21/2024] Open

Abstract

Purpose

The existing sepsis treatment lacks effective reference and relies too much on the experience of clinicians. Therefore, we used the reinforcement learning model to build an assisted model for the sepsis medication treatment.

Methods

Using the latest Sepsis 3.0 diagnostic criteria, 19,582 sepsis patients were screened from the Medical Intensive Care Information III database for treatment strategy research, and forty-six features were used in modeling. The study object of the medication strategy is the dosage of vasopressor drugs and intravenous infusion. Dueling DDQN is proposed to predict the patient's medication strategy (vasopressor and intravenous infusion dosage) through the relationship between the patient's state, reward function, and medication action. We also constructed protection against the possible high-risk behaviors of Dueling DDQN, especially sudden dose changes of vasopressors can lead to harmful clinical effects. In order to improve the guiding effect of clinically effective medication strategies on the model, we proposed a hybrid model (safe-dueling DDQN + expert strategies) to optimize medication strategies.

Results

The Dueling DDQN medication model for sepsis patients is superior to clinical strategies and other models in terms of off-policy evaluation values and mortality, and reduced the mortality of clinical strategies from 16.8 to 13.8%. Safe-Dueling DDQN we proposed, compared with Dueling DDQN, has an overall reduction in actions involving vasopressors and reduces large dose fluctuations. The hybrid model we proposed can switch between expert strategies and safe dueling DDQN strategies based on the current state of patients.

Conclusions

The reinforcement learning model we proposed for sepsis medication treatment, has practical clinical value and can improve the survival rate of patients to a certain extent while ensuring the balance and safety of medication.

Collapse

Gwon H, Ahn I, Kim Y, Kang HJ, Seo H, Choi H, Cho HN, Kim M, Han J, Kee G, Park S, Lee KH, Jun TJ, Kim YH. LDP-GAN : Generative adversarial networks with local differential privacy for patient medical records synthesis. Comput Biol Med 2024;168:107738. [PMID: 37995536 DOI: 10.1016/j.compbiomed.2023.107738] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2023] [Revised: 10/31/2023] [Accepted: 11/16/2023] [Indexed: 11/25/2023]

Affiliation(s)

Hansle Gwon Department of Information Medicine, Asan Medical Center, 8, Olympicro 43gil, Songpagu, Seoul, 05505, Republic of Korea
Imjin Ahn Department of Information Medicine, Asan Medical Center, 8, Olympicro 43gil, Songpagu, Seoul, 05505, Republic of Korea
Yunha Kim Department of Medical Science, Asan Medical Institute of Convergence Science and Technology, Asan Medical Center, University of Ulsan College of Medicine, 88, Olympicro 43gil, Songpagu, Seoul, 05505, Republic of Korea
Hee Jun Kang Division of Cardiology, Asan Medical Center, 88, Olympicro 43gil, Songpagu, Seoul, 05505, Republic of Korea
Hyeram Seo Department of Medical Science, Asan Medical Institute of Convergence Science and Technology, Asan Medical Center, University of Ulsan College of Medicine, 88, Olympicro 43gil, Songpagu, Seoul, 05505, Republic of Korea
Heejung Choi Department of Medical Science, Asan Medical Institute of Convergence Science and Technology, Asan Medical Center, University of Ulsan College of Medicine, 88, Olympicro 43gil, Songpagu, Seoul, 05505, Republic of Korea
Ha Na Cho Department of Information Medicine, Asan Medical Center, 8, Olympicro 43gil, Songpagu, Seoul, 05505, Republic of Korea
Minkyoung Kim Department of Medical Science, Asan Medical Institute of Convergence Science and Technology, Asan Medical Center, University of Ulsan College of Medicine, 88, Olympicro 43gil, Songpagu, Seoul, 05505, Republic of Korea
JiYe Han Department of Medical Science, Asan Medical Institute of Convergence Science and Technology, Asan Medical Center, University of Ulsan College of Medicine, 88, Olympicro 43gil, Songpagu, Seoul, 05505, Republic of Korea
Gaeun Kee Department of Information Medicine, Asan Medical Center, 8, Olympicro 43gil, Songpagu, Seoul, 05505, Republic of Korea
Seohyun Park Department of Information Medicine, Asan Medical Center, 8, Olympicro 43gil, Songpagu, Seoul, 05505, Republic of Korea
Kye Hwa Lee Department of Information Medicine, Asan Medical Center, 8, Olympicro 43gil, Songpagu, Seoul, 05505, Republic of Korea
Tae Joon Jun Big Data Research Center, Asan Institute for Life Sciences, Asan Medical Center, 88, Olympicro 43gil, Songpagu, Seoul, 05505, Republic of Korea.
Young-Hak Kim Division of Cardiology, Department of Information Medicine, Asan Medical Center, University of Ulsan College of Medicine, 88, Olympicro 43gil, Songpagu, Seoul, 05505, Republic of Korea

Collapse

Lim B, Seth I, Kah S, Sofiadellis F, Ross RJ, Rozen WM, Cuomo R. Using Generative Artificial Intelligence Tools in Cosmetic Surgery: A Study on Rhinoplasty, Facelifts, and Blepharoplasty Procedures. J Clin Med 2023;12:6524. [PMID: 37892665 PMCID: PMC10607912 DOI: 10.3390/jcm12206524] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2023] [Revised: 10/03/2023] [Accepted: 10/13/2023] [Indexed: 10/29/2023] Open

El Kababji S, Mitsakakis N, Fang X, Beltran-Bless AA, Pond G, Vandermeer L, Radhakrishnan D, Mosquera L, Paterson A, Shepherd L, Chen B, Barlow WE, Gralow J, Savard MF, Clemons M, El Emam K. Evaluating the Utility and Privacy of Synthetic Breast Cancer Clinical Trial Data Sets. JCO Clin Cancer Inform 2023;7:e2300116. [PMID: 38011617 PMCID: PMC10703127 DOI: 10.1200/cci.23.00116] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2023] [Revised: 08/24/2023] [Accepted: 09/19/2023] [Indexed: 11/29/2023] Open

Theodorou B, Xiao C, Sun J. Synthesize high-dimensional longitudinal electronic health records via hierarchical autoregressive language model. Nat Commun 2023;14:5305. [PMID: 37652934 PMCID: PMC10471716 DOI: 10.1038/s41467-023-41093-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2023] [Accepted: 08/23/2023] [Indexed: 09/02/2023] Open

Yin Y. Prediction and analysis of time series data based on granular computing. Front Comput Neurosci 2023;17:1192876. [PMID: 37576071 PMCID: PMC10413556 DOI: 10.3389/fncom.2023.1192876] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2023] [Accepted: 07/06/2023] [Indexed: 08/15/2023] Open

Azizi Z, Lindner S, Shiba Y, Raparelli V, Norris CM, Kublickiene K, Herrero MT, Kautzky-Willer A, Klimek P, Gisinger T, Pilote L, El Emam K. A comparison of synthetic data generation and federated analysis for enabling international evaluations of cardiovascular health. Sci Rep 2023;13:11540. [PMID: 37460705 DOI: 10.1038/s41598-023-38457-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2022] [Accepted: 07/08/2023] [Indexed: 07/20/2023] Open

Affiliation(s)

Zahra Azizi Centre for Outcomes Research and Evaluation, Research Institute of the McGill University Health Centre, 5252 De Maisonneuve Blvd, Office 2B.39, Montréal, QC, H4A 3S5, Canada
Simon Lindner Department of Internal Medicine III, Division of Endocrinology and Metabolism, Gender Medicine Unit, Medical University of Vienna, Vienna, Austria
Yumika Shiba Centre for Outcomes Research and Evaluation, Research Institute of the McGill University Health Centre, 5252 De Maisonneuve Blvd, Office 2B.39, Montréal, QC, H4A 3S5, Canada Faculty of Medicine, McGill University, Montreal, Canada
Valeria Raparelli Department of Translational Medicine, University of Ferrara, Ferrara, Italy Faculty of Nursing, University of Alberta, Edmonton, AB, Canada
Colleen M Norris Faculty of Nursing, University of Alberta, Edmonton, AB, Canada Heart and Stroke Strategic Clinical Networks, Alberta Health Services, Alberta, Canada
Karolina Kublickiene Karolinska Institute, Stockholm, Sweden
Maria Trinidad Herrero Clinical & Experimental Neuroscience (NiCE-IMIB-IUIE), School of Medicine, University of Murcia, Murcia, Spain
Alexandra Kautzky-Willer Department of Internal Medicine III, Division of Endocrinology and Metabolism, Gender Medicine Unit, Medical University of Vienna, Vienna, Austria
Peter Klimek Section for Science of Complex Systems, CeMSIIS, Medical University of Vienna, Vienna, Austria Complexity Science Hub Vienna, Vienna, Austria
Teresa Gisinger Division of Endocrinology and Metabolism, Medical University of Vienna, Vienna, Austria
Louise Pilote Centre for Outcomes Research and Evaluation, Research Institute of the McGill University Health Centre, 5252 De Maisonneuve Blvd, Office 2B.39, Montréal, QC, H4A 3S5, Canada. Divisions of Clinical Epidemiology and General Internal Medicine, McGill University Health Centre Research Institute, Montreal, QC, Canada.
Khaled El Emam Children's Hospital of Eastern Ontario Research Institute, 401 Smyth Road, Ottawa, ON, K1H 8L1, Canada. School of Epidemiology and Public Health, University of Ottawa, Ottawa, ON, Canada. Replica Analytics Ltd, Ottawa, ON, Canada.

Collapse

Mosquera L, El Emam K, Ding L, Sharma V, Zhang XH, Kababji SE, Carvalho C, Hamilton B, Palfrey D, Kong L, Jiang B, Eurich DT. A method for generating synthetic longitudinal health data. BMC Med Res Methodol 2023;23:67. [PMID: 36959532 PMCID: PMC10034254 DOI: 10.1186/s12874-023-01869-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2022] [Accepted: 02/19/2023] [Indexed: 03/25/2023] Open

Abstract

Getting access to administrative health data for research purposes is a difficult and time-consuming process due to increasingly demanding privacy regulations. An alternative method for sharing administrative health data would be to share synthetic datasets where the records do not correspond to real individuals, but the patterns and relationships seen in the data are reproduced. This paper assesses the feasibility of generating synthetic administrative health data using a recurrent deep learning model. Our data comes from 120,000 individuals from Alberta Health's administrative health database. We assess how similar our synthetic data is to the real data using utility assessments that assess the structure and general patterns in the data as well as by recreating a specific analysis in the real data commonly applied to this type of administrative health data. We also assess the privacy risks associated with the use of this synthetic dataset. Generic utility assessments that used Hellinger distance to quantify the difference in distributions between real and synthetic datasets for event types (0.027), attributes (mean 0.0417), Markov transition matrices (order 1 mean absolute difference: 0.0896, sd: 0.159; order 2: mean Hellinger distance 0.2195, sd: 0.2724), the Hellinger distance between the joint distributions was 0.352, and the similarity of random cohorts generated from real and synthetic data had a mean Hellinger distance of 0.3 and mean Euclidean distance of 0.064, indicating small differences between the distributions in the real data and the synthetic data. By applying a realistic analysis to both real and synthetic datasets, Cox regression hazard ratios achieved a mean confidence interval overlap of 68% for adjusted hazard ratios among 5 key outcomes of interest, indicating synthetic data produces similar analytic results to real data. The privacy assessment concluded that the attribution disclosure risk associated with this synthetic dataset was substantially less than the typical 0.09 acceptable risk threshold. Based on these metrics our results show that our synthetic data is suitably similar to the real data and could be shared for research purposes thereby alleviating concerns associated with the sharing of real data in some circumstances.

Collapse

Theodorou B, Xiao C, Sun J. Synthesize Extremely High-dimensional Longitudinal Electronic Health Records via Hierarchical Autoregressive Language Model. RESEARCH SQUARE 2023:rs.3.rs-2644725. [PMID: 36945542 PMCID: PMC10029081 DOI: 10.21203/rs.3.rs-2644725/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/12/2023]

Abstract

Synthetic electronic health records (EHRs) that are both realistic and preserve privacy can serve as an alternative to real EHRs for machine learning (ML) modeling and statistical analysis. However, generating high-fidelity and granular electronic health record (EHR) data in its original, highly-dimensional form poses challenges for existing methods due to the complexities inherent in high-dimensional data. In this paper, we propose Hierarchical Autoregressive Language mOdel (HALO) for generating longitudinal high-dimensional EHR, which preserve the statistical properties of real EHR and can be used to train accurate ML models without privacy concerns. Our HALO method, designed as a hierarchical autoregressive model, generates a probability density function of medical codes, clinical visits, and patient records, allowing for the generation of realistic EHR data in its original, unaggregated form without the need for variable selection or aggregation. Additionally, our model also produces high-quality continuous variables in a longitudinal and probabilistic manner. We conducted extensive experiments and demonstrate that HALO can generate high-fidelity EHR data with high-dimensional disease code probabilities ( d ≈ 10,000), disease code co-occurrence probabilities within a visit ( d ≈ 1,000,000), and conditional probabilities across consecutive visits ( d ≈ 5,000,000) and achieve above 0.9 R 2 correlation in comparison to real EHR data. In comparison to the leading baseline, HALO improves predictive modeling by over 17% in its predictive accuracy and perplexity on a hold-off test set of real EHR data. This performance then enables downstream ML models trained on its synthetic data to achieve comparable accuracy to models trained on real data (0.938 area under the ROC curve with HALO data vs. 0.943 with real data). Finally, using a combination of real and synthetic data enhances the accuracy of ML models beyond that achieved by using only real EHR data.

Collapse

Yan C, Yan Y, Wan Z, Zhang Z, Omberg L, Guinney J, Mooney SD, Malin BA. A Multifaceted benchmarking of synthetic electronic health record generation models. Nat Commun 2022;13:7609. [PMID: 36494374 PMCID: PMC9734113 DOI: 10.1038/s41467-022-35295-1] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2022] [Accepted: 11/28/2022] [Indexed: 12/13/2022] Open

Tuan Soh TY, Nik Mohd Rosdy NMM, Mohd Yusof MYP, Azhar Hilmy SH, Md Sabri BA. Adoption of a Digital Patient Health Passport as Part of a Primary Healthcare Service Delivery: Systematic Review. J Pers Med 2022;12:jpm12111814. [PMID: 36579540 PMCID: PMC9694834 DOI: 10.3390/jpm12111814] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2022] [Revised: 10/21/2022] [Accepted: 10/24/2022] [Indexed: 11/06/2022] Open

El Emam K, Mosquera L, Fang X. Validating a membership disclosure metric for synthetic health data. JAMIA Open 2022;5:ooac083. [PMID: 36238080 PMCID: PMC9553223 DOI: 10.1093/jamiaopen/ooac083] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2022] [Revised: 09/13/2022] [Accepted: 09/22/2022] [Indexed: 11/24/2022] Open

Zhang Z, Yan C, Malin BA. Keeping synthetic patients on track: feedback mechanisms to mitigate performance drift in longitudinal health data simulation. J Am Med Inform Assoc 2022;29:1890-1898. [PMID: 35927974 PMCID: PMC9552284 DOI: 10.1093/jamia/ocac131] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2022] [Revised: 06/25/2022] [Accepted: 07/22/2022] [Indexed: 11/13/2022] Open

Thomas JA, Foraker RE, Zamstein N, Morrow JD, Payne PRO, Wilcox AB. Demonstrating an approach for evaluating synthetic geospatial and temporal epidemiologic data utility: results from analyzing >1.8 million SARS-CoV-2 tests in the United States National COVID Cohort Collaborative (N3C). J Am Med Inform Assoc 2022;29:1350-1365. [PMID: 35357487 PMCID: PMC8992357 DOI: 10.1093/jamia/ocac045] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2021] [Revised: 03/11/2022] [Accepted: 03/28/2022] [Indexed: 11/16/2022] Open

El Emam K, Mosquera L, Fang X, El-Hussuna A. Utility Metrics for Evaluating Synthetic Health Data Generation Methods: Validation Study. JMIR Med Inform 2022;10:e35734. [PMID: 35389366 PMCID: PMC9030990 DOI: 10.2196/35734] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2021] [Revised: 01/27/2022] [Accepted: 02/13/2022] [Indexed: 01/06/2023] Open

Zhang Z, Yan C, Malin BA. Membership inference attacks against synthetic health data. J Biomed Inform 2022;125:103977. [PMID: 34920126 PMCID: PMC8766950 DOI: 10.1016/j.jbi.2021.103977] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2021] [Revised: 11/17/2021] [Accepted: 12/08/2021] [Indexed: 01/03/2023]

Abstract

Synthetic data generation has emerged as a promising method to protect patient privacy while sharing individual-level health data. Intuitively, sharing synthetic data should reduce disclosure risks because no explicit linkage is retained between the synthetic records and the real data upon which it is based. However, the risks associated with synthetic data are still evolving, and what seems protected today may not be tomorrow. In this paper, we show that membership inference attacks, whereby an adversary infers if the data from certain target individuals (known to the adversary a priori) were relied upon by the synthetic data generation process, can be substantially enhanced through state-of-the-art machine learning frameworks, which calls into question the protective nature of existing synthetic data generators. Specifically, we formulate the membership inference problem from the perspective of the data holder, who aims to perform a disclosure risk assessment prior to sharing any health data. To support such an assessment, we introduce a framework for effective membership inference against synthetic health data without specific assumptions about the generative model or a well-defined data structure, leveraging the principles of contrastive representation learning. To illustrate the potential for such an attack, we conducted experiments against synthesis approaches using two datasets derived from several health data resources (Vanderbilt University Medical Center, the All of Us Research Program) to determine the upper bound of risk brought by an adversary who invokes an optimal strategy. The results indicate that partially synthetic data are vulnerable to membership inference at a very high rate. By contrast, fully synthetic data are only marginally susceptible and, in most cases, could be deemed sufficiently protected from membership inference.

Collapse

Foomani FH, Anisuzzaman DM, Niezgoda J, Niezgoda J, Guns W, Gopalakrishnan S, Yu Z. Synthesizing time-series wound prognosis factors from electronic medical records using generative adversarial networks. J Biomed Inform 2021;125:103972. [PMID: 34920125 DOI: 10.1016/j.jbi.2021.103972] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2021] [Revised: 09/20/2021] [Accepted: 12/03/2021] [Indexed: 11/26/2022]

Abstract

Wound prognostic models not only provide an estimate of wound healing time to motivate patients to follow up their treatments but also can help clinicians to decide whether to use a standard care or adjuvant therapies and to assist them with designing clinical trials. However, collecting prognosis factors from Electronic Medical Records (EMR) of patients is challenging due to privacy, sensitivity, and confidentiality. In this study, we developed time series medical generative adversarial networks (GANs) to generate synthetic wound prognosis factors using very limited information collected during routine care in a specialized wound care facility. The generated prognosis variables are used in developing a predictive model for chronic wound healing trajectory. Our novel medical GAN can produce both continuous and categorical features from EMR. Moreover, we applied temporal information to our model by considering data collected from the weekly follow-ups of patients. Conditional training strategies were utilized to enhance training and generate classified data in terms of healing or non-healing. The ability of the proposed model to generate realistic EMR data was evaluated by TSTR (test on the synthetic, train on the real), discriminative accuracy, and visualization. We utilized samples generated by our proposed GAN in training a prognosis model to demonstrate its real-life application. Using the generated samples in training predictive models improved the classification accuracy by 6.66-10.01% compared to the previous EMR-GAN. Additionally, the suggested prognosis classifier has achieved the area under the curve (AUC) of 0.875, 0.810, and 0.647 when training the network using data from the first three visits, first two visits, and first visit, respectively. These results indicate a significant improvement in wound healing prediction compared to the previous prognosis models.

Collapse

Engr YS, Lalande A, Afilalo J, Jodoin PM. Generative Adversarial Networks in Cardiology. Can J Cardiol 2021;38:196-203. [PMID: 34780990 DOI: 10.1016/j.cjca.2021.11.003] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2021] [Revised: 11/04/2021] [Accepted: 11/08/2021] [Indexed: 01/18/2023] Open

Daniel C, Bellamine A, Kalra D. Key Contributions in Clinical Research Informatics. Yearb Med Inform 2021;30:233-238. [PMID: 34479395 PMCID: PMC8416193 DOI: 10.1055/s-0041-1726514] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open

Abstract

Objectives: To summarize key contributions to current research in the field of Clinical Research Informatics (CRI) and to select best papers published in 2020.

Method: A bibliographic search using a combination of Medical Subject Headings (MeSH) descriptors and free-text terms on CRI was performed using PubMed, followed by a double-blind review in order to select a list of candidate best papers to be then peer-reviewed by external reviewers. After peer-review ranking, a consensus meeting between two section editors and the editorial team was organized to finally conclude on the selected four best papers.

Results: Among the 877 papers published in 2020 and returned by the search, there were four best papers selected. The first best paper describes a method for mining temporal sequences from clinical documents to infer disease trajectories and enhancing high-throughput phenotyping. The authors of the second best paper demonstrate that the generation of synthetic Electronic Health Record (EHR) data through Generative Adversarial Networks (GANs) could be substantially improved by more appropriate training and evaluation criteria. The third best paper offers an efficient advance on methods to detect adverse drug events by computer-assisting expert reviewers with annotated candidate mentions in clinical documents. The large-scale data quality assessment study reported by the fourth best paper has clinical research informatics implications, in terms of the trustworthiness of inferences made from analysing electronic health records.

Conclusions: The most significant research efforts in the CRI field are currently focusing on data science with active research in the development and evaluation of Artificial Intelligence/Machine Learning (AI/ML) algorithms based on ever more intensive use of real-world data and especially EHR real or synthetic data. A major lesson that the coronavirus disease 2019 (COVID-19) pandemic has already taught the scientific CRI community is that timely international high-quality data-sharing and collaborative data analysis is absolutely vital to inform policy decisions.

Collapse

Thomas JA, Foraker RE, Zamstein N, Payne PR, Wilcox AB. Demonstrating an approach for evaluating synthetic geospatial and temporal epidemiologic data utility: Results from analyzing >1.8 million SARS-CoV-2 tests in the United States National COVID Cohort Collaborative (N3C). MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2021:2021.07.06.21259051. [PMID: 34268525 PMCID: PMC8282114 DOI: 10.1101/2021.07.06.21259051] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]

Zhang Z, Yan C, Lasko TA, Sun J, Malin BA. SynTEG: a framework for temporal structured electronic health data simulation. J Am Med Inform Assoc 2021;28:596-604. [PMID: 33277896 PMCID: PMC7936402 DOI: 10.1093/jamia/ocaa262] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2020] [Accepted: 10/06/2020] [Indexed: 12/16/2022] Open

Chu J, Chen J, Chen X, Dong W, Shi J, Huang Z. Knowledge-aware multi-center clinical dataset adaptation: Problem, method, and application. J Biomed Inform 2021;115:103710. [PMID: 33581323 DOI: 10.1016/j.jbi.2021.103710] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2020] [Revised: 02/05/2021] [Accepted: 02/06/2021] [Indexed: 11/30/2022]

Yan C, Zhang Z, Nyemba S, Malin BA. Generating Electronic Health Records with Multiple Data Types and Constraints. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2021;2020:1335-1344. [PMID: 33936510 PMCID: PMC8075510] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]

El Emam K, Mosquera L, Jonker E, Sood H. Evaluating the utility of synthetic COVID-19 case data. JAMIA Open 2021;4:ooab012. [PMID: 33709065 PMCID: PMC7936723 DOI: 10.1093/jamiaopen/ooab012] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2020] [Revised: 02/01/2021] [Accepted: 02/10/2021] [Indexed: 01/22/2023] Open

El Emam K, Mosquera L, Bass J. Evaluating Identity Disclosure Risk in Fully Synthetic Health Data: Model Development and Validation. J Med Internet Res 2020;22:e23139. [PMID: 33196453 PMCID: PMC7704280 DOI: 10.2196/23139] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2020] [Revised: 09/02/2020] [Accepted: 10/10/2020] [Indexed: 01/13/2023] Open

Lee D, Yu H, Jiang X, Rogith D, Gudala M, Tejani M, Zhang Q, Xiong L. Generating sequential electronic health records using dual adversarial autoencoder. J Am Med Inform Assoc 2020;27:1411-1419. [PMID: 32989459 PMCID: PMC7647348 DOI: 10.1093/jamia/ocaa119] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2019] [Revised: 05/18/2020] [Accepted: 06/16/2020] [Indexed: 11/12/2022] Open

Goncalves A, Ray P, Soper B, Stevens J, Coyle L, Sales AP. Generation and evaluation of synthetic patient data. BMC Med Res Methodol 2020;20:108. [PMID: 32381039 PMCID: PMC7204018 DOI: 10.1186/s12874-020-00977-1] [Citation(s) in RCA: 79] [Impact Index Per Article: 19.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2019] [Accepted: 04/13/2020] [Indexed: 01/12/2023] Open

Reiner Benaim A, Almog R, Gorelik Y, Hochberg I, Nassar L, Mashiach T, Khamaisi M, Lurie Y, Azzam ZS, Khoury J, Kurnik D, Beyar R. Analyzing Medical Research Results Based on Synthetic Data and Their Relation to Real Data Results: Systematic Comparison From Five Observational Studies. JMIR Med Inform 2020;8:e16492. [PMID: 32130148 PMCID: PMC7059086 DOI: 10.2196/16492] [Citation(s) in RCA: 47] [Impact Index Per Article: 11.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2019] [Revised: 12/01/2019] [Accepted: 12/27/2019] [Indexed: 12/16/2022] Open

Abstract

Background

Privacy restrictions limit access to protected patient-derived health information for research purposes. Consequently, data anonymization is required to allow researchers data access for initial analysis before granting institutional review board approval. A system installed and activated at our institution enables synthetic data generation that mimics data from real electronic medical records, wherein only fictitious patients are listed.

Objective

This paper aimed to validate the results obtained when analyzing synthetic structured data for medical research. A comprehensive validation process concerning meaningful clinical questions and various types of data was conducted to assess the accuracy and precision of statistical estimates derived from synthetic patient data.

Methods

A cross-hospital project was conducted to validate results obtained from synthetic data produced for five contemporary studies on various topics. For each study, results derived from synthetic data were compared with those based on real data. In addition, repeatedly generated synthetic datasets were used to estimate the bias and stability of results obtained from synthetic data.

Results

This study demonstrated that results derived from synthetic data were predictive of results from real data. When the number of patients was large relative to the number of variables used, highly accurate and strongly consistent results were observed between synthetic and real data. For studies based on smaller populations that accounted for confounders and modifiers by multivariate models, predictions were of moderate accuracy, yet clear trends were correctly observed.

Conclusions

The use of synthetic structured data provides a close estimate to real data results and is thus a powerful tool in shaping research hypotheses and accessing estimated analyses, without risking patient privacy. Synthetic data enable broad access to data (eg, for out-of-organization researchers), and rapid, safe, and repeatable analysis of data in hospitals or other health organizations where patient privacy is a primary value.

Collapse

A multicenter random forest model for effective prognosis prediction in collaborative clinical research network. Artif Intell Med 2020;103:101814. [PMID: 32143809 DOI: 10.1016/j.artmed.2020.101814] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2019] [Revised: 02/04/2020] [Accepted: 02/04/2020] [Indexed: 12/17/2022]

Abstract

BACKGROUND

The accuracy of a prognostic prediction model has become an essential aspect of the quality and reliability of the health-related decisions made by clinicians in modern medicine. Unfortunately, individual institutions often lack sufficient samples, which might not provide sufficient statistical power for models. One mitigation is to expand data collection from a single institution to multiple centers to collectively increase the sample size. However, sharing sensitive biomedical data for research involves complicated issues. Machine learning models such as random forests (RF), though they are commonly used and achieve good performances for prognostic prediction, usually suffer worse performance under multicenter privacy-preserving data mining scenarios compared to a centrally trained version.

METHODS AND MATERIALS

In this study, a multicenter random forest prognosis prediction model is proposed that enables federated clinical data mining from horizontally partitioned datasets. By using a novel data enhancement approach based on a differentially private generative adversarial network customized to clinical prognosis data, the proposed model is able to provide a multicenter RF model with performances on par with-or even better than-centrally trained RF but without the need to aggregate the raw data. Moreover, our model also incorporates an importance ranking step designed for feature selection without sharing patient-level information.

RESULT

The proposed model was evaluated on colorectal cancer datasets from the US and China. Two groups of datasets with different levels of heterogeneity within the collaborative research network were selected. First, we compare the performance of the distributed random forest model under different privacy parameters with different percentages of enhancement datasets and validate the effectiveness and plausibility of our approach. Then, we compare the discrimination and calibration ability of the proposed multicenter random forest with a centrally trained random forest model and other tree-based classifiers as well as some commonly used machine learning methods. The results show that the proposed model can provide better prediction performance in terms of discrimination and calibration ability than the centrally trained RF model or the other candidate models while following the privacy-preserving rules in both groups. Additionally, good discrimination and calibration ability are shown on the simplified model based on the feature importance ranking in the proposed approach.

CONCLUSION

The proposed random forest model exhibits ideal prediction capability using multicenter clinical data and overcomes the performance limitation arising from privacy guarantees. It can also provide feature importance ranking across institutions without pooling the data at a central site. This study offers a practical solution for building a prognosis prediction model in the collaborative clinical research network and solves practical issues in real-world applications of medical artificial intelligence.

Collapse