1
|
Adam D. Synthetic data can aid the analysis of clinical outcomes: How much can it be trusted? Proc Natl Acad Sci U S A 2024; 121:e2414310121. [PMID: 39083423 DOI: 10.1073/pnas.2414310121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/02/2024] Open
|
2
|
Murcia VM, Aggarwal V, Pesaladinne N, Thammineni R, Do N, Alterovitz G, Fricks RB. Automating Clinical Trial Matches Via Natural Language Processing of Synthetic Electronic Health Records and Clinical Trial Eligibility Criteria. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2024; 2024:125-134. [PMID: 38827083 PMCID: PMC11141802] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Subscribe] [Scholar Register] [Indexed: 06/04/2024]
Abstract
Clinical trials are critical to many medical advances; however, recruiting patients remains a persistent obstacle. Automated clinical trial matching could expedite recruitment across all trial phases. We detail our initial efforts towards automating the matching process by linking realistic synthetic electronic health records to clinical trial eligibility criteria using natural language processing methods. We also demonstrate how the Sørensen-Dice Index can be adapted to quantify match quality between a patient and a clinical trial.
Collapse
Affiliation(s)
- Victor M Murcia
- VA Massachusetts Veterans Epidemiology Research and Information Center, Boston, MA
- VA National Artificial Intelligence Institute, Washington, D.C
| | - Vinod Aggarwal
- VHA Office of Healthcare Innovation and Learning, VA Central Office, Washington DC
- MDClone, Be'er Sheva, Israel
| | | | - Ram Thammineni
- CTS Group, Girls Computing League, Nonprofit Organization, Herndon, VA
| | - Nhan Do
- VA Massachusetts Veterans Epidemiology Research and Information Center, Boston, MA
| | - Gil Alterovitz
- VA National Artificial Intelligence Institute, Washington, D.C
| | - Rafael B Fricks
- VA Massachusetts Veterans Epidemiology Research and Information Center, Boston, MA
- VA National Artificial Intelligence Institute, Washington, D.C
| |
Collapse
|
3
|
El Emam K, Mosquera L, Fang X, El-Hussuna A. An evaluation of the replicability of analyses using synthetic health data. Sci Rep 2024; 14:6978. [PMID: 38521806 PMCID: PMC10960851 DOI: 10.1038/s41598-024-57207-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2023] [Accepted: 03/15/2024] [Indexed: 03/25/2024] Open
Abstract
Synthetic data generation is being increasingly used as a privacy preserving approach for sharing health data. In addition to protecting privacy, it is important to ensure that generated data has high utility. A common way to assess utility is the ability of synthetic data to replicate results from the real data. Replicability has been defined using two criteria: (a) replicate the results of the analyses on real data, and (b) ensure valid population inferences from the synthetic data. A simulation study using three heterogeneous real-world datasets evaluated the replicability of logistic regression workloads. Eight replicability metrics were evaluated: decision agreement, estimate agreement, standardized difference, confidence interval overlap, bias, confidence interval coverage, statistical power, and precision (empirical SE). The analysis of synthetic data used a multiple imputation approach whereby up to 20 datasets were generated and the fitted logistic regression models were combined using combining rules for fully synthetic datasets. The effects of synthetic data amplification were evaluated, and two types of generative models were used: sequential synthesis using boosted decision trees and a generative adversarial network (GAN). Privacy risk was evaluated using a membership disclosure metric. For sequential synthesis, adjusted model parameters after combining at least ten synthetic datasets gave high decision and estimate agreement, low standardized difference, as well as high confidence interval overlap, low bias, the confidence interval had nominal coverage, and power close to the nominal level. Amplification had only a marginal benefit. Confidence interval coverage from a single synthetic dataset without applying combining rules were erroneous, and statistical power, as expected, was artificially inflated when amplification was used. Sequential synthesis performed considerably better than the GAN across multiple datasets. Membership disclosure risk was low for all datasets and models. For replicable results, the statistical analysis of fully synthetic data should be based on at least ten generated datasets of the same size as the original whose analyses results are combined. Analysis results from synthetic data without applying combining rules can be misleading. Replicability results are dependent on the type of generative model used, with our study suggesting that sequential synthesis has good replicability characteristics for common health research workloads.
Collapse
Affiliation(s)
- Khaled El Emam
- School of Epidemiology and Public Health, University of Ottawa, Ottawa, ON, Canada.
- Replica Analytics, Ottawa, ON, Canada.
- Children's Hospital of Eastern Ontario (CHEO) Research Institute, 401 Smyth Road, Ottawa, ON, K1H 8L1, Canada.
| | - Lucy Mosquera
- Replica Analytics, Ottawa, ON, Canada
- Children's Hospital of Eastern Ontario (CHEO) Research Institute, 401 Smyth Road, Ottawa, ON, K1H 8L1, Canada
| | - Xi Fang
- Replica Analytics, Ottawa, ON, Canada
| | | |
Collapse
|
4
|
Lun R, Siegal D, Ramsay T, Stotts G, Dowlatshahi D. Synthetic data in cancer and cerebrovascular disease research: A novel approach to big data. PLoS One 2024; 19:e0295921. [PMID: 38324588 PMCID: PMC10849264 DOI: 10.1371/journal.pone.0295921] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2023] [Accepted: 12/01/2023] [Indexed: 02/09/2024] Open
Abstract
OBJECTIVES Synthetic datasets are artificially manufactured based on real health systems data but do not contain real patient information. We sought to validate the use of synthetic data in stroke and cancer research by conducting a comparison study of cancer patients with ischemic stroke to non-cancer patients with ischemic stroke. DESIGN retrospective cohort study. SETTING We used synthetic data generated by MDClone and compared it to its original source data (i.e. real patient data from the Ottawa Hospital Data Warehouse). OUTCOME MEASURES We compared key differences in demographics, treatment characteristics, length of stay, and costs between cancer patients with ischemic stroke and non-cancer patients with ischemic stroke. We used a binary, multivariable logistic regression model to identify risk factors for recurrent stroke in the cancer population. RESULTS Using synthetic data, we found cancer patients with ischemic stroke had a lower prevalence of hypertension (52.0% in the cancer cohort vs 57.7% in the non-cancer cohort, p<0.0001), and a higher prevalence of chronic obstructive pulmonary disease (COPD: 8.5% vs 4.7%, p<0.0001), prior ischemic stroke (1.7% vs 0.1%, p<0.001), and prior venous thromboembolism (VTE: 8.2% vs 1.5%, p<0.0001). They also had a longer length of stay (8 days [IQR 3-16] vs 6 days [IQR 3-13], p = 0.011), and higher costs associated with their stroke encounters: $11,498 (IQR $4,440 -$20,668) in the cancer cohort vs $8,084 (IQR $3,947 -$16,706) in the non-cancer cohort (p = 0.0061). A multivariable logistic regression model identified 5 predictors for recurrent ischemic stroke in the cancer cohort using synthetic data; 3 of the same predictors identified using real patient data with similar effect measures. Summary statistics between synthetic and original datasets did not significantly differ, other than slight differences in the distributions of frequencies for numeric data. CONCLUSION We demonstrated the utility of synthetic data in stroke and cancer research and provided key differences between cancer and non-cancer patients with ischemic stroke. Synthetic data is a powerful tool that can allow researchers to easily explore hypothesis generation, enable data sharing without privacy breaches, and ensure broad access to big data in a rapid, safe, and reliable fashion.
Collapse
Affiliation(s)
- Ronda Lun
- School of Epidemiology and Public Health, University of Ottawa, Ottawa, Canada
- Clinical Epidemiology Program, Ottawa Hospital Research Institute, Ottawa, Canada
- Division of Neurology, Department of Medicine, The Ottawa Hospital, Ottawa, Canada
| | - Deborah Siegal
- School of Epidemiology, University of Ottawa, Ottawa, Canada
- Division of Hematology, Department of Medicine, The Ottawa Hospital, Ottawa, Canada
| | - Tim Ramsay
- School of Epidemiology, University of Ottawa, Ottawa, Canada
| | - Grant Stotts
- Division of Neurology, Department of Medicine, The Ottawa Hospital, Ottawa, Canada
| | - Dar Dowlatshahi
- School of Epidemiology and Public Health, University of Ottawa, Ottawa, Canada
- Clinical Epidemiology Program, Ottawa Hospital Research Institute, Ottawa, Canada
- Division of Neurology, Department of Medicine, The Ottawa Hospital, Ottawa, Canada
| |
Collapse
|
5
|
Moore JH, Li X, Chang JH, Tatonetti NP, Theodorescu D, Chen Y, Asselbergs FW, Venkatesan M, Wang ZP. SynTwin: A graph-based approach for predicting clinical outcomes using digital twins derived from synthetic patients. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2024; 29:96-107. [PMID: 38160272 PMCID: PMC10827004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 01/03/2024]
Abstract
The concept of a digital twin came from the engineering, industrial, and manufacturing domains to create virtual objects or machines that could inform the design and development of real objects. This idea is appealing for precision medicine where digital twins of patients could help inform healthcare decisions. We have developed a methodology for generating and using digital twins for clinical outcome prediction. We introduce a new approach that combines synthetic data and network science to create digital twins (i.e. SynTwin) for precision medicine. First, our approach starts by estimating the distance between all subjects based on their available features. Second, the distances are used to construct a network with subjects as nodes and edges defining distance less than the percolation threshold. Third, communities or cliques of subjects are defined. Fourth, a large population of synthetic patients are generated using a synthetic data generation algorithm that models the correlation structure of the data to generate new patients. Fifth, digital twins are selected from the synthetic patient population that are within a given distance defining a subject community in the network. Finally, we compare and contrast community-based prediction of clinical endpoints using real subjects, digital twins, or both within and outside of the community. Key to this approach are the digital twins defined using patient similarity that represent hypothetical unobserved patients with patterns similar to nearby real patients as defined by network distance and community structure. We apply our SynTwin approach to predicting mortality in a population-based cancer registry (n=87,674) from the Surveillance, Epidemiology, and End Results (SEER) program from the National Cancer Institute (USA). Our results demonstrate that nearest network neighbor prediction of mortality in this study is significantly improved with digital twins (AUROC=0.864, 95% CI=0.857-0.872) over just using real data alone (AUROC=0.791, 95% CI=0.781-0.800). These results suggest a network-based digital twin strategy using synthetic patients may add value to precision medicine efforts.
Collapse
Affiliation(s)
- Jason H Moore
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, West Hollywood, CA, United States2Cedars-Sinai Cancer, Cedars-Sinai Medical Center, Los Angeles, CA, United States,
| | | | | | | | | | | | | | | | | |
Collapse
|
6
|
Alloza C, Knox B, Raad H, Aguilà M, Coakley C, Mohrova Z, Boin É, Bénard M, Davies J, Jacquot E, Lecomte C, Fabre A, Batech M. A Case for Synthetic Data in Regulatory Decision-Making in Europe. Clin Pharmacol Ther 2023; 114:795-801. [PMID: 37441734 DOI: 10.1002/cpt.3001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2022] [Accepted: 07/05/2023] [Indexed: 07/15/2023]
Abstract
Regulators are faced with many challenges surrounding health data usage, including privacy, fragmentation, validity, and generalizability, especially in the European Union, for which synthetic data may provide innovative solutions. Synthetic data, defined as data artificially generated rather than captured in the real world, are increasingly being used for healthcare research purposes as a proxy to real-world data (RWD). Currently, there are barriers particularly challenging in Europe, where sharing patient's data is strictly regulated, costly, and time-consuming, causing delays in evidence generation and regulatory approvals. Recent initiatives are encouraging the use of synthetic data in regulatory decision making and health technology assessment to overcome these challenges, but synthetic data have still to overcome realistic obstacles before their adoption by researchers and regulators in Europe. Thus, the emerging use of RWD and synthetic data by pharmaceutical and medical device industries calls regulatory bodies to provide a framework for proper evidence generation and informed regulatory decision making. As the provision of data becomes more ubiquitous in scientific research, so will innovations in artificial intelligence, machine learning, and generation of synthetic data, making the exploration and intricacies of this topic all the more important and timely. In this review, we discuss the potential merits and challenges of synthetic data in the context of decision making in the European regulatory environment. We explore the current uses of synthetic data and ongoing initiatives, the value of synthetic data for regulatory purposes, and realistic barriers to the adoption of synthetic data in healthcare.
Collapse
|
7
|
Ang CYS, Chiew YS, Wang X, Ooi EH, Nor MBM, Cove ME, Chase JG. Virtual patient with temporal evolution for mechanical ventilation trial studies: A stochastic model approach. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2023; 240:107728. [PMID: 37531693 DOI: 10.1016/j.cmpb.2023.107728] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/12/2023] [Revised: 06/27/2023] [Accepted: 07/19/2023] [Indexed: 08/04/2023]
Abstract
BACKGROUND AND OBJECTIVE Healthcare datasets are plagued by issues of data scarcity and class imbalance. Clinically validated virtual patient (VP) models can provide accurate in-silico representations of real patients and thus a means for synthetic data generation in hospital critical care settings. This research presents a realistic, time-varying mechanically ventilated respiratory failure VP profile synthesised using a stochastic model. METHODS A stochastic model was developed using respiratory elastance (Ers) data from two clinical cohorts and averaged over 30-minute time intervals. The stochastic model was used to generate future Ers data based on current Ers values with added normally distributed random noise. Self-validation of the VPs was performed via Monte Carlo simulation and retrospective Ers profile fitting. A stochastic VP cohort of temporal Ers evolution was synthesised and then compared to an independent retrospective patient cohort data in a virtual trial across several measured patient responses, where similarity of profiles validates the realism of stochastic model generated VP profiles. RESULTS A total of 120,000 3-hour VPs for pressure control (PC) and volume control (VC) ventilation modes are generated using stochastic simulation. Optimisation of the stochastic simulation process yields an ideal noise percentage of 5-10% and simulation iteration of 200,000 iterations, allowing the simulation of a realistic and diverse set of Ers profiles. Results of self-validation show the retrospective Ers profiles were able to be recreated accurately with a mean squared error of only 0.099 [0.009-0.790]% for the PC cohort and 0.051 [0.030-0.126]% for the VC cohort. A virtual trial demonstrates the ability of the stochastic VP cohort to capture Ers trends within and beyond the retrospective patient cohort providing cohort-level validation. CONCLUSION VPs capable of temporal evolution demonstrate feasibility for use in designing, developing, and optimising bedside MV guidance protocols through in-silico simulation and validation. Overall, the temporal VPs developed using stochastic simulation alleviate the need for lengthy, resource intensive, high cost clinical trials, while facilitating statistically robust virtual trials, ultimately leading to improved patient care and outcomes in mechanical ventilation.
Collapse
Affiliation(s)
| | | | - Xin Wang
- School of Engineering, Monash University Malaysia, Selangor, Malaysia
| | - Ean Hin Ooi
- School of Engineering, Monash University Malaysia, Selangor, Malaysia
| | - Mohd Basri Mat Nor
- Kulliyah of Medicine, International Islamic University Malaysia, Pahang, Malaysia
| | - Matthew E Cove
- Division of Respiratory & Critical Care Medicine, Department of Medicine, National University Hospital, Singapore
| | - J Geoffrey Chase
- Center of Bioengineering, University of Canterbury, Christchurch, New Zealand
| |
Collapse
|
8
|
Greenberg JK, Landman JM, Kelly MP, Pennicooke BH, Molina CA, Foraker RE, Ray WZ. Leveraging Artificial Intelligence and Synthetic Data Derivatives for Spine Surgery Research. Global Spine J 2023; 13:2409-2421. [PMID: 35373623 PMCID: PMC10538345 DOI: 10.1177/21925682221085535] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
Abstract
STUDY DESIGN Retrospective cohort study. OBJECTIVES Leveraging electronic health records (EHRs) for spine surgery research is impeded by concerns regarding patient privacy and data ownership. Synthetic data derivatives may help overcome these limitations. This study's objective was to validate the use of synthetic data for spine surgery research. METHODS Data came from the EHR from 15 hospitals. Patients that underwent anterior cervical or posterior lumbar fusion (2010-2020) were included. Real data were obtained from the EHR. Synthetic data was generated to simulate the properties of the real data, without maintaining a one-to-one correspondence with real patients. Within each cohort, ability to predict 30-day readmissions and 30-day complications was evaluated using logistic regression and extreme gradient boosting machines (XGBoost). RESULTS We identified 9,072 real and 9,088 synthetic cervical fusion patients. Descriptive characteristics were nearly identical between the 2 datasets. When predicting readmission, models built using real and synthetic data both had c-statistics of .69-.71 using logistic regression and XGBoost. Among 12,111 real and 12,126 synthetic lumbar fusion patients, descriptive characteristics were nearly the same for most variables. Using logistic regression and XGBoost to predict readmission, discrimination was similar with models built using real and synthetic data (c-statistics .66-.69). When predicting complications, models derived using real and synthetic data showed similar discrimination in both cohorts. Despite some differences, the most influential predictors were similar in the real and synthetic datasets. CONCLUSION Synthetic data replicate most descriptive and predictive properties of real data, and therefore may expand EHR research in spine surgery.
Collapse
Affiliation(s)
- Jacob K. Greenberg
- Departments of Neurological Surgery, Medicine and Orthopaedic Surgery, Washington University School of Medicine in St Louis, St Louis, MO, USA
| | - Joshua M. Landman
- Departments of Neurological Surgery, Medicine and Orthopaedic Surgery, Washington University School of Medicine in St Louis, St Louis, MO, USA
| | | | - Brenton H. Pennicooke
- Departments of Neurological Surgery, Medicine and Orthopaedic Surgery, Washington University School of Medicine in St Louis, St Louis, MO, USA
| | - Camilo A. Molina
- Departments of Neurological Surgery, Medicine and Orthopaedic Surgery, Washington University School of Medicine in St Louis, St Louis, MO, USA
| | | | - Wilson Z. Ray
- Departments of Neurological Surgery, Medicine and Orthopaedic Surgery, Washington University School of Medicine in St Louis, St Louis, MO, USA
| |
Collapse
|
9
|
Feigin E, Feigin L, Ingbir M, Ben-Bassat OK, Shepshelovich D. Rate of Correction and All-Cause Mortality in Patients With Severe Hypernatremia. JAMA Netw Open 2023; 6:e2335415. [PMID: 37768662 PMCID: PMC10539989 DOI: 10.1001/jamanetworkopen.2023.35415] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/15/2023] [Accepted: 08/18/2023] [Indexed: 09/29/2023] Open
Abstract
Importance Hypernatremia is common among hospitalized patients and is associated with high mortality rates. Current guidelines suggest avoiding fast correction rates but are not supported by robust data. Objective To investigate whether there is an association between hypernatremia correction rate and patient survival. Design, Setting, and Participants This retrospective cohort study examined data from all patients admitted to the Tel Aviv Medical Center between 2007 and 2021 who were diagnosed with severe hypernatremia (serum sodium ≥155 mmol/L) at admission or during hospitalization. Statistical analysis was performed from April 2022 to August 2023. Exposure Patients were grouped as having fast correction rates (>0.5 mmol/L/h) and slow correction rates (≤0.5 mmol/L/h) in accordance with current guidelines. Main Outcomes and Measures All-cause 30-day mortality. Results A total of 4265 patients were included in this cohort, of which 2621 (61.5%) were men and 343 (8.0%) had fast correction rates; the median (IQR) age at diagnosis was 78 (64-87) years. Slow correction was associated with higher 30-day mortality compared with fast correction (50.7% [1990 of 3922] vs 31.8% [109 of 343]; P < .001). These results remained significant after adjusting for demographics (age, gender), Charlson comorbidity index, initial sodium, potassium, and creatinine levels, hospitalization in an ICU, and severe hyperglycemia (adjusted odds ratio [aOR], 2.02 [95% CI, 1.55-2.62]), regardless of whether hypernatremia was hospital acquired (aOR, 2.19 [95% CI, 1.57-3.05]) or documented on admission (aOR, 1.64 [95% CI, 1.06-2.55]). There was a strong negative correlation between absolute sodium correction during the first 24 hours following the initial documentation of severe hypernatremia and 30-day mortality (Pearson correlation coefficient, -0.80 [95% CI, -0.93 to -0.50]; P < .001). Median (IQR) hospitalization length was shorter for fast correction vs slow correction rates (5.0 [2.1-14.9] days vs 7.2 [3.5-16.1] days; P < .001). Prevalence of neurological complications was comparable for both groups, and none were attributed to fast correction rates of hypernatremia. Conclusions and Relevance This cohort study of patients with severe hypernatremia found that rapid correction of hypernatremia was associated with shorter hospitalizations and significantly lower patient mortality without any signs of neurologic complications. These results suggest that physicians should consider the totality of evidence when considering the optimal rates of correction for patients with severe hypernatremia.
Collapse
Affiliation(s)
- Eugene Feigin
- Internal Medicine Division, Tel Aviv Sourasky Medical Center, Tel Aviv, Israel
- Institute of Endocrinology, Metabolism and Hypertension, Tel Aviv Sourasky Medical Center, Tel Aviv, Israel
- Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel
| | - Libi Feigin
- Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel
| | - Merav Ingbir
- Internal Medicine Division, Tel Aviv Sourasky Medical Center, Tel Aviv, Israel
- Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel
- Nephrology Department, Tel Aviv Sourasky Medical Center, Tel Aviv, Israel
| | - Orit Kliuk Ben-Bassat
- Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel
- Nephrology Department, Tel Aviv Sourasky Medical Center, Tel Aviv, Israel
| | - Daniel Shepshelovich
- Internal Medicine Division, Tel Aviv Sourasky Medical Center, Tel Aviv, Israel
- Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel
| |
Collapse
|
10
|
Mavrogenis AF, Scarlat MM. Artificial intelligence publications: synthetic data, patients, and papers. INTERNATIONAL ORTHOPAEDICS 2023; 47:1395-1396. [PMID: 37162553 DOI: 10.1007/s00264-023-05830-w] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Subscribe] [Scholar Register] [Indexed: 05/11/2023]
Affiliation(s)
- Andreas F Mavrogenis
- First Department of Orthopaedics, National and Kapodistrian University of Athens, School of Medicine, Athens, Greece
| | | |
Collapse
|
11
|
Wilcox A. Understanding the opportunity and application of synthetic data in healthcare. Paediatr Perinat Epidemiol 2023; 37:301-302. [PMID: 36970808 DOI: 10.1111/ppe.12970] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/28/2023] [Accepted: 03/05/2023] [Indexed: 05/10/2023]
Affiliation(s)
- Adam Wilcox
- Center for Applied Clinical Informatics, Institute for Informatics, School of Medicine, Washington University, St Louis, Missouri, USA
| |
Collapse
|
12
|
Davis SE, Ssemaganda H, Koola JD, Mao J, Westerman D, Speroff T, Govindarajulu US, Ramsay CR, Sedrakyan A, Ohno-Machado L, Resnic FS, Matheny ME. Simulating complex patient populations with hierarchical learning effects to support methods development for post-market surveillance. BMC Med Res Methodol 2023; 23:89. [PMID: 37041457 PMCID: PMC10088292 DOI: 10.1186/s12874-023-01913-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2021] [Accepted: 04/04/2023] [Indexed: 04/13/2023] Open
Abstract
BACKGROUND Validating new algorithms, such as methods to disentangle intrinsic treatment risk from risk associated with experiential learning of novel treatments, often requires knowing the ground truth for data characteristics under investigation. Since the ground truth is inaccessible in real world data, simulation studies using synthetic datasets that mimic complex clinical environments are essential. We describe and evaluate a generalizable framework for injecting hierarchical learning effects within a robust data generation process that incorporates the magnitude of intrinsic risk and accounts for known critical elements in clinical data relationships. METHODS We present a multi-step data generating process with customizable options and flexible modules to support a variety of simulation requirements. Synthetic patients with nonlinear and correlated features are assigned to provider and institution case series. The probability of treatment and outcome assignment are associated with patient features based on user definitions. Risk due to experiential learning by providers and/or institutions when novel treatments are introduced is injected at various speeds and magnitudes. To further reflect real-world complexity, users can request missing values and omitted variables. We illustrate an implementation of our method in a case study using MIMIC-III data for reference patient feature distributions. RESULTS Realized data characteristics in the simulated data reflected specified values. Apparent deviations in treatment effects and feature distributions, though not statistically significant, were most common in small datasets (n < 3000) and attributable to random noise and variability in estimating realized values in small samples. When learning effects were specified, synthetic datasets exhibited changes in the probability of an adverse outcomes as cases accrued for the treatment group impacted by learning and stable probabilities as cases accrued for the treatment group not affected by learning. CONCLUSIONS Our framework extends clinical data simulation techniques beyond generation of patient features to incorporate hierarchical learning effects. This enables the complex simulation studies required to develop and rigorously test algorithms developed to disentangle treatment safety signals from the effects of experiential learning. By supporting such efforts, this work can help identify training opportunities, avoid unwarranted restriction of access to medical advances, and hasten treatment improvements.
Collapse
Affiliation(s)
- Sharon E Davis
- Department of Biomedical Informatics, Vanderbilt University Medical Center, 2525 West End Ave, Suite 1475, Nashville, TN, 37203, USA.
| | - Henry Ssemaganda
- Comparative Effectiveness Research Institute, Lahey Hospital and Medical Center, 41 Mall Road, Burlington, MA, 01803, USA
| | - Jejo D Koola
- UC Health Department of Biomedical Informatics, University of California San Diego, 9500 Gilman Dr. MC 0728, La Jolla, San Diego, CA, 92093-0728, USA
| | - Jialin Mao
- Department of Population Health Sciences, Weill Cornell Medicine, 1300 York Avenue, New York, NY, 10065, USA
| | - Dax Westerman
- Department of Biomedical Informatics, Vanderbilt University Medical Center, 2525 West End Ave, Suite 1475, Nashville, TN, 37203, USA
| | - Theodore Speroff
- Departments of Medicine and Biostatistics, Vanderbilt University Medical Center, 1313 21St Avenue South, Oxford House, Room 209, Nashville, TN, 37232, USA
| | - Usha S Govindarajulu
- Center for Biostatistics, Department of Population Health Science and Policy, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, Box 1077, New York, NY, 10029, USA
| | - Craig R Ramsay
- Health Services Research Unit, University of Aberdeen, Health Sciences Building, Foresterhill, 3rd Floor, Aberdeen, AB25 2ZD, UK
| | - Art Sedrakyan
- Department of Population Health Sciences, Weill Cornell Medicine, 1300 York Avenue, New York, NY, 10065, USA
| | - Lucila Ohno-Machado
- Biomedical Informatics and Data Science, Yale School of Medicine, 100 College Street, New Haven, CT, 06510, USA
| | - Frederic S Resnic
- Division of Cardiovascular Medicine and Comparative Effectiveness Research Institute, Lahey Hospital and Medical Center, Tufts University School of Medicine, 41 Burlington Mall Road, Burlington, MA, 01805, USA
| | - Michael E Matheny
- Departments of Biomedical Informatics, Biostatistics, and Medicine, Vanderbilt University Medical Center, 2525 West End Ave, Suite 1475, Nashville, TN, 37203, USA
- Geriatric Research Education and Clinical Care Center, Tennessee Valley Healthcare System VA, 1310 24th Avenue South, Nashville, TN, 37212, USA
| |
Collapse
|
13
|
Kepper MM, Walsh‐Bailey C, Prusaczyk B, Zhao M, Herrick C, Foraker R. The adoption of social determinants of health documentation in clinical settings. Health Serv Res 2023; 58:67-77. [PMID: 35862115 PMCID: PMC9836948 DOI: 10.1111/1475-6773.14039] [Citation(s) in RCA: 13] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023] Open
Abstract
OBJECTIVE To understand the frequency of social determinants of health (SDOH) diagnosis codes (Z-codes) within the electronic health record (EHR) for patients with prediabetes and diabetes and examine factors influencing the adoption of SDOH documentation in clinical care. DATA SOURCES EHR data and qualitative interviews with health care providers and stakeholders. STUDY DESIGN An explanatory sequential mixed methods design first examined the use of Z-codes within the EHR and qualitatively examined barriers to documenting SDOH. Data were integrated and interpreted using a joint display. This research was informed by the Framework for Dissemination and Utilization of Research for Health Care Policy and Practice. DATA COLLECTION/EXTRACTION METHODS We queried EHR data for patients with a hemoglobin A1c > 5.7 between October 1, 2015 and September 1, 2020 (n = 118,215) to examine the use of Z-codes and demographics and outcomes for patients with and without social needs. Semi-structured interviews were conducted with 23 participants (n = 15 health care providers; n = 7 billing and compliance stakeholders). The interview questions sought to understand how factors at the innovation-, individual-, organizational-, and environmental-level influence SDOH documentation. We used thematic analysis to analyze interview data. PRINCIPAL FINDINGS Patients with social needs were disproportionately older, female, Black, uninsured, living in low-income and high unemployment neighborhoods, and had a higher number of hospitalizations, obesity, prediabetes, and type 2 diabetes than those without a Z-code. Z-codes were not frequently used in the EHR (<1% of patients), and there was an overall lack of congruence between quantitative and qualitative results related to the prevalence of social needs. Providers faced barriers at multiple levels (e.g., individual-level: discomfort discussing social needs; organizational-level: limited time, competing priorities) for documenting SDOH and identified strategies to improve documentation. CONCLUSIONS Providers recognized the impact of SDOH on patient health and had positive perceptions of screening for and documenting social needs. Implementation strategies are needed to improve systematic documentation.
Collapse
Affiliation(s)
- Maura M. Kepper
- Prevention Research Center, Brown SchoolWashington University in St. LouisSt. LouisMissouriUSA
- Institute for Public HealthWashington University in St. LouisSt. LouisMissouriUSA
| | - Callie Walsh‐Bailey
- Prevention Research Center, Brown SchoolWashington University in St. LouisSt. LouisMissouriUSA
| | - Beth Prusaczyk
- Institute for Public HealthWashington University in St. LouisSt. LouisMissouriUSA
- Institute for InformaticsWashington University School of MedicineSt. LouisMissouriUSA
| | - Min Zhao
- Institute for InformaticsWashington University School of MedicineSt. LouisMissouriUSA
| | - Cynthia Herrick
- Institute for Public HealthWashington University in St. LouisSt. LouisMissouriUSA
- Division of EndocrinologyWashington University School of MedicineSt. LouisMissouriUSA
| | - Randi Foraker
- Institute for Public HealthWashington University in St. LouisSt. LouisMissouriUSA
- Institute for InformaticsWashington University School of MedicineSt. LouisMissouriUSA
- Division of General Medical Sciences, Department of MedicineWashington University School of MedicineSt. LouisMissouriUSA
| |
Collapse
|
14
|
Benzakour A, Altsitzioglou P, Lemée JM, Ahmad A, Mavrogenis AF, Benzakour T. Artificial intelligence in spine surgery. INTERNATIONAL ORTHOPAEDICS 2023; 47:457-465. [PMID: 35902390 DOI: 10.1007/s00264-022-05517-8] [Citation(s) in RCA: 15] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/29/2022] [Accepted: 07/11/2022] [Indexed: 01/28/2023]
Abstract
The continuous progress of research and clinical trials has offered a wide variety of information concerning the spine and the treatment of the different spinal pathologies that may occur. Planning the best therapy for each patient could be a very difficult and challenging task as it often requires thorough processing of the patient's history and individual characteristics by the clinician. Clinicians and researchers also face problems when it comes to data availability due to patients' personal information protection policies. Artificial intelligence refers to the reproduction of human intelligence via special programs and computers that are trained in a way that simulates human cognitive functions. Artificial intelligence implementations to daily clinical practice such as surgical robots that facilitate spine surgery and reduce radiation dosage to medical staff, special algorithms that can predict the possible outcomes of conservative versus surgical treatment in patients with low back pain and disk herniations, and systems that create artificial populations with great resemblance and similar characteristics to real patients are considered to be a novel breakthrough in modern medicine. To enhance the body of the related literature and inform the readers on the clinical applications of artificial intelligence, we performed this review to discuss the contribution of artificial intelligence in spine surgery and pathology.
Collapse
Affiliation(s)
- Ahmed Benzakour
- Centre Orléanais du Dos - Pôle Santé Oréliance, Saran, France
| | - Pavlos Altsitzioglou
- First Department of Orthopaedics, National and Kapodistrian University of Athens, School of Medicine, Athens, Greece
| | - Jean Michel Lemée
- Department of Neurosurgery, University Hospital of Angers, Angers, France
| | | | - Andreas F Mavrogenis
- First Department of Orthopaedics, National and Kapodistrian University of Athens, School of Medicine, Athens, Greece.
| | | |
Collapse
|
15
|
Yan C, Yan Y, Wan Z, Zhang Z, Omberg L, Guinney J, Mooney SD, Malin BA. A Multifaceted benchmarking of synthetic electronic health record generation models. Nat Commun 2022; 13:7609. [PMID: 36494374 PMCID: PMC9734113 DOI: 10.1038/s41467-022-35295-1] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2022] [Accepted: 11/28/2022] [Indexed: 12/13/2022] Open
Abstract
Synthetic health data have the potential to mitigate privacy concerns in supporting biomedical research and healthcare applications. Modern approaches for data generation continue to evolve and demonstrate remarkable potential. Yet there is a lack of a systematic assessment framework to benchmark methods as they emerge and determine which methods are most appropriate for which use cases. In this work, we introduce a systematic benchmarking framework to appraise key characteristics with respect to utility and privacy metrics. We apply the framework to evaluate synthetic data generation methods for electronic health records data from two large academic medical centers with respect to several use cases. The results illustrate that there is a utility-privacy tradeoff for sharing synthetic health data and further indicate that no method is unequivocally the best on all criteria in each use case, which makes it evident why synthetic data generation methods need to be assessed in context.
Collapse
Affiliation(s)
- Chao Yan
- grid.412807.80000 0004 1936 9916Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN USA
| | - Yao Yan
- grid.430406.50000 0004 6023 5303Sage Bionetworks, Seattle, WA USA
| | - Zhiyu Wan
- grid.412807.80000 0004 1936 9916Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN USA
| | - Ziqi Zhang
- grid.152326.10000 0001 2264 7217Department of Computer Science, Vanderbilt University, Nashville, TN USA
| | - Larsson Omberg
- grid.430406.50000 0004 6023 5303Sage Bionetworks, Seattle, WA USA
| | - Justin Guinney
- grid.34477.330000000122986657Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, WA USA ,grid.511425.60000 0004 9346 3636Tempus Labs, Chicago, IL USA
| | - Sean D. Mooney
- grid.34477.330000000122986657Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, WA USA
| | - Bradley A. Malin
- grid.412807.80000 0004 1936 9916Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN USA ,grid.152326.10000 0001 2264 7217Department of Computer Science, Vanderbilt University, Nashville, TN USA ,grid.412807.80000 0004 1936 9916Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN USA
| |
Collapse
|
16
|
El Emam K, Mosquera L, Fang X. Validating a membership disclosure metric for synthetic health data. JAMIA Open 2022; 5:ooac083. [PMID: 36238080 PMCID: PMC9553223 DOI: 10.1093/jamiaopen/ooac083] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2022] [Revised: 09/13/2022] [Accepted: 09/22/2022] [Indexed: 11/24/2022] Open
Abstract
BACKGROUND One of the increasingly accepted methods to evaluate the privacy of synthetic data is by measuring the risk of membership disclosure. This is a measure of the F1 accuracy that an adversary would correctly ascertain that a target individual from the same population as the real data is in the dataset used to train the generative model, and is commonly estimated using a data partitioning methodology with a 0.5 partitioning parameter. OBJECTIVE Validate the membership disclosure F1 score, evaluate and improve the parametrization of the partitioning method, and provide a benchmark for its interpretation. MATERIALS AND METHODS We performed a simulated membership disclosure attack on 4 population datasets: an Ontario COVID-19 dataset, a state hospital discharge dataset, a national health survey, and an international COVID-19 behavioral survey. Two generative methods were evaluated: sequential synthesis and a generative adversarial network. A theoretical analysis and a simulation were used to determine the correct partitioning parameter that would give the same F1 score as a ground truth simulated membership disclosure attack. RESULTS The default 0.5 parameter can give quite inaccurate membership disclosure values. The proportion of records from the training dataset in the attack dataset must be equal to the sampling fraction of the real dataset from the population. The approach is demonstrated on 7 clinical trial datasets. CONCLUSIONS Our proposed parameterization, as well as interpretation and generative model training guidance provide a theoretically and empirically grounded basis for evaluating and managing membership disclosure risk for synthetic data.
Collapse
Affiliation(s)
- Khaled El Emam
- Corresponding Author: Khaled El Emam, PhD, Research Institute, Children’s Hospital of Eastern Ontario, 401 Smyth Road, Ottawa, Ontario K1H 8L1, Canada;
| | - Lucy Mosquera
- Data Science, Replica Analytics Ltd., Ottawa, Ontario, Canada,Research Institute, Children’s Hospital of Eastern Ontario, Ottawa, Ontario, Canada
| | - Xi Fang
- Data Science, Replica Analytics Ltd., Ottawa, Ontario, Canada
| |
Collapse
|
17
|
Meeker D, Kallem C, Heras Y, Garcia S, Thompson C. Case report: evaluation of an open-source synthetic data platform for simulation studies. JAMIA Open 2022; 5:ooac067. [PMID: 35958672 PMCID: PMC9360775 DOI: 10.1093/jamiaopen/ooac067] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2021] [Revised: 05/10/2022] [Accepted: 07/28/2022] [Indexed: 11/12/2022] Open
Abstract
Abstract
Simulation is a mainstay of comparative- and cost-effectiveness research when empirical data are not available. The Synthea platform, originally designed for generating realistically coded longitudinal health records for software testing, implements data generation models specified in publicly contributed modules representing patients’ life cycle and disease and treatment progression. We test the hypothesis that Synthea can be used for simulation studies that draw parameters from observational studies and randomized trials. We benchmarked the results and assessed the effort required to create a Synthea module that replicates a recently published cost-effectiveness simulation comparing levofloxacin prophylaxis to usual care for leukemia. A module was iteratively developed using published parameters from the original study; we replicated the initial conditions and simulation endpoints of demographics, health events, costs, and mortality. We compare Synthea’s Generic Module Framework to platforms designed for simulation and show that Synthea can be used, with modifications, for some types of simulation studies.
Collapse
Affiliation(s)
- Daniella Meeker
- Keck School of Medicine, University of Southern California, Los Angeles, California, USA
| | - Crystal Kallem
- Clinovations Government+Health, Washington, District of Columbia, USA
| | - Yan Heras
- Optimum eHealth, LLC, Irvine, California, USA
| | - Stephanie Garcia
- Office of the National Coordinator for Health Information Technology, Washington, District of Columbia, USA
| | - Casey Thompson
- Corresponding Author: Casey Thompson, MSN, RN-BC, Clinovations Government+Health, 1325 G Street, NW, Suite 500, Washington, DC 20005, USA;
| |
Collapse
|
18
|
Be'er M, Amirav I, Cahal M, Rochman M, Lior Y, Rimon A, Lavy RG, Lavie M. Unforeseen changes in seasonality of pediatric respiratory illnesses during the first COVID-19 pandemic year. Pediatr Pulmonol 2022; 57:1425-1431. [PMID: 35307986 PMCID: PMC9088630 DOI: 10.1002/ppul.25896] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/17/2021] [Revised: 03/01/2022] [Accepted: 03/10/2022] [Indexed: 11/29/2022]
Abstract
OBJECTIVES To investigate whether the three nationwide coronavirus disease 2019 (COVID-19) lockdowns imposed in Israel during the full first pandemic year altered the traditional seasonality of pediatric respiratory healthcare utilization. METHODS Month by month pediatric emergency department (ED) visits and hospitalizations for respiratory diagnoses during the first full COVID-19 year were compared to those recorded for the six consecutive years preceding the pandemic. Data were collected from the patients' electronic files by utilizing a data extraction platform (MDClone© ). RESULTS A significant decline of 40% in respiratory ED visits and 54%-73% in respiratory hospitalizations during the first COVID-19 year compared with the pre-COVID-19 years were observed (p < 0.001 and p < 0.001, respectively). The rate of respiratory ED visits out of the total monthly visits, mostly for asthma, peaked during June 2020, compared with proceeding years (109 [5.9%] versus 88 [3.9%] visits; p < 0.001). This peak occurred 2 weeks after the lifting of the first lockdown, resembling the "back-to-school asthma" phenomenon of September. CONCLUSIONS This study demonstrates important changes in the seasonality of pediatric respiratory illnesses during the first COVID-19 year, including a new "back-from-lockdown" asthma peak. These dramatic changes along with the recent resurgence of respiratory diseases may indicate the beginnings of altered seasonality in pediatric pulmonary pathologies as collateral damage of the pandemic.
Collapse
Affiliation(s)
- Moria Be'er
- Pediatric Pulmonology Unit, Tel-Aviv Sourasky Medical Center, Sackler Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel
| | - Israel Amirav
- Pediatric Pulmonology Unit, Tel-Aviv Sourasky Medical Center, Sackler Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel
| | - Michal Cahal
- Pediatric Pulmonology Unit, Tel-Aviv Sourasky Medical Center, Sackler Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel
| | - Mika Rochman
- Pediatric Pulmonology Unit, Tel-Aviv Sourasky Medical Center, Sackler Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel
| | - Yotam Lior
- Division of Anesthesia, Intensive Care, and Pain Medicine, Tel-Aviv Sourasky Medical Center, Sackler Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel
| | - Ayelet Rimon
- Department of Pediatric Emergency, Tel-Aviv Sourasky Medical Center, Sackler Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel
| | - Roni G Lavy
- Department of Pediatric, Tel-Aviv Sourasky Medical Center, Sackler Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel
| | - Moran Lavie
- Pediatric Pulmonology Unit, Tel-Aviv Sourasky Medical Center, Sackler Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel
| |
Collapse
|
19
|
Thomas JA, Foraker RE, Zamstein N, Morrow JD, Payne PRO, Wilcox AB. Demonstrating an approach for evaluating synthetic geospatial and temporal epidemiologic data utility: results from analyzing >1.8 million SARS-CoV-2 tests in the United States National COVID Cohort Collaborative (N3C). J Am Med Inform Assoc 2022; 29:1350-1365. [PMID: 35357487 PMCID: PMC8992357 DOI: 10.1093/jamia/ocac045] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2021] [Revised: 03/11/2022] [Accepted: 03/28/2022] [Indexed: 11/16/2022] Open
Abstract
OBJECTIVE This study sought to evaluate whether synthetic data derived from a national coronavirus disease 2019 (COVID-19) dataset could be used for geospatial and temporal epidemic analyses. MATERIALS AND METHODS Using an original dataset (n = 1 854 968 severe acute respiratory syndrome coronavirus 2 tests) and its synthetic derivative, we compared key indicators of COVID-19 community spread through analysis of aggregate and zip code-level epidemic curves, patient characteristics and outcomes, distribution of tests by zip code, and indicator counts stratified by month and zip code. Similarity between the data was statistically and qualitatively evaluated. RESULTS In general, synthetic data closely matched original data for epidemic curves, patient characteristics, and outcomes. Synthetic data suppressed labels of zip codes with few total tests (mean = 2.9 ± 2.4; max = 16 tests; 66% reduction of unique zip codes). Epidemic curves and monthly indicator counts were similar between synthetic and original data in a random sample of the most tested (top 1%; n = 171) and for all unsuppressed zip codes (n = 5819), respectively. In small sample sizes, synthetic data utility was notably decreased. DISCUSSION Analyses on the population-level and of densely tested zip codes (which contained most of the data) were similar between original and synthetically derived datasets. Analyses of sparsely tested populations were less similar and had more data suppression. CONCLUSION In general, synthetic data were successfully used to analyze geospatial and temporal trends. Analyses using small sample sizes or populations were limited, in part due to purposeful data label suppression-an attribute disclosure countermeasure. Users should consider data fitness for use in these cases.
Collapse
Affiliation(s)
- Jason A Thomas
- Corresponding Author: Jason A. Thomas, PhD, Philips North America, LLC, 22100 Bothell Everett Hwy, Bothell, WA 98021, USA;
| | - Randi E Foraker
- Division of General Medical Sciences, School of Medicine, Washington University in St. Louis, St. Louis, Missouri, USA,School of Medicine, Institute for Informatics, Washington University in St. Louis, St. Louis, Missouri, USA
| | | | - Jon D Morrow
- MDClone Ltd., Be’er Sheva, Israel,Department of Obstetrics and Gynecology, New York University Grossman School of Medicine, New York, New York, USA
| | - Philip R O Payne
- Division of General Medical Sciences, School of Medicine, Washington University in St. Louis, St. Louis, Missouri, USA,School of Medicine, Institute for Informatics, Washington University in St. Louis, St. Louis, Missouri, USA
| | - Adam B Wilcox
- Division of General Medical Sciences, School of Medicine, Washington University in St. Louis, St. Louis, Missouri, USA,School of Medicine, Institute for Informatics, Washington University in St. Louis, St. Louis, Missouri, USA
| | | |
Collapse
|
20
|
El Emam K, Mosquera L, Fang X, El-Hussuna A. Utility Metrics for Evaluating Synthetic Health Data Generation Methods: Validation Study. JMIR Med Inform 2022; 10:e35734. [PMID: 35389366 PMCID: PMC9030990 DOI: 10.2196/35734] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2021] [Revised: 01/27/2022] [Accepted: 02/13/2022] [Indexed: 01/06/2023] Open
Abstract
Background A regular task by developers and users of synthetic data generation (SDG) methods is to evaluate and compare the utility of these methods. Multiple utility metrics have been proposed and used to evaluate synthetic data. However, they have not been validated in general or for comparing SDG methods. Objective This study evaluates the ability of common utility metrics to rank SDG methods according to performance on a specific analytic workload. The workload of interest is the use of synthetic data for logistic regression prediction models, which is a very frequent workload in health research. Methods We evaluated 6 utility metrics on 30 different health data sets and 3 different SDG methods (a Bayesian network, a Generative Adversarial Network, and sequential tree synthesis). These metrics were computed by averaging across 20 synthetic data sets from the same generative model. The metrics were then tested on their ability to rank the SDG methods based on prediction performance. Prediction performance was defined as the difference between each of the area under the receiver operating characteristic curve and area under the precision-recall curve values on synthetic data logistic regression prediction models versus real data models. Results The utility metric best able to rank SDG methods was the multivariate Hellinger distance based on a Gaussian copula representation of real and synthetic joint distributions. Conclusions This study has validated a generative model utility metric, the multivariate Hellinger distance, which can be used to reliably rank competing SDG methods on the same data set. The Hellinger distance metric can be used to evaluate and compare alternate SDG methods.
Collapse
Affiliation(s)
- Khaled El Emam
- School of Epidemiology and Public Health, University of Ottawa, Ottawa, ON, Canada.,Children's Hospital of Eastern Ontario Research Institute, Ottawa, ON, Canada.,Replica Analytics Ltd, Ottawa, ON, Canada
| | - Lucy Mosquera
- Children's Hospital of Eastern Ontario Research Institute, Ottawa, ON, Canada.,Replica Analytics Ltd, Ottawa, ON, Canada
| | - Xi Fang
- Replica Analytics Ltd, Ottawa, ON, Canada
| | | |
Collapse
|
21
|
Guo A, Foraker RE, MacGregor RM, Masood FM, Cupps BP, Pasque MK. The Use of Synthetic Electronic Health Record Data and Deep Learning to Improve Timing of High-Risk Heart Failure Surgical Intervention by Predicting Proximity to Catastrophic Decompensation. Front Digit Health 2021; 2:576945. [PMID: 34713050 PMCID: PMC8521851 DOI: 10.3389/fdgth.2020.576945] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2020] [Accepted: 11/13/2020] [Indexed: 12/24/2022] Open
Abstract
Objective: Although many clinical metrics are associated with proximity to decompensation in heart failure (HF), none are individually accurate enough to risk-stratify HF patients on a patient-by-patient basis. The dire consequences of this inaccuracy in risk stratification have profoundly lowered the clinical threshold for application of high-risk surgical intervention, such as ventricular assist device placement. Machine learning can detect non-intuitive classifier patterns that allow for innovative combination of patient feature predictive capability. A machine learning-based clinical tool to identify proximity to catastrophic HF deterioration on a patient-specific basis would enable more efficient direction of high-risk surgical intervention to those patients who have the most to gain from it, while sparing others. Synthetic electronic health record (EHR) data are statistically indistinguishable from the original protected health information, and can be analyzed as if they were original data but without any privacy concerns. We demonstrate that synthetic EHR data can be easily accessed and analyzed and are amenable to machine learning analyses. Methods: We developed synthetic data from EHR data of 26,575 HF patients admitted to a single institution during the decade ending on 12/31/2018. Twenty-seven clinically-relevant features were synthesized and utilized in supervised deep learning and machine learning algorithms (i.e., deep neural networks [DNN], random forest [RF], and logistic regression [LR]) to explore their ability to predict 1-year mortality by five-fold cross validation methods. We conducted analyses leveraging features from prior to/at and after/at the time of HF diagnosis. Results: The area under the receiver operating curve (AUC) was used to evaluate the performance of the three models: the mean AUC was 0.80 for DNN, 0.72 for RF, and 0.74 for LR. Age, creatinine, body mass index, and blood pressure levels were especially important features in predicting death within 1-year among HF patients. Conclusions: Machine learning models have considerable potential to improve accuracy in mortality prediction, such that high-risk surgical intervention can be applied only in those patients who stand to benefit from it. Access to EHR-based synthetic data derivatives eliminates risk of exposure of EHR data, speeds time-to-insight, and facilitates data sharing. As more clinical, imaging, and contractile features with proven predictive capability are added to these models, the development of a clinical tool to assist in timing of intervention in surgical candidates may be possible.
Collapse
Affiliation(s)
- Aixia Guo
- Institute for Informatics (I2), Washington University School of Medicine, St. Louis, MO, United States
| | - Randi E Foraker
- Institute for Informatics (I2), Washington University School of Medicine, St. Louis, MO, United States.,Department of Internal Medicine, Washington University School of Medicine, St. Louis, MO, United States
| | - Robert M MacGregor
- Department of Surgery, Washington University School of Medicine, St. Louis, MO, United States
| | - Faraz M Masood
- Department of Surgery, Washington University School of Medicine, St. Louis, MO, United States
| | - Brian P Cupps
- Department of Surgery, Washington University School of Medicine, St. Louis, MO, United States
| | - Michael K Pasque
- Department of Surgery, Washington University School of Medicine, St. Louis, MO, United States
| |
Collapse
|
22
|
Foraker R, Guo A, Thomas J, Zamstein N, Payne PR, Wilcox A. The National COVID Cohort Collaborative: Analyses of Original and Computationally Derived Electronic Health Record Data. J Med Internet Res 2021; 23:e30697. [PMID: 34559671 PMCID: PMC8491642 DOI: 10.2196/30697] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2021] [Revised: 08/24/2021] [Accepted: 09/12/2021] [Indexed: 01/22/2023] Open
Abstract
BACKGROUND Computationally derived ("synthetic") data can enable the creation and analysis of clinical, laboratory, and diagnostic data as if they were the original electronic health record data. Synthetic data can support data sharing to answer critical research questions to address the COVID-19 pandemic. OBJECTIVE We aim to compare the results from analyses of synthetic data to those from original data and assess the strengths and limitations of leveraging computationally derived data for research purposes. METHODS We used the National COVID Cohort Collaborative's instance of MDClone, a big data platform with data-synthesizing capabilities (MDClone Ltd). We downloaded electronic health record data from 34 National COVID Cohort Collaborative institutional partners and tested three use cases, including (1) exploring the distributions of key features of the COVID-19-positive cohort; (2) training and testing predictive models for assessing the risk of admission among these patients; and (3) determining geospatial and temporal COVID-19-related measures and outcomes, and constructing their epidemic curves. We compared the results from synthetic data to those from original data using traditional statistics, machine learning approaches, and temporal and spatial representations of the data. RESULTS For each use case, the results of the synthetic data analyses successfully mimicked those of the original data such that the distributions of the data were similar and the predictive models demonstrated comparable performance. Although the synthetic and original data yielded overall nearly the same results, there were exceptions that included an odds ratio on either side of the null in multivariable analyses (0.97 vs 1.01) and differences in the magnitude of epidemic curves constructed for zip codes with low population counts. CONCLUSIONS This paper presents the results of each use case and outlines key considerations for the use of synthetic data, examining their role in collaborative research for faster insights.
Collapse
Affiliation(s)
- Randi Foraker
- Division of General Medical Sciences, School of Medicine, Washington University in St. Louis, St. Louis, MO, United States
- Institute for Informatics, School of Medicine, Washington University in St. Louis, St. Louis, MO, United States
| | - Aixia Guo
- Institute for Informatics, School of Medicine, Washington University in St. Louis, St. Louis, MO, United States
| | - Jason Thomas
- Department of Biomedical and Medical Education, School of Medicine, University of Washington, Seattle, WA, United States
| | | | - Philip Ro Payne
- Division of General Medical Sciences, School of Medicine, Washington University in St. Louis, St. Louis, MO, United States
- Institute for Informatics, School of Medicine, Washington University in St. Louis, St. Louis, MO, United States
| | - Adam Wilcox
- Department of Biomedical and Medical Education, School of Medicine, University of Washington, Seattle, WA, United States
| |
Collapse
|
23
|
Guo A, Mazumder NR, Ladner DP, Foraker RE. Predicting mortality among patients with liver cirrhosis in electronic health records with machine learning. PLoS One 2021; 16:e0256428. [PMID: 34464403 PMCID: PMC8407576 DOI: 10.1371/journal.pone.0256428] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2020] [Accepted: 08/08/2021] [Indexed: 02/06/2023] Open
Abstract
OBJECTIVE Liver cirrhosis is a leading cause of death and effects millions of people in the United States. Early mortality prediction among patients with cirrhosis might give healthcare providers more opportunity to effectively treat the condition. We hypothesized that laboratory test results and other related diagnoses would be associated with mortality in this population. Our another assumption was that a deep learning model could outperform the current Model for End Stage Liver disease (MELD) score in predicting mortality. MATERIALS AND METHODS We utilized electronic health record data from 34,575 patients with a diagnosis of cirrhosis from a large medical center to study associations with mortality. Three time-windows of mortality (365 days, 180 days and 90 days) and two cases with different number of variables (all 41 available variables and 4 variables in MELD-NA) were studied. Missing values were imputed using multiple imputation for continuous variables and mode for categorical variables. Deep learning and machine learning algorithms, i.e., deep neural networks (DNN), random forest (RF) and logistic regression (LR) were employed to study the associations between baseline features such as laboratory measurements and diagnoses for each time window by 5-fold cross validation method. Metrics such as area under the receiver operating curve (AUC), overall accuracy, sensitivity, and specificity were used to evaluate models. RESULTS Performance of models comprising all variables outperformed those with 4 MELD-NA variables for all prediction cases and the DNN model outperformed the LR and RF models. For example, the DNN model achieved an AUC of 0.88, 0.86, and 0.85 for 90, 180, and 365-day mortality respectively as compared to the MELD score, which resulted in corresponding AUCs of 0.81, 0.79, and 0.76 for the same instances. The DNN and LR models had a significantly better f1 score compared to MELD at all time points examined. CONCLUSION Other variables such as alkaline phosphatase, alanine aminotransferase, and hemoglobin were also top informative features besides the 4 MELD-Na variables. Machine learning and deep learning models outperformed the current standard of risk prediction among patients with cirrhosis. Advanced informatics techniques showed promise for risk prediction in patients with cirrhosis.
Collapse
Affiliation(s)
- Aixia Guo
- Institute for Informatics (I2), Washington University School of Medicine, St. Louis, MO, United States of America
| | - Nikhilesh R. Mazumder
- Division of Gastroenterology, Northwestern Memorial Hospital, Chicago, IL, United States of America
- Northwestern University Transplant Outcomes Research Collaborative (NUTORC), Comprehensive Transplant Center, Feinberg School of Medicine, Northwestern University, Chicago, IL, United States of America
| | - Daniela P. Ladner
- Northwestern University Transplant Outcomes Research Collaborative (NUTORC), Comprehensive Transplant Center, Feinberg School of Medicine, Northwestern University, Chicago, IL, United States of America
- Division of Transplant, Department of Surgery, Northwestern Medicine, Chicago, IL, United States of America
| | - Randi E. Foraker
- Institute for Informatics (I2), Washington University School of Medicine, St. Louis, MO, United States of America
- Department of Internal Medicine, Washington University School of Medicine, St. Louis, MO, United States of America
| |
Collapse
|
24
|
Thomas JA, Foraker RE, Zamstein N, Payne PR, Wilcox AB. Demonstrating an approach for evaluating synthetic geospatial and temporal epidemiologic data utility: Results from analyzing >1.8 million SARS-CoV-2 tests in the United States National COVID Cohort Collaborative (N3C). MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2021:2021.07.06.21259051. [PMID: 34268525 PMCID: PMC8282114 DOI: 10.1101/2021.07.06.21259051] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
OBJECTIVE To evaluate whether synthetic data derived from a national COVID-19 data set could be used for geospatial and temporal epidemic analyses. MATERIALS AND METHODS Using an original data set (n=1,854,968 SARS-CoV-2 tests) and its synthetic derivative, we compared key indicators of COVID-19 community spread through analysis of aggregate and zip-code level epidemic curves, patient characteristics and outcomes, distribution of tests by zip code, and indicator counts stratified by month and zip code. Similarity between the data was statistically and qualitatively evaluated. RESULTS In general, synthetic data closely matched original data for epidemic curves, patient characteristics, and outcomes. Synthetic data suppressed labels of zip codes with few total tests (mean=2.9±2.4; max=16 tests; 66% reduction of unique zip codes). Epidemic curves and monthly indicator counts were similar between synthetic and original data in a random sample of the most tested (top 1%; n=171) and for all unsuppressed zip codes (n=5,819), respectively. In small sample sizes, synthetic data utility was notably decreased. DISCUSSION Analyses on the population-level and of densely-tested zip codes (which contained most of the data) were similar between original and synthetically-derived data sets. Analyses of sparsely-tested populations were less similar and had more data suppression. CONCLUSION In general, synthetic data were successfully used to analyze geospatial and temporal trends. Analyses using small sample sizes or populations were limited, in part due to purposeful data label suppression -an attribute disclosure countermeasure. Users should consider data fitness for use in these cases.
Collapse
Affiliation(s)
- Jason A. Thomas
- Department of Biomedical Informatics & Medical Education, University of Washington, Seattle, WA, USA
| | - Randi E. Foraker
- Division of General Medical Sciences, School of Medicine, Washington University in St. Louis, St. Louis, MO, USA
- Institute for Informatics, School of Medicine, Washington University in St. Louis, St. Louis, MO, USA
| | | | - Philip R.O. Payne
- Division of General Medical Sciences, School of Medicine, Washington University in St. Louis, St. Louis, MO, USA
- Institute for Informatics, School of Medicine, Washington University in St. Louis, St. Louis, MO, USA
| | - Adam B. Wilcox
- Department of Biomedical Informatics & Medical Education, University of Washington, Seattle, WA, USA
- UW Medicine, Seattle, WA, USA
| | | |
Collapse
|
25
|
Azizi Z, Zheng C, Mosquera L, Pilote L, El Emam K. Can synthetic data be a proxy for real clinical trial data? A validation study. BMJ Open 2021; 11:e043497. [PMID: 33863713 PMCID: PMC8055130 DOI: 10.1136/bmjopen-2020-043497] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/06/2020] [Revised: 01/14/2021] [Accepted: 03/18/2021] [Indexed: 11/03/2022] Open
Abstract
OBJECTIVES There are increasing requirements to make research data, especially clinical trial data, more broadly available for secondary analyses. However, data availability remains a challenge due to complex privacy requirements. This challenge can potentially be addressed using synthetic data. SETTING Replication of a published stage III colon cancer trial secondary analysis using synthetic data generated by a machine learning method. PARTICIPANTS There were 1543 patients in the control arm that were included in our analysis. PRIMARY AND SECONDARY OUTCOME MEASURES Analyses from a study published on the real dataset were replicated on synthetic data to investigate the relationship between bowel obstruction and event-free survival. Information theoretic metrics were used to compare the univariate distributions between real and synthetic data. Percentage CI overlap was used to assess the similarity in the size of the bivariate relationships, and similarly for the multivariate Cox models derived from the two datasets. RESULTS Analysis results were similar between the real and synthetic datasets. The univariate distributions were within 1% of difference on an information theoretic metric. All of the bivariate relationships had CI overlap on the tau statistic above 50%. The main conclusion from the published study, that lack of bowel obstruction has a strong impact on survival, was replicated directionally and the HR CI overlap between the real and synthetic data was 61% for overall survival (real data: HR 1.56, 95% CI 1.11 to 2.2; synthetic data: HR 2.03, 95% CI 1.44 to 2.87) and 86% for disease-free survival (real data: HR 1.51, 95% CI 1.18 to 1.95; synthetic data: HR 1.63, 95% CI 1.26 to 2.1). CONCLUSIONS The high concordance between the analytical results and conclusions from synthetic and real data suggests that synthetic data can be used as a reasonable proxy for real clinical trial datasets. TRIAL REGISTRATION NUMBER NCT00079274.
Collapse
Affiliation(s)
- Zahra Azizi
- Center for Outcomes Research and Evaluation, Faculty of Medicine, McGill University, Montreal, Québec, Canada
| | - Chaoyi Zheng
- Data Science, Replica Analytics Ltd, Ottawa, Ontario, Canada
| | - Lucy Mosquera
- Data Science, Replica Analytics Ltd, Ottawa, Ontario, Canada
| | - Louise Pilote
- Medicine, McGill University, Montreal, Québec, Canada
- Centre for Outcomes Research and Evaluation, Research Institute of the McGill University Health Centre, Montreal, Québec, Canada
| | - Khaled El Emam
- Electronic Health Information Laboratory, Children's Hospital of Eastern Ontario Research Institute, Ottawa, Ontario, Canada
- School of Epidemiology and Public Health, University of Ottawa, Ottawa, Ontario, Canada
| |
Collapse
|