Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Foraker RE, Yu SC, Gupta A, Michelson AP, Pineda Soto JA, Colvin R, Loh F, Kollef MH, Maddox T, Evanoff B, Dror H, Zamstein N, Lai AM, Payne PRO. Spot the difference: comparing results of analyses from real patient data and synthetic derivatives. JAMIA Open 2020;3:557-566. [PMID: 33623891 PMCID: PMC7886551 DOI: 10.1093/jamiaopen/ooaa060] [Citation(s) in RCA: 24] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2020] [Revised: 10/14/2020] [Accepted: 10/20/2020] [Indexed: 12/19/2022] Open

For:	Foraker RE, Yu SC, Gupta A, Michelson AP, Pineda Soto JA, Colvin R, Loh F, Kollef MH, Maddox T, Evanoff B, Dror H, Zamstein N, Lai AM, Payne PRO. Spot the difference: comparing results of analyses from real patient data and synthetic derivatives. JAMIA Open 2020;3:557-566. [PMID: 33623891 PMCID: PMC7886551 DOI: 10.1093/jamiaopen/ooaa060] [Citation(s) in RCA: 24] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2020] [Revised: 10/14/2020] [Accepted: 10/20/2020] [Indexed: 12/19/2022] Open

Number

Cited by Other Article(s)

Adam D. Synthetic data can aid the analysis of clinical outcomes: How much can it be trusted? Proc Natl Acad Sci U S A 2024;121:e2414310121. [PMID: 39083423 DOI: 10.1073/pnas.2414310121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/02/2024] Open

Murcia VM, Aggarwal V, Pesaladinne N, Thammineni R, Do N, Alterovitz G, Fricks RB. Automating Clinical Trial Matches Via Natural Language Processing of Synthetic Electronic Health Records and Clinical Trial Eligibility Criteria. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2024;2024:125-134. [PMID: 38827083 PMCID: PMC11141802] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Subscribe] [Scholar Register] [Indexed: 06/04/2024]

El Emam K, Mosquera L, Fang X, El-Hussuna A. An evaluation of the replicability of analyses using synthetic health data. Sci Rep 2024;14:6978. [PMID: 38521806 PMCID: PMC10960851 DOI: 10.1038/s41598-024-57207-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2023] [Accepted: 03/15/2024] [Indexed: 03/25/2024] Open

Abstract

Synthetic data generation is being increasingly used as a privacy preserving approach for sharing health data. In addition to protecting privacy, it is important to ensure that generated data has high utility. A common way to assess utility is the ability of synthetic data to replicate results from the real data. Replicability has been defined using two criteria: (a) replicate the results of the analyses on real data, and (b) ensure valid population inferences from the synthetic data. A simulation study using three heterogeneous real-world datasets evaluated the replicability of logistic regression workloads. Eight replicability metrics were evaluated: decision agreement, estimate agreement, standardized difference, confidence interval overlap, bias, confidence interval coverage, statistical power, and precision (empirical SE). The analysis of synthetic data used a multiple imputation approach whereby up to 20 datasets were generated and the fitted logistic regression models were combined using combining rules for fully synthetic datasets. The effects of synthetic data amplification were evaluated, and two types of generative models were used: sequential synthesis using boosted decision trees and a generative adversarial network (GAN). Privacy risk was evaluated using a membership disclosure metric. For sequential synthesis, adjusted model parameters after combining at least ten synthetic datasets gave high decision and estimate agreement, low standardized difference, as well as high confidence interval overlap, low bias, the confidence interval had nominal coverage, and power close to the nominal level. Amplification had only a marginal benefit. Confidence interval coverage from a single synthetic dataset without applying combining rules were erroneous, and statistical power, as expected, was artificially inflated when amplification was used. Sequential synthesis performed considerably better than the GAN across multiple datasets. Membership disclosure risk was low for all datasets and models. For replicable results, the statistical analysis of fully synthetic data should be based on at least ten generated datasets of the same size as the original whose analyses results are combined. Analysis results from synthetic data without applying combining rules can be misleading. Replicability results are dependent on the type of generative model used, with our study suggesting that sequential synthesis has good replicability characteristics for common health research workloads.

Collapse

Lun R, Siegal D, Ramsay T, Stotts G, Dowlatshahi D. Synthetic data in cancer and cerebrovascular disease research: A novel approach to big data. PLoS One 2024;19:e0295921. [PMID: 38324588 PMCID: PMC10849264 DOI: 10.1371/journal.pone.0295921] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2023] [Accepted: 12/01/2023] [Indexed: 02/09/2024] Open

Abstract

OBJECTIVES

Synthetic datasets are artificially manufactured based on real health systems data but do not contain real patient information. We sought to validate the use of synthetic data in stroke and cancer research by conducting a comparison study of cancer patients with ischemic stroke to non-cancer patients with ischemic stroke.

DESIGN

retrospective cohort study.

SETTING

We used synthetic data generated by MDClone and compared it to its original source data (i.e. real patient data from the Ottawa Hospital Data Warehouse).

OUTCOME MEASURES

We compared key differences in demographics, treatment characteristics, length of stay, and costs between cancer patients with ischemic stroke and non-cancer patients with ischemic stroke. We used a binary, multivariable logistic regression model to identify risk factors for recurrent stroke in the cancer population.

RESULTS

Using synthetic data, we found cancer patients with ischemic stroke had a lower prevalence of hypertension (52.0% in the cancer cohort vs 57.7% in the non-cancer cohort, p<0.0001), and a higher prevalence of chronic obstructive pulmonary disease (COPD: 8.5% vs 4.7%, p<0.0001), prior ischemic stroke (1.7% vs 0.1%, p<0.001), and prior venous thromboembolism (VTE: 8.2% vs 1.5%, p<0.0001). They also had a longer length of stay (8 days [IQR 3-16] vs 6 days [IQR 3-13], p = 0.011), and higher costs associated with their stroke encounters: $11,498 (IQR $4,440 -$20,668) in the cancer cohort vs $8,084 (IQR $3,947 -$16,706) in the non-cancer cohort (p = 0.0061). A multivariable logistic regression model identified 5 predictors for recurrent ischemic stroke in the cancer cohort using synthetic data; 3 of the same predictors identified using real patient data with similar effect measures. Summary statistics between synthetic and original datasets did not significantly differ, other than slight differences in the distributions of frequencies for numeric data.

CONCLUSION

We demonstrated the utility of synthetic data in stroke and cancer research and provided key differences between cancer and non-cancer patients with ischemic stroke. Synthetic data is a powerful tool that can allow researchers to easily explore hypothesis generation, enable data sharing without privacy breaches, and ensure broad access to big data in a rapid, safe, and reliable fashion.

Collapse

Moore JH, Li X, Chang JH, Tatonetti NP, Theodorescu D, Chen Y, Asselbergs FW, Venkatesan M, Wang ZP. SynTwin: A graph-based approach for predicting clinical outcomes using digital twins derived from synthetic patients. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2024;29:96-107. [PMID: 38160272 PMCID: PMC10827004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 01/03/2024]

Abstract

The concept of a digital twin came from the engineering, industrial, and manufacturing domains to create virtual objects or machines that could inform the design and development of real objects. This idea is appealing for precision medicine where digital twins of patients could help inform healthcare decisions. We have developed a methodology for generating and using digital twins for clinical outcome prediction. We introduce a new approach that combines synthetic data and network science to create digital twins (i.e. SynTwin) for precision medicine. First, our approach starts by estimating the distance between all subjects based on their available features. Second, the distances are used to construct a network with subjects as nodes and edges defining distance less than the percolation threshold. Third, communities or cliques of subjects are defined. Fourth, a large population of synthetic patients are generated using a synthetic data generation algorithm that models the correlation structure of the data to generate new patients. Fifth, digital twins are selected from the synthetic patient population that are within a given distance defining a subject community in the network. Finally, we compare and contrast community-based prediction of clinical endpoints using real subjects, digital twins, or both within and outside of the community. Key to this approach are the digital twins defined using patient similarity that represent hypothetical unobserved patients with patterns similar to nearby real patients as defined by network distance and community structure. We apply our SynTwin approach to predicting mortality in a population-based cancer registry (n=87,674) from the Surveillance, Epidemiology, and End Results (SEER) program from the National Cancer Institute (USA). Our results demonstrate that nearest network neighbor prediction of mortality in this study is significantly improved with digital twins (AUROC=0.864, 95% CI=0.857-0.872) over just using real data alone (AUROC=0.791, 95% CI=0.781-0.800). These results suggest a network-based digital twin strategy using synthetic patients may add value to precision medicine efforts.

Collapse

Alloza C, Knox B, Raad H, Aguilà M, Coakley C, Mohrova Z, Boin É, Bénard M, Davies J, Jacquot E, Lecomte C, Fabre A, Batech M. A Case for Synthetic Data in Regulatory Decision-Making in Europe. Clin Pharmacol Ther 2023;114:795-801. [PMID: 37441734 DOI: 10.1002/cpt.3001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2022] [Accepted: 07/05/2023] [Indexed: 07/15/2023]

Ang CYS, Chiew YS, Wang X, Ooi EH, Nor MBM, Cove ME, Chase JG. Virtual patient with temporal evolution for mechanical ventilation trial studies: A stochastic model approach. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2023;240:107728. [PMID: 37531693 DOI: 10.1016/j.cmpb.2023.107728] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/12/2023] [Revised: 06/27/2023] [Accepted: 07/19/2023] [Indexed: 08/04/2023]

Abstract

BACKGROUND AND OBJECTIVE

Healthcare datasets are plagued by issues of data scarcity and class imbalance. Clinically validated virtual patient (VP) models can provide accurate in-silico representations of real patients and thus a means for synthetic data generation in hospital critical care settings. This research presents a realistic, time-varying mechanically ventilated respiratory failure VP profile synthesised using a stochastic model.

METHODS

A stochastic model was developed using respiratory elastance (Ers) data from two clinical cohorts and averaged over 30-minute time intervals. The stochastic model was used to generate future Ers data based on current Ers values with added normally distributed random noise. Self-validation of the VPs was performed via Monte Carlo simulation and retrospective Ers profile fitting. A stochastic VP cohort of temporal Ers evolution was synthesised and then compared to an independent retrospective patient cohort data in a virtual trial across several measured patient responses, where similarity of profiles validates the realism of stochastic model generated VP profiles.

RESULTS

A total of 120,000 3-hour VPs for pressure control (PC) and volume control (VC) ventilation modes are generated using stochastic simulation. Optimisation of the stochastic simulation process yields an ideal noise percentage of 5-10% and simulation iteration of 200,000 iterations, allowing the simulation of a realistic and diverse set of Ers profiles. Results of self-validation show the retrospective Ers profiles were able to be recreated accurately with a mean squared error of only 0.099 [0.009-0.790]% for the PC cohort and 0.051 [0.030-0.126]% for the VC cohort. A virtual trial demonstrates the ability of the stochastic VP cohort to capture Ers trends within and beyond the retrospective patient cohort providing cohort-level validation.

CONCLUSION

VPs capable of temporal evolution demonstrate feasibility for use in designing, developing, and optimising bedside MV guidance protocols through in-silico simulation and validation. Overall, the temporal VPs developed using stochastic simulation alleviate the need for lengthy, resource intensive, high cost clinical trials, while facilitating statistically robust virtual trials, ultimately leading to improved patient care and outcomes in mechanical ventilation.

Collapse

Greenberg JK, Landman JM, Kelly MP, Pennicooke BH, Molina CA, Foraker RE, Ray WZ. Leveraging Artificial Intelligence and Synthetic Data Derivatives for Spine Surgery Research. Global Spine J 2023;13:2409-2421. [PMID: 35373623 PMCID: PMC10538345 DOI: 10.1177/21925682221085535] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open

Feigin E, Feigin L, Ingbir M, Ben-Bassat OK, Shepshelovich D. Rate of Correction and All-Cause Mortality in Patients With Severe Hypernatremia. JAMA Netw Open 2023;6:e2335415. [PMID: 37768662 PMCID: PMC10539989 DOI: 10.1001/jamanetworkopen.2023.35415] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/15/2023] [Accepted: 08/18/2023] [Indexed: 09/29/2023] Open

Abstract

Importance

Hypernatremia is common among hospitalized patients and is associated with high mortality rates. Current guidelines suggest avoiding fast correction rates but are not supported by robust data.

Objective

To investigate whether there is an association between hypernatremia correction rate and patient survival.

Design, Setting, and Participants

This retrospective cohort study examined data from all patients admitted to the Tel Aviv Medical Center between 2007 and 2021 who were diagnosed with severe hypernatremia (serum sodium ≥155 mmol/L) at admission or during hospitalization. Statistical analysis was performed from April 2022 to August 2023.

Exposure

Patients were grouped as having fast correction rates (>0.5 mmol/L/h) and slow correction rates (≤0.5 mmol/L/h) in accordance with current guidelines.

Main Outcomes and Measures

All-cause 30-day mortality.

Results

A total of 4265 patients were included in this cohort, of which 2621 (61.5%) were men and 343 (8.0%) had fast correction rates; the median (IQR) age at diagnosis was 78 (64-87) years. Slow correction was associated with higher 30-day mortality compared with fast correction (50.7% [1990 of 3922] vs 31.8% [109 of 343]; P < .001). These results remained significant after adjusting for demographics (age, gender), Charlson comorbidity index, initial sodium, potassium, and creatinine levels, hospitalization in an ICU, and severe hyperglycemia (adjusted odds ratio [aOR], 2.02 [95% CI, 1.55-2.62]), regardless of whether hypernatremia was hospital acquired (aOR, 2.19 [95% CI, 1.57-3.05]) or documented on admission (aOR, 1.64 [95% CI, 1.06-2.55]). There was a strong negative correlation between absolute sodium correction during the first 24 hours following the initial documentation of severe hypernatremia and 30-day mortality (Pearson correlation coefficient, -0.80 [95% CI, -0.93 to -0.50]; P < .001). Median (IQR) hospitalization length was shorter for fast correction vs slow correction rates (5.0 [2.1-14.9] days vs 7.2 [3.5-16.1] days; P < .001). Prevalence of neurological complications was comparable for both groups, and none were attributed to fast correction rates of hypernatremia.

Conclusions and Relevance

This cohort study of patients with severe hypernatremia found that rapid correction of hypernatremia was associated with shorter hospitalizations and significantly lower patient mortality without any signs of neurologic complications. These results suggest that physicians should consider the totality of evidence when considering the optimal rates of correction for patients with severe hypernatremia.

Collapse

Mavrogenis AF, Scarlat MM. Artificial intelligence publications: synthetic data, patients, and papers. INTERNATIONAL ORTHOPAEDICS 2023;47:1395-1396. [PMID: 37162553 DOI: 10.1007/s00264-023-05830-w] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Subscribe] [Scholar Register] [Indexed: 05/11/2023]

Wilcox A. Understanding the opportunity and application of synthetic data in healthcare. Paediatr Perinat Epidemiol 2023;37:301-302. [PMID: 36970808 DOI: 10.1111/ppe.12970] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/28/2023] [Accepted: 03/05/2023] [Indexed: 05/10/2023]

Davis SE, Ssemaganda H, Koola JD, Mao J, Westerman D, Speroff T, Govindarajulu US, Ramsay CR, Sedrakyan A, Ohno-Machado L, Resnic FS, Matheny ME. Simulating complex patient populations with hierarchical learning effects to support methods development for post-market surveillance. BMC Med Res Methodol 2023;23:89. [PMID: 37041457 PMCID: PMC10088292 DOI: 10.1186/s12874-023-01913-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2021] [Accepted: 04/04/2023] [Indexed: 04/13/2023] Open

Abstract

BACKGROUND

Validating new algorithms, such as methods to disentangle intrinsic treatment risk from risk associated with experiential learning of novel treatments, often requires knowing the ground truth for data characteristics under investigation. Since the ground truth is inaccessible in real world data, simulation studies using synthetic datasets that mimic complex clinical environments are essential. We describe and evaluate a generalizable framework for injecting hierarchical learning effects within a robust data generation process that incorporates the magnitude of intrinsic risk and accounts for known critical elements in clinical data relationships.

METHODS

We present a multi-step data generating process with customizable options and flexible modules to support a variety of simulation requirements. Synthetic patients with nonlinear and correlated features are assigned to provider and institution case series. The probability of treatment and outcome assignment are associated with patient features based on user definitions. Risk due to experiential learning by providers and/or institutions when novel treatments are introduced is injected at various speeds and magnitudes. To further reflect real-world complexity, users can request missing values and omitted variables. We illustrate an implementation of our method in a case study using MIMIC-III data for reference patient feature distributions.

RESULTS

Realized data characteristics in the simulated data reflected specified values. Apparent deviations in treatment effects and feature distributions, though not statistically significant, were most common in small datasets (n < 3000) and attributable to random noise and variability in estimating realized values in small samples. When learning effects were specified, synthetic datasets exhibited changes in the probability of an adverse outcomes as cases accrued for the treatment group impacted by learning and stable probabilities as cases accrued for the treatment group not affected by learning.

CONCLUSIONS

Our framework extends clinical data simulation techniques beyond generation of patient features to incorporate hierarchical learning effects. This enables the complex simulation studies required to develop and rigorously test algorithms developed to disentangle treatment safety signals from the effects of experiential learning. By supporting such efforts, this work can help identify training opportunities, avoid unwarranted restriction of access to medical advances, and hasten treatment improvements.

Collapse

Affiliation(s)

Sharon E Davis Department of Biomedical Informatics, Vanderbilt University Medical Center, 2525 West End Ave, Suite 1475, Nashville, TN, 37203, USA.
Henry Ssemaganda Comparative Effectiveness Research Institute, Lahey Hospital and Medical Center, 41 Mall Road, Burlington, MA, 01803, USA
Jejo D Koola UC Health Department of Biomedical Informatics, University of California San Diego, 9500 Gilman Dr. MC 0728, La Jolla, San Diego, CA, 92093-0728, USA
Jialin Mao Department of Population Health Sciences, Weill Cornell Medicine, 1300 York Avenue, New York, NY, 10065, USA
Dax Westerman Department of Biomedical Informatics, Vanderbilt University Medical Center, 2525 West End Ave, Suite 1475, Nashville, TN, 37203, USA
Theodore Speroff Departments of Medicine and Biostatistics, Vanderbilt University Medical Center, 1313 21St Avenue South, Oxford House, Room 209, Nashville, TN, 37232, USA
Usha S Govindarajulu Center for Biostatistics, Department of Population Health Science and Policy, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, Box 1077, New York, NY, 10029, USA
Craig R Ramsay Health Services Research Unit, University of Aberdeen, Health Sciences Building, Foresterhill, 3rd Floor, Aberdeen, AB25 2ZD, UK
Art Sedrakyan Department of Population Health Sciences, Weill Cornell Medicine, 1300 York Avenue, New York, NY, 10065, USA
Lucila Ohno-Machado Biomedical Informatics and Data Science, Yale School of Medicine, 100 College Street, New Haven, CT, 06510, USA
Frederic S Resnic Division of Cardiovascular Medicine and Comparative Effectiveness Research Institute, Lahey Hospital and Medical Center, Tufts University School of Medicine, 41 Burlington Mall Road, Burlington, MA, 01805, USA
Michael E Matheny Departments of Biomedical Informatics, Biostatistics, and Medicine, Vanderbilt University Medical Center, 2525 West End Ave, Suite 1475, Nashville, TN, 37203, USA Geriatric Research Education and Clinical Care Center, Tennessee Valley Healthcare System VA, 1310 24th Avenue South, Nashville, TN, 37212, USA

Collapse

Kepper MM, Walsh‐Bailey C, Prusaczyk B, Zhao M, Herrick C, Foraker R. The adoption of social determinants of health documentation in clinical settings. Health Serv Res 2023;58:67-77. [PMID: 35862115 PMCID: PMC9836948 DOI: 10.1111/1475-6773.14039] [Citation(s) in RCA: 13] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023] Open

Abstract

OBJECTIVE

To understand the frequency of social determinants of health (SDOH) diagnosis codes (Z-codes) within the electronic health record (EHR) for patients with prediabetes and diabetes and examine factors influencing the adoption of SDOH documentation in clinical care.

DATA SOURCES

EHR data and qualitative interviews with health care providers and stakeholders.

STUDY DESIGN

An explanatory sequential mixed methods design first examined the use of Z-codes within the EHR and qualitatively examined barriers to documenting SDOH. Data were integrated and interpreted using a joint display. This research was informed by the Framework for Dissemination and Utilization of Research for Health Care Policy and Practice.

DATA COLLECTION/EXTRACTION METHODS

We queried EHR data for patients with a hemoglobin A1c > 5.7 between October 1, 2015 and September 1, 2020 (n = 118,215) to examine the use of Z-codes and demographics and outcomes for patients with and without social needs. Semi-structured interviews were conducted with 23 participants (n = 15 health care providers; n = 7 billing and compliance stakeholders). The interview questions sought to understand how factors at the innovation-, individual-, organizational-, and environmental-level influence SDOH documentation. We used thematic analysis to analyze interview data.

PRINCIPAL FINDINGS

Patients with social needs were disproportionately older, female, Black, uninsured, living in low-income and high unemployment neighborhoods, and had a higher number of hospitalizations, obesity, prediabetes, and type 2 diabetes than those without a Z-code. Z-codes were not frequently used in the EHR (<1% of patients), and there was an overall lack of congruence between quantitative and qualitative results related to the prevalence of social needs. Providers faced barriers at multiple levels (e.g., individual-level: discomfort discussing social needs; organizational-level: limited time, competing priorities) for documenting SDOH and identified strategies to improve documentation.

CONCLUSIONS

Providers recognized the impact of SDOH on patient health and had positive perceptions of screening for and documenting social needs. Implementation strategies are needed to improve systematic documentation.

Collapse

Benzakour A, Altsitzioglou P, Lemée JM, Ahmad A, Mavrogenis AF, Benzakour T. Artificial intelligence in spine surgery. INTERNATIONAL ORTHOPAEDICS 2023;47:457-465. [PMID: 35902390 DOI: 10.1007/s00264-022-05517-8] [Citation(s) in RCA: 15] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/29/2022] [Accepted: 07/11/2022] [Indexed: 01/28/2023]

Yan C, Yan Y, Wan Z, Zhang Z, Omberg L, Guinney J, Mooney SD, Malin BA. A Multifaceted benchmarking of synthetic electronic health record generation models. Nat Commun 2022;13:7609. [PMID: 36494374 PMCID: PMC9734113 DOI: 10.1038/s41467-022-35295-1] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2022] [Accepted: 11/28/2022] [Indexed: 12/13/2022] Open

El Emam K, Mosquera L, Fang X. Validating a membership disclosure metric for synthetic health data. JAMIA Open 2022;5:ooac083. [PMID: 36238080 PMCID: PMC9553223 DOI: 10.1093/jamiaopen/ooac083] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2022] [Revised: 09/13/2022] [Accepted: 09/22/2022] [Indexed: 11/24/2022] Open

Meeker D, Kallem C, Heras Y, Garcia S, Thompson C. Case report: evaluation of an open-source synthetic data platform for simulation studies. JAMIA Open 2022;5:ooac067. [PMID: 35958672 PMCID: PMC9360775 DOI: 10.1093/jamiaopen/ooac067] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2021] [Revised: 05/10/2022] [Accepted: 07/28/2022] [Indexed: 11/12/2022] Open

Be'er M, Amirav I, Cahal M, Rochman M, Lior Y, Rimon A, Lavy RG, Lavie M. Unforeseen changes in seasonality of pediatric respiratory illnesses during the first COVID-19 pandemic year. Pediatr Pulmonol 2022;57:1425-1431. [PMID: 35307986 PMCID: PMC9088630 DOI: 10.1002/ppul.25896] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/17/2021] [Revised: 03/01/2022] [Accepted: 03/10/2022] [Indexed: 11/29/2022]

Thomas JA, Foraker RE, Zamstein N, Morrow JD, Payne PRO, Wilcox AB. Demonstrating an approach for evaluating synthetic geospatial and temporal epidemiologic data utility: results from analyzing >1.8 million SARS-CoV-2 tests in the United States National COVID Cohort Collaborative (N3C). J Am Med Inform Assoc 2022;29:1350-1365. [PMID: 35357487 PMCID: PMC8992357 DOI: 10.1093/jamia/ocac045] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2021] [Revised: 03/11/2022] [Accepted: 03/28/2022] [Indexed: 11/16/2022] Open

El Emam K, Mosquera L, Fang X, El-Hussuna A. Utility Metrics for Evaluating Synthetic Health Data Generation Methods: Validation Study. JMIR Med Inform 2022;10:e35734. [PMID: 35389366 PMCID: PMC9030990 DOI: 10.2196/35734] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2021] [Revised: 01/27/2022] [Accepted: 02/13/2022] [Indexed: 01/06/2023] Open

Guo A, Foraker RE, MacGregor RM, Masood FM, Cupps BP, Pasque MK. The Use of Synthetic Electronic Health Record Data and Deep Learning to Improve Timing of High-Risk Heart Failure Surgical Intervention by Predicting Proximity to Catastrophic Decompensation. Front Digit Health 2021;2:576945. [PMID: 34713050 PMCID: PMC8521851 DOI: 10.3389/fdgth.2020.576945] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2020] [Accepted: 11/13/2020] [Indexed: 12/24/2022] Open

Abstract

Objective: Although many clinical metrics are associated with proximity to decompensation in heart failure (HF), none are individually accurate enough to risk-stratify HF patients on a patient-by-patient basis. The dire consequences of this inaccuracy in risk stratification have profoundly lowered the clinical threshold for application of high-risk surgical intervention, such as ventricular assist device placement. Machine learning can detect non-intuitive classifier patterns that allow for innovative combination of patient feature predictive capability. A machine learning-based clinical tool to identify proximity to catastrophic HF deterioration on a patient-specific basis would enable more efficient direction of high-risk surgical intervention to those patients who have the most to gain from it, while sparing others. Synthetic electronic health record (EHR) data are statistically indistinguishable from the original protected health information, and can be analyzed as if they were original data but without any privacy concerns. We demonstrate that synthetic EHR data can be easily accessed and analyzed and are amenable to machine learning analyses. Methods: We developed synthetic data from EHR data of 26,575 HF patients admitted to a single institution during the decade ending on 12/31/2018. Twenty-seven clinically-relevant features were synthesized and utilized in supervised deep learning and machine learning algorithms (i.e., deep neural networks [DNN], random forest [RF], and logistic regression [LR]) to explore their ability to predict 1-year mortality by five-fold cross validation methods. We conducted analyses leveraging features from prior to/at and after/at the time of HF diagnosis. Results: The area under the receiver operating curve (AUC) was used to evaluate the performance of the three models: the mean AUC was 0.80 for DNN, 0.72 for RF, and 0.74 for LR. Age, creatinine, body mass index, and blood pressure levels were especially important features in predicting death within 1-year among HF patients. Conclusions: Machine learning models have considerable potential to improve accuracy in mortality prediction, such that high-risk surgical intervention can be applied only in those patients who stand to benefit from it. Access to EHR-based synthetic data derivatives eliminates risk of exposure of EHR data, speeds time-to-insight, and facilitates data sharing. As more clinical, imaging, and contractile features with proven predictive capability are added to these models, the development of a clinical tool to assist in timing of intervention in surgical candidates may be possible.

Collapse

Foraker R, Guo A, Thomas J, Zamstein N, Payne PR, Wilcox A. The National COVID Cohort Collaborative: Analyses of Original and Computationally Derived Electronic Health Record Data. J Med Internet Res 2021;23:e30697. [PMID: 34559671 PMCID: PMC8491642 DOI: 10.2196/30697] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2021] [Revised: 08/24/2021] [Accepted: 09/12/2021] [Indexed: 01/22/2023] Open

Abstract

BACKGROUND

Computationally derived ("synthetic") data can enable the creation and analysis of clinical, laboratory, and diagnostic data as if they were the original electronic health record data. Synthetic data can support data sharing to answer critical research questions to address the COVID-19 pandemic.

OBJECTIVE

We aim to compare the results from analyses of synthetic data to those from original data and assess the strengths and limitations of leveraging computationally derived data for research purposes.

METHODS

We used the National COVID Cohort Collaborative's instance of MDClone, a big data platform with data-synthesizing capabilities (MDClone Ltd). We downloaded electronic health record data from 34 National COVID Cohort Collaborative institutional partners and tested three use cases, including (1) exploring the distributions of key features of the COVID-19-positive cohort; (2) training and testing predictive models for assessing the risk of admission among these patients; and (3) determining geospatial and temporal COVID-19-related measures and outcomes, and constructing their epidemic curves. We compared the results from synthetic data to those from original data using traditional statistics, machine learning approaches, and temporal and spatial representations of the data.

RESULTS

For each use case, the results of the synthetic data analyses successfully mimicked those of the original data such that the distributions of the data were similar and the predictive models demonstrated comparable performance. Although the synthetic and original data yielded overall nearly the same results, there were exceptions that included an odds ratio on either side of the null in multivariable analyses (0.97 vs 1.01) and differences in the magnitude of epidemic curves constructed for zip codes with low population counts.

CONCLUSIONS

This paper presents the results of each use case and outlines key considerations for the use of synthetic data, examining their role in collaborative research for faster insights.

Collapse

Guo A, Mazumder NR, Ladner DP, Foraker RE. Predicting mortality among patients with liver cirrhosis in electronic health records with machine learning. PLoS One 2021;16:e0256428. [PMID: 34464403 PMCID: PMC8407576 DOI: 10.1371/journal.pone.0256428] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2020] [Accepted: 08/08/2021] [Indexed: 02/06/2023] Open

Abstract

OBJECTIVE

Liver cirrhosis is a leading cause of death and effects millions of people in the United States. Early mortality prediction among patients with cirrhosis might give healthcare providers more opportunity to effectively treat the condition. We hypothesized that laboratory test results and other related diagnoses would be associated with mortality in this population. Our another assumption was that a deep learning model could outperform the current Model for End Stage Liver disease (MELD) score in predicting mortality.

MATERIALS AND METHODS

We utilized electronic health record data from 34,575 patients with a diagnosis of cirrhosis from a large medical center to study associations with mortality. Three time-windows of mortality (365 days, 180 days and 90 days) and two cases with different number of variables (all 41 available variables and 4 variables in MELD-NA) were studied. Missing values were imputed using multiple imputation for continuous variables and mode for categorical variables. Deep learning and machine learning algorithms, i.e., deep neural networks (DNN), random forest (RF) and logistic regression (LR) were employed to study the associations between baseline features such as laboratory measurements and diagnoses for each time window by 5-fold cross validation method. Metrics such as area under the receiver operating curve (AUC), overall accuracy, sensitivity, and specificity were used to evaluate models.

RESULTS

Performance of models comprising all variables outperformed those with 4 MELD-NA variables for all prediction cases and the DNN model outperformed the LR and RF models. For example, the DNN model achieved an AUC of 0.88, 0.86, and 0.85 for 90, 180, and 365-day mortality respectively as compared to the MELD score, which resulted in corresponding AUCs of 0.81, 0.79, and 0.76 for the same instances. The DNN and LR models had a significantly better f1 score compared to MELD at all time points examined.

CONCLUSION

Other variables such as alkaline phosphatase, alanine aminotransferase, and hemoglobin were also top informative features besides the 4 MELD-Na variables. Machine learning and deep learning models outperformed the current standard of risk prediction among patients with cirrhosis. Advanced informatics techniques showed promise for risk prediction in patients with cirrhosis.

Collapse

Thomas JA, Foraker RE, Zamstein N, Payne PR, Wilcox AB. Demonstrating an approach for evaluating synthetic geospatial and temporal epidemiologic data utility: Results from analyzing >1.8 million SARS-CoV-2 tests in the United States National COVID Cohort Collaborative (N3C). MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2021:2021.07.06.21259051. [PMID: 34268525 PMCID: PMC8282114 DOI: 10.1101/2021.07.06.21259051] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]

Azizi Z, Zheng C, Mosquera L, Pilote L, El Emam K. Can synthetic data be a proxy for real clinical trial data? A validation study. BMJ Open 2021;11:e043497. [PMID: 33863713 PMCID: PMC8055130 DOI: 10.1136/bmjopen-2020-043497] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/06/2020] [Revised: 01/14/2021] [Accepted: 03/18/2021] [Indexed: 11/03/2022] Open

Abstract

OBJECTIVES

There are increasing requirements to make research data, especially clinical trial data, more broadly available for secondary analyses. However, data availability remains a challenge due to complex privacy requirements. This challenge can potentially be addressed using synthetic data.

SETTING

Replication of a published stage III colon cancer trial secondary analysis using synthetic data generated by a machine learning method.

PARTICIPANTS

There were 1543 patients in the control arm that were included in our analysis.

PRIMARY AND SECONDARY OUTCOME MEASURES

Analyses from a study published on the real dataset were replicated on synthetic data to investigate the relationship between bowel obstruction and event-free survival. Information theoretic metrics were used to compare the univariate distributions between real and synthetic data. Percentage CI overlap was used to assess the similarity in the size of the bivariate relationships, and similarly for the multivariate Cox models derived from the two datasets.

RESULTS

Analysis results were similar between the real and synthetic datasets. The univariate distributions were within 1% of difference on an information theoretic metric. All of the bivariate relationships had CI overlap on the tau statistic above 50%. The main conclusion from the published study, that lack of bowel obstruction has a strong impact on survival, was replicated directionally and the HR CI overlap between the real and synthetic data was 61% for overall survival (real data: HR 1.56, 95% CI 1.11 to 2.2; synthetic data: HR 2.03, 95% CI 1.44 to 2.87) and 86% for disease-free survival (real data: HR 1.51, 95% CI 1.18 to 1.95; synthetic data: HR 1.63, 95% CI 1.26 to 2.1).

CONCLUSIONS

The high concordance between the analytical results and conclusions from synthetic and real data suggests that synthetic data can be used as a reasonable proxy for real clinical trial datasets.

TRIAL REGISTRATION NUMBER

NCT00079274.

Collapse