1
|
Thangaraj PM, Benson SH, Oikonomou EK, Asselbergs FW, Khera R. Cardiovascular care with digital twin technology in the era of generative artificial intelligence. Eur Heart J 2024:ehae619. [PMID: 39322420 DOI: 10.1093/eurheartj/ehae619] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/14/2024] [Revised: 07/16/2024] [Accepted: 09/01/2024] [Indexed: 09/27/2024] Open
Abstract
Digital twins, which are in silico replications of an individual and its environment, have advanced clinical decision-making and prognostication in cardiovascular medicine. The technology enables personalized simulations of clinical scenarios, prediction of disease risk, and strategies for clinical trial augmentation. Current applications of cardiovascular digital twins have integrated multi-modal data into mechanistic and statistical models to build physiologically accurate cardiac replicas to enhance disease phenotyping, enrich diagnostic workflows, and optimize procedural planning. Digital twin technology is rapidly evolving in the setting of newly available data modalities and advances in generative artificial intelligence, enabling dynamic and comprehensive simulations unique to an individual. These twins fuse physiologic, environmental, and healthcare data into machine learning and generative models to build real-time patient predictions that can model interactions with the clinical environment to accelerate personalized patient care. This review summarizes digital twins in cardiovascular medicine and their potential future applications by incorporating new personalized data modalities. It examines the technical advances in deep learning and generative artificial intelligence that broaden the scope and predictive power of digital twins. Finally, it highlights the individual and societal challenges as well as ethical considerations that are essential to realizing the future vision of incorporating cardiology digital twins into personalized cardiovascular care.
Collapse
Affiliation(s)
- Phyllis M Thangaraj
- Section of Cardiology, Department of Internal Medicine, Yale School of Medicine, 789 Howard Ave., New Haven, CT, USA
| | - Sean H Benson
- Department of Cardiology, Amsterdam Cardiovascular Sciences, Amsterdam University Medical Center, University of Amsterdam, Amsterdam, Netherlands
| | - Evangelos K Oikonomou
- Section of Cardiology, Department of Internal Medicine, Yale School of Medicine, 789 Howard Ave., New Haven, CT, USA
| | - Folkert W Asselbergs
- Department of Cardiology, Amsterdam Cardiovascular Sciences, Amsterdam University Medical Center, University of Amsterdam, Amsterdam, Netherlands
- Institute of Health Informatics, University College London, London, UK
- The National Institute for Health Research University College London Hospitals Biomedical Research Center, University College London, London, UK
| | - Rohan Khera
- Section of Cardiology, Department of Internal Medicine, Yale School of Medicine, 789 Howard Ave., New Haven, CT, USA
- Section of Health Informatics, Department of Biostatistics, Yale School of Public Health, 47 College St., New Haven, CT, USA
- Department of Biomedical Informatics and Data Science, Yale School of Medicine, 100 College St. Fl 9, New Haven, CT, USA
- Center for Outcomes Research and Evaluation, Yale-New Haven Hospital, 195 Church St. Fl 6, New Haven, CT 06510, USA
| |
Collapse
|
2
|
Cho H, Froelicher D, Dokmai N, Nandi A, Sadhuka S, Hong MM, Berger B. Privacy-Enhancing Technologies in Biomedical Data Science. Annu Rev Biomed Data Sci 2024; 7:317-343. [PMID: 39178425 PMCID: PMC11346580 DOI: 10.1146/annurev-biodatasci-120423-120107] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/25/2024]
Abstract
The rapidly growing scale and variety of biomedical data repositories raise important privacy concerns. Conventional frameworks for collecting and sharing human subject data offer limited privacy protection, often necessitating the creation of data silos. Privacy-enhancing technologies (PETs) promise to safeguard these data and broaden their usage by providing means to share and analyze sensitive data while protecting privacy. Here, we review prominent PETs and illustrate their role in advancing biomedicine. We describe key use cases of PETs and their latest technical advances and highlight recent applications of PETs in a range of biomedical domains. We conclude by discussing outstanding challenges and social considerations that need to be addressed to facilitate a broader adoption of PETs in biomedical data science.
Collapse
Affiliation(s)
- Hyunghoon Cho
- Department of Biomedical Informatics and Data Science, Yale School of Medicine, New Haven, Connecticut, USA;
| | - David Froelicher
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA;
| | - Natnatee Dokmai
- Department of Biomedical Informatics and Data Science, Yale School of Medicine, New Haven, Connecticut, USA;
| | - Anupama Nandi
- Department of Biomedical Informatics and Data Science, Yale School of Medicine, New Haven, Connecticut, USA;
| | - Shuvom Sadhuka
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA;
| | - Matthew M Hong
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA;
| | - Bonnie Berger
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA;
- Department of Mathematics, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
| |
Collapse
|
3
|
Prediger L, Jälkö J, Honkela A, Kaski S. Collaborative learning from distributed data with differentially private synthetic data. BMC Med Inform Decis Mak 2024; 24:167. [PMID: 38877563 PMCID: PMC11179391 DOI: 10.1186/s12911-024-02563-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2023] [Accepted: 06/03/2024] [Indexed: 06/16/2024] Open
Abstract
BACKGROUND Consider a setting where multiple parties holding sensitive data aim to collaboratively learn population level statistics, but pooling the sensitive data sets is not possible due to privacy concerns and parties are unable to engage in centrally coordinated joint computation. We study the feasibility of combining privacy preserving synthetic data sets in place of the original data for collaborative learning on real-world health data from the UK Biobank. METHODS We perform an empirical evaluation based on an existing prospective cohort study from the literature. Multiple parties were simulated by splitting the UK Biobank cohort along assessment centers, for which we generate synthetic data using differentially private generative modelling techniques. We then apply the original study's Poisson regression analysis on the combined synthetic data sets and evaluate the effects of 1) the size of local data set, 2) the number of participating parties, and 3) local shifts in distributions, on the obtained likelihood scores. RESULTS We discover that parties engaging in the collaborative learning via shared synthetic data obtain more accurate estimates of the regression parameters compared to using only their local data. This finding extends to the difficult case of small heterogeneous data sets. Furthermore, the more parties participate, the larger and more consistent the improvements become up to a certain limit. Finally, we find that data sharing can especially help parties whose data contain underrepresented groups to perform better-adjusted analysis for said groups. CONCLUSIONS Based on our results we conclude that sharing of synthetic data is a viable method for enabling learning from sensitive data without violating privacy constraints even if individual data sets are small or do not represent the overall population well. Lack of access to distributed sensitive data is often a bottleneck in biomedical research, which our study shows can be alleviated with privacy-preserving collaborative learning methods.
Collapse
Affiliation(s)
| | - Joonas Jälkö
- Aalto University, Espoo, 00076, Finland
- University of Helsinki, Helsinki, 00014, Finland
| | | | - Samuel Kaski
- Aalto University, Espoo, 00076, Finland
- University of Manchester, Manchester, M13 9Pl, UK
| |
Collapse
|
4
|
Vallevik VB, Babic A, Marshall SE, Elvatun S, Brøgger HMB, Alagaratnam S, Edwin B, Veeraragavan NR, Befring AK, Nygård JF. Can I trust my fake data - A comprehensive quality assessment framework for synthetic tabular data in healthcare. Int J Med Inform 2024; 185:105413. [PMID: 38493547 DOI: 10.1016/j.ijmedinf.2024.105413] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2023] [Revised: 02/17/2024] [Accepted: 03/11/2024] [Indexed: 03/19/2024]
Abstract
BACKGROUND Ensuring safe adoption of AI tools in healthcare hinges on access to sufficient data for training, testing and validation. Synthetic data has been suggested in response to privacy concerns and regulatory requirements and can be created by training a generator on real data to produce a dataset with similar statistical properties. Competing metrics with differing taxonomies for quality evaluation have been proposed, resulting in a complex landscape. Optimising quality entails balancing considerations that make the data fit for use, yet relevant dimensions are left out of existing frameworks. METHOD We performed a comprehensive literature review on the use of quality evaluation metrics on synthetic data within the scope of synthetic tabular healthcare data using deep generative methods. Based on this and the collective team experiences, we developed a conceptual framework for quality assurance. The applicability was benchmarked against a practical case from the Dutch National Cancer Registry. CONCLUSION We present a conceptual framework for quality assuranceof synthetic data for AI applications in healthcare that aligns diverging taxonomies, expands on common quality dimensions to include the dimensions of Fairness and Carbon footprint, and proposes stages necessary to support real-life applications. Building trust in synthetic data by increasing transparency and reducing the safety risk will accelerate the development and uptake of trustworthy AI tools for the benefit of patients. DISCUSSION Despite the growing emphasis on algorithmic fairness and carbon footprint, these metrics were scarce in the literature review. The overwhelming focus was on statistical similarity using distance metrics while sequential logic detection was scarce. A consensus-backed framework that includes all relevant quality dimensions can provide assurance for safe and responsible real-life applications of synthetic data. As the choice of appropriate metrics are highly context dependent, further research is needed on validation studies to guide metric choices and support the development of technical standards.
Collapse
Affiliation(s)
- Vibeke Binz Vallevik
- University of Oslo, Boks 1072 Blindern, NO-0316 Oslo, Norway; DNV AS, Veritasveien 1, 1322 Høvik, Norway.
| | | | | | - Severin Elvatun
- Cancer Registry of Norway, Ullernchausseen 64, 0379 Oslo, Norway
| | - Helga M B Brøgger
- DNV AS, Veritasveien 1, 1322 Høvik, Norway; Oslo University Hospital, Sognsvannsveien 20, 0372 Oslo, Norway
| | | | - Bjørn Edwin
- University of Oslo, Boks 1072 Blindern, NO-0316 Oslo, Norway; The Intervention Centre and Department of HPB Surgery, Oslo University Hospital and Institute of Clinical Medicine, Faculty of Medicine, University of Oslo, Oslo, Norway
| | | | | | - Jan F Nygård
- Cancer Registry of Norway, Ullernchausseen 64, 0379 Oslo, Norway; UiT - The Arctic University of Norway, Tromsø, Norway
| |
Collapse
|
5
|
Carey EG, Adeyemi FO, Neelakantan L, Fernandes B, Fazel M, Ford T, Burn AM. Preferences on Governance Models for Mental Health Data: Qualitative Study With Young People. JMIR Form Res 2024; 8:e50368. [PMID: 38652525 PMCID: PMC11077411 DOI: 10.2196/50368] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2023] [Revised: 11/08/2023] [Accepted: 03/22/2024] [Indexed: 04/25/2024] Open
Abstract
BACKGROUND Improving access to mental health data to accelerate research and improve mental health outcomes is a potentially achievable goal given the substantial data that can now be collected from mobile devices. Smartphones can provide a useful mechanism for collecting mental health data from young people, especially as their use is relatively ubiquitous in high-resource settings such as the United Kingdom and they have a high capacity to collect active and passive data. This raises the interesting opportunity to establish a large bank of mental health data from young people that could be accessed by researchers worldwide, but it is important to clarify how to ensure that this is done in an appropriate manner aligned with the values of young people. OBJECTIVE In this study, we discussed the preferences of young people in the United Kingdom regarding the governance, sharing, and use of their mental health data with the establishment of a global data bank in mind. We aimed to determine whether young people want and feel safe to share their mental health data; if so, with whom; and their preferences in doing so. METHODS Young people (N=46) were provided with 2 modules of educational material about data governance models and background in scientific research. We then conducted 2-hour web-based group sessions using a deliberative democracy methodology to reach a consensus where possible. Findings were analyzed using the framework method. RESULTS Young people were generally enthusiastic about contributing data to mental health research. They believed that broader availability of mental health data could be used to discover what improves or worsens mental health and develop new services to support young people. However, this enthusiasm came with many concerns and caveats, including distributed control of access to ensure appropriate use, distributed power, and data management that included diverse representation and sufficient ethical training for applicants and data managers. CONCLUSIONS Although it is feasible to use smartphones to collect mental health data from young people in the United Kingdom, it is essential to carefully consider the parameters of such a data bank. Addressing and embedding young people's preferences, including the need for robust procedures regarding how their data are managed, stored, and accessed, will set a solid foundation for establishing any global data bank.
Collapse
Affiliation(s)
- Emma Grace Carey
- Department of Psychiatry, University of Cambridge, Cambridge, United Kingdom
| | | | - Lakshmi Neelakantan
- School of Population and Global Health, University of Melbourne, Melbourne, Australia
| | - Blossom Fernandes
- Department of Psychiatry, University of Oxford, Oxford, United Kingdom
| | - Mina Fazel
- Department of Psychiatry, University of Oxford, Oxford, United Kingdom
| | - Tamsin Ford
- Department of Psychiatry, University of Cambridge, Cambridge, United Kingdom
| | - Anne-Marie Burn
- Department of Psychiatry, University of Cambridge, Cambridge, United Kingdom
| |
Collapse
|
6
|
El Emam K, Mosquera L, Fang X, El-Hussuna A. An evaluation of the replicability of analyses using synthetic health data. Sci Rep 2024; 14:6978. [PMID: 38521806 PMCID: PMC10960851 DOI: 10.1038/s41598-024-57207-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2023] [Accepted: 03/15/2024] [Indexed: 03/25/2024] Open
Abstract
Synthetic data generation is being increasingly used as a privacy preserving approach for sharing health data. In addition to protecting privacy, it is important to ensure that generated data has high utility. A common way to assess utility is the ability of synthetic data to replicate results from the real data. Replicability has been defined using two criteria: (a) replicate the results of the analyses on real data, and (b) ensure valid population inferences from the synthetic data. A simulation study using three heterogeneous real-world datasets evaluated the replicability of logistic regression workloads. Eight replicability metrics were evaluated: decision agreement, estimate agreement, standardized difference, confidence interval overlap, bias, confidence interval coverage, statistical power, and precision (empirical SE). The analysis of synthetic data used a multiple imputation approach whereby up to 20 datasets were generated and the fitted logistic regression models were combined using combining rules for fully synthetic datasets. The effects of synthetic data amplification were evaluated, and two types of generative models were used: sequential synthesis using boosted decision trees and a generative adversarial network (GAN). Privacy risk was evaluated using a membership disclosure metric. For sequential synthesis, adjusted model parameters after combining at least ten synthetic datasets gave high decision and estimate agreement, low standardized difference, as well as high confidence interval overlap, low bias, the confidence interval had nominal coverage, and power close to the nominal level. Amplification had only a marginal benefit. Confidence interval coverage from a single synthetic dataset without applying combining rules were erroneous, and statistical power, as expected, was artificially inflated when amplification was used. Sequential synthesis performed considerably better than the GAN across multiple datasets. Membership disclosure risk was low for all datasets and models. For replicable results, the statistical analysis of fully synthetic data should be based on at least ten generated datasets of the same size as the original whose analyses results are combined. Analysis results from synthetic data without applying combining rules can be misleading. Replicability results are dependent on the type of generative model used, with our study suggesting that sequential synthesis has good replicability characteristics for common health research workloads.
Collapse
Affiliation(s)
- Khaled El Emam
- School of Epidemiology and Public Health, University of Ottawa, Ottawa, ON, Canada.
- Replica Analytics, Ottawa, ON, Canada.
- Children's Hospital of Eastern Ontario (CHEO) Research Institute, 401 Smyth Road, Ottawa, ON, K1H 8L1, Canada.
| | - Lucy Mosquera
- Replica Analytics, Ottawa, ON, Canada
- Children's Hospital of Eastern Ontario (CHEO) Research Institute, 401 Smyth Road, Ottawa, ON, K1H 8L1, Canada
| | - Xi Fang
- Replica Analytics, Ottawa, ON, Canada
| | | |
Collapse
|
7
|
Yuan J, Tang R, Jiang X, Hu X. Large Language Models for Healthcare Data Augmentation: An Example on Patient-Trial Matching. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2024; 2023:1324-1333. [PMID: 38222339 PMCID: PMC10785941] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 01/16/2024]
Abstract
The process of matching patients with suitable clinical trials is essential for advancing medical research and providing optimal care. However, current approaches face challenges such as data standardization, ethical considerations, and a lack of interoperability between Electronic Health Records (EHRs) and clinical trial criteria. In this paper, we explore the potential of large language models (LLMs) to address these challenges by leveraging their advanced natural language generation capabilities to improve compatibility between EHRs and clinical trial descriptions. We propose an innovative privacy-aware data augmentation approach for LLM-based patient-trial matching (LLM-PTM), which balances the benefits of LLMs while ensuring the security and confidentiality of sensitive patient data. Our experiments demonstrate a 7.32% average improvement in performance using the proposed LLM-PTM method, and the generalizability to new data is improved by 12.12%. Additionally, we present case studies to further illustrate the effectiveness of our approach and provide a deeper understanding of its underlying principles.
Collapse
Affiliation(s)
| | | | | | - Xia Hu
- Rice University, Houston, TX
| |
Collapse
|
8
|
Bordukova M, Makarov N, Rodriguez-Esteban R, Schmich F, Menden MP. Generative artificial intelligence empowers digital twins in drug discovery and clinical trials. Expert Opin Drug Discov 2024; 19:33-42. [PMID: 37887266 DOI: 10.1080/17460441.2023.2273839] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2023] [Accepted: 10/18/2023] [Indexed: 10/28/2023]
Abstract
INTRODUCTION The concept of Digital Twins (DTs) translated to drug development and clinical trials describes virtual representations of systems of various complexities, ranging from individual cells to entire humans, and enables in silico simulations and experiments. DTs increase the efficiency of drug discovery and development by digitalizing processes associated with high economic, ethical, or social burden. The impact is multifaceted: DT models sharpen disease understanding, support biomarker discovery and accelerate drug development, thus advancing precision medicine. One way to realize DTs is by generative artificial intelligence (AI), a cutting-edge technology that enables the creation of novel, realistic and complex data with desired properties. AREAS COVERED The authors provide a brief introduction to generative AI and describe how it facilitates the modeling of DTs. In addition, they compare existing implementations of generative AI for DTs in drug discovery and clinical trials. Finally, they discuss technical and regulatory challenges that should be addressed before DTs can transform drug discovery and clinical trials. EXPERT OPINION The current state of DTs in drug discovery and clinical trials does not exploit the entire power of generative AI yet and is limited to simulation of a small number of characteristics. Nonetheless, generative AI has the potential to transform the field by leveraging recent developments in deep learning and customizing models for the needs of scientists, physicians and patients.
Collapse
Affiliation(s)
- Maria Bordukova
- Data & Analytics, Pharmaceutical Research and Early Development, Roche Innovation Center Munich (RICM), Penzberg, Germany
- Institute of Computational Biology, Computational Health Center, Helmholtz Munich, Munich, Germany
- Department of Biology, Ludwig-Maximilians University Munich, Munich, Germany
| | - Nikita Makarov
- Data & Analytics, Pharmaceutical Research and Early Development, Roche Innovation Center Munich (RICM), Penzberg, Germany
- Institute of Computational Biology, Computational Health Center, Helmholtz Munich, Munich, Germany
- Department of Biology, Ludwig-Maximilians University Munich, Munich, Germany
| | - Raul Rodriguez-Esteban
- Data & Analytics, Pharmaceutical Research and Early Development, Roche Innovation Center Basel (RICB), Basel, Switzerland
| | - Fabian Schmich
- Data & Analytics, Pharmaceutical Research and Early Development, Roche Innovation Center Munich (RICM), Penzberg, Germany
| | - Michael P Menden
- Institute of Computational Biology, Computational Health Center, Helmholtz Munich, Munich, Germany
- Department of Biology, Ludwig-Maximilians University Munich, Munich, Germany
- Department of Biochemistry and Pharmacology, University of Melbourne, Melbourne, Australia
- German Center for Diabetes Research (DZD e.V.), Munich, Germany
| |
Collapse
|
9
|
Gouda MA, Hong W, Jiang D, Feng N, Zhou B, Li Z. Synthesis of sEMG Signals for Hand Gestures Using a 1DDCGAN. Bioengineering (Basel) 2023; 10:1353. [PMID: 38135944 PMCID: PMC10740493 DOI: 10.3390/bioengineering10121353] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2023] [Revised: 11/18/2023] [Accepted: 11/20/2023] [Indexed: 12/24/2023] Open
Abstract
The emergence of modern prosthetics controlled by bio-signals has been facilitated by AI and microchip technology innovations. AI algorithms are trained using sEMG produced by muscles during contractions. The data acquisition procedure may result in discomfort and fatigue, particularly for amputees. Furthermore, prosthetic companies restrict sEMG signal exchange, limiting data-driven research and reproducibility. GANs present a viable solution to the aforementioned concerns. GANs can generate high-quality sEMG, which can be utilised for data augmentation, decrease the training time required by prosthetic users, enhance classification accuracy and ensure research reproducibility. This research proposes the utilisation of a one-dimensional deep convolutional GAN (1DDCGAN) to generate the sEMG of hand gestures. This approach involves the incorporation of dynamic time wrapping, fast Fourier transform and wavelets as discriminator inputs. Two datasets were utilised to validate the methodology, where five windows and increments were utilised to extract features to evaluate the synthesised sEMG quality. In addition to the traditional classification and augmentation metrics, two novel metrics-the Mantel test and the classifier two-sample test-were used for evaluation. The 1DDCGAN preserved the inter-feature correlations and generated high-quality signals, which resembled the original data. Additionally, the classification accuracy improved by an average of 1.21-5%.
Collapse
Affiliation(s)
| | - Wang Hong
- Department of Mechanical Engineering and Automation, Northeastern University, Shenyang 110819, China; (M.A.G.); (D.J.); (N.F.); (B.Z.); (Z.L.)
| | | | | | | | | |
Collapse
|
10
|
Kang HYJ, Batbaatar E, Choi DW, Choi KS, Ko M, Ryu KS. Synthetic Tabular Data Based on Generative Adversarial Networks in Health Care: Generation and Validation Using the Divide-and-Conquer Strategy. JMIR Med Inform 2023; 11:e47859. [PMID: 37999942 DOI: 10.2196/47859] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2023] [Revised: 08/02/2023] [Accepted: 10/28/2023] [Indexed: 11/25/2023] Open
Abstract
BACKGROUND Synthetic data generation (SDG) based on generative adversarial networks (GANs) is used in health care, but research on preserving data with logical relationships with synthetic tabular data (STD) remains challenging. Filtering methods for SDG can lead to the loss of important information. OBJECTIVE This study proposed a divide-and-conquer (DC) method to generate STD based on the GAN algorithm, while preserving data with logical relationships. METHODS The proposed method was evaluated on data from the Korea Association for Lung Cancer Registry (KALC-R) and 2 benchmark data sets (breast cancer and diabetes). The DC-based SDG strategy comprises 3 steps: (1) We used 2 different partitioning methods (the class-specific criterion distinguished between survival and death groups, while the Cramer V criterion identified the highest correlation between columns in the original data); (2) the entire data set was divided into a number of subsets, which were then used as input for the conditional tabular generative adversarial network and the copula generative adversarial network to generate synthetic data; and (3) the generated synthetic data were consolidated into a single entity. For validation, we compared DC-based SDG and conditional sampling (CS)-based SDG through the performances of machine learning models. In addition, we generated imbalanced and balanced synthetic data for each of the 3 data sets and compared their performance using 4 classifiers: decision tree (DT), random forest (RF), Extreme Gradient Boosting (XGBoost), and light gradient-boosting machine (LGBM) models. RESULTS The synthetic data of the 3 diseases (non-small cell lung cancer [NSCLC], breast cancer, and diabetes) generated by our proposed model outperformed the 4 classifiers (DT, RF, XGBoost, and LGBM). The CS- versus DC-based model performances were compared using the mean area under the curve (SD) values: 74.87 (SD 0.77) versus 63.87 (SD 2.02) for NSCLC, 73.31 (SD 1.11) versus 67.96 (SD 2.15) for breast cancer, and 61.57 (SD 0.09) versus 60.08 (SD 0.17) for diabetes (DT); 85.61 (SD 0.29) versus 79.01 (SD 1.20) for NSCLC, 78.05 (SD 1.59) versus 73.48 (SD 4.73) for breast cancer, and 59.98 (SD 0.24) versus 58.55 (SD 0.17) for diabetes (RF); 85.20 (SD 0.82) versus 76.42 (SD 0.93) for NSCLC, 77.86 (SD 2.27) versus 68.32 (SD 2.37) for breast cancer, and 60.18 (SD 0.20) versus 58.98 (SD 0.29) for diabetes (XGBoost); and 85.14 (SD 0.77) versus 77.62 (SD 1.85) for NSCLC, 78.16 (SD 1.52) versus 70.02 (SD 2.17) for breast cancer, and 61.75 (SD 0.13) versus 61.12 (SD 0.23) for diabetes (LGBM). In addition, we found that balanced synthetic data performed better. CONCLUSIONS This study is the first attempt to generate and validate STD based on a DC approach and shows improved performance using STD. The necessity for balanced SDG was also demonstrated.
Collapse
Affiliation(s)
- Ha Ye Jin Kang
- Department of Applied Artificial Intelligence, Hanyang University, Ansan, Republic of Korea
- Department of Cancer AI & Digital Health, Graduate School of Cancer Science and Policy, National Cancer Center, Gyeonggi-do, Republic of Korea
| | - Erdenebileg Batbaatar
- National Cancer Data Center, National Cancer Control Institute, National Cancer Center, Gyeonggi-do, Republic of Korea
| | - Dong-Woo Choi
- National Cancer Data Center, National Cancer Control Institute, National Cancer Center, Gyeonggi-do, Republic of Korea
| | - Kui Son Choi
- National Cancer Data Center, National Cancer Control Institute, National Cancer Center, Gyeonggi-do, Republic of Korea
- Department of Cancer Control and Policy, Graduate School of Cancer Science and Policy, National Cancer Center, Gyeonggi-do, Republic of Korea
| | - Minsam Ko
- Department of Applied Artificial Intelligence, Hanyang University, Ansan, Republic of Korea
- Department of Human-Computer Interaction, Hanyang University, Ansan, Republic of Korea
| | - Kwang Sun Ryu
- Department of Cancer AI & Digital Health, Graduate School of Cancer Science and Policy, National Cancer Center, Gyeonggi-do, Republic of Korea
- National Cancer Data Center, National Cancer Control Institute, National Cancer Center, Gyeonggi-do, Republic of Korea
| |
Collapse
|
11
|
Xing X, Ser JD, Wu Y, Li Y, Xia J, Xu L, Firmin D, Gatehouse P, Yang G. HDL: Hybrid Deep Learning for the Synthesis of Myocardial Velocity Maps in Digital Twins for Cardiac Analysis. IEEE J Biomed Health Inform 2023; 27:5134-5142. [PMID: 35290192 DOI: 10.1109/jbhi.2022.3158897] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Synthetic digital twins based on medical data accelerate the acquisition, labelling and decision making procedure in digital healthcare. A core part of digital healthcare twins is model-based data synthesis, which permits the generation of realistic medical signals without requiring to cope with the modelling complexity of anatomical and biochemical phenomena producing them in reality. Unfortunately, algorithms for cardiac data synthesis have been so far scarcely studied in the literature. An important imaging modality in the cardiac examination is three-directional CINE multi-slice myocardial velocity mapping (3Dir MVM), which provides a quantitative assessment of cardiac motion in three orthogonal directions of the left ventricle. The long acquisition time and complex acquisition produce make it more urgent to produce synthetic digital twins of this imaging modality. In this study, we propose a hybrid deep learning (HDL) network, especially for synthetic 3Dir MVM data. Our algorithm is featured by a hybrid UNet and a Generative Adversarial Network with a foreground-background generation scheme. The experimental results show that from temporally down-sampled magnitude CINE images (six times), our proposed algorithm can still successfully synthesise high temporal resolution 3Dir MVM CMR data (PSNR=42.32) with precise left ventricle segmentation (DICE=0.92). These performance scores indicate that our proposed HDL algorithm can be implemented in real-world digital twins for myocardial velocity mapping data simulation. To the best of our knowledge, this work is the first one investigating digital twins of the 3Dir MVM CMR, which has shown great potential for improving the efficiency of clinical studies via synthesised cardiac data.
Collapse
|
12
|
Bonomi L, Gousheh S, Fan L. Enabling Health Data Sharing with Fine-Grained Privacy. PROCEEDINGS OF THE ... ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT. ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT 2023; 2023:131-141. [PMID: 37906633 PMCID: PMC10601092 DOI: 10.1145/3583780.3614864] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/02/2023]
Abstract
Sharing health data is vital in advancing medical research and transforming knowledge into clinical practice. Meanwhile, protecting the privacy of data contributors is of paramount importance. To that end, several privacy approaches have been proposed to protect individual data contributors in data sharing, including data anonymization and data synthesis techniques. These approaches have shown promising results in providing privacy protection at the dataset level. In this work, we study the privacy challenges in enabling fine-grained privacy in health data sharing. Our work is motivated by recent research findings, in which patients and healthcare providers may have different privacy preferences and policies that need to be addressed. Specifically, we propose a novel and effective privacy solution that enables data curators (e.g., healthcare providers) to protect sensitive data elements while preserving data usefulness. Our solution builds on randomized techniques to provide rigorous privacy protection for sensitive elements and leverages graphical models to mitigate privacy leakage due to dependent elements. To enhance the usefulness of the shared data, our randomized mechanism incorporates domain knowledge to preserve semantic similarity and adopts a block-structured design to minimize utility loss. Evaluations with real-world health data demonstrate the effectiveness of our approach and the usefulness of the shared data for health applications.
Collapse
Affiliation(s)
- Luca Bonomi
- Vanderbilt University Medical Center, Nashville, TN, USA
| | - Sepand Gousheh
- University of North Carolina at Charlotte, Charlotte, NC, USA
| | - Liyue Fan
- University of North Carolina at Charlotte, Charlotte, NC, USA
| |
Collapse
|
13
|
García-Domínguez A, Galván-Tejada CE, Magallanes-Quintanar R, Cruz M, Gonzalez-Curiel I, Delgado-Contreras JR, Soto-Murillo MA, Celaya-Padilla JM, Galván-Tejada JI. Optimizing Clinical Diabetes Diagnosis through Generative Adversarial Networks: Evaluation and Validation. Diseases 2023; 11:134. [PMID: 37873778 PMCID: PMC10594466 DOI: 10.3390/diseases11040134] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2023] [Revised: 09/24/2023] [Accepted: 09/28/2023] [Indexed: 10/25/2023] Open
Abstract
The escalating prevalence of Type 2 Diabetes (T2D) represents a substantial burden on global healthcare systems, especially in regions such as Mexico. Existing diagnostic techniques, although effective, often require invasive procedures and labor-intensive efforts. The promise of artificial intelligence and data science for streamlining and enhancing T2D diagnosis is well-recognized; however, these advancements are frequently constrained by the limited availability of comprehensive patient datasets. To mitigate this challenge, the present study investigated the efficacy of Generative Adversarial Networks (GANs) for augmenting existing T2D patient data, with a focus on a Mexican cohort. The researchers utilized a dataset of 1019 Mexican nationals, divided into 499 non-diabetic controls and 520 diabetic cases. GANs were applied to create synthetic patient profiles, which were subsequently used to train a Random Forest (RF) classification model. The study's findings revealed a notable improvement in the model's diagnostic accuracy, validating the utility of GAN-based data augmentation in a clinical context. The results bear significant implications for enhancing the robustness and reliability of Machine Learning tools in T2D diagnosis and management, offering a pathway toward more timely and effective patient care.
Collapse
Affiliation(s)
- Antonio García-Domínguez
- Unidad Académica de Ingeniería Eléctrica, Universidad Autónoma de Zacatecas, Jardín Juárez 147, Centro, Zacatecas 98000, Mexico; (A.G.-D.); (R.M.-Q.); (J.R.D.-C.); (M.A.S.-M.); (J.M.C.-P.); (J.I.G.-T.)
| | - Carlos E. Galván-Tejada
- Unidad Académica de Ingeniería Eléctrica, Universidad Autónoma de Zacatecas, Jardín Juárez 147, Centro, Zacatecas 98000, Mexico; (A.G.-D.); (R.M.-Q.); (J.R.D.-C.); (M.A.S.-M.); (J.M.C.-P.); (J.I.G.-T.)
| | - Rafael Magallanes-Quintanar
- Unidad Académica de Ingeniería Eléctrica, Universidad Autónoma de Zacatecas, Jardín Juárez 147, Centro, Zacatecas 98000, Mexico; (A.G.-D.); (R.M.-Q.); (J.R.D.-C.); (M.A.S.-M.); (J.M.C.-P.); (J.I.G.-T.)
| | - Miguel Cruz
- Medical Research Unit in Biochemestry, National Medical Center Siglo XXI, IMSS, Mexico City 06720, Mexico;
| | - Irma Gonzalez-Curiel
- Unidad Académica de Ciencias Químicas, Universidad Autónoma de Zacatecas, Jardín Juarez 147, Centro, Zacatecas 98000, Mexico;
| | - J. Rubén Delgado-Contreras
- Unidad Académica de Ingeniería Eléctrica, Universidad Autónoma de Zacatecas, Jardín Juárez 147, Centro, Zacatecas 98000, Mexico; (A.G.-D.); (R.M.-Q.); (J.R.D.-C.); (M.A.S.-M.); (J.M.C.-P.); (J.I.G.-T.)
| | - Manuel A. Soto-Murillo
- Unidad Académica de Ingeniería Eléctrica, Universidad Autónoma de Zacatecas, Jardín Juárez 147, Centro, Zacatecas 98000, Mexico; (A.G.-D.); (R.M.-Q.); (J.R.D.-C.); (M.A.S.-M.); (J.M.C.-P.); (J.I.G.-T.)
| | - José M. Celaya-Padilla
- Unidad Académica de Ingeniería Eléctrica, Universidad Autónoma de Zacatecas, Jardín Juárez 147, Centro, Zacatecas 98000, Mexico; (A.G.-D.); (R.M.-Q.); (J.R.D.-C.); (M.A.S.-M.); (J.M.C.-P.); (J.I.G.-T.)
| | - Jorge I. Galván-Tejada
- Unidad Académica de Ingeniería Eléctrica, Universidad Autónoma de Zacatecas, Jardín Juárez 147, Centro, Zacatecas 98000, Mexico; (A.G.-D.); (R.M.-Q.); (J.R.D.-C.); (M.A.S.-M.); (J.M.C.-P.); (J.I.G.-T.)
| |
Collapse
|
14
|
Peppes N, Tsakanikas P, Daskalakis E, Alexakis T, Adamopoulou E, Demestichas K. FoGGAN: Generating Realistic Parkinson's Disease Freezing of Gait Data Using GANs. SENSORS (BASEL, SWITZERLAND) 2023; 23:8158. [PMID: 37836988 PMCID: PMC10574838 DOI: 10.3390/s23198158] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/11/2023] [Revised: 09/23/2023] [Accepted: 09/27/2023] [Indexed: 10/15/2023]
Abstract
Data scarcity in the healthcare domain is a major drawback for most state-of-the-art technologies engaging artificial intelligence. The unavailability of quality data due to both the difficulty to gather and label them as well as due to their sensitive nature create a breeding ground for data augmentation solutions. Parkinson's Disease (PD) which can have a wide range of symptoms including motor impairments consists of a very challenging case for quality data acquisition. Generative Adversarial Networks (GANs) can help alleviate such data availability issues. In this light, this study focuses on a data augmentation solution engaging Generative Adversarial Networks (GANs) using a freezing of gait (FoG) symptom dataset as input. The data generated by the so-called FoGGAN architecture presented in this study are almost identical to the original as concluded by a variety of similarity metrics. This highlights the significance of such solutions as they can provide credible synthetically generated data which can be utilized as training dataset inputs to AI applications. Additionally, a DNN classifier's performance is evaluated using three different evaluation datasets and the accuracy results were quite encouraging, highlighting that the FOGGAN solution could lead to the alleviation of the data shortage matter.
Collapse
Affiliation(s)
- Nikolaos Peppes
- Institute of Communication and Computer Systems, National Technical University of Athens, 15773 Athens, Greece; (P.T.); (E.D.); (T.A.); (E.A.)
| | - Panagiotis Tsakanikas
- Institute of Communication and Computer Systems, National Technical University of Athens, 15773 Athens, Greece; (P.T.); (E.D.); (T.A.); (E.A.)
| | - Emmanouil Daskalakis
- Institute of Communication and Computer Systems, National Technical University of Athens, 15773 Athens, Greece; (P.T.); (E.D.); (T.A.); (E.A.)
| | - Theodoros Alexakis
- Institute of Communication and Computer Systems, National Technical University of Athens, 15773 Athens, Greece; (P.T.); (E.D.); (T.A.); (E.A.)
| | - Evgenia Adamopoulou
- Institute of Communication and Computer Systems, National Technical University of Athens, 15773 Athens, Greece; (P.T.); (E.D.); (T.A.); (E.A.)
| | - Konstantinos Demestichas
- Department of Agricultural Economics and Rural Development, Agricultural University of Athens, 11855 Athens, Greece;
| |
Collapse
|
15
|
Pun FW, Ozerov IV, Zhavoronkov A. AI-powered therapeutic target discovery. Trends Pharmacol Sci 2023; 44:561-572. [PMID: 37479540 DOI: 10.1016/j.tips.2023.06.010] [Citation(s) in RCA: 39] [Impact Index Per Article: 39.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2023] [Revised: 06/20/2023] [Accepted: 06/23/2023] [Indexed: 07/23/2023]
Abstract
Disease modeling and target identification are the most crucial initial steps in drug discovery, and influence the probability of success at every step of drug development. Traditional target identification is a time-consuming process that takes years to decades and usually starts in an academic setting. Given its advantages of analyzing large datasets and intricate biological networks, artificial intelligence (AI) is playing a growing role in modern drug target identification. We review recent advances in target discovery, focusing on breakthroughs in AI-driven therapeutic target exploration. We also discuss the importance of striking a balance between novelty and confidence in target selection. An increasing number of AI-identified targets are being validated through experiments and several AI-derived drugs are entering clinical trials; we highlight current limitations and potential pathways for moving forward.
Collapse
Affiliation(s)
- Frank W Pun
- Insilico Medicine Hong Kong Ltd., Hong Kong Science and Technology Park, New Territories, Hong Kong
| | - Ivan V Ozerov
- Insilico Medicine Hong Kong Ltd., Hong Kong Science and Technology Park, New Territories, Hong Kong
| | - Alex Zhavoronkov
- Insilico Medicine Hong Kong Ltd., Hong Kong Science and Technology Park, New Territories, Hong Kong; Insilico Medicine MENA, 6F IRENA Building, Abu Dhabi, United Arab Emirates; Buck Institute for Research on Aging, Novato, CA, USA.
| |
Collapse
|
16
|
Jacobs F, D'Amico S, Benvenuti C, Gaudio M, Saltalamacchia G, Miggiano C, De Sanctis R, Della Porta MG, Santoro A, Zambelli A. Opportunities and Challenges of Synthetic Data Generation in Oncology. JCO Clin Cancer Inform 2023; 7:e2300045. [PMID: 37535875 DOI: 10.1200/cci.23.00045] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2023] [Revised: 05/05/2023] [Accepted: 05/25/2023] [Indexed: 08/05/2023] Open
Abstract
Widespread interest in artificial intelligence (AI) in health care has focused mainly on deductive systems that analyze available real-world data to discover patterns not otherwise visible. Generative adversarial network, a new type of inductive AI, has recently evolved to generate high-fidelity virtual synthetic data (SD) trained on relatively limited real-world information. The AI system is fed with a collection of real data, and it learns to generate new augmented data while maintaining the general characteristics of the original data set. The use of SD to enhance clinical research and protect patient privacy has drawn a lot of interest in medicine and in the complex field of oncology. This article summarizes the main characteristics of this innovative technology and critically discusses how it can be used to accelerate data access for secondary purposes, providing an overview of the opportunities and challenges of SD generation for clinical cancer research and health care.
Collapse
Affiliation(s)
- Flavia Jacobs
- Department of Biomedical Sciences, Humanitas University, Milan, Italy
- IRCCS Istituto Clinico Humanitas, Milan, Italy
| | | | - Chiara Benvenuti
- Department of Biomedical Sciences, Humanitas University, Milan, Italy
- IRCCS Istituto Clinico Humanitas, Milan, Italy
| | - Mariangela Gaudio
- Department of Biomedical Sciences, Humanitas University, Milan, Italy
- IRCCS Istituto Clinico Humanitas, Milan, Italy
| | | | - Chiara Miggiano
- Department of Biomedical Sciences, Humanitas University, Milan, Italy
- IRCCS Istituto Clinico Humanitas, Milan, Italy
| | - Rita De Sanctis
- Department of Biomedical Sciences, Humanitas University, Milan, Italy
- IRCCS Istituto Clinico Humanitas, Milan, Italy
| | - Matteo Giovanni Della Porta
- Department of Biomedical Sciences, Humanitas University, Milan, Italy
- IRCCS Istituto Clinico Humanitas, Milan, Italy
| | - Armando Santoro
- Department of Biomedical Sciences, Humanitas University, Milan, Italy
- IRCCS Istituto Clinico Humanitas, Milan, Italy
| | - Alberto Zambelli
- Department of Biomedical Sciences, Humanitas University, Milan, Italy
- IRCCS Istituto Clinico Humanitas, Milan, Italy
| |
Collapse
|
17
|
Zuber S, Bechtiger L, Bodelet JS, Golin M, Heumann J, Kim JH, Klee M, Mur J, Noll J, Voll S, O’Keefe P, Steinhoff A, Zölitz U, Muniz-Terrera G, Shanahan L, Shanahan MJ, Hofer SM. An integrative approach for the analysis of risk and health across the life course: challenges, innovations, and opportunities for life course research. DISCOVER SOCIAL SCIENCE AND HEALTH 2023; 3:14. [PMID: 37469576 PMCID: PMC10352429 DOI: 10.1007/s44155-023-00044-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/26/2023] [Accepted: 06/26/2023] [Indexed: 07/21/2023]
Abstract
Life course epidemiology seeks to understand the intricate relationships between risk factors and health outcomes across different stages of life to inform prevention and intervention strategies to optimize health throughout the lifespan. However, extant evidence has predominantly been based on separate analyses of data from individual birth cohorts or panel studies, which may not be sufficient to unravel the complex interplay of risk and health across different contexts. We highlight the importance of a multi-study perspective that enables researchers to: (a) Compare and contrast findings from different contexts and populations, which can help identify generalizable patterns and context-specific factors; (b) Examine the robustness of associations and the potential for effect modification by factors such as age, sex, and socioeconomic status; and (c) Improve statistical power and precision by pooling data from multiple studies, thereby allowing for the investigation of rare exposures and outcomes. This integrative framework combines the advantages of multi-study data with a life course perspective to guide research in understanding life course risk and resilience on adult health outcomes by: (a) Encouraging the use of harmonized measures across studies to facilitate comparisons and synthesis of findings; (b) Promoting the adoption of advanced analytical techniques that can accommodate the complexities of multi-study, longitudinal data; and (c) Fostering collaboration between researchers, data repositories, and funding agencies to support the integration of longitudinal data from diverse sources. An integrative approach can help inform the development of individualized risk scores and personalized interventions to promote health and well-being at various life stages.
Collapse
Affiliation(s)
- Sascha Zuber
- Institute On Aging & Lifelong Health, University of Victoria, Victoria, BC Canada
- Center for the Interdisciplinary Study of Gerontology and Vulnerability, University of Geneva, Geneva, Switzerland
| | - Laura Bechtiger
- Jacobs Center for Productive Youth Development, University of Zürich, Zürich, Switzerland
| | | | - Marta Golin
- Jacobs Center for Productive Youth Development, University of Zürich, Zürich, Switzerland
| | - Jens Heumann
- Jacobs Center for Productive Youth Development, University of Zürich, Zürich, Switzerland
| | - Jung Hyun Kim
- University of Luxembourg, Esch-sur-Alzette, Luxembourg
| | - Matthias Klee
- University of Luxembourg, Esch-sur-Alzette, Luxembourg
| | - Jure Mur
- University of Edinburgh, Edinburgh, Scotland
| | - Jennie Noll
- Pennsylvania State University, State College, PA USA
| | - Stacey Voll
- Institute On Aging & Lifelong Health, University of Victoria, Victoria, BC Canada
| | - Patrick O’Keefe
- Department of Neurology, Oregon Health & Science University, Portland, OR USA
| | - Annekatrin Steinhoff
- Jacobs Center for Productive Youth Development, University of Zürich, Zürich, Switzerland
- University Hospital of Child and Adolescent Psychiatry and Psychotherapy, University of Bern, Bern, Switzerland
| | - Ulf Zölitz
- Jacobs Center for Productive Youth Development, University of Zürich, Zürich, Switzerland
| | | | - Lilly Shanahan
- Jacobs Center for Productive Youth Development, University of Zürich, Zürich, Switzerland
- Department of Psychology, University of Zürich, Zürich, Switzerland
| | - Michael J. Shanahan
- Jacobs Center for Productive Youth Development, University of Zürich, Zürich, Switzerland
- Department of Sociology, University of Zürich, Zürich, Switzerland
| | - Scott M. Hofer
- Institute On Aging & Lifelong Health, University of Victoria, Victoria, BC Canada
- Department of Neurology, Oregon Health & Science University, Portland, OR USA
| |
Collapse
|
18
|
Azizi Z, Lindner S, Shiba Y, Raparelli V, Norris CM, Kublickiene K, Herrero MT, Kautzky-Willer A, Klimek P, Gisinger T, Pilote L, El Emam K. A comparison of synthetic data generation and federated analysis for enabling international evaluations of cardiovascular health. Sci Rep 2023; 13:11540. [PMID: 37460705 DOI: 10.1038/s41598-023-38457-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2022] [Accepted: 07/08/2023] [Indexed: 07/20/2023] Open
Abstract
Sharing health data for research purposes across international jurisdictions has been a challenge due to privacy concerns. Two privacy enhancing technologies that can enable such sharing are synthetic data generation (SDG) and federated analysis, but their relative strengths and weaknesses have not been evaluated thus far. In this study we compared SDG with federated analysis to enable such international comparative studies. The objective of the analysis was to assess country-level differences in the role of sex on cardiovascular health (CVH) using a pooled dataset of Canadian and Austrian individuals. The Canadian data was synthesized and sent to the Austrian team for analysis. The utility of the pooled (synthetic Canadian + real Austrian) dataset was evaluated by comparing the regression results from the two approaches. The privacy of the Canadian synthetic data was assessed using a membership disclosure test which showed an F1 score of 0.001, indicating low privacy risk. The outcome variable of interest was CVH, calculated through a modified CANHEART index. The main and interaction effect parameter estimates of the federated and pooled analyses were consistent and directionally the same. It took approximately one month to set up the synthetic data generation platform and generate the synthetic data, whereas it took over 1.5 years to set up the federated analysis system. Synthetic data generation can be an efficient and effective tool for enabling multi-jurisdictional studies while addressing privacy concerns.
Collapse
Affiliation(s)
- Zahra Azizi
- Centre for Outcomes Research and Evaluation, Research Institute of the McGill University Health Centre, 5252 De Maisonneuve Blvd, Office 2B.39, Montréal, QC, H4A 3S5, Canada
| | - Simon Lindner
- Department of Internal Medicine III, Division of Endocrinology and Metabolism, Gender Medicine Unit, Medical University of Vienna, Vienna, Austria
| | - Yumika Shiba
- Centre for Outcomes Research and Evaluation, Research Institute of the McGill University Health Centre, 5252 De Maisonneuve Blvd, Office 2B.39, Montréal, QC, H4A 3S5, Canada
- Faculty of Medicine, McGill University, Montreal, Canada
| | - Valeria Raparelli
- Department of Translational Medicine, University of Ferrara, Ferrara, Italy
- Faculty of Nursing, University of Alberta, Edmonton, AB, Canada
| | - Colleen M Norris
- Faculty of Nursing, University of Alberta, Edmonton, AB, Canada
- Heart and Stroke Strategic Clinical Networks, Alberta Health Services, Alberta, Canada
| | | | - Maria Trinidad Herrero
- Clinical & Experimental Neuroscience (NiCE-IMIB-IUIE), School of Medicine, University of Murcia, Murcia, Spain
| | - Alexandra Kautzky-Willer
- Department of Internal Medicine III, Division of Endocrinology and Metabolism, Gender Medicine Unit, Medical University of Vienna, Vienna, Austria
| | - Peter Klimek
- Section for Science of Complex Systems, CeMSIIS, Medical University of Vienna, Vienna, Austria
- Complexity Science Hub Vienna, Vienna, Austria
| | - Teresa Gisinger
- Division of Endocrinology and Metabolism, Medical University of Vienna, Vienna, Austria
| | - Louise Pilote
- Centre for Outcomes Research and Evaluation, Research Institute of the McGill University Health Centre, 5252 De Maisonneuve Blvd, Office 2B.39, Montréal, QC, H4A 3S5, Canada.
- Divisions of Clinical Epidemiology and General Internal Medicine, McGill University Health Centre Research Institute, Montreal, QC, Canada.
| | - Khaled El Emam
- Children's Hospital of Eastern Ontario Research Institute, 401 Smyth Road, Ottawa, ON, K1H 8L1, Canada.
- School of Epidemiology and Public Health, University of Ottawa, Ottawa, ON, Canada.
- Replica Analytics Ltd, Ottawa, ON, Canada.
| |
Collapse
|
19
|
Scendoni R, Tomassini L, Cingolani M, Perali A, Pilati S, Fedeli P. Artificial Intelligence in Evaluation of Permanent Impairment: New Operational Frontiers. Healthcare (Basel) 2023; 11:1979. [PMID: 37510420 PMCID: PMC10378994 DOI: 10.3390/healthcare11141979] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2023] [Revised: 07/01/2023] [Accepted: 07/07/2023] [Indexed: 07/30/2023] Open
Abstract
Artificial intelligence (AI) and machine learning (ML) span multiple disciplines, including the medico-legal sciences, also with reference to the concept of disease and disability. In this context, the International Classification of Diseases, Injuries, and Causes of Death (ICD) is a standard for the classification of diseases and related problems developed by the World Health Organization (WHO), and it represents a valid tool for statistical and epidemiological studies. Indeed, the International Classification of Functioning, Disability, and Health (ICF) is outlined as a classification that aims to describe the state of health of people in relation to their existential spheres (social, family, work). This paper lays the foundations for proposing an operating model for the use of AI in the assessment of impairments with the aim of making the information system as homogeneous as possible, starting from the main coding systems of the reference pathologies and functional damages. Providing a scientific basis for the understanding and study of health, as well as establishing a common language for the assessment of disability in its various meanings through AI systems, will allow for the improvement and standardization of communication between the various expert users.
Collapse
Affiliation(s)
- Roberto Scendoni
- Department of Law, Institute of Legal Medicine, University of Macerata, 62100 Macerata, Italy
| | - Luca Tomassini
- International School of Advanced Studies, University of Camerino, 62032 Camerino, Italy
| | - Mariano Cingolani
- Department of Law, Institute of Legal Medicine, University of Macerata, 62100 Macerata, Italy
| | - Andrea Perali
- Physics Unit, School of Pharmacy, University of Camerino, 62032 Camerino, Italy
| | - Sebastiano Pilati
- Physics Division, School of Science and Technology, University of Camerino, 62032 Camerino, Italy
| | - Piergiorgio Fedeli
- School of Law, Legal Medicine, University of Camerino, 62032 Camerino, Italy
| |
Collapse
|
20
|
Sun H, Plawinski J, Subramaniam S, Jamaludin A, Kadir T, Readie A, Ligozio G, Ohlssen D, Baillie M, Coroller T. A deep learning approach to private data sharing of medical images using conditional generative adversarial networks (GANs). PLoS One 2023; 18:e0280316. [PMID: 37410795 PMCID: PMC10325103 DOI: 10.1371/journal.pone.0280316] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2022] [Accepted: 12/27/2022] [Indexed: 07/08/2023] Open
Abstract
Clinical data sharing can facilitate data-driven scientific research, allowing a broader range of questions to be addressed and thereby leading to greater understanding and innovation. However, sharing biomedical data can put sensitive personal information at risk. This is usually addressed by data anonymization, which is a slow and expensive process. An alternative to anonymization is construction of a synthetic dataset that behaves similar to the real clinical data but preserves patient privacy. As part of a collaboration between Novartis and the Oxford Big Data Institute, a synthetic dataset was generated based on images from COSENTYX® (secukinumab) ankylosing spondylitis (AS) clinical studies. An auxiliary classifier Generative Adversarial Network (ac-GAN) was trained to generate synthetic magnetic resonance images (MRIs) of vertebral units (VUs), conditioned on the VU location (cervical, thoracic and lumbar). Here, we present a method for generating a synthetic dataset and conduct an in-depth analysis on its properties along three key metrics: image fidelity, sample diversity and dataset privacy.
Collapse
Affiliation(s)
- Hanxi Sun
- Department of Statistics, Purdue University, West Lafayette, IN, United States of America
| | - Jason Plawinski
- Novartis Pharmaceutical Corporation, East Hanover, New Jersey, United States of America
| | - Sajanth Subramaniam
- Novartis Pharmaceutical Corporation, East Hanover, New Jersey, United States of America
| | | | | | - Aimee Readie
- Novartis Pharmaceutical Corporation, East Hanover, New Jersey, United States of America
| | - Gregory Ligozio
- Novartis Pharmaceutical Corporation, East Hanover, New Jersey, United States of America
| | - David Ohlssen
- Novartis Pharmaceutical Corporation, East Hanover, New Jersey, United States of America
| | - Mark Baillie
- Novartis Pharmaceutical Corporation, East Hanover, New Jersey, United States of America
| | - Thibaud Coroller
- Novartis Pharmaceutical Corporation, East Hanover, New Jersey, United States of America
| |
Collapse
|
21
|
Wang X, Dervishi L, Li W, Jiang X, Ayday E, Vaidya J. Efficient Federated Kinship Relationship Identification. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2023; 2023:534-543. [PMID: 37351796 PMCID: PMC10283133] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/24/2023]
Abstract
Kinship relationship estimation plays a significant role in today's genome studies. Since genetic data are mostly stored and protected in different silos, retrieving the desirable kinship relationships across federated data warehouses is a non-trivial problem. The ability to identify and connect related individuals is important for both research and clinical applications. In this work, we propose a new privacy-preserving kinship relationship estimation framework: Incremental Update Kinship Identification (INK). The proposed framework includes three key components that allow us to control the balance between privacy and accuracy (of kinship estimation): an incremental process coupled with the use of auxiliary information and informative scores. Our empirical evaluation shows that INK can achieve higher kinship identification correctness while exposing fewer genetic markers.
Collapse
Affiliation(s)
| | | | | | | | - Erman Ayday
- Case Western Reserve University, Cleveland, OH
| | | |
Collapse
|
22
|
Fritzsche MC, Akyüz K, Cano Abadía M, McLennan S, Marttinen P, Mayrhofer MT, Buyx AM. Ethical layering in AI-driven polygenic risk scores-New complexities, new challenges. Front Genet 2023; 14:1098439. [PMID: 36816027 PMCID: PMC9933509 DOI: 10.3389/fgene.2023.1098439] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2022] [Accepted: 01/04/2023] [Indexed: 01/27/2023] Open
Abstract
Researchers aim to develop polygenic risk scores as a tool to prevent and more effectively treat serious diseases, disorders and conditions such as breast cancer, type 2 diabetes mellitus and coronary heart disease. Recently, machine learning techniques, in particular deep neural networks, have been increasingly developed to create polygenic risk scores using electronic health records as well as genomic and other health data. While the use of artificial intelligence for polygenic risk scores may enable greater accuracy, performance and prediction, it also presents a range of increasingly complex ethical challenges. The ethical and social issues of many polygenic risk score applications in medicine have been widely discussed. However, in the literature and in practice, the ethical implications of their confluence with the use of artificial intelligence have not yet been sufficiently considered. Based on a comprehensive review of the existing literature, we argue that this stands in need of urgent consideration for research and subsequent translation into the clinical setting. Considering the many ethical layers involved, we will first give a brief overview of the development of artificial intelligence-driven polygenic risk scores, associated ethical and social implications, challenges in artificial intelligence ethics, and finally, explore potential complexities of polygenic risk scores driven by artificial intelligence. We point out emerging complexity regarding fairness, challenges in building trust, explaining and understanding artificial intelligence and polygenic risk scores as well as regulatory uncertainties and further challenges. We strongly advocate taking a proactive approach to embedding ethics in research and implementation processes for polygenic risk scores driven by artificial intelligence.
Collapse
Affiliation(s)
- Marie-Christine Fritzsche
- Institute of History and Ethics in Medicine, TUM School of Medicine, Technical University of Munich, Munich, Germany
- Department of Science, Technology and Society (STS), School of Social Sciences and Technology, Technical University of Munich, Munich, Germany
| | - Kaya Akyüz
- Biobanking and Biomolecular Resources Research Infrastructure Consortium - European Research Infrastructure Consortium (BBMRI-ERIC), Graz, Austria
- Department of Science and Technology Studies, University of Vienna, Vienna, Austria
| | - Mónica Cano Abadía
- Biobanking and Biomolecular Resources Research Infrastructure Consortium - European Research Infrastructure Consortium (BBMRI-ERIC), Graz, Austria
| | - Stuart McLennan
- Institute of History and Ethics in Medicine, TUM School of Medicine, Technical University of Munich, Munich, Germany
- Department of Science, Technology and Society (STS), School of Social Sciences and Technology, Technical University of Munich, Munich, Germany
| | - Pekka Marttinen
- Helsinki Institute for Information Technology HIIT, Aalto University, Helsinki, Finland
| | - Michaela Th. Mayrhofer
- Biobanking and Biomolecular Resources Research Infrastructure Consortium - European Research Infrastructure Consortium (BBMRI-ERIC), Graz, Austria
| | - Alena M. Buyx
- Institute of History and Ethics in Medicine, TUM School of Medicine, Technical University of Munich, Munich, Germany
- Department of Science, Technology and Society (STS), School of Social Sciences and Technology, Technical University of Munich, Munich, Germany
| |
Collapse
|
23
|
Hernadez M, Epelde G, Alberdi A, Cilla R, Rankin D. Synthetic Tabular Data Evaluation in the Health Domain Covering Resemblance, Utility, and Privacy Dimensions. Methods Inf Med 2023. [PMID: 36623830 DOI: 10.1055/s-0042-1760247] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023]
Abstract
BACKGROUND Synthetic tabular data generation is a potentially valuable technology with great promise for data augmentation and privacy preservation. However, prior to adoption, an empirical assessment of generated synthetic tabular data is required across dimensions relevant to the target application to determine its efficacy. A lack of standardized and objective evaluation and benchmarking strategy for synthetic tabular data in the health domain has been found in the literature. OBJECTIVE The aim of this paper is to identify key dimensions, per dimension metrics, and methods for evaluating synthetic tabular data generated with different techniques and configurations for health domain application development and to provide a strategy to orchestrate them. METHODS Based on the literature, the resemblance, utility, and privacy dimensions have been prioritized, and a collection of metrics and methods for their evaluation are orchestrated into a complete evaluation pipeline. This way, a guided and comparative assessment of generated synthetic tabular data can be done, categorizing its quality into three categories ("Excellent," "Good," and "Poor"). Six health care-related datasets and four synthetic tabular data generation approaches have been chosen to conduct an analysis and evaluation to verify the utility of the proposed evaluation pipeline. RESULTS The synthetic tabular data generated with the four selected approaches has maintained resemblance, utility, and privacy for most datasets and synthetic tabular data generation approach combination. In several datasets, some approaches have outperformed others, while in other datasets, more than one approach has yielded the same performance. CONCLUSION The results have shown that the proposed pipeline can effectively be used to evaluate and benchmark the synthetic tabular data generated by various synthetic tabular data generation approaches. Therefore, this pipeline can support the scientific community in selecting the most suitable synthetic tabular data generation approaches for their data and application of interest.
Collapse
Affiliation(s)
- Mikel Hernadez
- Digital Health and Biomedical Technologies Department, Vicomtech Foundation, Basque Research and Technology Alliance (BRTA), Donostia-San Sebastian, Spain
| | - Gorka Epelde
- Digital Health and Biomedical Technologies Department, Vicomtech Foundation, Basque Research and Technology Alliance (BRTA), Donostia-San Sebastian, Spain.,eHealth Group, Biodonostia Health Research Institute, Donostia-San Sebastian, Spain
| | - Ane Alberdi
- Biomedical Engineering Department, Mondragon Unibertsitatea, Arrasate-Mondragón, Spain
| | - Rodrigo Cilla
- Digital Health and Biomedical Technologies Department, Vicomtech Foundation, Basque Research and Technology Alliance (BRTA), Donostia-San Sebastian, Spain
| | - Debbie Rankin
- School of Computing, Engineering and Intelligent Systems, Ulster University, Derry-Londonderry, United Kingdom
| |
Collapse
|
24
|
Ge S, Liu B, Wang P, Li Y, Zeng D. Learning Privacy-Preserving Student Networks via Discriminative-Generative Distillation. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; PP:116-127. [PMID: 37015525 DOI: 10.1109/tip.2022.3226416] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
While deep models have proved successful in learning rich knowledge from massive well-annotated data, they may pose a privacy leakage risk in practical deployment. It is necessary to find an effective trade-off between high utility and strong privacy. In this work, we propose a discriminative-generative distillation approach to learn privacy-preserving deep models. Our key idea is taking models as bridge to distill knowledge from private data and then transfer it to learn a student network via two streams. First, discriminative stream trains a baseline classifier on private data and an ensemble of teachers on multiple disjoint private subsets, respectively. Then, generative stream takes the classifier as a fixed discriminator and trains a generator in a data-free manner. After that, the generator is used to generate massive synthetic data which are further applied to train a variational autoencoder (VAE). Among these synthetic data, a few of them are fed into the teacher ensemble to query labels via differentially private aggregation, while most of them are embedded to the trained VAE for reconstructing synthetic data. Finally, a semi-supervised student learning is performed to simultaneously handle two tasks: knowledge transfer from the teachers with distillation on few privately labeled synthetic data, and knowledge enhancement with tangent-normal adversarial regularization on many triples of reconstructed synthetic data. In this way, our approach can control query cost over private data and mitigate accuracy degradation in a unified manner, leading to a privacy-preserving student model. Extensive experiments and analysis clearly show the effectiveness of the proposed approach.
Collapse
|
25
|
Rajotte JF, Bergen R, Buckeridge DL, El Emam K, Ng R, Strome E. Synthetic data as an enabler for machine learning applications in medicine. iScience 2022; 25:105331. [PMID: 36325058 PMCID: PMC9619172 DOI: 10.1016/j.isci.2022.105331] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/31/2022] Open
Abstract
Synthetic data generation is the process of using machine learning methods to train a model that captures the patterns in a real dataset. Then new or synthetic data can be generated from that trained model. The synthetic data does not have a one-to-one mapping to the original data or to real patients, and therefore has the potential of privacy preserving properties. There is a growing interest in the application of synthetic data across health and life sciences, but to fully realize the benefits, further education, research, and policy innovation is required. This article summarizes the opportunities and challenges of SDG for health data, and provides directions for how this technology can be leveraged to accelerate data access for secondary purposes.
Collapse
Affiliation(s)
| | - Robert Bergen
- Data Science Institute, University of British Columbia, Vancouver, BC, Canada
| | | | - Khaled El Emam
- School of Epidemiology and Public Health, University of Ottawa and Replica Analytics, Ottawa, ON, Canada
| | - Raymond Ng
- Data Science Institute, University of British Columbia, Vancouver, BC, Canada
| | | |
Collapse
|
26
|
Sakthivel RK, Nagasubramanian G, Sankayya M, Al-Turjman F. Multilingual News Feed Analysis Using Intelligent Linguistic Particle Filtering Techniques. ACM T ASIAN LOW-RESO 2022. [DOI: 10.1145/3569899] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
Analyzing real-time news feeds and their impacts in the real world is a complex task in the social networking arena. Particularly, countries with a multilingual environment have various patterns and perceptions of news reports considering the diversity of the people. Multilingual and multimodal news analysis is an emerging trend for evaluating news source neutralities. Therefore, in this work, four new deep news particle filtering techniques were developed, including generic news analysis, sequential importance re-sampling (SIR)-based news particle filtering analysis, reinforcement learning (RL)-based multimodal news analysis, and deep Convolution neural network (DCNN)-based multi-news filtering approach, for news classification. Results indicate that these techniques, which primarily employ particle filtering with multilevel sampling strategies, produce 15% to 20% better performance than conventional news analysis techniques.
Collapse
Affiliation(s)
| | | | | | - Fadi Al-Turjman
- Artificial Intelligence Engineering Dept., AI and Robotics Institute, Near East University, Mersin 10, Turkey
- Research Center for AI and IoT, Faculty of Engineering, University of Kyrenia, Mersin 10, Turkey
| |
Collapse
|
27
|
El Emam K, Mosquera L, Fang X. Validating a membership disclosure metric for synthetic health data. JAMIA Open 2022; 5:ooac083. [PMID: 36238080 PMCID: PMC9553223 DOI: 10.1093/jamiaopen/ooac083] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2022] [Revised: 09/13/2022] [Accepted: 09/22/2022] [Indexed: 11/24/2022] Open
Abstract
BACKGROUND One of the increasingly accepted methods to evaluate the privacy of synthetic data is by measuring the risk of membership disclosure. This is a measure of the F1 accuracy that an adversary would correctly ascertain that a target individual from the same population as the real data is in the dataset used to train the generative model, and is commonly estimated using a data partitioning methodology with a 0.5 partitioning parameter. OBJECTIVE Validate the membership disclosure F1 score, evaluate and improve the parametrization of the partitioning method, and provide a benchmark for its interpretation. MATERIALS AND METHODS We performed a simulated membership disclosure attack on 4 population datasets: an Ontario COVID-19 dataset, a state hospital discharge dataset, a national health survey, and an international COVID-19 behavioral survey. Two generative methods were evaluated: sequential synthesis and a generative adversarial network. A theoretical analysis and a simulation were used to determine the correct partitioning parameter that would give the same F1 score as a ground truth simulated membership disclosure attack. RESULTS The default 0.5 parameter can give quite inaccurate membership disclosure values. The proportion of records from the training dataset in the attack dataset must be equal to the sampling fraction of the real dataset from the population. The approach is demonstrated on 7 clinical trial datasets. CONCLUSIONS Our proposed parameterization, as well as interpretation and generative model training guidance provide a theoretically and empirically grounded basis for evaluating and managing membership disclosure risk for synthetic data.
Collapse
Affiliation(s)
- Khaled El Emam
- Corresponding Author: Khaled El Emam, PhD, Research Institute, Children’s Hospital of Eastern Ontario, 401 Smyth Road, Ottawa, Ontario K1H 8L1, Canada;
| | - Lucy Mosquera
- Data Science, Replica Analytics Ltd., Ottawa, Ontario, Canada,Research Institute, Children’s Hospital of Eastern Ontario, Ottawa, Ontario, Canada
| | - Xi Fang
- Data Science, Replica Analytics Ltd., Ottawa, Ontario, Canada
| |
Collapse
|
28
|
Shi J, Wang D, Tesei G, Norgeot B. Generating high-fidelity privacy-conscious synthetic patient data for causal effect estimation with multiple treatments. Front Artif Intell 2022; 5:918813. [PMID: 36187323 PMCID: PMC9515575 DOI: 10.3389/frai.2022.918813] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2022] [Accepted: 08/15/2022] [Indexed: 12/03/2022] Open
Abstract
In the past decade, there has been exponentially growing interest in the use of observational data collected as a part of routine healthcare practice to determine the effect of a treatment with causal inference models. Validation of these models, however, has been a challenge because the ground truth is unknown: only one treatment-outcome pair for each person can be observed. There have been multiple efforts to fill this void using synthetic data where the ground truth can be generated. However, to date, these datasets have been severely limited in their utility either by being modeled after small non-representative patient populations, being dissimilar to real target populations, or only providing known effects for two cohorts (treated vs. control). In this work, we produced a large-scale and realistic synthetic dataset that provides ground truth effects for over 10 hypertension treatments on blood pressure outcomes. The synthetic dataset was created by modeling a nationwide cohort of more than 580, 000 hypertension patient data including each person's multi-year history of diagnoses, medications, and laboratory values. We designed a data generation process by combining an adapted ADS-GAN model for fictitious patient information generation and a neural network for treatment outcome generation. Wasserstein distance of 0.35 demonstrates that our synthetic data follows a nearly identical joint distribution to the patient cohort used to generate the data. Patient privacy was a primary concern for this study; the ϵ-identifiability metric, which estimates the probability of actual patients being identified, is 0.008%, ensuring that our synthetic data cannot be used to identify any actual patients. To demonstrate its usage, we tested the bias in causal effect estimation of four well-established models using this dataset. The approach we used can be readily extended to other types of diseases in the clinical domain, and to datasets in other domains as well.
Collapse
|
29
|
Couckuyt A, Seurinck R, Emmaneel A, Quintelier K, Novak D, Van Gassen S, Saeys Y. Challenges in translational machine learning. Hum Genet 2022; 141:1451-1466. [PMID: 35246744 PMCID: PMC8896412 DOI: 10.1007/s00439-022-02439-8] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2021] [Accepted: 02/08/2022] [Indexed: 11/25/2022]
Abstract
Machine learning (ML) algorithms are increasingly being used to help implement clinical decision support systems. In this new field, we define as "translational machine learning", joint efforts and strong communication between data scientists and clinicians help to span the gap between ML and its adoption in the clinic. These collaborations also improve interpretability and trust in translational ML methods and ultimately aim to result in generalizable and reproducible models. To help clinicians and bioinformaticians refine their translational ML pipelines, we review the steps from model building to the use of ML in the clinic. We discuss experimental setup, computational analysis, interpretability and reproducibility, and emphasize the challenges involved. We highly advise collaboration and data sharing between consortia and institutes to build multi-centric cohorts that facilitate ML methodologies that generalize across centers. In the end, we hope that this review provides a way to streamline translational ML and helps to tackle the challenges that come with it.
Collapse
Affiliation(s)
- Artuur Couckuyt
- Department of Applied Mathematics, Computer Science and Statistics, Ghent University, Gent, Belgium
- Data Mining and Modeling for Biomedicine, VIB-UGent Center for Inflammation Research, Gent, Belgium
| | - Ruth Seurinck
- Department of Applied Mathematics, Computer Science and Statistics, Ghent University, Gent, Belgium
- Data Mining and Modeling for Biomedicine, VIB-UGent Center for Inflammation Research, Gent, Belgium
| | - Annelies Emmaneel
- Department of Applied Mathematics, Computer Science and Statistics, Ghent University, Gent, Belgium
- Data Mining and Modeling for Biomedicine, VIB-UGent Center for Inflammation Research, Gent, Belgium
| | - Katrien Quintelier
- Department of Applied Mathematics, Computer Science and Statistics, Ghent University, Gent, Belgium
- Data Mining and Modeling for Biomedicine, VIB-UGent Center for Inflammation Research, Gent, Belgium
- Department of Pulmonary Diseases, Erasmus MC, Rotterdam, The Netherlands
| | - David Novak
- Department of Applied Mathematics, Computer Science and Statistics, Ghent University, Gent, Belgium
- Data Mining and Modeling for Biomedicine, VIB-UGent Center for Inflammation Research, Gent, Belgium
| | - Sofie Van Gassen
- Department of Applied Mathematics, Computer Science and Statistics, Ghent University, Gent, Belgium
- Data Mining and Modeling for Biomedicine, VIB-UGent Center for Inflammation Research, Gent, Belgium
| | - Yvan Saeys
- Department of Applied Mathematics, Computer Science and Statistics, Ghent University, Gent, Belgium.
- Data Mining and Modeling for Biomedicine, VIB-UGent Center for Inflammation Research, Gent, Belgium.
| |
Collapse
|
30
|
Generation of realistic synthetic data using Multimodal Neural Ordinary Differential Equations. NPJ Digit Med 2022; 5:122. [PMID: 35986075 PMCID: PMC9391444 DOI: 10.1038/s41746-022-00666-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2021] [Accepted: 07/25/2022] [Indexed: 11/11/2022] Open
Abstract
Individual organizations, such as hospitals, pharmaceutical companies, and health insurance providers, are currently limited in their ability to collect data that are fully representative of a disease population. This can, in turn, negatively impact the generalization ability of statistical models and scientific insights. However, sharing data across different organizations is highly restricted by legal regulations. While federated data access concepts exist, they are technically and organizationally difficult to realize. An alternative approach would be to exchange synthetic patient data instead. In this work, we introduce the Multimodal Neural Ordinary Differential Equations (MultiNODEs), a hybrid, multimodal AI approach, which allows for generating highly realistic synthetic patient trajectories on a continuous time scale, hence enabling smooth interpolation and extrapolation of clinical studies. Our proposed method can integrate both static and longitudinal data, and implicitly handles missing values. We demonstrate the capabilities of MultiNODEs by applying them to real patient-level data from two independent clinical studies and simulated epidemiological data of an infectious disease.
Collapse
|
31
|
Hernandez M, Epelde G, Alberdi A, Cilla R, Rankin D. Synthetic data generation for tabular health records: A systematic review. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.04.053] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
32
|
Abstract
We consider the problem of enhancing user privacy in common data analysis and machine learning development tasks, such as data annotation and inspection, by substituting the real data with samples from a generative adversarial network. We propose employing Bayesian differential privacy as the means to achieve a rigorous theoretical guarantee while providing a better privacy-utility trade-off. We demonstrate experimentally that our approach produces higher-fidelity samples compared to prior work, allowing to (1) detect more subtle data errors and biases, and (2) reduce the need for real data labelling by achieving high accuracy when training directly on artificial samples.
Collapse
|
33
|
ZenoPS: A Distributed Learning System Integrating Communication Efficiency and Security. ALGORITHMS 2022. [DOI: 10.3390/a15070233] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Distributed machine learning is primarily motivated by the promise of increased computation power for accelerating training and mitigating privacy concerns. Unlike machine learning on a single device, distributed machine learning requires collaboration and communication among the devices. This creates several new challenges: (1) the heavy communication overhead can be a bottleneck that slows down the training, and (2) the unreliable communication and weaker control over the remote entities make the distributed system vulnerable to systematic failures and malicious attacks. This paper presents a variant of stochastic gradient descent (SGD) with improved communication efficiency and security in distributed environments. Our contributions include (1) a new technique called error reset to adapt both infrequent synchronization and message compression for communication reduction in both synchronous and asynchronous training, (2) new score-based approaches for validating the updates, and (3) integration with both error reset and score-based validation. The proposed system provides communication reduction, both synchronous and asynchronous training, Byzantine tolerance, and local privacy preservation. We evaluate our techniques both theoretically and empirically.
Collapse
|
34
|
Coyner AS, Chen JS, Chang K, Singh P, Ostmo S, Chan RVP, Chiang MF, Kalpathy-Cramer J, Campbell JP. Synthetic Medical Images for Robust, Privacy-Preserving Training of Artificial Intelligence: Application to Retinopathy of Prematurity Diagnosis. OPHTHALMOLOGY SCIENCE 2022; 2:100126. [PMID: 36249693 PMCID: PMC9560638 DOI: 10.1016/j.xops.2022.100126] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/24/2021] [Revised: 02/01/2022] [Accepted: 02/07/2022] [Indexed: 02/06/2023]
Abstract
Purpose Developing robust artificial intelligence (AI) models for medical image analysis requires large quantities of diverse, well-chosen data that can prove challenging to collect because of privacy concerns, disease rarity, or diagnostic label quality. Collecting image-based datasets for retinopathy of prematurity (ROP), a potentially blinding disease, suffers from these challenges. Progressively growing generative adversarial networks (PGANs) may help, because they can synthesize highly realistic images that may increase both the size and diversity of medical datasets. Design Diagnostic validation study of convolutional neural networks (CNNs) for plus disease detection, a component of severe ROP, using synthetic data. Participants Five thousand eight hundred forty-two retinal fundus images (RFIs) collected from 963 preterm infants. Methods Retinal vessel maps (RVMs) were segmented from RFIs. PGANs were trained to synthesize RVMs with normal, pre-plus, or plus disease vasculature. Convolutional neural networks were trained, using real or synthetic RVMs, to detect plus disease from 2 real RVM test datasets. Main Outcome Measures Features of real and synthetic RVMs were evaluated using uniform manifold approximation and projection (UMAP). Similarities were evaluated at the dataset and feature level using Fréchet inception distance and Euclidean distance, respectively. CNN performance was assessed via area under the receiver operating characteristic curve (AUC); AUCs were compared via bootstrapping and Delong's test for correlated receiver operating characteristic curves. Confusion matrices were compared using McNemar's chi-square test and Cohen's κ value. Results The CNN trained on synthetic RVMs showed a significantly higher AUC (0.971; P = 0.006 and P = 0.004) and classified plus disease more similarly to a set of 8 international experts (κ = 0.922) than the CNN trained on real RVMs (AUC = 0.934; κ = 0.701). Real and synthetic RVMs overlapped, by plus disease diagnosis, on the UMAP manifold, showing that synthetic images spanned the disease severity spectrum. Fréchet inception distance and Euclidean distances suggested that real and synthetic RVMs were more dissimilar to one another than real RVMs were to one another, further suggesting that synthetic RVMs were distinct from the training data with respect to privacy considerations. Conclusions Synthetic datasets may be useful for training robust medical AI models. Furthermore, PGANs may be able to synthesize realistic data for use without protected health information concerns.
Collapse
Key Words
- AI, artificial intelligence
- Artificial intelligence
- CNN, convolutional neural network
- DL, deep learning
- Deep learning
- FID, Fréchet inception distance
- GAN, generative adversarial network
- Generative adversarial network
- PGAN, progressively growing generative adversarial network
- RFI, retinal fundus image
- ROP, retinopathy of prematurity
- RVM, retinal vessel map
- Retinopathy of prematurity
- UMAP, uniform manifold approximation and projection
Collapse
Affiliation(s)
- Aaron S. Coyner
- Department of Ophthalmology, Casey Eye Institute, Oregon Health & Science University, Portland, Oregon
| | - Jimmy S. Chen
- Department of Ophthalmology, Casey Eye Institute, Oregon Health & Science University, Portland, Oregon
- Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, San Diego, California
| | - Ken Chang
- Department of Radiology, Athinoula A. Martinos Center for Biomedical Imaging, Massachusetts General Hospital, Charlestown, Massachusetts
- Center for Clinical Data Science, Massachusetts General Hospital and Boston Women’s Hospital, Boston, Massachusetts
| | - Praveer Singh
- Department of Radiology, Athinoula A. Martinos Center for Biomedical Imaging, Massachusetts General Hospital, Charlestown, Massachusetts
- Center for Clinical Data Science, Massachusetts General Hospital and Boston Women’s Hospital, Boston, Massachusetts
| | - Susan Ostmo
- Department of Ophthalmology, Casey Eye Institute, Oregon Health & Science University, Portland, Oregon
| | - R. V. Paul Chan
- Department of Ophthalmology and Visual Sciences, Eye and Ear Infirmary, University of Illinois, Chicago, Illinois
| | - Michael F. Chiang
- National Eye Institute, National Institutes of Health, Bethesda, Maryland
| | - Jayashree Kalpathy-Cramer
- Department of Radiology, Athinoula A. Martinos Center for Biomedical Imaging, Massachusetts General Hospital, Charlestown, Massachusetts
- Center for Clinical Data Science, Massachusetts General Hospital and Boston Women’s Hospital, Boston, Massachusetts
| | - J. Peter Campbell
- Department of Ophthalmology, Casey Eye Institute, Oregon Health & Science University, Portland, Oregon
| | - Imaging and Informatics in Retinopathy of Prematurity Consortium†
- Department of Ophthalmology, Casey Eye Institute, Oregon Health & Science University, Portland, Oregon
- Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, San Diego, California
- Department of Radiology, Athinoula A. Martinos Center for Biomedical Imaging, Massachusetts General Hospital, Charlestown, Massachusetts
- Center for Clinical Data Science, Massachusetts General Hospital and Boston Women’s Hospital, Boston, Massachusetts
- Department of Ophthalmology and Visual Sciences, Eye and Ear Infirmary, University of Illinois, Chicago, Illinois
- National Eye Institute, National Institutes of Health, Bethesda, Maryland
| |
Collapse
|
35
|
Bonomi L, Fan L. Sharing Time-to-Event Data with Privacy Protection. PROCEEDINGS. IEEE INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS 2022; 2022:10.1109/ichi54592.2022.00014. [PMID: 36120417 PMCID: PMC9473343 DOI: 10.1109/ichi54592.2022.00014] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Sharing time-to-event data is beneficial for enabling collaborative research efforts (e.g., survival studies), facilitating the design of effective interventions, and advancing patient care (e.g., early diagnosis). Despite numerous privacy solutions for sharing time-to-event data, recent research studies have shown that external information may become available (e.g., self-disclosure of study participation on social media) to an adversary, posing new privacy concerns. In this work, we formulate a cohort inference attack for time-to-event data sharing, in which an informed adversary aims at inferring the membership of a target individual in a specific cohort. Our study investigates the privacy risks associated with time-to-event data and evaluates the empirical privacy protection offered by popular privacy-protecting solutions (e.g., binning, differential privacy). Furthermore, we propose a novel approach to privately release individual level time-to-event data with high utility, while providing indistinguishability guarantees for the input value. Our method TE-Sanitizer is shown to provide effective mitigation against the inference attacks and high usefulness in survival analysis. The results and discussion provide domain experts with insights on the privacy and the usefulness of the studied methods.
Collapse
Affiliation(s)
- Luca Bonomi
- Dept. of Biomedical Informatics, Vanderbilt University, Nashville, TN
| | - Liyue Fan
- Dept. of Computer Science, University of North Carolina at Charlotte, Charlotte, NC
| |
Collapse
|
36
|
Torkzadehmahani R, Nasirigerdeh R, Blumenthal DB, Kacprowski T, List M, Matschinske J, Spaeth J, Wenke NK, Baumbach J. Privacy-Preserving Artificial Intelligence Techniques in Biomedicine. Methods Inf Med 2022; 61:e12-e27. [PMID: 35062032 PMCID: PMC9246509 DOI: 10.1055/s-0041-1740630] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2021] [Accepted: 09/18/2021] [Indexed: 12/15/2022]
Abstract
BACKGROUND Artificial intelligence (AI) has been successfully applied in numerous scientific domains. In biomedicine, AI has already shown tremendous potential, e.g., in the interpretation of next-generation sequencing data and in the design of clinical decision support systems. OBJECTIVES However, training an AI model on sensitive data raises concerns about the privacy of individual participants. For example, summary statistics of a genome-wide association study can be used to determine the presence or absence of an individual in a given dataset. This considerable privacy risk has led to restrictions in accessing genomic and other biomedical data, which is detrimental for collaborative research and impedes scientific progress. Hence, there has been a substantial effort to develop AI methods that can learn from sensitive data while protecting individuals' privacy. METHOD This paper provides a structured overview of recent advances in privacy-preserving AI techniques in biomedicine. It places the most important state-of-the-art approaches within a unified taxonomy and discusses their strengths, limitations, and open problems. CONCLUSION As the most promising direction, we suggest combining federated machine learning as a more scalable approach with other additional privacy-preserving techniques. This would allow to merge the advantages to provide privacy guarantees in a distributed way for biomedical applications. Nonetheless, more research is necessary as hybrid approaches pose new challenges such as additional network or computation overhead.
Collapse
Affiliation(s)
- Reihaneh Torkzadehmahani
- Institute for Artificial Intelligence in Medicine and Healthcare, Technical University of Munich, Munich, Germany
| | - Reza Nasirigerdeh
- Institute for Artificial Intelligence in Medicine and Healthcare, Technical University of Munich, Munich, Germany
- Klinikum Rechts der Isar, Technical University of Munich, Munich, Germany
| | - David B. Blumenthal
- Department of Artificial Intelligence in Biomedical Engineering (AIBE), Friedrich-Alexander University Erlangen-Nürnberg (FAU), Erlangen, Germany
| | - Tim Kacprowski
- Division of Data Science in Biomedicine, Peter L. Reichertz Institute for Medical Informatics of TU Braunschweig and Medical School Hannover, Braunschweig, Germany
- Braunschweig Integrated Centre of Systems Biology (BRICS), TU Braunschweig, Braunschweig, Germany
| | - Markus List
- Chair of Experimental Bioinformatics, Technical University of Munich, Munich, Germany
| | - Julian Matschinske
- E.U. Horizon2020 FeatureCloud Project Consortium
- Chair of Computational Systems Biology, University of Hamburg, Hamburg, Germany
| | - Julian Spaeth
- E.U. Horizon2020 FeatureCloud Project Consortium
- Chair of Computational Systems Biology, University of Hamburg, Hamburg, Germany
| | - Nina Kerstin Wenke
- E.U. Horizon2020 FeatureCloud Project Consortium
- Chair of Computational Systems Biology, University of Hamburg, Hamburg, Germany
| | - Jan Baumbach
- E.U. Horizon2020 FeatureCloud Project Consortium
- Chair of Computational Systems Biology, University of Hamburg, Hamburg, Germany
- Institute of Mathematics and Computer Science, University of Southern Denmark, Odense, Denmark
| |
Collapse
|
37
|
Kokosi T, De Stavola B, Mitra R, Frayling L, Doherty A, Dove I, Sonnenberg P, Harron K. An overview of synthetic administrative data for research. Int J Popul Data Sci 2022; 7:1727. [PMID: 37650026 PMCID: PMC10464868 DOI: 10.23889/ijpds.v7i1.1727] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022] Open
Abstract
Use of administrative data for research and for planning services has increased over recent decades due to the value of the large, rich information available. However, concerns about the release of sensitive or personal data and the associated disclosure risk can lead to lengthy approval processes and restricted data access. This can delay or prevent the production of timely evidence. A promising solution to facilitate more efficient data access is to create synthetic versions of the original datasets which are less likely to hold confidential information and can minimise disclosure risk. Such data may be used as an interim solution, allowing researchers to develop their analysis plans on non-disclosive data, whilst waiting for access to the real data. We aim to provide an overview of the background and uses of synthetic data and describe common methods used to generate synthetic data in the context of UK administrative research. We propose a simplified terminology for categories of synthetic data (univariate, multivariate, and complex modality synthetic data) as well as a more comprehensive description of the terminology used in the existing literature and illustrate challenges and future directions for research.
Collapse
Affiliation(s)
- Theodora Kokosi
- Department of Population, Policy and Practice, UCL Great Ormond Street Institute of Child Health, University College London, London, UK
| | - Bianca De Stavola
- Department of Population, Policy and Practice, UCL Great Ormond Street Institute of Child Health, University College London, London, UK
| | - Robin Mitra
- School of Mathematics, Cardiff University, Cardiff UK
| | | | - Aiden Doherty
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, UK
- Nuffield Department of Population Health, University of Oxford, Oxford, UK
| | - Iain Dove
- Office for National Statistics, Titchfield, UK
| | - Pam Sonnenberg
- Department of Infection & Population Health, Institute for Global Health, University College London, London, UK
| | - Katie Harron
- Department of Population, Policy and Practice, UCL Great Ormond Street Institute of Child Health, University College London, London, UK
| |
Collapse
|
38
|
Thomas JA, Foraker RE, Zamstein N, Morrow JD, Payne PRO, Wilcox AB. Demonstrating an approach for evaluating synthetic geospatial and temporal epidemiologic data utility: results from analyzing >1.8 million SARS-CoV-2 tests in the United States National COVID Cohort Collaborative (N3C). J Am Med Inform Assoc 2022; 29:1350-1365. [PMID: 35357487 PMCID: PMC8992357 DOI: 10.1093/jamia/ocac045] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2021] [Revised: 03/11/2022] [Accepted: 03/28/2022] [Indexed: 11/16/2022] Open
Abstract
OBJECTIVE This study sought to evaluate whether synthetic data derived from a national coronavirus disease 2019 (COVID-19) dataset could be used for geospatial and temporal epidemic analyses. MATERIALS AND METHODS Using an original dataset (n = 1 854 968 severe acute respiratory syndrome coronavirus 2 tests) and its synthetic derivative, we compared key indicators of COVID-19 community spread through analysis of aggregate and zip code-level epidemic curves, patient characteristics and outcomes, distribution of tests by zip code, and indicator counts stratified by month and zip code. Similarity between the data was statistically and qualitatively evaluated. RESULTS In general, synthetic data closely matched original data for epidemic curves, patient characteristics, and outcomes. Synthetic data suppressed labels of zip codes with few total tests (mean = 2.9 ± 2.4; max = 16 tests; 66% reduction of unique zip codes). Epidemic curves and monthly indicator counts were similar between synthetic and original data in a random sample of the most tested (top 1%; n = 171) and for all unsuppressed zip codes (n = 5819), respectively. In small sample sizes, synthetic data utility was notably decreased. DISCUSSION Analyses on the population-level and of densely tested zip codes (which contained most of the data) were similar between original and synthetically derived datasets. Analyses of sparsely tested populations were less similar and had more data suppression. CONCLUSION In general, synthetic data were successfully used to analyze geospatial and temporal trends. Analyses using small sample sizes or populations were limited, in part due to purposeful data label suppression-an attribute disclosure countermeasure. Users should consider data fitness for use in these cases.
Collapse
Affiliation(s)
- Jason A Thomas
- Corresponding Author: Jason A. Thomas, PhD, Philips North America, LLC, 22100 Bothell Everett Hwy, Bothell, WA 98021, USA;
| | - Randi E Foraker
- Division of General Medical Sciences, School of Medicine, Washington University in St. Louis, St. Louis, Missouri, USA,School of Medicine, Institute for Informatics, Washington University in St. Louis, St. Louis, Missouri, USA
| | | | - Jon D Morrow
- MDClone Ltd., Be’er Sheva, Israel,Department of Obstetrics and Gynecology, New York University Grossman School of Medicine, New York, New York, USA
| | - Philip R O Payne
- Division of General Medical Sciences, School of Medicine, Washington University in St. Louis, St. Louis, Missouri, USA,School of Medicine, Institute for Informatics, Washington University in St. Louis, St. Louis, Missouri, USA
| | - Adam B Wilcox
- Division of General Medical Sciences, School of Medicine, Washington University in St. Louis, St. Louis, Missouri, USA,School of Medicine, Institute for Informatics, Washington University in St. Louis, St. Louis, Missouri, USA
| | | |
Collapse
|
39
|
Hartebrodt A, Röttger R. Federated horizontally partitioned principal component analysis for biomedical applications. BIOINFORMATICS ADVANCES 2022; 2:vbac026. [PMID: 36699354 PMCID: PMC9710634 DOI: 10.1093/bioadv/vbac026] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/13/2021] [Revised: 04/07/2022] [Indexed: 01/28/2023]
Abstract
Motivation Federated learning enables privacy-preserving machine learning in the medical domain because the sensitive patient data remain with the owner and only parameters are exchanged between the data holders. The federated scenario introduces specific challenges related to the decentralized nature of the data, such as batch effects and differences in study population between the sites. Here, we investigate the challenges of moving classical analysis methods to the federated domain, specifically principal component analysis (PCA), a versatile and widely used tool, often serving as an initial step in machine learning and visualization workflows. We provide implementations of different federated PCA algorithms and evaluate them regarding their accuracy for high-dimensional biological data using realistic sample distributions over multiple data sites, and their ability to preserve downstream analyses. Results Federated subspace iteration converges to the centralized solution even for unfavorable data distributions, while approximate methods introduce error. Larger sample sizes at the study sites lead to better accuracy of the approximate methods. Approximate methods may be sufficient for coarse data visualization, but are vulnerable to outliers and batch effects. Before the analysis, the PCA algorithm, as well as the number of eigenvectors should be considered carefully to avoid unnecessary communication overhead. Availability and implementation Simulation code and notebooks for federated PCA can be found at https://gitlab.com/roettgerlab/federatedPCA; the code for the federated app is available at https://github.com/AnneHartebrodt/fc-federated-pca. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
- Anne Hartebrodt
- Department of Mathematics and Computer Science, University of Southern Denmark, Odense 5230, Denmark,To whom correspondence should be addressed.
| | - Richard Röttger
- Department of Mathematics and Computer Science, University of Southern Denmark, Odense 5230, Denmark
| |
Collapse
|
40
|
El Emam K, Mosquera L, Fang X, El-Hussuna A. Utility Metrics for Evaluating Synthetic Health Data Generation Methods: Validation Study. JMIR Med Inform 2022; 10:e35734. [PMID: 35389366 PMCID: PMC9030990 DOI: 10.2196/35734] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2021] [Revised: 01/27/2022] [Accepted: 02/13/2022] [Indexed: 01/06/2023] Open
Abstract
Background A regular task by developers and users of synthetic data generation (SDG) methods is to evaluate and compare the utility of these methods. Multiple utility metrics have been proposed and used to evaluate synthetic data. However, they have not been validated in general or for comparing SDG methods. Objective This study evaluates the ability of common utility metrics to rank SDG methods according to performance on a specific analytic workload. The workload of interest is the use of synthetic data for logistic regression prediction models, which is a very frequent workload in health research. Methods We evaluated 6 utility metrics on 30 different health data sets and 3 different SDG methods (a Bayesian network, a Generative Adversarial Network, and sequential tree synthesis). These metrics were computed by averaging across 20 synthetic data sets from the same generative model. The metrics were then tested on their ability to rank the SDG methods based on prediction performance. Prediction performance was defined as the difference between each of the area under the receiver operating characteristic curve and area under the precision-recall curve values on synthetic data logistic regression prediction models versus real data models. Results The utility metric best able to rank SDG methods was the multivariate Hellinger distance based on a Gaussian copula representation of real and synthetic joint distributions. Conclusions This study has validated a generative model utility metric, the multivariate Hellinger distance, which can be used to reliably rank competing SDG methods on the same data set. The Hellinger distance metric can be used to evaluate and compare alternate SDG methods.
Collapse
Affiliation(s)
- Khaled El Emam
- School of Epidemiology and Public Health, University of Ottawa, Ottawa, ON, Canada.,Children's Hospital of Eastern Ontario Research Institute, Ottawa, ON, Canada.,Replica Analytics Ltd, Ottawa, ON, Canada
| | - Lucy Mosquera
- Children's Hospital of Eastern Ontario Research Institute, Ottawa, ON, Canada.,Replica Analytics Ltd, Ottawa, ON, Canada
| | - Xi Fang
- Replica Analytics Ltd, Ottawa, ON, Canada
| | | |
Collapse
|
41
|
Bonomi L, Wu Z, Fan L. Sharing personal ECG time-series data privately. J Am Med Inform Assoc 2022; 29:1152-1160. [PMID: 35380666 PMCID: PMC9196703 DOI: 10.1093/jamia/ocac047] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2021] [Revised: 03/16/2022] [Accepted: 03/31/2022] [Indexed: 11/13/2022] Open
Abstract
Abstract
Objective
Emerging technologies (eg, wearable devices) have made it possible to collect data directly from individuals (eg, time-series), providing new insights on the health and well-being of individual patients. Broadening the access to these data would facilitate the integration with existing data sources (eg, clinical and genomic data) and advance medical research. Compared to traditional health data, these data are collected directly from individuals, are highly unique and provide fine-grained information, posing new privacy challenges. In this work, we study the applicability of a novel privacy model to enable individual-level time-series data sharing while maintaining the usability for data analytics.
Methods and materials
We propose a privacy-protecting method for sharing individual-level electrocardiography (ECG) time-series data, which leverages dimensional reduction technique and random sampling to achieve provable privacy protection. We show that our solution provides strong privacy protection against an informed adversarial model while enabling useful aggregate-level analysis.
Results
We conduct our evaluations on 2 real-world ECG datasets. Our empirical results show that the privacy risk is significantly reduced after sanitization while the data usability is retained for a variety of clinical tasks (eg, predictive modeling and clustering).
Discussion
Our study investigates the privacy risk in sharing individual-level ECG time-series data. We demonstrate that individual-level data can be highly unique, requiring new privacy solutions to protect data contributors.
Conclusion
The results suggest our proposed privacy-protection method provides strong privacy protections while preserving the usefulness of the data.
Collapse
Affiliation(s)
- Luca Bonomi
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Zeyun Wu
- Department of Electrical and Computer Engineering, University of California San Diego, La Jolla, California, USA
| | - Liyue Fan
- Department of Computer Science, University of North Carolina at Charlotte, Charlotte, North Carolina, USA
| |
Collapse
|
42
|
Liu H, Peng C, Tian Y, Long S, Tian F, Wu Z. GDP vs. LDP: A Survey from the Perspective of Information-Theoretic Channel. ENTROPY (BASEL, SWITZERLAND) 2022; 24:430. [PMID: 35327940 PMCID: PMC8953244 DOI: 10.3390/e24030430] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 02/06/2022] [Revised: 03/02/2022] [Accepted: 03/17/2022] [Indexed: 11/30/2022]
Abstract
The existing work has conducted in-depth research and analysis on global differential privacy (GDP) and local differential privacy (LDP) based on information theory. However, the data privacy preserving community does not systematically review and analyze GDP and LDP based on the information-theoretic channel model. To this end, we systematically reviewed GDP and LDP from the perspective of the information-theoretic channel in this survey. First, we presented the privacy threat model under information-theoretic channel. Second, we described and compared the information-theoretic channel models of GDP and LDP. Third, we summarized and analyzed definitions, privacy-utility metrics, properties, and mechanisms of GDP and LDP under their channel models. Finally, we discussed the open problems of GDP and LDP based on different types of information-theoretic channel models according to the above systematic review. Our main contribution provides a systematic survey of channel models, definitions, privacy-utility metrics, properties, and mechanisms for GDP and LDP from the perspective of information-theoretic channel and surveys the differential privacy synthetic data generation application using generative adversarial network and federated learning, respectively. Our work is helpful for systematically understanding the privacy threat model, definitions, privacy-utility metrics, properties, and mechanisms of GDP and LDP from the perspective of information-theoretic channel and promotes in-depth research and analysis of GDP and LDP based on different types of information-theoretic channel models.
Collapse
Affiliation(s)
- Hai Liu
- Guizhou Big Data Academy, Guizhou University, Guiyang 550025, China;
- College of Computer Science and Technology, Guizhou University, Guiyang 550025, China; (Y.T.); (S.L.)
- State Key Laboratory of Public Big Data, Guizhou University, Guiyang 550025, China
| | - Changgen Peng
- Guizhou Big Data Academy, Guizhou University, Guiyang 550025, China;
- College of Computer Science and Technology, Guizhou University, Guiyang 550025, China; (Y.T.); (S.L.)
- State Key Laboratory of Public Big Data, Guizhou University, Guiyang 550025, China
| | - Youliang Tian
- College of Computer Science and Technology, Guizhou University, Guiyang 550025, China; (Y.T.); (S.L.)
- State Key Laboratory of Public Big Data, Guizhou University, Guiyang 550025, China
| | - Shigong Long
- College of Computer Science and Technology, Guizhou University, Guiyang 550025, China; (Y.T.); (S.L.)
- State Key Laboratory of Public Big Data, Guizhou University, Guiyang 550025, China
| | - Feng Tian
- School of Computer Science, Shaanxi Normal University, Xi’an 710119, China; (F.T.); (Z.W.)
| | - Zhenqiang Wu
- School of Computer Science, Shaanxi Normal University, Xi’an 710119, China; (F.T.); (Z.W.)
| |
Collapse
|
43
|
Incorporation of Synthetic Data Generation Techniques within a Controlled Data Processing Workflow in the Health and Wellbeing Domain. ELECTRONICS 2022. [DOI: 10.3390/electronics11050812] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
Abstract
To date, the use of synthetic data generation techniques in the health and wellbeing domain has been mainly limited to research activities. Although several open source and commercial packages have been released, they have been oriented to generating synthetic data as a standalone data preparation process and not integrated into a broader analysis or experiment testing workflow. In this context, the VITALISE project is working to harmonize Living Lab research and data capture protocols and to provide controlled processing access to captured data to industrial and scientific communities. In this paper, we present the initial design and implementation of our synthetic data generation approach in the context of VITALISE Living Lab controlled data processing workflow, together with identified challenges and future developments. By uploading data captured from Living Labs, generating synthetic data from them, developing analysis locally with synthetic data, and then executing them remotely with real data, the utility of the proposed workflow has been validated. Results have shown that the presented workflow helps accelerate research on artificial intelligence, ensuring compliance with data protection laws. The presented approach has demonstrated how the adoption of state-of-the-art synthetic data generation techniques can be applied for real-world applications.
Collapse
|
44
|
Dong J, Roth A, Su WJ. Authors’ reply to the Discussion of ‘Gaussian Differential Privacy’ by Dong
et al
. J R Stat Soc Series B Stat Methodol 2022. [DOI: 10.1111/rssb.12463] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
45
|
Artificial Intelligence and Cardiovascular Genetics. Life (Basel) 2022; 12:life12020279. [PMID: 35207566 PMCID: PMC8875522 DOI: 10.3390/life12020279] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2021] [Revised: 01/26/2022] [Accepted: 02/09/2022] [Indexed: 12/13/2022] Open
Abstract
Polygenic diseases, which are genetic disorders caused by the combined action of multiple genes, pose unique and significant challenges for the diagnosis and management of affected patients. A major goal of cardiovascular medicine has been to understand how genetic variation leads to the clinical heterogeneity seen in polygenic cardiovascular diseases (CVDs). Recent advances and emerging technologies in artificial intelligence (AI), coupled with the ever-increasing availability of next generation sequencing (NGS) technologies, now provide researchers with unprecedented possibilities for dynamic and complex biological genomic analyses. Combining these technologies may lead to a deeper understanding of heterogeneous polygenic CVDs, better prognostic guidance, and, ultimately, greater personalized medicine. Advances will likely be achieved through increasingly frequent and robust genomic characterization of patients, as well the integration of genomic data with other clinical data, such as cardiac imaging, coronary angiography, and clinical biomarkers. This review discusses the current opportunities and limitations of genomics; provides a brief overview of AI; and identifies the current applications, limitations, and future directions of AI in genomics.
Collapse
|
46
|
Artificial Intelligence and Hypertension Management. Artif Intell Med 2022. [DOI: 10.1007/978-3-030-64573-1_263] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
47
|
Greener JG, Kandathil SM, Moffat L, Jones DT. A guide to machine learning for biologists. Nat Rev Mol Cell Biol 2022; 23:40-55. [PMID: 34518686 DOI: 10.1038/s41580-021-00407-0] [Citation(s) in RCA: 556] [Impact Index Per Article: 278.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/23/2021] [Indexed: 02/08/2023]
Abstract
The expanding scale and inherent complexity of biological data have encouraged a growing use of machine learning in biology to build informative and predictive models of the underlying biological processes. All machine learning techniques fit models to data; however, the specific methods are quite varied and can at first glance seem bewildering. In this Review, we aim to provide readers with a gentle introduction to a few key machine learning techniques, including the most recently developed and widely used techniques involving deep neural networks. We describe how different techniques may be suited to specific types of biological data, and also discuss some best practices and points to consider when one is embarking on experiments involving machine learning. Some emerging directions in machine learning methodology are also discussed.
Collapse
Affiliation(s)
- Joe G Greener
- Department of Computer Science, University College London, London, UK
| | - Shaun M Kandathil
- Department of Computer Science, University College London, London, UK
| | - Lewis Moffat
- Department of Computer Science, University College London, London, UK
| | - David T Jones
- Department of Computer Science, University College London, London, UK.
| |
Collapse
|
48
|
Zhang Z, Yan C, Malin BA. Membership inference attacks against synthetic health data. J Biomed Inform 2022; 125:103977. [PMID: 34920126 PMCID: PMC8766950 DOI: 10.1016/j.jbi.2021.103977] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2021] [Revised: 11/17/2021] [Accepted: 12/08/2021] [Indexed: 01/03/2023]
Abstract
Synthetic data generation has emerged as a promising method to protect patient privacy while sharing individual-level health data. Intuitively, sharing synthetic data should reduce disclosure risks because no explicit linkage is retained between the synthetic records and the real data upon which it is based. However, the risks associated with synthetic data are still evolving, and what seems protected today may not be tomorrow. In this paper, we show that membership inference attacks, whereby an adversary infers if the data from certain target individuals (known to the adversary a priori) were relied upon by the synthetic data generation process, can be substantially enhanced through state-of-the-art machine learning frameworks, which calls into question the protective nature of existing synthetic data generators. Specifically, we formulate the membership inference problem from the perspective of the data holder, who aims to perform a disclosure risk assessment prior to sharing any health data. To support such an assessment, we introduce a framework for effective membership inference against synthetic health data without specific assumptions about the generative model or a well-defined data structure, leveraging the principles of contrastive representation learning. To illustrate the potential for such an attack, we conducted experiments against synthesis approaches using two datasets derived from several health data resources (Vanderbilt University Medical Center, the All of Us Research Program) to determine the upper bound of risk brought by an adversary who invokes an optimal strategy. The results indicate that partially synthetic data are vulnerable to membership inference at a very high rate. By contrast, fully synthetic data are only marginally susceptible and, in most cases, could be deemed sufficiently protected from membership inference.
Collapse
Affiliation(s)
- Ziqi Zhang
- Vanderbilt University, 2525 West End Avenue, Nashville, TN 37240,Corresponding author: (Ziqi Zhang)
| | - Chao Yan
- Vanderbilt University, 2525 West End Avenue, Nashville, TN 37240
| | - Bradley A. Malin
- Vanderbilt University, 2525 West End Avenue, Nashville, TN 37240,Vanderbilt University Medical Center, 2525 West End Avenue, Nashville, TN 37240
| |
Collapse
|
49
|
Wan Z, Vorobeychik Y, Xia W, Liu Y, Wooders M, Guo J, Yin Z, Clayton EW, Kantarcioglu M, Malin BA. Using game theory to thwart multistage privacy intrusions when sharing data. SCIENCE ADVANCES 2021; 7:eabe9986. [PMID: 34890225 PMCID: PMC8664254 DOI: 10.1126/sciadv.abe9986] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/14/2020] [Accepted: 10/25/2021] [Indexed: 06/13/2023]
Abstract
Person-specific biomedical data are now widely collected, but its sharing raises privacy concerns, specifically about the re-identification of seemingly anonymous records. Formal re-identification risk assessment frameworks can inform decisions about whether and how to share data; current techniques, however, focus on scenarios where the data recipients use only one resource for re-identification purposes. This is a concern because recent attacks show that adversaries can access multiple resources, combining them in a stage-wise manner, to enhance the chance of an attack’s success. In this work, we represent a re-identification game using a two-player Stackelberg game of perfect information, which can be applied to assess risk, and suggest an optimal data sharing strategy based on a privacy-utility tradeoff. We report on experiments with large-scale genomic datasets to show that, using game theoretic models accounting for adversarial capabilities to launch multistage attacks, most data can be effectively shared with low re-identification risk.
Collapse
Affiliation(s)
- Zhiyu Wan
- Department of Electrical Engineering and Computer Science, Vanderbilt University, Nashville, TN 37212, USA
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, USA
| | - Yevgeniy Vorobeychik
- Department of Computer Science and Engineering, Washington University in St. Louis, St. Louis, MO 63130, USA
| | - Weiyi Xia
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, USA
| | - Yongtai Liu
- Department of Electrical Engineering and Computer Science, Vanderbilt University, Nashville, TN 37212, USA
| | - Myrna Wooders
- Department of Economics, Vanderbilt University, Nashville, TN 37235, USA
| | - Jia Guo
- Department of Electrical Engineering and Computer Science, Vanderbilt University, Nashville, TN 37212, USA
| | - Zhijun Yin
- Department of Electrical Engineering and Computer Science, Vanderbilt University, Nashville, TN 37212, USA
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, USA
| | - Ellen Wright Clayton
- Center for Biomedical Ethics and Society, Vanderbilt University Medical Center, Nashville, TN 37203, USA
- School of Law, Vanderbilt University, Nashville, TN 37203, USA
- Department of Pediatrics, Vanderbilt University Medical Center, Nashville, TN 37232, USA
| | - Murat Kantarcioglu
- Department of Computer Science, University of Texas at Dallas, Richardson, TX 75080, USA
- Institute for Quantitative Social Science, Harvard University, Cambridge, MA 02138, USA
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA 94720, USA
| | - Bradley A. Malin
- Department of Electrical Engineering and Computer Science, Vanderbilt University, Nashville, TN 37212, USA
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, USA
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN 37203, USA
| |
Collapse
|
50
|
Chen D, Cheung SCS, Chuah CN, Ozonoff S. Differentially Private Generative Adversarial Networks with Model Inversion. PROCEEDINGS OF THE ... IEEE INTERNATIONAL WORKSHOP ON INFORMATION FORENSICS AND SECURITY. IEEE INTERNATIONAL WORKSHOP ON INFORMATION FORENSICS AND SECURITY 2021; 2021:10.1109/wifs53200.2021.9648378. [PMID: 35517057 PMCID: PMC9070036 DOI: 10.1109/wifs53200.2021.9648378] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
To protect sensitive data in training a Generative Adversarial Network (GAN), the standard approach is to use differentially private (DP) stochastic gradient descent method in which controlled noise is added to the gradients. The quality of the output synthetic samples can be adversely affected and the training of the network may not even converge in the presence of these noises. We propose Differentially Private Model Inversion (DPMI) method where the private data is first mapped to the latent space via a public generator, followed by a lower-dimensional DP-GAN with better convergent properties. Experimental results on standard datasets CIFAR10 and SVHN as well as on a facial landmark dataset for Autism screening show that our approach outperforms the standard DP-GAN method based on Inception Score, Frechet Inception Distance, and classification accuracy under the same privacy guarantee.
Collapse
|