1
|
Sheng B, Pushpanathan K, Guan Z, Lim QH, Lim ZW, Yew SME, Goh JHL, Bee YM, Sabanayagam C, Sevdalis N, Lim CC, Lim CT, Shaw J, Jia W, Ekinci EI, Simó R, Lim LL, Li H, Tham YC. Artificial intelligence for diabetes care: current and future prospects. Lancet Diabetes Endocrinol 2024; 12:569-595. [PMID: 39054035 DOI: 10.1016/s2213-8587(24)00154-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/02/2024] [Revised: 03/28/2024] [Accepted: 05/16/2024] [Indexed: 07/27/2024]
Abstract
Artificial intelligence (AI) use in diabetes care is increasingly being explored to personalise care for people with diabetes and adapt treatments for complex presentations. However, the rapid advancement of AI also introduces challenges such as potential biases, ethical considerations, and implementation challenges in ensuring that its deployment is equitable. Ensuring inclusive and ethical developments of AI technology can empower both health-care providers and people with diabetes in managing the condition. In this Review, we explore and summarise the current and future prospects of AI across the diabetes care continuum, from enhancing screening and diagnosis to optimising treatment and predicting and managing complications.
Collapse
Affiliation(s)
- Bin Sheng
- Shanghai Belt and Road International Joint Laboratory for Intelligent Prevention and Treatment of Metabolic Disorders, Department of Computer Science and Engineering, School of Electronic, Information, and Electrical Engineering, Shanghai Jiao Tong University, Department of Endocrinology and Metabolism, Shanghai Sixth People's Hospital Affiliated to Shanghai Jiao Tong University School of Medicine, Shanghai Diabetes Institute, Shanghai Clinical Center for Diabetes, Shanghai, China; Key Laboratory of Artificial Intelligence, Ministry of Education, School of Electronic, Information, and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China
| | - Krithi Pushpanathan
- Centre of Innovation and Precision Eye Health, Department of Ophthalmology, National University of Singapore, Singapore; Yong Loo Lin School of Medicine, National University of Singapore, Singapore
| | - Zhouyu Guan
- Shanghai Belt and Road International Joint Laboratory for Intelligent Prevention and Treatment of Metabolic Disorders, Department of Computer Science and Engineering, School of Electronic, Information, and Electrical Engineering, Shanghai Jiao Tong University, Department of Endocrinology and Metabolism, Shanghai Sixth People's Hospital Affiliated to Shanghai Jiao Tong University School of Medicine, Shanghai Diabetes Institute, Shanghai Clinical Center for Diabetes, Shanghai, China
| | - Quan Hziung Lim
- Department of Medicine, Faculty of Medicine, University of Malaya, Kuala Lumpur, Malaysia
| | - Zhi Wei Lim
- Yong Loo Lin School of Medicine, National University of Singapore, Singapore
| | - Samantha Min Er Yew
- Centre of Innovation and Precision Eye Health, Department of Ophthalmology, National University of Singapore, Singapore; Yong Loo Lin School of Medicine, National University of Singapore, Singapore
| | | | - Yong Mong Bee
- Department of Endocrinology, Singapore General Hospital, Singapore; SingHealth Duke-National University of Singapore Diabetes Centre, Singapore Health Services, Singapore
| | - Charumathi Sabanayagam
- Ophthalmology and Visual Sciences Academic Clinical Program, Duke-National University of Singapore Medical School, Singapore; Singapore Eye Research Institute, Singapore National Eye Centre, Singapore
| | - Nick Sevdalis
- Centre for Behavioural and Implementation Science Interventions, National University of Singapore, Singapore
| | | | - Chwee Teck Lim
- Department of Biomedical Engineering, National University of Singapore, Singapore; Institute for Health Innovation and Technology, National University of Singapore, Singapore; Mechanobiology Institute, National University of Singapore, Singapore
| | - Jonathan Shaw
- Baker Heart and Diabetes Institute, Melbourne, VIC, Australia
| | - Weiping Jia
- Shanghai Belt and Road International Joint Laboratory for Intelligent Prevention and Treatment of Metabolic Disorders, Department of Computer Science and Engineering, School of Electronic, Information, and Electrical Engineering, Shanghai Jiao Tong University, Department of Endocrinology and Metabolism, Shanghai Sixth People's Hospital Affiliated to Shanghai Jiao Tong University School of Medicine, Shanghai Diabetes Institute, Shanghai Clinical Center for Diabetes, Shanghai, China
| | - Elif Ilhan Ekinci
- Australian Centre for Accelerating Diabetes Innovations, Melbourne Medical School and Department of Medicine, University of Melbourne, Melbourne, VIC, Australia; Department of Endocrinology, Austin Health, Melbourne, VIC, Australia
| | - Rafael Simó
- Diabetes and Metabolism Research Unit, Vall d'Hebron University Hospital and Vall d'Hebron Research Institute, Barcelona, Spain; Centro de Investigación Biomédica en Red de Diabetes y Enfermedades Metabólicas Asociadas, Instituto de Salud Carlos III, Madrid, Spain
| | - Lee-Ling Lim
- Department of Medicine, Faculty of Medicine, University of Malaya, Kuala Lumpur, Malaysia; Department of Medicine and Therapeutics, Chinese University of Hong Kong, Hong Kong Special Administrative Region, China; Asia Diabetes Foundation, Hong Kong Special Administrative Region, China
| | - Huating Li
- Shanghai Belt and Road International Joint Laboratory for Intelligent Prevention and Treatment of Metabolic Disorders, Department of Computer Science and Engineering, School of Electronic, Information, and Electrical Engineering, Shanghai Jiao Tong University, Department of Endocrinology and Metabolism, Shanghai Sixth People's Hospital Affiliated to Shanghai Jiao Tong University School of Medicine, Shanghai Diabetes Institute, Shanghai Clinical Center for Diabetes, Shanghai, China.
| | - Yih-Chung Tham
- Centre of Innovation and Precision Eye Health, Department of Ophthalmology, National University of Singapore, Singapore; Yong Loo Lin School of Medicine, National University of Singapore, Singapore; Ophthalmology and Visual Sciences Academic Clinical Program, Duke-National University of Singapore Medical School, Singapore; Singapore Eye Research Institute, Singapore National Eye Centre, Singapore.
| |
Collapse
|
2
|
Vasdev N, Gupta T, Pawar B, Bain A, Tekade RK. Navigating the future of health care with AI-driven digital therapeutics. Drug Discov Today 2024:104110. [PMID: 39034025 DOI: 10.1016/j.drudis.2024.104110] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2024] [Revised: 07/01/2024] [Accepted: 07/16/2024] [Indexed: 07/23/2024]
Abstract
Digital therapeutics (DTx) is a recently conceived idea in health care that aims to cure ailments and modify patient behavior by employing a range of digital technologies. Notably, when traditional medication is not entirely efficacious, DTx offers an innovative avenue for treatments linked to dysfunctional behaviors and lifestyle management. DTx involves extremely adaptable therapeutic devices that empower greater patient engagement in treating illness, using algorithms to collect, transfer and analyze the patient's data. Efficient clinical monitoring and supervision at the individual level by remote access and algorithms for a range of diseases is made possible by integrating machine learning and artificial intelligence with DTx. There is a potentially large worldwide market for DTx owing to its convenient, personalized therapies.
Collapse
Affiliation(s)
- Nupur Vasdev
- National Institute of Pharmaceutical Education and Research (NIPER) Ahmedabad, An Institute of National Importance, Government of India, Department of Pharmaceuticals, Ministry of Chemicals and Fertilizers, Palaj, Opp. Air Force Station, Gandhinagar 382355, Gujarat, India
| | - Tanisha Gupta
- National Institute of Pharmaceutical Education and Research (NIPER) Ahmedabad, An Institute of National Importance, Government of India, Department of Pharmaceuticals, Ministry of Chemicals and Fertilizers, Palaj, Opp. Air Force Station, Gandhinagar 382355, Gujarat, India
| | - Bhakti Pawar
- National Institute of Pharmaceutical Education and Research (NIPER) Ahmedabad, An Institute of National Importance, Government of India, Department of Pharmaceuticals, Ministry of Chemicals and Fertilizers, Palaj, Opp. Air Force Station, Gandhinagar 382355, Gujarat, India
| | - Anoothi Bain
- National Institute of Pharmaceutical Education and Research (NIPER) Ahmedabad, An Institute of National Importance, Government of India, Department of Pharmaceuticals, Ministry of Chemicals and Fertilizers, Palaj, Opp. Air Force Station, Gandhinagar 382355, Gujarat, India
| | - Rakesh Kumar Tekade
- National Institute of Pharmaceutical Education and Research (NIPER) Ahmedabad, An Institute of National Importance, Government of India, Department of Pharmaceuticals, Ministry of Chemicals and Fertilizers, Palaj, Opp. Air Force Station, Gandhinagar 382355, Gujarat, India.
| |
Collapse
|
3
|
Borisov V, Leemann T, Sebler K, Haug J, Pawelczyk M, Kasneci G. Deep Neural Networks and Tabular Data: A Survey. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:7499-7519. [PMID: 37015381 DOI: 10.1109/tnnls.2022.3229161] [Citation(s) in RCA: 21] [Impact Index Per Article: 21.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Heterogeneous tabular data are the most commonly used form of data and are essential for numerous critical and computationally demanding applications. On homogeneous datasets, deep neural networks have repeatedly shown excellent performance and have therefore been widely adopted. However, their adaptation to tabular data for inference or data generation tasks remains highly challenging. To facilitate further progress in the field, this work provides an overview of state-of-the-art deep learning methods for tabular data. We categorize these methods into three groups: data transformations, specialized architectures, and regularization models. For each of these groups, our work offers a comprehensive overview of the main approaches. Moreover, we discuss deep learning approaches for generating tabular data and also provide an overview over strategies for explaining deep models on tabular data. Thus, our first contribution is to address the main research streams and existing methodologies in the mentioned areas while highlighting relevant challenges and open research questions. Our second contribution is to provide an empirical comparison of traditional machine learning methods with 11 deep learning approaches across five popular real-world tabular datasets of different sizes and with different learning objectives. Our results, which we have made publicly available as competitive benchmarks, indicate that algorithms based on gradient-boosted tree ensembles still mostly outperform deep learning models on supervised learning tasks, suggesting that the research progress on competitive deep learning models for tabular data is stagnating. To the best of our knowledge, this is the first in-depth overview of deep learning approaches for tabular data; as such, this work can serve as a valuable starting point to guide researchers and practitioners interested in deep learning with tabular data.
Collapse
|
4
|
Chandra S, Prakash PKS, Samanta S, Chilukuri S. ClinicalGAN: powering patient monitoring in clinical trials with patient digital twins. Sci Rep 2024; 14:12236. [PMID: 38806536 PMCID: PMC11133486 DOI: 10.1038/s41598-024-62567-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2023] [Accepted: 05/19/2024] [Indexed: 05/30/2024] Open
Abstract
Conducting clinical trials is becoming increasingly challenging lately due to spiraling costs, increased time to market, and high failure rates. Patient recruitment and retention is one of the key challenges that impact 90% of the trials directly. While a lot of attention has been given to optimizing patient recruitment, limited progress has been made towards developing comprehensive clinical trial monitoring systems to determine patients at risk and potentially improve patient retention through the right intervention at the right time. Earlier research in patient retention primarily focused on using deterministic frameworks to model the inherently stochastic patient journey process. Existing generative approaches to model temporal data such as TimeGAN or CRBM , face challenges and fail to address key requirements such as personalized generation, variable patient journey, and multi-variate time-series needed to model patient digital twin. In response to these challenges, current research proposes ClinicalGAN to enable patient level generation, effectively creating a patient's digital twin. ClinicalGAN provides capabilities for: (a) patient-level personalized generation by utilizing patient meta-data for conditional generation; (b) dynamic termination prediction to enable pro-active patient monitoring for improved patient retention; (c) multi-variate time-series training to incorporate relationship and dependencies among different tests measures captured during patient journey. The proposed solution is validated on two Alzheimer's clinical trial datasets and the results are benchmarked across multiple dimensions of generation quality. Empirical results demonstrate that the proposed ClinicalGAN outperforms the SOTA approach by 3-4 × on average across all the generation quality metrics. Furthermore, the proposed architecture is shown to outperform predictive methods at the task of drop-off prediction significantly (5-10% MAPE scores).
Collapse
Affiliation(s)
- Shantanu Chandra
- ZS, 2nd Floor, MFAR Manyta Tech park, Phase IV, Manayata Tech Park, Nagavara, Bengaluru, Karnataka, India.
| | - P K S Prakash
- ZS, 2nd Floor, MFAR Manyta Tech park, Phase IV, Manayata Tech Park, Nagavara, Bengaluru, Karnataka, India
| | - Subhrajit Samanta
- ZS, 2nd Floor, MFAR Manyta Tech park, Phase IV, Manayata Tech Park, Nagavara, Bengaluru, Karnataka, India
| | | |
Collapse
|
5
|
Vallevik VB, Babic A, Marshall SE, Elvatun S, Brøgger HMB, Alagaratnam S, Edwin B, Veeraragavan NR, Befring AK, Nygård JF. Can I trust my fake data - A comprehensive quality assessment framework for synthetic tabular data in healthcare. Int J Med Inform 2024; 185:105413. [PMID: 38493547 DOI: 10.1016/j.ijmedinf.2024.105413] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2023] [Revised: 02/17/2024] [Accepted: 03/11/2024] [Indexed: 03/19/2024]
Abstract
BACKGROUND Ensuring safe adoption of AI tools in healthcare hinges on access to sufficient data for training, testing and validation. Synthetic data has been suggested in response to privacy concerns and regulatory requirements and can be created by training a generator on real data to produce a dataset with similar statistical properties. Competing metrics with differing taxonomies for quality evaluation have been proposed, resulting in a complex landscape. Optimising quality entails balancing considerations that make the data fit for use, yet relevant dimensions are left out of existing frameworks. METHOD We performed a comprehensive literature review on the use of quality evaluation metrics on synthetic data within the scope of synthetic tabular healthcare data using deep generative methods. Based on this and the collective team experiences, we developed a conceptual framework for quality assurance. The applicability was benchmarked against a practical case from the Dutch National Cancer Registry. CONCLUSION We present a conceptual framework for quality assuranceof synthetic data for AI applications in healthcare that aligns diverging taxonomies, expands on common quality dimensions to include the dimensions of Fairness and Carbon footprint, and proposes stages necessary to support real-life applications. Building trust in synthetic data by increasing transparency and reducing the safety risk will accelerate the development and uptake of trustworthy AI tools for the benefit of patients. DISCUSSION Despite the growing emphasis on algorithmic fairness and carbon footprint, these metrics were scarce in the literature review. The overwhelming focus was on statistical similarity using distance metrics while sequential logic detection was scarce. A consensus-backed framework that includes all relevant quality dimensions can provide assurance for safe and responsible real-life applications of synthetic data. As the choice of appropriate metrics are highly context dependent, further research is needed on validation studies to guide metric choices and support the development of technical standards.
Collapse
Affiliation(s)
- Vibeke Binz Vallevik
- University of Oslo, Boks 1072 Blindern, NO-0316 Oslo, Norway; DNV AS, Veritasveien 1, 1322 Høvik, Norway.
| | | | | | - Severin Elvatun
- Cancer Registry of Norway, Ullernchausseen 64, 0379 Oslo, Norway
| | - Helga M B Brøgger
- DNV AS, Veritasveien 1, 1322 Høvik, Norway; Oslo University Hospital, Sognsvannsveien 20, 0372 Oslo, Norway
| | | | - Bjørn Edwin
- University of Oslo, Boks 1072 Blindern, NO-0316 Oslo, Norway; The Intervention Centre and Department of HPB Surgery, Oslo University Hospital and Institute of Clinical Medicine, Faculty of Medicine, University of Oslo, Oslo, Norway
| | | | | | - Jan F Nygård
- Cancer Registry of Norway, Ullernchausseen 64, 0379 Oslo, Norway; UiT - The Arctic University of Norway, Tromsø, Norway
| |
Collapse
|
6
|
Carini C, Seyhan AA. Tribulations and future opportunities for artificial intelligence in precision medicine. J Transl Med 2024; 22:411. [PMID: 38702711 PMCID: PMC11069149 DOI: 10.1186/s12967-024-05067-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2024] [Accepted: 03/05/2024] [Indexed: 05/06/2024] Open
Abstract
Upon a diagnosis, the clinical team faces two main questions: what treatment, and at what dose? Clinical trials' results provide the basis for guidance and support for official protocols that clinicians use to base their decisions. However, individuals do not consistently demonstrate the reported response from relevant clinical trials. The decision complexity increases with combination treatments where drugs administered together can interact with each other, which is often the case. Additionally, the individual's response to the treatment varies with the changes in their condition. In practice, the drug and the dose selection depend significantly on the medical protocol and the medical team's experience. As such, the results are inherently varied and often suboptimal. Big data and Artificial Intelligence (AI) approaches have emerged as excellent decision-making tools, but multiple challenges limit their application. AI is a rapidly evolving and dynamic field with the potential to revolutionize various aspects of human life. AI has become increasingly crucial in drug discovery and development. AI enhances decision-making across different disciplines, such as medicinal chemistry, molecular and cell biology, pharmacology, pathology, and clinical practice. In addition to these, AI contributes to patient population selection and stratification. The need for AI in healthcare is evident as it aids in enhancing data accuracy and ensuring the quality care necessary for effective patient treatment. AI is pivotal in improving success rates in clinical practice. The increasing significance of AI in drug discovery, development, and clinical trials is underscored by many scientific publications. Despite the numerous advantages of AI, such as enhancing and advancing Precision Medicine (PM) and remote patient monitoring, unlocking its full potential in healthcare requires addressing fundamental concerns. These concerns include data quality, the lack of well-annotated large datasets, data privacy and safety issues, biases in AI algorithms, legal and ethical challenges, and obstacles related to cost and implementation. Nevertheless, integrating AI in clinical medicine will improve diagnostic accuracy and treatment outcomes, contribute to more efficient healthcare delivery, reduce costs, and facilitate better patient experiences, making healthcare more sustainable. This article reviews AI applications in drug development and clinical practice, making healthcare more sustainable, and highlights concerns and limitations in applying AI.
Collapse
Affiliation(s)
- Claudio Carini
- School of Cancer and Pharmaceutical Sciences, Faculty of Life Sciences and Medicine, New Hunt's House, King's College London, Guy's Campus, London, UK.
- Biomarkers Consortium, Foundation of the National Institute of Health, Bethesda, MD, USA.
| | - Attila A Seyhan
- Laboratory of Translational Oncology and Experimental Cancer Therapeutics, Warren Alpert Medical School, Brown University, Providence, RI, USA.
- Department of Pathology and Laboratory Medicine, Warren Alpert Medical School, Brown University, Providence, RI, USA.
- Joint Program in Cancer Biology, Lifespan Health System and Brown University, Providence, RI, USA.
- Legorreta Cancer Center at Brown University, Providence, RI, USA.
| |
Collapse
|
7
|
Rahman MA, Victoros E, Ernest J, Davis R, Shanjana Y, Islam MR. Impact of Artificial Intelligence (AI) Technology in Healthcare Sector: A Critical Evaluation of Both Sides of the Coin. CLINICAL PATHOLOGY (THOUSAND OAKS, VENTURA COUNTY, CALIF.) 2024; 17:2632010X241226887. [PMID: 38264676 PMCID: PMC10804900 DOI: 10.1177/2632010x241226887] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 09/07/2023] [Accepted: 12/27/2023] [Indexed: 01/25/2024]
Abstract
The influence of artificial intelligence (AI) has drastically risen in recent years, especially in the field of medicine. Its influence has spread so greatly that it is determined to become a pillar in the future medical world. A comprehensive literature search related to AI in healthcare was performed in the PubMed database and retrieved the relevant information from suitable ones. AI excels in aspects such as rapid adaptation, high diagnostic accuracy, and data management that can help improve workforce productivity. With this potential in sight, the FDA has continuously approved more machine learning (ML) software to be used by medical workers and scientists. However, there are few controversies such as increased chances of data breaches, concern for clinical implementation, and potential healthcare dilemmas. In this article, the positive and negative aspects of AI implementation in healthcare are discussed, as well as recommended some potential solutions to the potential issues at hand.
Collapse
Affiliation(s)
| | | | - Julianne Ernest
- Nesbitt School of Pharmacy Wilkes University, Wilkes-Barre, PA, USA
| | - Rob Davis
- Nesbitt School of Pharmacy Wilkes University, Wilkes-Barre, PA, USA
| | - Yeasna Shanjana
- Department of Environmental Sciences, North South University, Bashundhara, Dhaka, Bangladesh
| | | |
Collapse
|
8
|
Gwon H, Ahn I, Kim Y, Kang HJ, Seo H, Choi H, Cho HN, Kim M, Han J, Kee G, Park S, Lee KH, Jun TJ, Kim YH. LDP-GAN : Generative adversarial networks with local differential privacy for patient medical records synthesis. Comput Biol Med 2024; 168:107738. [PMID: 37995536 DOI: 10.1016/j.compbiomed.2023.107738] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2023] [Revised: 10/31/2023] [Accepted: 11/16/2023] [Indexed: 11/25/2023]
Abstract
Electronic medical records(EMR) have considerable potential to advance healthcare technologies, including medical AI. Nevertheless, due to the privacy issues associated with the sharing of patient's personal information, it is difficult to sufficiently utilize them. Generative models based on deep learning can solve this problem by creating synthetic data similar to real patient data. However, the data used for training these deep learning models run into the risk of getting leaked because of malicious attacks. This means that traditional deep learning-based generative models cannot completely solve the privacy issues. Therefore, we suggested a method to prevent the leakage of training data by protecting the model from malicious attacks using local differential privacy(LDP). Our method was evaluated in terms of utility and privacy. Experimental results demonstrated that the proposed method can generate medical data with reasonable performance while protecting training data from malicious attacks.
Collapse
Affiliation(s)
- Hansle Gwon
- Department of Information Medicine, Asan Medical Center, 8, Olympicro 43gil, Songpagu, Seoul, 05505, Republic of Korea
| | - Imjin Ahn
- Department of Information Medicine, Asan Medical Center, 8, Olympicro 43gil, Songpagu, Seoul, 05505, Republic of Korea
| | - Yunha Kim
- Department of Medical Science, Asan Medical Institute of Convergence Science and Technology, Asan Medical Center, University of Ulsan College of Medicine, 88, Olympicro 43gil, Songpagu, Seoul, 05505, Republic of Korea
| | - Hee Jun Kang
- Division of Cardiology, Asan Medical Center, 88, Olympicro 43gil, Songpagu, Seoul, 05505, Republic of Korea
| | - Hyeram Seo
- Department of Medical Science, Asan Medical Institute of Convergence Science and Technology, Asan Medical Center, University of Ulsan College of Medicine, 88, Olympicro 43gil, Songpagu, Seoul, 05505, Republic of Korea
| | - Heejung Choi
- Department of Medical Science, Asan Medical Institute of Convergence Science and Technology, Asan Medical Center, University of Ulsan College of Medicine, 88, Olympicro 43gil, Songpagu, Seoul, 05505, Republic of Korea
| | - Ha Na Cho
- Department of Information Medicine, Asan Medical Center, 8, Olympicro 43gil, Songpagu, Seoul, 05505, Republic of Korea
| | - Minkyoung Kim
- Department of Medical Science, Asan Medical Institute of Convergence Science and Technology, Asan Medical Center, University of Ulsan College of Medicine, 88, Olympicro 43gil, Songpagu, Seoul, 05505, Republic of Korea
| | - JiYe Han
- Department of Medical Science, Asan Medical Institute of Convergence Science and Technology, Asan Medical Center, University of Ulsan College of Medicine, 88, Olympicro 43gil, Songpagu, Seoul, 05505, Republic of Korea
| | - Gaeun Kee
- Department of Information Medicine, Asan Medical Center, 8, Olympicro 43gil, Songpagu, Seoul, 05505, Republic of Korea
| | - Seohyun Park
- Department of Information Medicine, Asan Medical Center, 8, Olympicro 43gil, Songpagu, Seoul, 05505, Republic of Korea
| | - Kye Hwa Lee
- Department of Information Medicine, Asan Medical Center, 8, Olympicro 43gil, Songpagu, Seoul, 05505, Republic of Korea
| | - Tae Joon Jun
- Big Data Research Center, Asan Institute for Life Sciences, Asan Medical Center, 88, Olympicro 43gil, Songpagu, Seoul, 05505, Republic of Korea.
| | - Young-Hak Kim
- Division of Cardiology, Department of Information Medicine, Asan Medical Center, University of Ulsan College of Medicine, 88, Olympicro 43gil, Songpagu, Seoul, 05505, Republic of Korea
| |
Collapse
|
9
|
Kang HYJ, Batbaatar E, Choi DW, Choi KS, Ko M, Ryu KS. Synthetic Tabular Data Based on Generative Adversarial Networks in Health Care: Generation and Validation Using the Divide-and-Conquer Strategy. JMIR Med Inform 2023; 11:e47859. [PMID: 37999942 DOI: 10.2196/47859] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2023] [Revised: 08/02/2023] [Accepted: 10/28/2023] [Indexed: 11/25/2023] Open
Abstract
BACKGROUND Synthetic data generation (SDG) based on generative adversarial networks (GANs) is used in health care, but research on preserving data with logical relationships with synthetic tabular data (STD) remains challenging. Filtering methods for SDG can lead to the loss of important information. OBJECTIVE This study proposed a divide-and-conquer (DC) method to generate STD based on the GAN algorithm, while preserving data with logical relationships. METHODS The proposed method was evaluated on data from the Korea Association for Lung Cancer Registry (KALC-R) and 2 benchmark data sets (breast cancer and diabetes). The DC-based SDG strategy comprises 3 steps: (1) We used 2 different partitioning methods (the class-specific criterion distinguished between survival and death groups, while the Cramer V criterion identified the highest correlation between columns in the original data); (2) the entire data set was divided into a number of subsets, which were then used as input for the conditional tabular generative adversarial network and the copula generative adversarial network to generate synthetic data; and (3) the generated synthetic data were consolidated into a single entity. For validation, we compared DC-based SDG and conditional sampling (CS)-based SDG through the performances of machine learning models. In addition, we generated imbalanced and balanced synthetic data for each of the 3 data sets and compared their performance using 4 classifiers: decision tree (DT), random forest (RF), Extreme Gradient Boosting (XGBoost), and light gradient-boosting machine (LGBM) models. RESULTS The synthetic data of the 3 diseases (non-small cell lung cancer [NSCLC], breast cancer, and diabetes) generated by our proposed model outperformed the 4 classifiers (DT, RF, XGBoost, and LGBM). The CS- versus DC-based model performances were compared using the mean area under the curve (SD) values: 74.87 (SD 0.77) versus 63.87 (SD 2.02) for NSCLC, 73.31 (SD 1.11) versus 67.96 (SD 2.15) for breast cancer, and 61.57 (SD 0.09) versus 60.08 (SD 0.17) for diabetes (DT); 85.61 (SD 0.29) versus 79.01 (SD 1.20) for NSCLC, 78.05 (SD 1.59) versus 73.48 (SD 4.73) for breast cancer, and 59.98 (SD 0.24) versus 58.55 (SD 0.17) for diabetes (RF); 85.20 (SD 0.82) versus 76.42 (SD 0.93) for NSCLC, 77.86 (SD 2.27) versus 68.32 (SD 2.37) for breast cancer, and 60.18 (SD 0.20) versus 58.98 (SD 0.29) for diabetes (XGBoost); and 85.14 (SD 0.77) versus 77.62 (SD 1.85) for NSCLC, 78.16 (SD 1.52) versus 70.02 (SD 2.17) for breast cancer, and 61.75 (SD 0.13) versus 61.12 (SD 0.23) for diabetes (LGBM). In addition, we found that balanced synthetic data performed better. CONCLUSIONS This study is the first attempt to generate and validate STD based on a DC approach and shows improved performance using STD. The necessity for balanced SDG was also demonstrated.
Collapse
Affiliation(s)
- Ha Ye Jin Kang
- Department of Applied Artificial Intelligence, Hanyang University, Ansan, Republic of Korea
- Department of Cancer AI & Digital Health, Graduate School of Cancer Science and Policy, National Cancer Center, Gyeonggi-do, Republic of Korea
| | - Erdenebileg Batbaatar
- National Cancer Data Center, National Cancer Control Institute, National Cancer Center, Gyeonggi-do, Republic of Korea
| | - Dong-Woo Choi
- National Cancer Data Center, National Cancer Control Institute, National Cancer Center, Gyeonggi-do, Republic of Korea
| | - Kui Son Choi
- National Cancer Data Center, National Cancer Control Institute, National Cancer Center, Gyeonggi-do, Republic of Korea
- Department of Cancer Control and Policy, Graduate School of Cancer Science and Policy, National Cancer Center, Gyeonggi-do, Republic of Korea
| | - Minsam Ko
- Department of Applied Artificial Intelligence, Hanyang University, Ansan, Republic of Korea
- Department of Human-Computer Interaction, Hanyang University, Ansan, Republic of Korea
| | - Kwang Sun Ryu
- Department of Cancer AI & Digital Health, Graduate School of Cancer Science and Policy, National Cancer Center, Gyeonggi-do, Republic of Korea
- National Cancer Data Center, National Cancer Control Institute, National Cancer Center, Gyeonggi-do, Republic of Korea
| |
Collapse
|
10
|
Bonomi L, Gousheh S, Fan L. Enabling Health Data Sharing with Fine-Grained Privacy. PROCEEDINGS OF THE ... ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT. ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT 2023; 2023:131-141. [PMID: 37906633 PMCID: PMC10601092 DOI: 10.1145/3583780.3614864] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/02/2023]
Abstract
Sharing health data is vital in advancing medical research and transforming knowledge into clinical practice. Meanwhile, protecting the privacy of data contributors is of paramount importance. To that end, several privacy approaches have been proposed to protect individual data contributors in data sharing, including data anonymization and data synthesis techniques. These approaches have shown promising results in providing privacy protection at the dataset level. In this work, we study the privacy challenges in enabling fine-grained privacy in health data sharing. Our work is motivated by recent research findings, in which patients and healthcare providers may have different privacy preferences and policies that need to be addressed. Specifically, we propose a novel and effective privacy solution that enables data curators (e.g., healthcare providers) to protect sensitive data elements while preserving data usefulness. Our solution builds on randomized techniques to provide rigorous privacy protection for sensitive elements and leverages graphical models to mitigate privacy leakage due to dependent elements. To enhance the usefulness of the shared data, our randomized mechanism incorporates domain knowledge to preserve semantic similarity and adopts a block-structured design to minimize utility loss. Evaluations with real-world health data demonstrate the effectiveness of our approach and the usefulness of the shared data for health applications.
Collapse
Affiliation(s)
- Luca Bonomi
- Vanderbilt University Medical Center, Nashville, TN, USA
| | - Sepand Gousheh
- University of North Carolina at Charlotte, Charlotte, NC, USA
| | - Liyue Fan
- University of North Carolina at Charlotte, Charlotte, NC, USA
| |
Collapse
|
11
|
Peppes N, Tsakanikas P, Daskalakis E, Alexakis T, Adamopoulou E, Demestichas K. FoGGAN: Generating Realistic Parkinson's Disease Freezing of Gait Data Using GANs. SENSORS (BASEL, SWITZERLAND) 2023; 23:8158. [PMID: 37836988 PMCID: PMC10574838 DOI: 10.3390/s23198158] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/11/2023] [Revised: 09/23/2023] [Accepted: 09/27/2023] [Indexed: 10/15/2023]
Abstract
Data scarcity in the healthcare domain is a major drawback for most state-of-the-art technologies engaging artificial intelligence. The unavailability of quality data due to both the difficulty to gather and label them as well as due to their sensitive nature create a breeding ground for data augmentation solutions. Parkinson's Disease (PD) which can have a wide range of symptoms including motor impairments consists of a very challenging case for quality data acquisition. Generative Adversarial Networks (GANs) can help alleviate such data availability issues. In this light, this study focuses on a data augmentation solution engaging Generative Adversarial Networks (GANs) using a freezing of gait (FoG) symptom dataset as input. The data generated by the so-called FoGGAN architecture presented in this study are almost identical to the original as concluded by a variety of similarity metrics. This highlights the significance of such solutions as they can provide credible synthetically generated data which can be utilized as training dataset inputs to AI applications. Additionally, a DNN classifier's performance is evaluated using three different evaluation datasets and the accuracy results were quite encouraging, highlighting that the FOGGAN solution could lead to the alleviation of the data shortage matter.
Collapse
Affiliation(s)
- Nikolaos Peppes
- Institute of Communication and Computer Systems, National Technical University of Athens, 15773 Athens, Greece; (P.T.); (E.D.); (T.A.); (E.A.)
| | - Panagiotis Tsakanikas
- Institute of Communication and Computer Systems, National Technical University of Athens, 15773 Athens, Greece; (P.T.); (E.D.); (T.A.); (E.A.)
| | - Emmanouil Daskalakis
- Institute of Communication and Computer Systems, National Technical University of Athens, 15773 Athens, Greece; (P.T.); (E.D.); (T.A.); (E.A.)
| | - Theodoros Alexakis
- Institute of Communication and Computer Systems, National Technical University of Athens, 15773 Athens, Greece; (P.T.); (E.D.); (T.A.); (E.A.)
| | - Evgenia Adamopoulou
- Institute of Communication and Computer Systems, National Technical University of Athens, 15773 Athens, Greece; (P.T.); (E.D.); (T.A.); (E.A.)
| | - Konstantinos Demestichas
- Department of Agricultural Economics and Rural Development, Agricultural University of Athens, 11855 Athens, Greece;
| |
Collapse
|
12
|
Al Hadithy ZA, Al Lawati A, Al-Zadjali R, Al Sinawi H. Knowledge, Attitudes, and Perceptions of Artificial Intelligence in Healthcare Among Medical Students at Sultan Qaboos University. Cureus 2023; 15:e44887. [PMID: 37814766 PMCID: PMC10560391 DOI: 10.7759/cureus.44887] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 09/08/2023] [Indexed: 10/11/2023] Open
Abstract
Background Artificial intelligence (AI) is increasingly used in healthcare, but more data are needed about the knowledge, perceptions, attitudes, and preparedness of medical students in Oman towards this technology. This study aimed to investigate these aspects among clinical-year medical students at Sultan Qaboos University. Methodology A web-based validated exploratory questionnaire adapted from a study conducted at the University of Toronto was distributed to all clinical year (phase III) medical students at Sultan Qaboos University. The questionnaire collected demographic and background information, tested students' knowledge of AI, and assessed their perceptions and attitudes toward it. The data were analyzed using the Statistical Package for Social Sciences (SPSS, IBM Corp., Armonk, NY). Results A total of 221 out of 368 clinical-year medical students (60%) completed the survey. Most respondents were in their junior clerkship year (n = 94, 42.5%). Most students (n = 167, 75.4%) had no prior exposure to AI in healthcare, with a median knowledge score of 3.25 out of 5 in AI, and showed no improvement over the years. However, they overall had positive perceptions and attitudes towards AI. Students also had concerns about the impact of AI on employment prospects and ethical issues but were generally receptive to incorporating AI into medical school curricula, as 174 students (78.7%) believed every medical trainee should receive training on AI competencies. Conclusion This study provides valuable insights into the knowledge, perceptions, attitudes, and preparedness of medical students in Oman toward AI in healthcare. Medical educators in Oman should consider incorporating AI into medical school curricula to prepare future physicians for using this technology in healthcare.
Collapse
Affiliation(s)
- Zinah A Al Hadithy
- College of Medicine and Health Sciences, Sultan Qaboos University, Muscat, OMN
| | - Abdullah Al Lawati
- College of Medicine and Health Sciences, Sultan Qaboos University, Muscat, OMN
| | - Riham Al-Zadjali
- General Practice, Sultan Qaboos University Hospital, Muscat, OMN
| | - Hamed Al Sinawi
- Psychiatry and Behavioral Sciences, Sultan Qaboos University Hospital, Muscat, OMN
| |
Collapse
|
13
|
Theodorou B, Xiao C, Sun J. Synthesize high-dimensional longitudinal electronic health records via hierarchical autoregressive language model. Nat Commun 2023; 14:5305. [PMID: 37652934 PMCID: PMC10471716 DOI: 10.1038/s41467-023-41093-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2023] [Accepted: 08/23/2023] [Indexed: 09/02/2023] Open
Abstract
Synthetic electronic health records (EHRs) that are both realistic and privacy-preserving offer alternatives to real EHRs for machine learning (ML) and statistical analysis. However, generating high-fidelity EHR data in its original, high-dimensional form poses challenges for existing methods. We propose Hierarchical Autoregressive Language mOdel (HALO) for generating longitudinal, high-dimensional EHR, which preserve the statistical properties of real EHRs and can train accurate ML models without privacy concerns. HALO generates a probability density function over medical codes, clinical visits, and patient records, allowing for generating realistic EHR data without requiring variable selection or aggregation. Extensive experiments demonstrated that HALO can generate high-fidelity data with high-dimensional disease code probabilities closely mirroring (above 0.9 R2 correlation) real EHR data. HALO also enhances the accuracy of predictive modeling and enables downstream ML models to attain similar accuracy as models trained on genuine data.
Collapse
Affiliation(s)
- Brandon Theodorou
- University of Illinois at Urbana-Champaign, 201 North Goodwin Avenue, Urbana, IL, USA
- Medisyn Inc., Las Vegas, NV, USA
| | - Cao Xiao
- Medisyn Inc., Las Vegas, NV, USA
| | - Jimeng Sun
- University of Illinois at Urbana-Champaign, 201 North Goodwin Avenue, Urbana, IL, USA.
- Medisyn Inc., Las Vegas, NV, USA.
| |
Collapse
|
14
|
Wolfien M, Ahmadi N, Fitzer K, Grummt S, Heine KL, Jung IC, Krefting D, Kühn A, Peng Y, Reinecke I, Scheel J, Schmidt T, Schmücker P, Schüttler C, Waltemath D, Zoch M, Sedlmayr M. Ten Topics to Get Started in Medical Informatics Research. J Med Internet Res 2023; 25:e45948. [PMID: 37486754 PMCID: PMC10407648 DOI: 10.2196/45948] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2023] [Revised: 03/29/2023] [Accepted: 04/11/2023] [Indexed: 07/25/2023] Open
Abstract
The vast and heterogeneous data being constantly generated in clinics can provide great wealth for patients and research alike. The quickly evolving field of medical informatics research has contributed numerous concepts, algorithms, and standards to facilitate this development. However, these difficult relationships, complex terminologies, and multiple implementations can present obstacles for people who want to get active in the field. With a particular focus on medical informatics research conducted in Germany, we present in our Viewpoint a set of 10 important topics to improve the overall interdisciplinary communication between different stakeholders (eg, physicians, computational experts, experimentalists, students, patient representatives). This may lower the barriers to entry and offer a starting point for collaborations at different levels. The suggested topics are briefly introduced, then general best practice guidance is given, and further resources for in-depth reading or hands-on tutorials are recommended. In addition, the topics are set to cover current aspects and open research gaps of the medical informatics domain, including data regulations and concepts; data harmonization and processing; and data evaluation, visualization, and dissemination. In addition, we give an example on how these topics can be integrated in a medical informatics curriculum for higher education. By recognizing these topics, readers will be able to (1) set clinical and research data into the context of medical informatics, understanding what is possible to achieve with data or how data should be handled in terms of data privacy and storage; (2) distinguish current interoperability standards and obtain first insights into the processes leading to effective data transfer and analysis; and (3) value the use of newly developed technical approaches to utilize the full potential of clinical data.
Collapse
Affiliation(s)
- Markus Wolfien
- Institute for Medical Informatics and Biometry, Faculty of Medicine Carl Gustav Carus, Technische Universität Dresden, Dresden, Germany
- Center for Scalable Data Analytics and Artificial Intelligence, Dresden, Germany
| | - Najia Ahmadi
- Institute for Medical Informatics and Biometry, Faculty of Medicine Carl Gustav Carus, Technische Universität Dresden, Dresden, Germany
| | - Kai Fitzer
- Core Unit Data Integration Center, University Medicine Greifswald, Greifswald, Germany
| | - Sophia Grummt
- Institute for Medical Informatics and Biometry, Faculty of Medicine Carl Gustav Carus, Technische Universität Dresden, Dresden, Germany
| | - Kilian-Ludwig Heine
- Institute for Medical Informatics and Biometry, Faculty of Medicine Carl Gustav Carus, Technische Universität Dresden, Dresden, Germany
| | - Ian-C Jung
- Institute for Medical Informatics and Biometry, Faculty of Medicine Carl Gustav Carus, Technische Universität Dresden, Dresden, Germany
| | - Dagmar Krefting
- Department of Medical Informatics, University Medical Center, Goettingen, Germany
| | - Andreas Kühn
- Institute for Medical Informatics and Biometry, Faculty of Medicine Carl Gustav Carus, Technische Universität Dresden, Dresden, Germany
| | - Yuan Peng
- Institute for Medical Informatics and Biometry, Faculty of Medicine Carl Gustav Carus, Technische Universität Dresden, Dresden, Germany
| | - Ines Reinecke
- Institute for Medical Informatics and Biometry, Faculty of Medicine Carl Gustav Carus, Technische Universität Dresden, Dresden, Germany
| | - Julia Scheel
- Department of Systems Biology and Bioinformatics, University of Rostock, Rostock, Germany
| | - Tobias Schmidt
- Institute for Medical Informatics, University of Applied Sciences Mannheim, Mannheim, Germany
| | - Paul Schmücker
- Institute for Medical Informatics, University of Applied Sciences Mannheim, Mannheim, Germany
| | - Christina Schüttler
- Central Biobank Erlangen, University Hospital Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
| | - Dagmar Waltemath
- Core Unit Data Integration Center, University Medicine Greifswald, Greifswald, Germany
- Department of Medical Informatics, University Medicine Greifswald, Greifswald, Germany
| | - Michele Zoch
- Institute for Medical Informatics and Biometry, Faculty of Medicine Carl Gustav Carus, Technische Universität Dresden, Dresden, Germany
| | - Martin Sedlmayr
- Institute for Medical Informatics and Biometry, Faculty of Medicine Carl Gustav Carus, Technische Universität Dresden, Dresden, Germany
- Center for Scalable Data Analytics and Artificial Intelligence, Dresden, Germany
| |
Collapse
|
15
|
Ghosheh GO, Thwaites CL, Zhu T. Synthesizing Electronic Health Records for Predictive Models in Low-Middle-Income Countries (LMICs). Biomedicines 2023; 11:1749. [PMID: 37371844 PMCID: PMC10295936 DOI: 10.3390/biomedicines11061749] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2023] [Revised: 06/12/2023] [Accepted: 06/15/2023] [Indexed: 06/29/2023] Open
Abstract
The spread of machine learning models, coupled with by the growing adoption of electronic health records (EHRs), has opened the door for developing clinical decision support systems. However, despite the great promise of machine learning for healthcare in low-middle-income countries (LMICs), many data-specific limitations, such as the small size and irregular sampling, hinder the progress in such applications. Recently, deep generative models have been proposed to generate realistic-looking synthetic data, including EHRs, by learning the underlying data distribution without compromising patient privacy. In this study, we first use a deep generative model to generate synthetic data based on a small dataset (364 patients) from a LMIC setting. Next, we use synthetic data to build models that predict the onset of hospital-acquired infections based on minimal information collected at patient ICU admission. The performance of the diagnostic model trained on the synthetic data outperformed models trained on the original and oversampled data using techniques such as SMOTE. We also experiment with varying the size of the synthetic data and observe the impact on the performance and interpretability of the models. Our results show the promise of using deep generative models in enabling healthcare data owners to develop and validate models that serve their needs and applications, despite limitations in dataset size.
Collapse
Affiliation(s)
- Ghadeer O. Ghosheh
- Department of Engineering Sciences, University of Oxford, Oxford OX1 3PJ, UK
| | - C. Louise Thwaites
- Oxford University Clinical Research Unit (OUCRU), Ho Chi Minh City 710400, Vietnam
- Centre for Global Health and Tropical Medicine, University of Oxford, Oxford OX3 7LG, UK
| | - Tingting Zhu
- Department of Engineering Sciences, University of Oxford, Oxford OX1 3PJ, UK
| |
Collapse
|
16
|
Al Kuwaiti A, Nazer K, Al-Reedy A, Al-Shehri S, Al-Muhanna A, Subbarayalu AV, Al Muhanna D, Al-Muhanna FA. A Review of the Role of Artificial Intelligence in Healthcare. J Pers Med 2023; 13:951. [PMID: 37373940 PMCID: PMC10301994 DOI: 10.3390/jpm13060951] [Citation(s) in RCA: 32] [Impact Index Per Article: 32.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2023] [Revised: 05/11/2023] [Accepted: 05/12/2023] [Indexed: 06/29/2023] Open
Abstract
Artificial intelligence (AI) applications have transformed healthcare. This study is based on a general literature review uncovering the role of AI in healthcare and focuses on the following key aspects: (i) medical imaging and diagnostics, (ii) virtual patient care, (iii) medical research and drug discovery, (iv) patient engagement and compliance, (v) rehabilitation, and (vi) other administrative applications. The impact of AI is observed in detecting clinical conditions in medical imaging and diagnostic services, controlling the outbreak of coronavirus disease 2019 (COVID-19) with early diagnosis, providing virtual patient care using AI-powered tools, managing electronic health records, augmenting patient engagement and compliance with the treatment plan, reducing the administrative workload of healthcare professionals (HCPs), discovering new drugs and vaccines, spotting medical prescription errors, extensive data storage and analysis, and technology-assisted rehabilitation. Nevertheless, this science pitch meets several technical, ethical, and social challenges, including privacy, safety, the right to decide and try, costs, information and consent, access, and efficacy, while integrating AI into healthcare. The governance of AI applications is crucial for patient safety and accountability and for raising HCPs' belief in enhancing acceptance and boosting significant health consequences. Effective governance is a prerequisite to precisely address regulatory, ethical, and trust issues while advancing the acceptance and implementation of AI. Since COVID-19 hit the global health system, the concept of AI has created a revolution in healthcare, and such an uprising could be another step forward to meet future healthcare needs.
Collapse
Affiliation(s)
- Ahmed Al Kuwaiti
- Department of Dental Education, College of Dentistry, Deanship of Quality and Academic Accreditation, Imam Abdulrahman Bin Faisal University, Dammam 31441, Saudi Arabia
| | - Khalid Nazer
- Department of Information and Technology, Imam Abdulrahman Bin Faisal University, Dammam 31441, Saudi Arabia
- Health Information Department, King Fahad hospital of the University, Al-Khobar 31952, Saudi Arabia
| | - Abdullah Al-Reedy
- Department of Information and Technology, Family and Community Medicine Department, Family and Community Medicine Centre, Imam Abdulrahman Bin Faisal University, Dammam 31441, Saudi Arabia
| | - Shaher Al-Shehri
- Faculty of Medicine, Family and Community Medicine Department, Family and Community Medicine Centre, Imam Abdulrahman Bin Faisal University, Dammam 31441, Saudi Arabia
| | - Afnan Al-Muhanna
- Breast Imaging Division, Department of Radiology, Imam Abdulrahman Bin Faisal University, Dammam 31441, Saudi Arabia
- Radiology Department, King Fahad hospital of the University, Al-Khobar 31952, Saudi Arabia
| | - Arun Vijay Subbarayalu
- Quality Studies and Research Unit, Vice Deanship of Quality, Deanship of Quality and Academic Accreditation, Imam Abdulrahman Bin Faisal University, Dammam 31441, Saudi Arabia
| | - Dhoha Al Muhanna
- NDirectorate of Quality and Patient Safety, Family and Community Medicine Center, Imam Abdulrahman Bin Faisal University, Dammam 31441, Saudi Arabia
| | - Fahad A. Al-Muhanna
- Nephrology Division, Department of Internal Medicine, Faculty of Medicine, Imam Abdulrahman Bin Faisal University, Dammam 31441, Saudi Arabia
- Medicine Department, King Fahad hospital of the University, Al-Khobar 31952, Saudi Arabia
| |
Collapse
|
17
|
Sun C, van Soest J, Dumontier M. Generating synthetic personal health data using conditional generative adversarial networks combining with differential privacy. J Biomed Inform 2023:104404. [PMID: 37268168 DOI: 10.1016/j.jbi.2023.104404] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2023] [Revised: 04/25/2023] [Accepted: 05/21/2023] [Indexed: 06/04/2023]
Abstract
A large amount of personal health data that is highly valuable to the scientific community is still not accessible or requires a lengthy request process due to privacy concerns and legal restrictions. As a solution, synthetic data has been studied and proposed to be a promising alternative to this issue. However, generating realistic and privacy-preserving synthetic personal health data retains challenges such as simulating the characteristics of the patients' data that are in the minority classes, capturing the relations among variables in imbalanced data and transferring them to the synthetic data, and preserving individual patients' privacy. In this paper, we propose a differentially private conditional Generative Adversarial Network model (DP-CGANS) consisting of data transformation, sampling, conditioning, and network training to generate realistic and privacy-preserving personal data. Our model distinguishes categorical and continuous variables and transforms them into latent space separately for better training performance. We tackle the unique challenges of generating synthetic patient data due to the special data characteristics of personal health data. For example, patients with a certain disease are typically the minority in the dataset and the relations among variables are crucial to be observed. Our model is structured with a conditional vector as an additional input to present the minority class in the imbalanced data and maximally capture the dependency between variables. Moreover, we inject statistical noise into the gradients in the networking training process of DP-CGANS to provide a differential privacy guarantee. We extensively evaluate our model with state-of-the-art generative models on personal socio-economic datasets and real-world personal health datasets in terms of statistical similarity, machine learning performance, and privacy measurement. We demonstrate that our model outperforms other comparable models, especially in capturing the dependence between variables. Finally, we present the balance between data utility and privacy in synthetic data generation considering the different data structures and characteristics of real-world personal health data such as imbalanced classes, abnormal distributions, and data sparsity.
Collapse
Affiliation(s)
- Chang Sun
- Institute of Data Science, Faculty of Science and Engineering, Maastricht University, Maastricht, The Netherlands; Department of Advanced Computing Sciences, Faculty of Science and Engineering, Maastricht University, Maastricht, The Netherlands.
| | - Johan van Soest
- Brightlands Institute of Smart Society, Faculty of Science and Engineering, Maastricht University, Heerlen, The Netherlands; Department of Radiation Oncology (Maastro), GROW School for Oncology and Reproduction, Maastricht University Medical Centre, Maastricht, The Netherlands.
| | - Michel Dumontier
- Institute of Data Science, Faculty of Science and Engineering, Maastricht University, Maastricht, The Netherlands; Department of Advanced Computing Sciences, Faculty of Science and Engineering, Maastricht University, Maastricht, The Netherlands.
| |
Collapse
|
18
|
Li J, Cairns BJ, Li J, Zhu T. Generating synthetic mixed-type longitudinal electronic health records for artificial intelligent applications. NPJ Digit Med 2023; 6:98. [PMID: 37244963 DOI: 10.1038/s41746-023-00834-7] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2022] [Accepted: 05/05/2023] [Indexed: 05/29/2023] Open
Abstract
The recent availability of electronic health records (EHRs) have provided enormous opportunities to develop artificial intelligence (AI) algorithms. However, patient privacy has become a major concern that limits data sharing across hospital settings and subsequently hinders the advances in AI. Synthetic data, which benefits from the development and proliferation of generative models, has served as a promising substitute for real patient EHR data. However, the current generative models are limited as they only generate single type of clinical data for a synthetic patient, i.e., either continuous-valued or discrete-valued. To mimic the nature of clinical decision-making which encompasses various data types/sources, in this study, we propose a generative adversarial network (GAN) entitled EHR-M-GAN that simultaneously synthesizes mixed-type timeseries EHR data. EHR-M-GAN is capable of capturing the multidimensional, heterogeneous, and correlated temporal dynamics in patient trajectories. We have validated EHR-M-GAN on three publicly-available intensive care unit databases with records from a total of 141,488 unique patients, and performed privacy risk evaluation of the proposed model. EHR-M-GAN has demonstrated its superiority over state-of-the-art benchmarks for synthesizing clinical timeseries with high fidelity, while addressing the limitations regarding data types and dimensionality in the current generative models. Notably, prediction models for outcomes of intensive care performed significantly better when training data was augmented with the addition of EHR-M-GAN-generated timeseries. EHR-M-GAN may have use in developing AI algorithms in resource-limited settings, lowering the barrier for data acquisition while preserving patient privacy.
Collapse
Affiliation(s)
- Jin Li
- Engineering Research Center of EMR and Intelligent Expert System, Ministry of Education, College of Biomedical Engineering & Instrument Science, Zhejiang University, Hangzhou, China
- Department of Engineering Science, University of Oxford, Oxford, UK
| | - Benjamin J Cairns
- Clinical Trial Service Unit and Epidemiological Studies, Nuffield Department of Population Health, University of Oxford, Oxford, UK
| | - Jingsong Li
- Engineering Research Center of EMR and Intelligent Expert System, Ministry of Education, College of Biomedical Engineering & Instrument Science, Zhejiang University, Hangzhou, China.
- Research Center for Healthcare Data Science, Zhejiang Lab, Hangzhou, China.
| | - Tingting Zhu
- Department of Engineering Science, University of Oxford, Oxford, UK.
| |
Collapse
|
19
|
Nikolentzos G, Vazirgiannis M, Xypolopoulos C, Lingman M, Brandt EG. Synthetic electronic health records generated with variational graph autoencoders. NPJ Digit Med 2023; 6:83. [PMID: 37120594 PMCID: PMC10148837 DOI: 10.1038/s41746-023-00822-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2022] [Accepted: 04/05/2023] [Indexed: 05/01/2023] Open
Abstract
Data-driven medical care delivery must always respect patient privacy-a requirement that is not easily met. This issue has impeded improvements to healthcare software and has delayed the long-predicted prevalence of artificial intelligence in healthcare. Until now, it has been very difficult to share data between healthcare organizations, resulting in poor statistical models due to unrepresentative patient cohorts. Synthetic data, i.e., artificial but realistic electronic health records, could overcome the drought that is troubling the healthcare sector. Deep neural network architectures, in particular, have shown an incredible ability to learn from complex data sets and generate large amounts of unseen data points with the same statistical properties as the training data. Here, we present a generative neural network model that can create synthetic health records with realistic timelines. These clinical trajectories are generated on a per-patient basis and are represented as linear-sequence graphs of clinical events over time. We use a variational graph autoencoder (VGAE) to generate synthetic samples from real-world electronic health records. Our approach generates health records not seen in the training data. We show that these artificial patient trajectories are realistic and preserve patient privacy and can therefore support the safe sharing of data across organizations.
Collapse
Affiliation(s)
- Giannis Nikolentzos
- LIX, École Polytechnique, Institut Polytechnique de Paris, Palaiseau, France.
| | - Michalis Vazirgiannis
- LIX, École Polytechnique, Institut Polytechnique de Paris, Palaiseau, France
- Department of Computer Science, KTH Royal Institute of Technology, Stockholm, Sweden
| | | | - Markus Lingman
- Department of Molecular and Clinical Medicine/Cardiology, Institute of Medicine, Sahlgrenska Academy, University of Gothenburg, Gothenburg, Sweden
- Center for Applied Intelligent Systems Research, Halmstad University, Halmstad, Sweden
| | | |
Collapse
|
20
|
Davis SE, Ssemaganda H, Koola JD, Mao J, Westerman D, Speroff T, Govindarajulu US, Ramsay CR, Sedrakyan A, Ohno-Machado L, Resnic FS, Matheny ME. Simulating complex patient populations with hierarchical learning effects to support methods development for post-market surveillance. BMC Med Res Methodol 2023; 23:89. [PMID: 37041457 PMCID: PMC10088292 DOI: 10.1186/s12874-023-01913-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2021] [Accepted: 04/04/2023] [Indexed: 04/13/2023] Open
Abstract
BACKGROUND Validating new algorithms, such as methods to disentangle intrinsic treatment risk from risk associated with experiential learning of novel treatments, often requires knowing the ground truth for data characteristics under investigation. Since the ground truth is inaccessible in real world data, simulation studies using synthetic datasets that mimic complex clinical environments are essential. We describe and evaluate a generalizable framework for injecting hierarchical learning effects within a robust data generation process that incorporates the magnitude of intrinsic risk and accounts for known critical elements in clinical data relationships. METHODS We present a multi-step data generating process with customizable options and flexible modules to support a variety of simulation requirements. Synthetic patients with nonlinear and correlated features are assigned to provider and institution case series. The probability of treatment and outcome assignment are associated with patient features based on user definitions. Risk due to experiential learning by providers and/or institutions when novel treatments are introduced is injected at various speeds and magnitudes. To further reflect real-world complexity, users can request missing values and omitted variables. We illustrate an implementation of our method in a case study using MIMIC-III data for reference patient feature distributions. RESULTS Realized data characteristics in the simulated data reflected specified values. Apparent deviations in treatment effects and feature distributions, though not statistically significant, were most common in small datasets (n < 3000) and attributable to random noise and variability in estimating realized values in small samples. When learning effects were specified, synthetic datasets exhibited changes in the probability of an adverse outcomes as cases accrued for the treatment group impacted by learning and stable probabilities as cases accrued for the treatment group not affected by learning. CONCLUSIONS Our framework extends clinical data simulation techniques beyond generation of patient features to incorporate hierarchical learning effects. This enables the complex simulation studies required to develop and rigorously test algorithms developed to disentangle treatment safety signals from the effects of experiential learning. By supporting such efforts, this work can help identify training opportunities, avoid unwarranted restriction of access to medical advances, and hasten treatment improvements.
Collapse
Affiliation(s)
- Sharon E Davis
- Department of Biomedical Informatics, Vanderbilt University Medical Center, 2525 West End Ave, Suite 1475, Nashville, TN, 37203, USA.
| | - Henry Ssemaganda
- Comparative Effectiveness Research Institute, Lahey Hospital and Medical Center, 41 Mall Road, Burlington, MA, 01803, USA
| | - Jejo D Koola
- UC Health Department of Biomedical Informatics, University of California San Diego, 9500 Gilman Dr. MC 0728, La Jolla, San Diego, CA, 92093-0728, USA
| | - Jialin Mao
- Department of Population Health Sciences, Weill Cornell Medicine, 1300 York Avenue, New York, NY, 10065, USA
| | - Dax Westerman
- Department of Biomedical Informatics, Vanderbilt University Medical Center, 2525 West End Ave, Suite 1475, Nashville, TN, 37203, USA
| | - Theodore Speroff
- Departments of Medicine and Biostatistics, Vanderbilt University Medical Center, 1313 21St Avenue South, Oxford House, Room 209, Nashville, TN, 37232, USA
| | - Usha S Govindarajulu
- Center for Biostatistics, Department of Population Health Science and Policy, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, Box 1077, New York, NY, 10029, USA
| | - Craig R Ramsay
- Health Services Research Unit, University of Aberdeen, Health Sciences Building, Foresterhill, 3rd Floor, Aberdeen, AB25 2ZD, UK
| | - Art Sedrakyan
- Department of Population Health Sciences, Weill Cornell Medicine, 1300 York Avenue, New York, NY, 10065, USA
| | - Lucila Ohno-Machado
- Biomedical Informatics and Data Science, Yale School of Medicine, 100 College Street, New Haven, CT, 06510, USA
| | - Frederic S Resnic
- Division of Cardiovascular Medicine and Comparative Effectiveness Research Institute, Lahey Hospital and Medical Center, Tufts University School of Medicine, 41 Burlington Mall Road, Burlington, MA, 01805, USA
| | - Michael E Matheny
- Departments of Biomedical Informatics, Biostatistics, and Medicine, Vanderbilt University Medical Center, 2525 West End Ave, Suite 1475, Nashville, TN, 37203, USA
- Geriatric Research Education and Clinical Care Center, Tennessee Valley Healthcare System VA, 1310 24th Avenue South, Nashville, TN, 37212, USA
| |
Collapse
|
21
|
Mosquera L, El Emam K, Ding L, Sharma V, Zhang XH, Kababji SE, Carvalho C, Hamilton B, Palfrey D, Kong L, Jiang B, Eurich DT. A method for generating synthetic longitudinal health data. BMC Med Res Methodol 2023; 23:67. [PMID: 36959532 PMCID: PMC10034254 DOI: 10.1186/s12874-023-01869-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2022] [Accepted: 02/19/2023] [Indexed: 03/25/2023] Open
Abstract
Getting access to administrative health data for research purposes is a difficult and time-consuming process due to increasingly demanding privacy regulations. An alternative method for sharing administrative health data would be to share synthetic datasets where the records do not correspond to real individuals, but the patterns and relationships seen in the data are reproduced. This paper assesses the feasibility of generating synthetic administrative health data using a recurrent deep learning model. Our data comes from 120,000 individuals from Alberta Health's administrative health database. We assess how similar our synthetic data is to the real data using utility assessments that assess the structure and general patterns in the data as well as by recreating a specific analysis in the real data commonly applied to this type of administrative health data. We also assess the privacy risks associated with the use of this synthetic dataset. Generic utility assessments that used Hellinger distance to quantify the difference in distributions between real and synthetic datasets for event types (0.027), attributes (mean 0.0417), Markov transition matrices (order 1 mean absolute difference: 0.0896, sd: 0.159; order 2: mean Hellinger distance 0.2195, sd: 0.2724), the Hellinger distance between the joint distributions was 0.352, and the similarity of random cohorts generated from real and synthetic data had a mean Hellinger distance of 0.3 and mean Euclidean distance of 0.064, indicating small differences between the distributions in the real data and the synthetic data. By applying a realistic analysis to both real and synthetic datasets, Cox regression hazard ratios achieved a mean confidence interval overlap of 68% for adjusted hazard ratios among 5 key outcomes of interest, indicating synthetic data produces similar analytic results to real data. The privacy assessment concluded that the attribution disclosure risk associated with this synthetic dataset was substantially less than the typical 0.09 acceptable risk threshold. Based on these metrics our results show that our synthetic data is suitably similar to the real data and could be shared for research purposes thereby alleviating concerns associated with the sharing of real data in some circumstances.
Collapse
Affiliation(s)
- Lucy Mosquera
- Replica Analytics Ltd, Ottawa, ON, Canada
- Children's Hospital of Eastern Ontario Research Institute, 401 Smyth Road, Ottawa, ON, K1J 8L1, Canada
| | - Khaled El Emam
- Replica Analytics Ltd, Ottawa, ON, Canada.
- Children's Hospital of Eastern Ontario Research Institute, 401 Smyth Road, Ottawa, ON, K1J 8L1, Canada.
- School of Epidemiology and Public Health, University of Ottawa, Ottawa, ON, Canada.
| | - Lei Ding
- Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, AB, Canada
| | - Vishal Sharma
- School of Public Health, University of Alberta, Edmonton, AB, Canada
| | | | - Samer El Kababji
- Children's Hospital of Eastern Ontario Research Institute, 401 Smyth Road, Ottawa, ON, K1J 8L1, Canada
| | | | | | - Dan Palfrey
- Institute of Health Economics, Edmonton, Alberta, Canada
| | - Linglong Kong
- Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, AB, Canada
| | - Bei Jiang
- Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, AB, Canada
| | - Dean T Eurich
- School of Public Health, University of Alberta, Edmonton, AB, Canada
| |
Collapse
|
22
|
Theodorou B, Xiao C, Sun J. Synthesize Extremely High-dimensional Longitudinal Electronic Health Records via Hierarchical Autoregressive Language Model. RESEARCH SQUARE 2023:rs.3.rs-2644725. [PMID: 36945542 PMCID: PMC10029081 DOI: 10.21203/rs.3.rs-2644725/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/12/2023]
Abstract
Synthetic electronic health records (EHRs) that are both realistic and preserve privacy can serve as an alternative to real EHRs for machine learning (ML) modeling and statistical analysis. However, generating high-fidelity and granular electronic health record (EHR) data in its original, highly-dimensional form poses challenges for existing methods due to the complexities inherent in high-dimensional data. In this paper, we propose Hierarchical Autoregressive Language mOdel (HALO) for generating longitudinal high-dimensional EHR, which preserve the statistical properties of real EHR and can be used to train accurate ML models without privacy concerns. Our HALO method, designed as a hierarchical autoregressive model, generates a probability density function of medical codes, clinical visits, and patient records, allowing for the generation of realistic EHR data in its original, unaggregated form without the need for variable selection or aggregation. Additionally, our model also produces high-quality continuous variables in a longitudinal and probabilistic manner. We conducted extensive experiments and demonstrate that HALO can generate high-fidelity EHR data with high-dimensional disease code probabilities ( d ≈ 10,000), disease code co-occurrence probabilities within a visit ( d ≈ 1,000,000), and conditional probabilities across consecutive visits ( d ≈ 5,000,000) and achieve above 0.9 R 2 correlation in comparison to real EHR data. In comparison to the leading baseline, HALO improves predictive modeling by over 17% in its predictive accuracy and perplexity on a hold-off test set of real EHR data. This performance then enables downstream ML models trained on its synthetic data to achieve comparable accuracy to models trained on real data (0.938 area under the ROC curve with HALO data vs. 0.943 with real data). Finally, using a combination of real and synthetic data enhances the accuracy of ML models beyond that achieved by using only real EHR data.
Collapse
|
23
|
khan B, Fatima H, Qureshi A, Kumar S, Hanan A, Hussain J, Abdullah S. Drawbacks of Artificial Intelligence and Their Potential Solutions in the Healthcare Sector. BIOMEDICAL MATERIALS & DEVICES (NEW YORK, N.Y.) 2023; 1:1-8. [PMID: 36785697 PMCID: PMC9908503 DOI: 10.1007/s44174-023-00063-2] [Citation(s) in RCA: 42] [Impact Index Per Article: 42.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/18/2022] [Accepted: 01/19/2023] [Indexed: 02/10/2023]
Abstract
Artificial intelligence (AI) has the potential to make substantial progress toward the goal of making healthcare more personalized, predictive, preventative, and interactive. We believe AI will continue its present path and ultimately become a mature and effective tool for the healthcare sector. Besides this AI-based systems raise concerns regarding data security and privacy. Because health records are important and vulnerable, hackers often target them during data breaches. The absence of standard guidelines for the moral use of AI and ML in healthcare has only served to worsen the situation. There is debate about how far artificial intelligence (AI) may be utilized ethically in healthcare settings since there are no universal guidelines for its use. Therefore, maintaining the confidentiality of medical records is crucial. This study enlightens the possible drawbacks of AI in the implementation of healthcare sector and their solutions to overcome these situations. Graphical Abstract
Collapse
Affiliation(s)
- Bangul khan
- Hong Kong Centre for Cerebro-Caradiovasular Health Engineering (COCHE), Shatin, Hong Kong
- Riphah International University, Lahore, Pakistan
| | - Hajira Fatima
- Mehran University of Engineering and Technology, Jamshoro, Pakistan
| | | | | | - Abdul Hanan
- Mehran University of Engineering and Technology, Jamshoro, Pakistan
| | | | - Saad Abdullah
- Riphah International University, Lahore, Pakistan
- Mälardalen University, Västerås, Sweden
| |
Collapse
|
24
|
Yang W, Zou H, Wang M, Zhang Q, Li S, Liang H. Mortality prediction among ICU inpatients based on MIMIC-III database results from the conditional medical generative adversarial network. Heliyon 2023; 9:e13200. [PMID: 36798767 PMCID: PMC9925961 DOI: 10.1016/j.heliyon.2023.e13200] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2022] [Revised: 01/18/2023] [Accepted: 01/19/2023] [Indexed: 01/26/2023] Open
Abstract
Background and aims Improved mortality prediction among intensive care unit (ICU) inpatients is a valuable and challenging task. Limited clinical data, especially with appropriate labels, are an important element restricting accurate predictions. Generative adversarial networks (GANs) are excellent generative models and have shown great potential for data simulation. However, there have been no relevant studies using GANs to predict mortality among ICU inpatients. In this study, we aim to evaluate the predictive performance of a variant of GAN called conditional medical GAN (c-med GAN) compared with some baseline models, including simplified acute physiology score II (SAPS II), support vector machine (SVM), and multilayer perceptron (MLP). Methods Data from a publicly available intensive care database, the Medical Information Mart for Intensive Care III (MIMIC-III) database (v1.4), were included in this study. The area under the precision-recall curve (PR-AUC), area under the receiver operating characteristic curve (ROC-AUC), and F1 score were used to evaluate the predictive performance. In addition, the size of the dataset was artificially reduced, and the performance of the c-med GAN was compared in different size datasets. Results The results showed that c-med GAN achieves the best PR-AUC, ROC-AUC, and F1 score compared with SAPS II, SVM, and MLP when training in the full MIMIC-III dataset. When the size of the dataset was reduced, the prediction performances of both MLP and c-med GAN were affected. However, the c-med GAN still outperformed MLP on smaller datasets and had less degradation. Conclusion The prediction of in-hospital mortality based on the c-med GAN for ICU patients showed better performance than the baseline models. Despite some inadequacies, this model may have a promising future in clinical applications which will be explored by further research.
Collapse
Affiliation(s)
- Wei Yang
- Department of Urology, The General Hospital of Western Theater Command (Chengdu Military General Hospital), Chengdu, 610083, China
| | - Hong Zou
- Department of General Surgery, The General Hospital of Western Theater Command (Chengdu Military General Hospital), Chengdu, 610083, China,Department of Liver Surgery & Liver Transplantation, State Key Laboratory of Biotherapy and Cancer Center, West China Hospital, Sichuan University and Collaborative Innovation Center of Biotherapy, Chengdu 610044, Sichuan Province, China
| | - Meng Wang
- Department of Traditional Chinese Medicine, The General Hospital of Western Theater Command (Chengdu Military General Hospital), Chengdu, 610083, China
| | - Qin Zhang
- Department of Gastroenterology, The 77th Army Hospital, Jiajiang, 614100, China
| | - Shadan Li
- Department of Urology, The General Hospital of Western Theater Command (Chengdu Military General Hospital), Chengdu, 610083, China
| | - Hongyin Liang
- Department of General Surgery, The General Hospital of Western Theater Command (Chengdu Military General Hospital), Chengdu, 610083, China,Corresponding author.
| |
Collapse
|
25
|
Hernadez M, Epelde G, Alberdi A, Cilla R, Rankin D. Synthetic Tabular Data Evaluation in the Health Domain Covering Resemblance, Utility, and Privacy Dimensions. Methods Inf Med 2023. [PMID: 36623830 DOI: 10.1055/s-0042-1760247] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023]
Abstract
BACKGROUND Synthetic tabular data generation is a potentially valuable technology with great promise for data augmentation and privacy preservation. However, prior to adoption, an empirical assessment of generated synthetic tabular data is required across dimensions relevant to the target application to determine its efficacy. A lack of standardized and objective evaluation and benchmarking strategy for synthetic tabular data in the health domain has been found in the literature. OBJECTIVE The aim of this paper is to identify key dimensions, per dimension metrics, and methods for evaluating synthetic tabular data generated with different techniques and configurations for health domain application development and to provide a strategy to orchestrate them. METHODS Based on the literature, the resemblance, utility, and privacy dimensions have been prioritized, and a collection of metrics and methods for their evaluation are orchestrated into a complete evaluation pipeline. This way, a guided and comparative assessment of generated synthetic tabular data can be done, categorizing its quality into three categories ("Excellent," "Good," and "Poor"). Six health care-related datasets and four synthetic tabular data generation approaches have been chosen to conduct an analysis and evaluation to verify the utility of the proposed evaluation pipeline. RESULTS The synthetic tabular data generated with the four selected approaches has maintained resemblance, utility, and privacy for most datasets and synthetic tabular data generation approach combination. In several datasets, some approaches have outperformed others, while in other datasets, more than one approach has yielded the same performance. CONCLUSION The results have shown that the proposed pipeline can effectively be used to evaluate and benchmark the synthetic tabular data generated by various synthetic tabular data generation approaches. Therefore, this pipeline can support the scientific community in selecting the most suitable synthetic tabular data generation approaches for their data and application of interest.
Collapse
Affiliation(s)
- Mikel Hernadez
- Digital Health and Biomedical Technologies Department, Vicomtech Foundation, Basque Research and Technology Alliance (BRTA), Donostia-San Sebastian, Spain
| | - Gorka Epelde
- Digital Health and Biomedical Technologies Department, Vicomtech Foundation, Basque Research and Technology Alliance (BRTA), Donostia-San Sebastian, Spain.,eHealth Group, Biodonostia Health Research Institute, Donostia-San Sebastian, Spain
| | - Ane Alberdi
- Biomedical Engineering Department, Mondragon Unibertsitatea, Arrasate-Mondragón, Spain
| | - Rodrigo Cilla
- Digital Health and Biomedical Technologies Department, Vicomtech Foundation, Basque Research and Technology Alliance (BRTA), Donostia-San Sebastian, Spain
| | - Debbie Rankin
- School of Computing, Engineering and Intelligent Systems, Ulster University, Derry-Londonderry, United Kingdom
| |
Collapse
|
26
|
Kroes SKS, van Leeuwen M, Groenwold RHH, Janssen MP. Generating synthetic mixed discrete-continuous health records with mixed sum-product networks. J Am Med Inform Assoc 2022; 30:16-25. [PMID: 36228120 PMCID: PMC9748584 DOI: 10.1093/jamia/ocac184] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2022] [Revised: 09/09/2022] [Accepted: 10/01/2022] [Indexed: 12/15/2022] Open
Abstract
OBJECTIVE Privacy is a concern whenever individual patient health data is exchanged for scientific research. We propose using mixed sum-product networks (MSPNs) as private representations of data and take samples from the network to generate synthetic data that can be shared for subsequent statistical analysis. This anonymization method was evaluated with respect to privacy and information loss. MATERIALS AND METHODS Using a simulation study, information loss was quantified by assessing whether synthetic data could reproduce regression parameters obtained from the original data. Predictors variable types were varied between continuous, count, categorical, and mixed discrete-continuous. Additionally, we measured whether the MSPN approach successfully anonymizes the data by removing associations between background and sensitive information for these datasets. RESULTS The synthetic data generated with MSPNs yielded regression results highly similar to those generated with original data, differing less than 5% in most simulation scenarios. Standard errors increased compared to the original data. Particularly for smaller datasets (1000 records), this resulted in a discrepancy between the estimated and empirical standard errors. Sensitive values could no longer be inferred from background information for at least 99% of tested individuals. DISCUSSION The proposed anonymization approach yields very promising results. Further research is required to evaluate its performance with other types of data and analyses, and to predict how user parameter choices affect a bias-privacy trade-off. CONCLUSION Generating synthetic data from MSPNs is a promising, easy-to-use approach for anonymization of sensitive individual health data that yields informative and private data.
Collapse
Affiliation(s)
- Shannon K S Kroes
- Transfusion Technology Assessment Group, Donor Medicine Research Department, Sanquin Research, Amsterdam, The Netherlands
- Leiden Institute of Advanced Computer Science, Computer Science, Leiden University, Leiden, The Netherlands
- Department of Clinical Epidemiology, Leiden University Medical Center, Leiden, The Netherlands
| | - Matthijs van Leeuwen
- Leiden Institute of Advanced Computer Science, Computer Science, Leiden University, Leiden, The Netherlands
| | - Rolf H H Groenwold
- Department of Clinical Epidemiology, Leiden University Medical Center, Leiden, The Netherlands
- Department of Biomedical Data Sciences, Leiden University Medical Center, Leiden, The Netherlands
| | - Mart P Janssen
- Transfusion Technology Assessment Group, Donor Medicine Research Department, Sanquin Research, Amsterdam, The Netherlands
- Leiden Institute of Advanced Computer Science, Computer Science, Leiden University, Leiden, The Netherlands
| |
Collapse
|
27
|
Yan C, Yan Y, Wan Z, Zhang Z, Omberg L, Guinney J, Mooney SD, Malin BA. A Multifaceted benchmarking of synthetic electronic health record generation models. Nat Commun 2022; 13:7609. [PMID: 36494374 PMCID: PMC9734113 DOI: 10.1038/s41467-022-35295-1] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2022] [Accepted: 11/28/2022] [Indexed: 12/13/2022] Open
Abstract
Synthetic health data have the potential to mitigate privacy concerns in supporting biomedical research and healthcare applications. Modern approaches for data generation continue to evolve and demonstrate remarkable potential. Yet there is a lack of a systematic assessment framework to benchmark methods as they emerge and determine which methods are most appropriate for which use cases. In this work, we introduce a systematic benchmarking framework to appraise key characteristics with respect to utility and privacy metrics. We apply the framework to evaluate synthetic data generation methods for electronic health records data from two large academic medical centers with respect to several use cases. The results illustrate that there is a utility-privacy tradeoff for sharing synthetic health data and further indicate that no method is unequivocally the best on all criteria in each use case, which makes it evident why synthetic data generation methods need to be assessed in context.
Collapse
Affiliation(s)
- Chao Yan
- grid.412807.80000 0004 1936 9916Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN USA
| | - Yao Yan
- grid.430406.50000 0004 6023 5303Sage Bionetworks, Seattle, WA USA
| | - Zhiyu Wan
- grid.412807.80000 0004 1936 9916Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN USA
| | - Ziqi Zhang
- grid.152326.10000 0001 2264 7217Department of Computer Science, Vanderbilt University, Nashville, TN USA
| | - Larsson Omberg
- grid.430406.50000 0004 6023 5303Sage Bionetworks, Seattle, WA USA
| | - Justin Guinney
- grid.34477.330000000122986657Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, WA USA ,grid.511425.60000 0004 9346 3636Tempus Labs, Chicago, IL USA
| | - Sean D. Mooney
- grid.34477.330000000122986657Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, WA USA
| | - Bradley A. Malin
- grid.412807.80000 0004 1936 9916Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN USA ,grid.152326.10000 0001 2264 7217Department of Computer Science, Vanderbilt University, Nashville, TN USA ,grid.412807.80000 0004 1936 9916Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN USA
| |
Collapse
|
28
|
Halfpenny W, Baxter SL. Towards effective data sharing in ophthalmology: data standardization and data privacy. Curr Opin Ophthalmol 2022; 33:418-424. [PMID: 35819893 PMCID: PMC9357189 DOI: 10.1097/icu.0000000000000878] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Abstract
PURPOSE OF REVIEW The purpose of this review is to provide an overview of updates in data standardization and data privacy in ophthalmology. These topics represent two key aspects of medical information sharing and are important knowledge areas given trends in data-driven healthcare. RECENT FINDINGS Standardization and privacy can be seen as complementary aspects that pertain to data sharing. Standardization promotes the ease and efficacy through which data is shared. Privacy considerations ensure that data sharing is appropriate and sufficiently controlled. There is active development in both areas, including government regulations and common data models to advance standardization, and application of technologies such as blockchain and synthetic data to help tackle privacy issues. These advancements have seen use in ophthalmology, but there are areas where further work is required. SUMMARY Information sharing is fundamental to both research and care delivery, and standardization/privacy are key constituent considerations. Therefore, widespread engagement with, and development of, data standardization and privacy ecosystems stand to offer great benefit to ophthalmology.
Collapse
Affiliation(s)
| | - Sally L. Baxter
- Division of Ophthalmology Informatics and Data Science, Viterbi Family Department of Ophthalmology and Shiley Eye Institute, University of California San Diego, La Jolla, CA, USA
- Health Department of Biomedical Informatics, University of California San Diego, La Jolla, CA, USA
| |
Collapse
|
29
|
Zhang Z, Yan C, Malin BA. Keeping synthetic patients on track: feedback mechanisms to mitigate performance drift in longitudinal health data simulation. J Am Med Inform Assoc 2022; 29:1890-1898. [PMID: 35927974 PMCID: PMC9552284 DOI: 10.1093/jamia/ocac131] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2022] [Revised: 06/25/2022] [Accepted: 07/22/2022] [Indexed: 11/13/2022] Open
Abstract
OBJECTIVE Synthetic data are increasingly relied upon to share electronic health record (EHR) data while maintaining patient privacy. Current simulation methods can generate longitudinal data, but the results are unreliable for several reasons. First, the synthetic data drifts from the real data distribution over time. Second, the typical approach to quality assessment, which is based on the extent to which real records can be distinguished from synthetic records using a critic model, often fails to recognize poor simulation results. In this article, we introduce a longitudinal simulation framework, called LS-EHR, which addresses these issues. MATERIALS AND METHODS LS-EHR enhances simulation through conditional fuzzing and regularization, rejection sampling, and prior knowledge embedding. We compare LS-EHR to the state-of-the-art using data from 60 000 EHRs from Vanderbilt University Medical Center (VUMC) and the All of Us Research Program. We assess discrimination between real and synthetic data over time. We evaluate the generation process and critic model using the area under the receiver operating characteristic curve (AUROC). For the critic, a higher value indicates a more robust model for quality assessment. For the generation process, a lower value indicates better synthetic data quality. RESULTS The LS-EHR critic improves discrimination AUROC from 0.655 to 0.909 and 0.692 to 0.918 for VUMC and All of Us data, respectively. By using the new critic, the LS-EHR generation model reduces the AUROC from 0.909 to 0.758 and 0.918 to 0.806. CONCLUSION LS-EHR can substantially improve the usability of simulated longitudinal EHR data.
Collapse
Affiliation(s)
- Ziqi Zhang
- Department of Computer Science, Vanderbilt University, Nashville, Tennessee, USA
| | - Chao Yan
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Bradley A Malin
- Department of Computer Science, Vanderbilt University, Nashville, Tennessee, USA.,Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA.,Department of Biostatistics, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| |
Collapse
|
30
|
Hahn W, Schütte K, Schultz K, Wolkenhauer O, Sedlmayr M, Schuler U, Eichler M, Bej S, Wolfien M. Contribution of Synthetic Data Generation towards an Improved Patient Stratification in Palliative Care. J Pers Med 2022; 12:1278. [PMID: 36013227 PMCID: PMC9409663 DOI: 10.3390/jpm12081278] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2022] [Revised: 07/29/2022] [Accepted: 08/01/2022] [Indexed: 11/23/2022] Open
Abstract
AI model development for synthetic data generation to improve Machine Learning (ML) methodologies is an integral part of research in Computer Science and is currently being transferred to related medical fields, such as Systems Medicine and Medical Informatics. In general, the idea of personalized decision-making support based on patient data has driven the motivation of researchers in the medical domain for more than a decade, but the overall sparsity and scarcity of data are still major limitations. This is in contrast to currently applied technology that allows us to generate and analyze patient data in diverse forms, such as tabular data on health records, medical images, genomics data, or even audio and video. One solution arising to overcome these data limitations in relation to medical records is the synthetic generation of tabular data based on real world data. Consequently, ML-assisted decision-support can be interpreted more conveniently, using more relevant patient data at hand. At a methodological level, several state-of-the-art ML algorithms generate and derive decisions from such data. However, there remain key issues that hinder a broad practical implementation in real-life clinical settings. In this review, we will give for the first time insights towards current perspectives and potential impacts of using synthetic data generation in palliative care screening because it is a challenging prime example of highly individualized, sparsely available patient information. Taken together, the reader will obtain initial starting points and suitable solutions relevant for generating and using synthetic data for ML-based screenings in palliative care and beyond.
Collapse
Affiliation(s)
- Waldemar Hahn
- Institute for Medical Informatics and Biometry, Faculty of Medicine Carl Gustav Carus, Technische Universität Dresden, Fetscherstraße 74, 01307 Dresden, Germany
| | - Katharina Schütte
- University Palliative Center, University Hospital Carl Gustav Carus, Technische Universität Dresden, Fetscherstraße 74, 01307 Dresden, Germany
| | - Kristian Schultz
- Department of Systems Biology and Bioinformatics, University of Rostock, Universitätsplatz 1, 18051 Rostock, Germany
| | - Olaf Wolkenhauer
- Department of Systems Biology and Bioinformatics, University of Rostock, Universitätsplatz 1, 18051 Rostock, Germany
- Leibniz-Institute for Food Systems Biology, Technical University Munich, 85354 Freising, Germany
- Stellenbosch Institute of Advanced Study, Wallenberg Research Centre, Stellenbosch University, Stellenbosch 7602, South Africa
| | - Martin Sedlmayr
- Institute for Medical Informatics and Biometry, Faculty of Medicine Carl Gustav Carus, Technische Universität Dresden, Fetscherstraße 74, 01307 Dresden, Germany
| | - Ulrich Schuler
- University Palliative Center, University Hospital Carl Gustav Carus, Technische Universität Dresden, Fetscherstraße 74, 01307 Dresden, Germany
| | - Martin Eichler
- National Center for Tumor Diseases Dresden (NCT/UCC), Fetscherstraße 74, 01307 Dresden, Germany
- German Cancer Research Center (DKFZ), Im Neuenheimer Feld 280, 69120 Heidelberg, Germany
- Faculty of Medicine, University Hospital Carl Gustav Carus, Technische Universität Dresden, Fetscherstraße 74, 01307 Dresden, Germany
- Helmholtz-Zentrum Dresden-Rossendorf (HZDR), Bautzner Landstraße 400, 01328 Dresden, Germany
| | - Saptarshi Bej
- Department of Systems Biology and Bioinformatics, University of Rostock, Universitätsplatz 1, 18051 Rostock, Germany
- Leibniz-Institute for Food Systems Biology, Technical University Munich, 85354 Freising, Germany
| | - Markus Wolfien
- Institute for Medical Informatics and Biometry, Faculty of Medicine Carl Gustav Carus, Technische Universität Dresden, Fetscherstraße 74, 01307 Dresden, Germany
| |
Collapse
|
31
|
Javidi H, Mariam A, Khademi G, Zabor EC, Zhao R, Radivoyevitch T, Rotroff DM. Identification of robust deep neural network models of longitudinal clinical measurements. NPJ Digit Med 2022; 5:106. [PMID: 35896817 PMCID: PMC9329311 DOI: 10.1038/s41746-022-00651-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2022] [Accepted: 07/06/2022] [Indexed: 11/09/2022] Open
Abstract
Deep learning (DL) from electronic health records holds promise for disease prediction, but systematic methods for learning from simulated longitudinal clinical measurements have yet to be reported. We compared nine DL frameworks using simulated body mass index (BMI), glucose, and systolic blood pressure trajectories, independently isolated shape and magnitude changes, and evaluated model performance across various parameters (e.g., irregularity, missingness). Overall, discrimination based on variation in shape was more challenging than magnitude. Time-series forest-convolutional neural networks (TSF-CNN) and Gramian angular field(GAF)-CNN outperformed other approaches (P < 0.05) with overall area-under-the-curve (AUCs) of 0.93 for both models, and 0.92 and 0.89 for variation in magnitude and shape with up to 50% missing data. Furthermore, in a real-world assessment, the TSF-CNN model predicted T2D with AUCs reaching 0.72 using only BMI trajectories. In conclusion, we performed an extensive evaluation of DL approaches and identified robust modeling frameworks for disease prediction based on longitudinal clinical measurements.
Collapse
Affiliation(s)
- Hamed Javidi
- Department of Quantitative Health Sciences, Lerner Research Institute, Cleveland Clinic, Cleveland, OH, USA
- Department of Electrical Engineering and Computer Science, Cleveland State University, Cleveland, OH, USA
| | - Arshiya Mariam
- Department of Quantitative Health Sciences, Lerner Research Institute, Cleveland Clinic, Cleveland, OH, USA
| | - Gholamreza Khademi
- Department of Quantitative Health Sciences, Lerner Research Institute, Cleveland Clinic, Cleveland, OH, USA
| | - Emily C Zabor
- Department of Quantitative Health Sciences, Lerner Research Institute, Cleveland Clinic, Cleveland, OH, USA
| | - Ran Zhao
- Department of Quantitative Health Sciences, Lerner Research Institute, Cleveland Clinic, Cleveland, OH, USA
| | - Tomas Radivoyevitch
- Department of Quantitative Health Sciences, Lerner Research Institute, Cleveland Clinic, Cleveland, OH, USA
| | - Daniel M Rotroff
- Department of Quantitative Health Sciences, Lerner Research Institute, Cleveland Clinic, Cleveland, OH, USA.
- Department of Electrical Engineering and Computer Science, Cleveland State University, Cleveland, OH, USA.
- Endocrinology and Metabolism Institute, Cleveland Clinic, Cleveland, OH, USA.
- Cleveland Clinic Lerner College of Medicine, Case Western Reserve University, Cleveland, OH, USA.
| |
Collapse
|
32
|
GAN-Based Approaches for Generating Structured Data in the Medical Domain. APPLIED SCIENCES-BASEL 2022. [DOI: 10.3390/app12147075] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Modern machine and deep learning methods require large datasets to achieve reliable and robust results. This requirement is often difficult to meet in the medical field, due to data sharing limitations imposed by privacy regulations or the presence of a small number of patients (e.g., rare diseases). To address this data scarcity and to improve the situation, novel generative models such as Generative Adversarial Networks (GANs) have been widely used to generate synthetic data that mimic real data by representing features that reflect health-related information without reference to real patients. In this paper, we consider several GAN models to generate synthetic data used for training binary (malignant/benign) classifiers, and compare their performances in terms of classification accuracy with cases where only real data are considered. We aim to investigate how synthetic data can improve classification accuracy, especially when a small amount of data is available. To this end, we have developed and implemented an evaluation framework where binary classifiers are trained on extended datasets containing both real and synthetic data. The results show improved accuracy for classifiers trained with generated data from more advanced GAN models, even when limited amounts of original data are available.
Collapse
|
33
|
Hernandez M, Epelde G, Alberdi A, Cilla R, Rankin D. Synthetic data generation for tabular health records: A systematic review. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.04.053] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
34
|
André A, Peyrou B, Carpentier A, Vignaux JJ. Feasibility and Assessment of a Machine Learning-Based Predictive Model of Outcome After Lumbar Decompression Surgery. Global Spine J 2022; 12:894-908. [PMID: 33207969 PMCID: PMC9344503 DOI: 10.1177/2192568220969373] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
STUDY DESIGN Retrospective study at a unique center. OBJECTIVE The aim of this study is twofold, to develop a virtual patients model for lumbar decompression surgery and to evaluate the precision of an artificial neural network (ANN) model designed to accurately predict the clinical outcomes of lumbar decompression surgery. METHODS We performed a retrospective study of complete Electronic Health Records (EHR) to identify potential unfavorable criteria for spine surgery (predictors). A cohort of synthetics EHR was created to classify patients by surgical success (green zone) or partial failure (orange zone) using an Artificial Neural Network which screens all the available predictors. RESULTS In the actual cohort, we included 60 patients, with complete EHR allowing efficient analysis, 26 patients were in the orange zone (43.4%) and 34 were in the green zone (56.6%). The average positive criteria amount for actual patients was 8.62 for the green zone (SD+/- 3.09) and 10.92 for the orange zone (SD 3.38). The classifier (a neural network) was trained using 10,000 virtual patients and 2000 virtual patients were used for test purposes. The 12,000 virtual patients were generated from the 60 EHR, of which half were in the green zone and half in the orange zone. The model showed an accuracy of 72% and a ROC score of 0.78. The sensitivity was 0.885 and the specificity 0.59. CONCLUSION Our method can be used to predict a favorable patient to have lumbar decompression surgery. However, there is still a need to further develop its ability to analyze patients in the "failure of treatment" zone to offer precise management of patient health before spinal surgery.
Collapse
Affiliation(s)
- Arthur André
- Ramsay santé, Clinique Geoffroy
Saint-Hilaire, Paris, France,Neurosurgery Department,
Pitié-Salpêtrière University Hospital, Paris, France,Cortexx Medical Intelligence, Paris,
France,Arthur André, Cortexx Medical Intelligence,
156 Boulevard, Haussmann 75008, Paris.
| | | | | | | |
Collapse
|
35
|
Incorporation of Synthetic Data Generation Techniques within a Controlled Data Processing Workflow in the Health and Wellbeing Domain. ELECTRONICS 2022. [DOI: 10.3390/electronics11050812] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
Abstract
To date, the use of synthetic data generation techniques in the health and wellbeing domain has been mainly limited to research activities. Although several open source and commercial packages have been released, they have been oriented to generating synthetic data as a standalone data preparation process and not integrated into a broader analysis or experiment testing workflow. In this context, the VITALISE project is working to harmonize Living Lab research and data capture protocols and to provide controlled processing access to captured data to industrial and scientific communities. In this paper, we present the initial design and implementation of our synthetic data generation approach in the context of VITALISE Living Lab controlled data processing workflow, together with identified challenges and future developments. By uploading data captured from Living Labs, generating synthetic data from them, developing analysis locally with synthetic data, and then executing them remotely with real data, the utility of the proposed workflow has been validated. Results have shown that the presented workflow helps accelerate research on artificial intelligence, ensuring compliance with data protection laws. The presented approach has demonstrated how the adoption of state-of-the-art synthetic data generation techniques can be applied for real-world applications.
Collapse
|
36
|
Torfi A, Fox EA, Reddy CK. Differentially private synthetic medical data generation using convolutional GANs. Inf Sci (N Y) 2022. [DOI: 10.1016/j.ins.2021.12.018] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
|
37
|
Gupta M, Poulain R, Phan TLT, Bunnell HT, Beheshti R. Flexible-Window Predictions on Electronic Health Records. PROCEEDINGS OF THE ... AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE. AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE 2022; 36:12510-12516. [PMID: 36312212 PMCID: PMC9610888 DOI: 10.1609/aaai.v36i11.21520] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
Various types of machine learning techniques are available for analyzing electronic health records (EHRs). For predictive tasks, most existing methods either explicitly or implicitly divide these time-series datasets into predetermined observation and prediction windows. Patients have different lengths of medical history and the desired predictions (for purposes such as diagnosis or treatment) are required at different times in the future. In this paper, we propose a method that uses a sequence-to-sequence generator model to transfer an input sequence of EHR data to a sequence of user-defined target labels, providing the end-users with "flexible" observation and prediction windows to define. We use adversarial and semi-supervised approaches in our design, where the sequence-to-sequence model acts as a generator and a discriminator distinguishes between the actual (observed) and generated labels. We evaluate our models through an extensive series of experiments using two large EHR datasets from adult and pediatric populations. In an obesity predicting case study, we show that our model can achieve superior results in flexible-window prediction tasks, after being trained once and even with large missing rates on the input EHR data. Moreover, using a number of attention analysis experiments, we show that the proposed model can effectively learn more relevant features in different prediction tasks.
Collapse
|
38
|
Dinh TQ, Xiong Y, Huang Z, Vo T, Mishra A, Kim WH, Ravi SN, Singh V. Performing Group Difference Testing on Graph Structured Data From GANs: Analysis and Applications in Neuroimaging. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2022; 44:877-889. [PMID: 32763848 PMCID: PMC7867665 DOI: 10.1109/tpami.2020.3013433] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Generative adversarial networks (GANs) have emerged as a powerful generative model in computer vision. Given their impressive abilities in generating highly realistic images, they are also being used in novel ways in applications in the life sciences. This raises an interesting question when GANs are used in scientific or biomedical studies. Consider the setting where we are restricted to only using the samples from a trained GAN for downstream group difference analysis (and do not have direct access to the real data). Will we obtain similar conclusions? In this work, we explore if "generated" data, i.e., sampled from such GANs can be used for performing statistical group difference tests in cases versus controls studies, common across many scientific disciplines. We provide a detailed analysis describing regimes where this may be feasible. We complement the technical results with an empirical study focused on the analysis of cortical thickness on brain mesh surfaces in an Alzheimer's disease dataset. To exploit the geometric nature of the data, we use simple ideas from spectral graph theory to show how adjustments to existing GANs can yield improvements. We also give a generalization error bound by extending recent results on Neural Network Distance. To our knowledge, our work offers the first analysis assessing whether the Null distribution in "healthy versus diseased subjects" type statistical testing using data generated from the GANs coincides with the one obtained from the same analysis with real data. The code is available at https://github.com/yyxiongzju/GLapGAN.
Collapse
|
39
|
Postpartum pelvic organ prolapse assessment via adversarial feature complementation in heterogeneous data. Neural Comput Appl 2022. [DOI: 10.1007/s00521-021-06869-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
40
|
Nie Y, Huang C, Liang H, Xu H. Adversarial and Implicit Modality Imputation with Applications to Depression Early Detection. ARTIF INTELL 2022. [DOI: 10.1007/978-3-031-20500-2_19] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/02/2023]
|
41
|
Foomani FH, Anisuzzaman DM, Niezgoda J, Niezgoda J, Guns W, Gopalakrishnan S, Yu Z. Synthesizing time-series wound prognosis factors from electronic medical records using generative adversarial networks. J Biomed Inform 2021; 125:103972. [PMID: 34920125 DOI: 10.1016/j.jbi.2021.103972] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2021] [Revised: 09/20/2021] [Accepted: 12/03/2021] [Indexed: 11/26/2022]
Abstract
Wound prognostic models not only provide an estimate of wound healing time to motivate patients to follow up their treatments but also can help clinicians to decide whether to use a standard care or adjuvant therapies and to assist them with designing clinical trials. However, collecting prognosis factors from Electronic Medical Records (EMR) of patients is challenging due to privacy, sensitivity, and confidentiality. In this study, we developed time series medical generative adversarial networks (GANs) to generate synthetic wound prognosis factors using very limited information collected during routine care in a specialized wound care facility. The generated prognosis variables are used in developing a predictive model for chronic wound healing trajectory. Our novel medical GAN can produce both continuous and categorical features from EMR. Moreover, we applied temporal information to our model by considering data collected from the weekly follow-ups of patients. Conditional training strategies were utilized to enhance training and generate classified data in terms of healing or non-healing. The ability of the proposed model to generate realistic EMR data was evaluated by TSTR (test on the synthetic, train on the real), discriminative accuracy, and visualization. We utilized samples generated by our proposed GAN in training a prognosis model to demonstrate its real-life application. Using the generated samples in training predictive models improved the classification accuracy by 6.66-10.01% compared to the previous EMR-GAN. Additionally, the suggested prognosis classifier has achieved the area under the curve (AUC) of 0.875, 0.810, and 0.647 when training the network using data from the first three visits, first two visits, and first visit, respectively. These results indicate a significant improvement in wound healing prediction compared to the previous prognosis models.
Collapse
Affiliation(s)
- Farnaz H Foomani
- Department of Electrical Engineering, University of Wisconsin-Milwaukee, Milwaukee, WI, United States
| | - D M Anisuzzaman
- Department of Computer Science, University of Wisconsin-Milwaukee, Milwaukee, WI, United States
| | | | | | - William Guns
- AZH Wound and Vascular Center, Milwaukee, WI, United States
| | | | - Zeyun Yu
- Department of Electrical Engineering, University of Wisconsin-Milwaukee, Milwaukee, WI, United States; Department of Computer Science, University of Wisconsin-Milwaukee, Milwaukee, WI, United States.
| |
Collapse
|
42
|
Engr YS, Lalande A, Afilalo J, Jodoin PM. Generative Adversarial Networks in Cardiology. Can J Cardiol 2021; 38:196-203. [PMID: 34780990 DOI: 10.1016/j.cjca.2021.11.003] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2021] [Revised: 11/04/2021] [Accepted: 11/08/2021] [Indexed: 01/18/2023] Open
Abstract
Generative Adversarial Networks (GANs) are state-of-the-art neural network models used to synthesize images and other data. GANs brought a considerable improvement to the quality of synthetic data, quickly becoming the standard for data generation tasks. In this work, we summarize the applications of GANs in the field of cardiology, including generation of realistic cardiac images, electrocardiography signals, and synthetic electronic health records. The utility of GAN-generated data is discussed with respect to research, clinical care, and academia. Moreover, we present illustrative examples of our GAN-generated cardiac magnetic resonance and echocardiography images, showing the evolution in image quality across six different models, which has become almost indistinguishable from real images. Finally, we discuss future applications, such as modality translation or patient trajectory modeling. Moreover, we discuss the pending challenges that GANs need to overcome, namely their training dynamics, the medical fidelity or the data regulations and ethics questions, to become integrated in cardiology workflows.
Collapse
Affiliation(s)
| | - Alain Lalande
- Laboratoire ImVIA, Université de Bourgogne, 64 rue Sully, 21000 Dijon, France; Medical Imaging Department, University Hospital of Dijon, 1 Bld Jeanne d'Arc, 21079, Dijon, France
| | - Jonathan Afilalo
- Jewish General Hospital, McGill University, 3755 Côte Ste-Catherine Road, Montreal, Qc, Canada, H3T 1E2
| | - Pierre-Marc Jodoin
- Université de Sherbrooke, 2500 Boul. de l'Universite, Sherbrooke, Qc, Canada, J1K 2R1
| |
Collapse
|
43
|
Zuo Z, Watson M, Budgen D, Hall R, Kennelly C, Al Moubayed N. Data Anonymization for Pervasive Health Care: Systematic Literature Mapping Study. JMIR Med Inform 2021; 9:e29871. [PMID: 34652278 PMCID: PMC8556642 DOI: 10.2196/29871] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2021] [Revised: 06/21/2021] [Accepted: 08/02/2021] [Indexed: 01/29/2023] Open
Abstract
BACKGROUND Data science offers an unparalleled opportunity to identify new insights into many aspects of human life with recent advances in health care. Using data science in digital health raises significant challenges regarding data privacy, transparency, and trustworthiness. Recent regulations enforce the need for a clear legal basis for collecting, processing, and sharing data, for example, the European Union's General Data Protection Regulation (2016) and the United Kingdom's Data Protection Act (2018). For health care providers, legal use of the electronic health record (EHR) is permitted only in clinical care cases. Any other use of the data requires thoughtful considerations of the legal context and direct patient consent. Identifiable personal and sensitive information must be sufficiently anonymized. Raw data are commonly anonymized to be used for research purposes, with risk assessment for reidentification and utility. Although health care organizations have internal policies defined for information governance, there is a significant lack of practical tools and intuitive guidance about the use of data for research and modeling. Off-the-shelf data anonymization tools are developed frequently, but privacy-related functionalities are often incomparable with regard to use in different problem domains. In addition, tools to support measuring the risk of the anonymized data with regard to reidentification against the usefulness of the data exist, but there are question marks over their efficacy. OBJECTIVE In this systematic literature mapping study, we aim to alleviate the aforementioned issues by reviewing the landscape of data anonymization for digital health care. METHODS We used Google Scholar, Web of Science, Elsevier Scopus, and PubMed to retrieve academic studies published in English up to June 2020. Noteworthy gray literature was also used to initialize the search. We focused on review questions covering 5 bottom-up aspects: basic anonymization operations, privacy models, reidentification risk and usability metrics, off-the-shelf anonymization tools, and the lawful basis for EHR data anonymization. RESULTS We identified 239 eligible studies, of which 60 were chosen for general background information; 16 were selected for 7 basic anonymization operations; 104 covered 72 conventional and machine learning-based privacy models; four and 19 papers included seven and 15 metrics, respectively, for measuring the reidentification risk and degree of usability; and 36 explored 20 data anonymization software tools. In addition, we also evaluated the practical feasibility of performing anonymization on EHR data with reference to their usability in medical decision-making. Furthermore, we summarized the lawful basis for delivering guidance on practical EHR data anonymization. CONCLUSIONS This systematic literature mapping study indicates that anonymization of EHR data is theoretically achievable; yet, it requires more research efforts in practical implementations to balance privacy preservation and usability to ensure more reliable health care applications.
Collapse
Affiliation(s)
- Zheming Zuo
- Department of Computer Science, Durham University, Durham, United Kingdom
| | - Matthew Watson
- Department of Computer Science, Durham University, Durham, United Kingdom
| | - David Budgen
- Department of Computer Science, Durham University, Durham, United Kingdom
| | - Robert Hall
- Cievert Ltd, Newcastle upon Tyne, United Kingdom
| | | | - Noura Al Moubayed
- Department of Computer Science, Durham University, Durham, United Kingdom
| |
Collapse
|
44
|
|
45
|
Murdoch B. Privacy and artificial intelligence: challenges for protecting health information in a new era. BMC Med Ethics 2021; 22:122. [PMID: 34525993 PMCID: PMC8442400 DOI: 10.1186/s12910-021-00687-3] [Citation(s) in RCA: 100] [Impact Index Per Article: 33.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2021] [Accepted: 08/25/2021] [Indexed: 12/15/2022] Open
Abstract
Background Advances in healthcare artificial intelligence (AI) are occurring rapidly and there is a growing discussion about managing its development. Many AI technologies end up owned and controlled by private entities. The nature of the implementation of AI could mean such corporations, clinics and public bodies will have a greater than typical role in obtaining, utilizing and protecting patient health information. This raises privacy issues relating to implementation and data security.
Main body The first set of concerns includes access, use and control of patient data in private hands. Some recent public–private partnerships for implementing AI have resulted in poor protection of privacy. As such, there have been calls for greater systemic oversight of big data health research. Appropriate safeguards must be in place to maintain privacy and patient agency. Private custodians of data can be impacted by competing goals and should be structurally encouraged to ensure data protection and to deter alternative use thereof. Another set of concerns relates to the external risk of privacy breaches through AI-driven methods. The ability to deidentify or anonymize patient health data may be compromised or even nullified in light of new algorithms that have successfully reidentified such data. This could increase the risk to patient data under private custodianship. Conclusions We are currently in a familiar situation in which regulation and oversight risk falling behind the technologies they govern. Regulation should emphasize patient agency and consent, and should encourage increasingly sophisticated methods of data anonymization and protection.
Collapse
Affiliation(s)
- Blake Murdoch
- Health Law Institute, Faculty of Law, University of Alberta, Edmonton, AB, T6G 2H5, Canada.
| |
Collapse
|
46
|
Wolterink JM, Mukhopadhyay A, Leiner T, Vogl TJ, Bucher AM, Išgum I. Generative Adversarial Networks: A Primer for Radiologists. Radiographics 2021; 41:840-857. [PMID: 33891522 DOI: 10.1148/rg.2021200151] [Citation(s) in RCA: 28] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
Artificial intelligence techniques involving the use of artificial neural networks-that is, deep learning techniques-are expected to have a major effect on radiology. Some of the most exciting applications of deep learning in radiology make use of generative adversarial networks (GANs). GANs consist of two artificial neural networks that are jointly optimized but with opposing goals. One neural network, the generator, aims to synthesize images that cannot be distinguished from real images. The second neural network, the discriminator, aims to distinguish these synthetic images from real images. These deep learning models allow, among other applications, the synthesis of new images, acceleration of image acquisitions, reduction of imaging artifacts, efficient and accurate conversion between medical images acquired with different modalities, and identification of abnormalities depicted on images. The authors provide an introduction to GANs and adversarial deep learning methods. In addition, the different ways in which GANs can be used for image synthesis and image-to-image translation tasks, as well as the principles underlying conditional GANs and cycle-consistent GANs, are described. Illustrated examples of GAN applications in radiologic image analysis for different imaging modalities and different tasks are provided. The clinical potential of GANs, future clinical GAN applications, and potential pitfalls and caveats that radiologists should be aware of also are discussed in this review. The online slide presentation from the RSNA Annual Meeting is available for this article. ©RSNA, 2021.
Collapse
Affiliation(s)
- Jelmer M Wolterink
- From the Department of Applied Mathematics, Faculty of Electrical Engineering, Mathematics and Computer Science, Technical Medical Centre, University of Twente, Zilverling, PO Box 217, 7500 AE Enschede, the Netherlands (J.M.W.); Department of Biomedical Engineering and Physics (J.M.W., I.I.) and Department of Radiology and Nuclear Medicine (I.I.), Amsterdam University Medical Center, Amsterdam, the Netherlands; Department of Informatics, Technische Universität Darmstadt, Darmstadt, Germany (A.M.); Department of Radiology, Utrecht University Medical Center, Utrecht, the Netherlands (T.L.); and Institute of Diagnostic and Interventional Radiology, Universitätsklinikum Frankfurt, Frankfurt, Germany (T.J.V., A.M.B.)
| | - Anirban Mukhopadhyay
- From the Department of Applied Mathematics, Faculty of Electrical Engineering, Mathematics and Computer Science, Technical Medical Centre, University of Twente, Zilverling, PO Box 217, 7500 AE Enschede, the Netherlands (J.M.W.); Department of Biomedical Engineering and Physics (J.M.W., I.I.) and Department of Radiology and Nuclear Medicine (I.I.), Amsterdam University Medical Center, Amsterdam, the Netherlands; Department of Informatics, Technische Universität Darmstadt, Darmstadt, Germany (A.M.); Department of Radiology, Utrecht University Medical Center, Utrecht, the Netherlands (T.L.); and Institute of Diagnostic and Interventional Radiology, Universitätsklinikum Frankfurt, Frankfurt, Germany (T.J.V., A.M.B.)
| | - Tim Leiner
- From the Department of Applied Mathematics, Faculty of Electrical Engineering, Mathematics and Computer Science, Technical Medical Centre, University of Twente, Zilverling, PO Box 217, 7500 AE Enschede, the Netherlands (J.M.W.); Department of Biomedical Engineering and Physics (J.M.W., I.I.) and Department of Radiology and Nuclear Medicine (I.I.), Amsterdam University Medical Center, Amsterdam, the Netherlands; Department of Informatics, Technische Universität Darmstadt, Darmstadt, Germany (A.M.); Department of Radiology, Utrecht University Medical Center, Utrecht, the Netherlands (T.L.); and Institute of Diagnostic and Interventional Radiology, Universitätsklinikum Frankfurt, Frankfurt, Germany (T.J.V., A.M.B.)
| | - Thomas J Vogl
- From the Department of Applied Mathematics, Faculty of Electrical Engineering, Mathematics and Computer Science, Technical Medical Centre, University of Twente, Zilverling, PO Box 217, 7500 AE Enschede, the Netherlands (J.M.W.); Department of Biomedical Engineering and Physics (J.M.W., I.I.) and Department of Radiology and Nuclear Medicine (I.I.), Amsterdam University Medical Center, Amsterdam, the Netherlands; Department of Informatics, Technische Universität Darmstadt, Darmstadt, Germany (A.M.); Department of Radiology, Utrecht University Medical Center, Utrecht, the Netherlands (T.L.); and Institute of Diagnostic and Interventional Radiology, Universitätsklinikum Frankfurt, Frankfurt, Germany (T.J.V., A.M.B.)
| | - Andreas M Bucher
- From the Department of Applied Mathematics, Faculty of Electrical Engineering, Mathematics and Computer Science, Technical Medical Centre, University of Twente, Zilverling, PO Box 217, 7500 AE Enschede, the Netherlands (J.M.W.); Department of Biomedical Engineering and Physics (J.M.W., I.I.) and Department of Radiology and Nuclear Medicine (I.I.), Amsterdam University Medical Center, Amsterdam, the Netherlands; Department of Informatics, Technische Universität Darmstadt, Darmstadt, Germany (A.M.); Department of Radiology, Utrecht University Medical Center, Utrecht, the Netherlands (T.L.); and Institute of Diagnostic and Interventional Radiology, Universitätsklinikum Frankfurt, Frankfurt, Germany (T.J.V., A.M.B.)
| | - Ivana Išgum
- From the Department of Applied Mathematics, Faculty of Electrical Engineering, Mathematics and Computer Science, Technical Medical Centre, University of Twente, Zilverling, PO Box 217, 7500 AE Enschede, the Netherlands (J.M.W.); Department of Biomedical Engineering and Physics (J.M.W., I.I.) and Department of Radiology and Nuclear Medicine (I.I.), Amsterdam University Medical Center, Amsterdam, the Netherlands; Department of Informatics, Technische Universität Darmstadt, Darmstadt, Germany (A.M.); Department of Radiology, Utrecht University Medical Center, Utrecht, the Netherlands (T.L.); and Institute of Diagnostic and Interventional Radiology, Universitätsklinikum Frankfurt, Frankfurt, Germany (T.J.V., A.M.B.)
| |
Collapse
|
47
|
Shen L, Kann BH, Taylor RA, Shung DL. The Clinician's Guide to the Machine Learning Galaxy. Front Physiol 2021; 12:658583. [PMID: 33889088 PMCID: PMC8056037 DOI: 10.3389/fphys.2021.658583] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2021] [Accepted: 03/10/2021] [Indexed: 11/13/2022] Open
Affiliation(s)
- Lin Shen
- Department of Medicine, Brigham and Women's Hospital, Boston, MA, United States.,Division of Gastroenterology, Hepatology and Endoscopy, Brigham and Women's Hospital, Boston, MA, United States
| | - Benjamin H Kann
- Department of Radiation Oncology, Dana-Farber Cancer Institute/Brigham and Women's Hospital and Harvard Medical School, Boston, MA, United States.,Artificial Intelligence in Medicine Program, Brigham and Women's Hospital, Boston, MA, United States
| | - R Andrew Taylor
- Department of Emergency Medicine, Yale School of Medicine, New Haven, CT, United States
| | - Dennis L Shung
- Section of Digestive Diseases, Department of Medicine, Yale School of Medicine, New Haven, CT, United States
| |
Collapse
|
48
|
Kaur D, Sobiesk M, Patil S, Liu J, Bhagat P, Gupta A, Markuzon N. Application of Bayesian networks to generate synthetic health data. J Am Med Inform Assoc 2021; 28:801-811. [PMID: 33367620 DOI: 10.1093/jamia/ocaa303] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2020] [Accepted: 11/16/2020] [Indexed: 01/08/2023] Open
Abstract
OBJECTIVE This study seeks to develop a fully automated method of generating synthetic data from a real dataset that could be employed by medical organizations to distribute health data to researchers, reducing the need for access to real data. We hypothesize the application of Bayesian networks will improve upon the predominant existing method, medBGAN, in handling the complexity and dimensionality of healthcare data. MATERIALS AND METHODS We employed Bayesian networks to learn probabilistic graphical structures and simulated synthetic patient records from the learned structure. We used the University of California Irvine (UCI) heart disease and diabetes datasets as well as the MIMIC-III diagnoses database. We evaluated our method through statistical tests, machine learning tasks, preservation of rare events, disclosure risk, and the ability of a machine learning classifier to discriminate between the real and synthetic data. RESULTS Our Bayesian network model outperformed or equaled medBGAN in all key metrics. Notable improvement was achieved in capturing rare variables and preserving association rules. DISCUSSION Bayesian networks generated data sufficiently similar to the original data with minimal risk of disclosure, while offering additional transparency, computational efficiency, and capacity to handle more data types in comparison to existing methods. We hope this method will allow healthcare organizations to efficiently disseminate synthetic health data to researchers, enabling them to generate hypotheses and develop analytical tools. CONCLUSION We conclude the application of Bayesian networks is a promising option for generating realistic synthetic health data that preserves the features of the original data without compromising data privacy.
Collapse
Affiliation(s)
- Dhamanpreet Kaur
- Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
| | - Matthew Sobiesk
- Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
| | - Shubham Patil
- Rochester Institute of Technology, Rochester, New York, USA
| | - Jin Liu
- Clinical Informatics, Philips Research North America, Cambridge, Massachusetts, USA
| | - Puran Bhagat
- Clinical Informatics, Philips Research North America, Cambridge, Massachusetts, USA
| | - Amar Gupta
- Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
| | - Natasha Markuzon
- Clinical Informatics, Philips Research North America, Cambridge, Massachusetts, USA
| |
Collapse
|
49
|
Haendel MA, Chute CG, Bennett TD, Eichmann DA, Guinney J, Kibbe WA, Payne PRO, Pfaff ER, Robinson PN, Saltz JH, Spratt H, Suver C, Wilbanks J, Wilcox AB, Williams AE, Wu C, Blacketer C, Bradford RL, Cimino JJ, Clark M, Colmenares EW, Francis PA, Gabriel D, Graves A, Hemadri R, Hong SS, Hripscak G, Jiao D, Klann JG, Kostka K, Lee AM, Lehmann HP, Lingrey L, Miller RT, Morris M, Murphy SN, Natarajan K, Palchuk MB, Sheikh U, Solbrig H, Visweswaran S, Walden A, Walters KM, Weber GM, Zhang XT, Zhu RL, Amor B, Girvin AT, Manna A, Qureshi N, Kurilla MG, Michael SG, Portilla LM, Rutter JL, Austin CP, Gersing KR. The National COVID Cohort Collaborative (N3C): Rationale, design, infrastructure, and deployment. J Am Med Inform Assoc 2021; 28:427-443. [PMID: 32805036 PMCID: PMC7454687 DOI: 10.1093/jamia/ocaa196] [Citation(s) in RCA: 304] [Impact Index Per Article: 101.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2020] [Accepted: 08/14/2020] [Indexed: 01/12/2023] Open
Abstract
Objective Coronavirus disease 2019 (COVID-19) poses societal challenges that require expeditious data and knowledge sharing. Though organizational clinical data are abundant, these are largely inaccessible to outside researchers. Statistical, machine learning, and causal analyses are most successful with large-scale data beyond what is available in any given organization. Here, we introduce the National COVID Cohort Collaborative (N3C), an open science community focused on analyzing patient-level data from many centers. Materials and Methods The Clinical and Translational Science Award Program and scientific community created N3C to overcome technical, regulatory, policy, and governance barriers to sharing and harmonizing individual-level clinical data. We developed solutions to extract, aggregate, and harmonize data across organizations and data models, and created a secure data enclave to enable efficient, transparent, and reproducible collaborative analytics. Results Organized in inclusive workstreams, we created legal agreements and governance for organizations and researchers; data extraction scripts to identify and ingest positive, negative, and possible COVID-19 cases; a data quality assurance and harmonization pipeline to create a single harmonized dataset; population of the secure data enclave with data, machine learning, and statistical analytics tools; dissemination mechanisms; and a synthetic data pilot to democratize data access. Conclusions The N3C has demonstrated that a multisite collaborative learning health network can overcome barriers to rapidly build a scalable infrastructure incorporating multiorganizational clinical data for COVID-19 analytics. We expect this effort to save lives by enabling rapid collaboration among clinicians, researchers, and data scientists to identify treatments and specialized care and thereby reduce the immediate and long-term impacts of COVID-19.
Collapse
Affiliation(s)
- Melissa A Haendel
- Oregon Clinical and Translational Research Institute, Oregon Health and Science University, Portland, Oregon, USA.,Translational and Integrative Sciences Center, Department of Molecular Toxicology, Oregon State University, Corvallis, Oregon, USA
| | - Christopher G Chute
- Schools of Medicine, Public Health, and Nursing, Johns Hopkins University, Baltimore, Maryland, USA
| | - Tellen D Bennett
- Section of Informatics and Data Science, Department of Pediatrics, University of Colorado School of Medicine, University of Colorado, Aurora, Colorado, USA
| | - David A Eichmann
- School of Library and Information Science, The University of Iowa, Iowa City, Iowa, USA
| | | | | | - Philip R O Payne
- Institute for Informatics, Washington University in St. Louis, Saint Louis,Missouri, USA
| | - Emily R Pfaff
- North Carolina Translational and Clinical Sciences Institute (NC TraCS), University of North Carolina at Chapel Hill, Chapel Hill,North Carolina, USA
| | | | - Joel H Saltz
- Department of Biomedical Informatics, Stony Brook University, Stony Brook, New York, USA
| | - Heidi Spratt
- University of Texas Medical Branch, Galveston, Texas, USA
| | | | | | | | - Andrew E Williams
- Tufts Medical Center Clinical and Translational Science Institute, Tufts Medical Center, Boston,Massachusetts, USA
| | - Chunlei Wu
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, California, USA
| | - Clair Blacketer
- Janssen Research and Development, LLC, Raritan, New Jersey, USA
| | - Robert L Bradford
- North Carolina Translational and Clinical Sciences Institute (NC TraCS), University of North Carolina at Chapel Hill, Chapel Hill,North Carolina, USA
| | - James J Cimino
- University of Alabama-Birmingham, Birmingham, Alabama, USA
| | - Marshall Clark
- North Carolina Translational and Clinical Sciences Institute (NC TraCS), University of North Carolina at Chapel Hill, Chapel Hill,North Carolina, USA
| | - Evan W Colmenares
- Department of Pharmaceutical Outcomes and Policy, University of North Carolina at Chapel Hill, Chapel Hill,North Carolina, USA
| | | | - Davera Gabriel
- Johns Hopkins University School of Medicine, Baltimore, Maryland, USA
| | - Alexis Graves
- University of Iowa Institute for Clinical and Translational Science, The University of Iowa, Iowa City, Iowa, USA
| | - Raju Hemadri
- National Center for Advancing Translational Science, Bethesda, Maryland, USA
| | - Stephanie S Hong
- Johns Hopkins University School of Medicine, Baltimore, Maryland, USA
| | - George Hripscak
- Department of Biomedical Informatics, Columbia University, New York, New York, USA
| | - Dazhi Jiao
- Johns Hopkins University School of Medicine, Baltimore, Maryland, USA
| | | | | | - Adam M Lee
- University of North Carolina at Chapel Hill, Chapel Hill,North Carolina, USA
| | - Harold P Lehmann
- Johns Hopkins University School of Medicine, Baltimore, Maryland, USA
| | | | - Robert T Miller
- Tufts Clinical and Translational Science Institute, Tufts University, Boston,Massachusetts, USA
| | - Michele Morris
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh,Pennsylvania, USA
| | | | | | | | - Usman Sheikh
- National Center for Advancing Translational Science, Bethesda, Maryland, USA
| | - Harold Solbrig
- Johns Hopkins University School of Medicine, Baltimore, Maryland, USA
| | - Shyam Visweswaran
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh,Pennsylvania, USA
| | - Anita Walden
- Oregon Clinical and Translational Research Institute, Oregon Health and Science University, Portland, Oregon, USA.,Sage Bionetworks, Seattle, Washington, USA
| | - Kellie M Walters
- North Carolina Translational and Clinical Sciences Institute (NC TraCS), University of North Carolina at Chapel Hill, Chapel Hill,North Carolina, USA
| | - Griffin M Weber
- Department of Biomedical Informatics, Harvard Medical School, Boston,Massachusetts, USA
| | | | - Richard L Zhu
- Johns Hopkins University School of Medicine, Baltimore, Maryland, USA
| | | | | | - Amin Manna
- Palantir Technologies, Palo Alto, California, USA
| | | | - Michael G Kurilla
- Division of Clinical Innovation, National Center for Advancing Translational Science, Bethesda, Maryland, USA
| | - Sam G Michael
- National Center for Advancing Translational Sciences, National Institutes of Health, Bethesda, Maryland, USA
| | - Lili M Portilla
- Office of Strategic Alliances, National Center for Advancing Translational Sciences, National Institutes of Health, Bethesda, Maryland, USA
| | - Joni L Rutter
- Office of the Director, National Center for Advancing Translational Science, Bethesda, Maryland, USA
| | - Christopher P Austin
- National Center for Advancing Translational Sciences, National Institutes of Health, Bethesda, Maryland, USA
| | - Ken R Gersing
- National Center for Advancing Translational Science, Bethesda, Maryland, USA
| | | |
Collapse
|
50
|
Zhang Z, Yan C, Mesa DA, Sun J, Malin BA. Ensuring electronic medical record simulation through better training, modeling, and evaluation. J Am Med Inform Assoc 2021; 27:99-108. [PMID: 31592533 DOI: 10.1093/jamia/ocz161] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2019] [Revised: 07/29/2019] [Accepted: 08/15/2019] [Indexed: 12/15/2022] Open
Abstract
OBJECTIVE Electronic medical records (EMRs) can support medical research and discovery, but privacy risks limit the sharing of such data on a wide scale. Various approaches have been developed to mitigate risk, including record simulation via generative adversarial networks (GANs). While showing promise in certain application domains, GANs lack a principled approach for EMR data that induces subpar simulation. In this article, we improve EMR simulation through a novel pipeline that (1) enhances the learning model, (2) incorporates evaluation criteria for data utility that informs learning, and (3) refines the training process. MATERIALS AND METHODS We propose a new electronic health record generator using a GAN with a Wasserstein divergence and layer normalization techniques. We designed 2 utility measures to characterize similarity in the structural properties of real and simulated EMRs in the original and latent space, respectively. We applied a filtering strategy to enhance GAN training for low-prevalence clinical concepts. We evaluated the new and existing GANs with utility and privacy measures (membership and disclosure attacks) using billing codes from over 1 million EMRs at Vanderbilt University Medical Center. RESULTS The proposed model outperformed the state-of-the-art approaches with significant improvement in retaining the nature of real records, including prediction performance and structural properties, without sacrificing privacy. Additionally, the filtering strategy achieved higher utility when the EMR training dataset was small. CONCLUSIONS These findings illustrate that EMR simulation through GANs can be substantially improved through more appropriate training, modeling, and evaluation criteria.
Collapse
Affiliation(s)
- Ziqi Zhang
- Department of Electrical Engineering and Computer Science, Vanderbilt University, Nashville, Tennessee, USA
| | - Chao Yan
- Department of Electrical Engineering and Computer Science, Vanderbilt University, Nashville, Tennessee, USA
| | - Diego A Mesa
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Jimeng Sun
- College of Computing, Georgia Institute of Technology, Atlanta, Georgia, USA
| | - Bradley A Malin
- Department of Electrical Engineering and Computer Science, Vanderbilt University, Nashville, Tennessee, USA.,Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA.,Department of Biostatistics, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| |
Collapse
|