1
|
Jamal A, Singh S, Qureshi F. Synthetic data as an investigative tool in hypertension and renal diseases research. World J Methodol 2025; 15:98626. [DOI: 10.5662/wjm.v15.i1.98626] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/01/2024] [Revised: 08/15/2024] [Accepted: 08/29/2024] [Indexed: 09/29/2024] Open
Abstract
There is a growing body of clinical research on the utility of synthetic data derivatives, an emerging research tool in medicine. In nephrology, clinicians can use machine learning and artificial intelligence as powerful aids in their clinical decision-making while also preserving patient privacy. This is especially important given the epidemiology of chronic kidney disease, renal oncology, and hypertension worldwide. However, there remains a need to create a framework for guidance regarding how to better utilize synthetic data as a practical application in this research.
Collapse
Affiliation(s)
- Aleena Jamal
- Sidney Kimmel Medical College, Thomas Jefferson University, Philadelphia, PA 19107, United States
| | - Som Singh
- School of Medicine, University of Missouri Kansas City, Kansas, MO 64106, United States
| | - Fawad Qureshi
- Division of Nephrology and Hypertension, Mayo Clinic, Rochester, MN 55905, United States
| |
Collapse
|
2
|
Álvarez-Chaves H, Spruit M, R-Moreno MD. Improving ED admissions forecasting by using generative AI: An approach based on DGAN. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2024; 256:108363. [PMID: 39182250 DOI: 10.1016/j.cmpb.2024.108363] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/08/2024] [Revised: 07/05/2024] [Accepted: 08/01/2024] [Indexed: 08/27/2024]
Abstract
BACKGROUND AND OBJECTIVE Generative Deep Learning has emerged in recent years as a significant player in the Artificial Intelligence field. Synthesizing new data while maintaining the features of reality has revolutionized the field of Deep Learning, proving to be particularly useful in contexts where obtaining data is challenging. The objective of this study is to employ the DoppelGANger algorithm, a cutting-edge approach based on Generative Adversarial Networks for time series, to enhance patient admissions forecasting in a hospital Emergency Department. METHODS We employed the DoppelGANger algorithm in a sequential methodology, conditioning generated time series with unique attributes to optimize data utilization. After confirming the successful creation of synthetic data with new attribute values, we adopted the Train-Synthetic-Test-Real framework to ensure the reliability of our synthetic data validation. We then augmented the original series with synthetic data to enhance the Prophet model's performance. This process was applied to two datasets derived from the original: one with four years of training followed by one year of testing, and another with three years of training and two years of testing. RESULTS The experimental results show that the generative model outperformed Prophet on the forecasting task, improving the SMAPE from 7.30 to 6.99 with the four-year training set, and from 22.84 to 7.41 for the three-year training set, all in daily aggregations. For the data replacement task, the Prophet SMAPE values decreased to 6.84 and 7.18 for four and three-year sets on the same aggregation. Additionally, data augmentation reduced the SMAPE to 6.79 for a one-year test set and achieved 8.56 for the two-year test set, surpassing the performance achieved by the same Prophet model when trained only on real data. Results for the remaining aggregations were consistent. CONCLUSIONS The findings of this study suggest that employing a generative algorithm to extend a training dataset can effectively enhance predictive models within the domain of Emergency Department admissions. The improvement can lead to more efficient resource allocation and patient management.
Collapse
Affiliation(s)
| | - Marco Spruit
- Leiden University Medical Center, Department of Public Health and Primary Care, 2333 ZA, Leiden, The Netherlands.
| | - María D R-Moreno
- Universidad de Alcalá, Escuela Politécnica Superior, 28805, Madrid, Spain.
| |
Collapse
|
3
|
Fragkouli SC, Solanki D, Castro LJ, Psomopoulos FE, Queralt-Rosinach N, Cirillo D, Crossman LC. Synthetic data: how could it be used in infectious disease research? Future Microbiol 2024:1-6. [PMID: 39345126 DOI: 10.1080/17460913.2024.2400853] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2024] [Accepted: 09/02/2024] [Indexed: 10/01/2024] Open
Affiliation(s)
- Styliani-Christina Fragkouli
- Department of Biology, National & Kapodistrian University of Athens, Athens, 15772, Greece
- Institute of Applied Biosciences, Centre for Research & Technology Hellas, Thessaloniki, 57001, Greece
| | - Dhwani Solanki
- ZB MED Information Centre for Life Sciences, Cologne, 50931, Germany
| | - Leyla J Castro
- ZB MED Information Centre for Life Sciences, Cologne, 50931, Germany
| | - Fotis E Psomopoulos
- Institute of Applied Biosciences, Centre for Research & Technology Hellas, Thessaloniki, 57001, Greece
| | - Núria Queralt-Rosinach
- Department of Human Genetics, Leiden University Medical Center, Leiden, 2333, The Netherlands
| | - Davide Cirillo
- Barcelona Supercomputing Center (BSC), Barcelona, E-08034, Spain
| | - Lisa C Crossman
- SequenceAnalysis.co.uk, Norwich Research Park, Norwich, NR4 7UG, UK
- School of Biological Sciences, University of East Anglia, Norwich Research Park, Norwich, NR4 7TJ, UK
| |
Collapse
|
4
|
Tian M, Chen B, Guo A, Jiang S, Zhang AR. Reliable generation of privacy-preserving synthetic electronic health record time series via diffusion models. J Am Med Inform Assoc 2024:ocae229. [PMID: 39222376 DOI: 10.1093/jamia/ocae229] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2023] [Revised: 08/04/2024] [Accepted: 08/12/2024] [Indexed: 09/04/2024] Open
Abstract
OBJECTIVE Electronic health records (EHRs) are rich sources of patient-level data, offering valuable resources for medical data analysis. However, privacy concerns often restrict access to EHRs, hindering downstream analysis. Current EHR deidentification methods are flawed and can lead to potential privacy leakage. Additionally, existing publicly available EHR databases are limited, preventing the advancement of medical research using EHR. This study aims to overcome these challenges by generating realistic and privacy-preserving synthetic EHRs time series efficiently. MATERIALS AND METHODS We introduce a new method for generating diverse and realistic synthetic EHR time series data using denoizing diffusion probabilistic models. We conducted experiments on 6 databases: Medical Information Mart for Intensive Care III and IV, the eICU Collaborative Research Database (eICU), and non-EHR datasets on Stocks and Energy. We compared our proposed method with 8 existing methods. RESULTS Our results demonstrate that our approach significantly outperforms all existing methods in terms of data fidelity while requiring less training effort. Additionally, data generated by our method yield a lower discriminative accuracy compared to other baseline methods, indicating the proposed method can generate data with less privacy risk. DISCUSSION The proposed model utilizes a mixed diffusion process to generate realistic synthetic EHR samples that protect patient privacy. This method could be useful in tackling data availability issues in the field of healthcare by reducing barrier to EHR access and supporting research in machine learning for health. CONCLUSION The proposed diffusion model-based method can reliably and efficiently generate synthetic EHR time series, which facilitates the downstream medical data analysis. Our numerical results show the superiority of the proposed method over all other existing methods.
Collapse
Affiliation(s)
- Muhang Tian
- Department of Computer Science, Duke University, Durham, NC 27708, United States
| | - Bernie Chen
- Department of Electrical & Computer Engineering, Duke University, Durham, NC 27708, United States
| | - Allan Guo
- Department of Computer Science, Duke University, Durham, NC 27708, United States
| | - Shiyi Jiang
- Department of Electrical & Computer Engineering, Duke University, Durham, NC 27708, United States
| | - Anru R Zhang
- Department of Computer Science, Duke University, Durham, NC 27708, United States
- Department of Biostatistics & Bioinformatics, Duke University, Durham, NC 27708, United States
| |
Collapse
|
5
|
Koetzier LR, Wu J, Mastrodicasa D, Lutz A, Chung M, Koszek WA, Pratap J, Chaudhari AS, Rajpurkar P, Lungren MP, Willemink MJ. Generating Synthetic Data for Medical Imaging. Radiology 2024; 312:e232471. [PMID: 39254456 DOI: 10.1148/radiol.232471] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/11/2024]
Abstract
Artificial intelligence (AI) models for medical imaging tasks, such as classification or segmentation, require large and diverse datasets of images. However, due to privacy and ethical issues, as well as data sharing infrastructure barriers, these datasets are scarce and difficult to assemble. Synthetic medical imaging data generated by AI from existing data could address this challenge by augmenting and anonymizing real imaging data. In addition, synthetic data enable new applications, including modality translation, contrast synthesis, and professional training for radiologists. However, the use of synthetic data also poses technical and ethical challenges. These challenges include ensuring the realism and diversity of the synthesized images while keeping data unidentifiable, evaluating the performance and generalizability of models trained on synthetic data, and high computational costs. Since existing regulations are not sufficient to guarantee the safe and ethical use of synthetic images, it becomes evident that updated laws and more rigorous oversight are needed. Regulatory bodies, physicians, and AI developers should collaborate to develop, maintain, and continually refine best practices for synthetic data. This review aims to provide an overview of the current knowledge of synthetic data in medical imaging and highlights current key challenges in the field to guide future research and development.
Collapse
Affiliation(s)
- Lennart R Koetzier
- From the Delft University of Technology, Delft, the Netherlands (L.R.K.); Segmed, 3790 El Camino Real #810, Palo Alto, CA 94306 (J.W., A.L., M.C., W.A.K., J.P., M.J.W.); Department of Radiology, University of Washington, Seattle, Wash (D.M.); Department of Radiology, OncoRad/Tumor Imaging Metrics Core, Seattle, Wash (D.M.); Harvard University, Cambridge, Mass (J.P.); Department of Radiology, Stanford University School of Medicine, Palo Alto, Calif (A.S.C.); Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, Calif (A.S.C.); Department of Biomedical Informatics, Harvard Medical School, Boston, Mass (P.R.); Microsoft, Redmond, Wash (M.P.L.); and Department of Radiology and Biomedical Imaging, University of California San Francisco, San Francisco, Calif (M.P.L.)
| | - Jie Wu
- From the Delft University of Technology, Delft, the Netherlands (L.R.K.); Segmed, 3790 El Camino Real #810, Palo Alto, CA 94306 (J.W., A.L., M.C., W.A.K., J.P., M.J.W.); Department of Radiology, University of Washington, Seattle, Wash (D.M.); Department of Radiology, OncoRad/Tumor Imaging Metrics Core, Seattle, Wash (D.M.); Harvard University, Cambridge, Mass (J.P.); Department of Radiology, Stanford University School of Medicine, Palo Alto, Calif (A.S.C.); Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, Calif (A.S.C.); Department of Biomedical Informatics, Harvard Medical School, Boston, Mass (P.R.); Microsoft, Redmond, Wash (M.P.L.); and Department of Radiology and Biomedical Imaging, University of California San Francisco, San Francisco, Calif (M.P.L.)
| | - Domenico Mastrodicasa
- From the Delft University of Technology, Delft, the Netherlands (L.R.K.); Segmed, 3790 El Camino Real #810, Palo Alto, CA 94306 (J.W., A.L., M.C., W.A.K., J.P., M.J.W.); Department of Radiology, University of Washington, Seattle, Wash (D.M.); Department of Radiology, OncoRad/Tumor Imaging Metrics Core, Seattle, Wash (D.M.); Harvard University, Cambridge, Mass (J.P.); Department of Radiology, Stanford University School of Medicine, Palo Alto, Calif (A.S.C.); Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, Calif (A.S.C.); Department of Biomedical Informatics, Harvard Medical School, Boston, Mass (P.R.); Microsoft, Redmond, Wash (M.P.L.); and Department of Radiology and Biomedical Imaging, University of California San Francisco, San Francisco, Calif (M.P.L.)
| | - Aline Lutz
- From the Delft University of Technology, Delft, the Netherlands (L.R.K.); Segmed, 3790 El Camino Real #810, Palo Alto, CA 94306 (J.W., A.L., M.C., W.A.K., J.P., M.J.W.); Department of Radiology, University of Washington, Seattle, Wash (D.M.); Department of Radiology, OncoRad/Tumor Imaging Metrics Core, Seattle, Wash (D.M.); Harvard University, Cambridge, Mass (J.P.); Department of Radiology, Stanford University School of Medicine, Palo Alto, Calif (A.S.C.); Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, Calif (A.S.C.); Department of Biomedical Informatics, Harvard Medical School, Boston, Mass (P.R.); Microsoft, Redmond, Wash (M.P.L.); and Department of Radiology and Biomedical Imaging, University of California San Francisco, San Francisco, Calif (M.P.L.)
| | - Matthew Chung
- From the Delft University of Technology, Delft, the Netherlands (L.R.K.); Segmed, 3790 El Camino Real #810, Palo Alto, CA 94306 (J.W., A.L., M.C., W.A.K., J.P., M.J.W.); Department of Radiology, University of Washington, Seattle, Wash (D.M.); Department of Radiology, OncoRad/Tumor Imaging Metrics Core, Seattle, Wash (D.M.); Harvard University, Cambridge, Mass (J.P.); Department of Radiology, Stanford University School of Medicine, Palo Alto, Calif (A.S.C.); Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, Calif (A.S.C.); Department of Biomedical Informatics, Harvard Medical School, Boston, Mass (P.R.); Microsoft, Redmond, Wash (M.P.L.); and Department of Radiology and Biomedical Imaging, University of California San Francisco, San Francisco, Calif (M.P.L.)
| | - W Adam Koszek
- From the Delft University of Technology, Delft, the Netherlands (L.R.K.); Segmed, 3790 El Camino Real #810, Palo Alto, CA 94306 (J.W., A.L., M.C., W.A.K., J.P., M.J.W.); Department of Radiology, University of Washington, Seattle, Wash (D.M.); Department of Radiology, OncoRad/Tumor Imaging Metrics Core, Seattle, Wash (D.M.); Harvard University, Cambridge, Mass (J.P.); Department of Radiology, Stanford University School of Medicine, Palo Alto, Calif (A.S.C.); Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, Calif (A.S.C.); Department of Biomedical Informatics, Harvard Medical School, Boston, Mass (P.R.); Microsoft, Redmond, Wash (M.P.L.); and Department of Radiology and Biomedical Imaging, University of California San Francisco, San Francisco, Calif (M.P.L.)
| | - Jayanth Pratap
- From the Delft University of Technology, Delft, the Netherlands (L.R.K.); Segmed, 3790 El Camino Real #810, Palo Alto, CA 94306 (J.W., A.L., M.C., W.A.K., J.P., M.J.W.); Department of Radiology, University of Washington, Seattle, Wash (D.M.); Department of Radiology, OncoRad/Tumor Imaging Metrics Core, Seattle, Wash (D.M.); Harvard University, Cambridge, Mass (J.P.); Department of Radiology, Stanford University School of Medicine, Palo Alto, Calif (A.S.C.); Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, Calif (A.S.C.); Department of Biomedical Informatics, Harvard Medical School, Boston, Mass (P.R.); Microsoft, Redmond, Wash (M.P.L.); and Department of Radiology and Biomedical Imaging, University of California San Francisco, San Francisco, Calif (M.P.L.)
| | - Akshay S Chaudhari
- From the Delft University of Technology, Delft, the Netherlands (L.R.K.); Segmed, 3790 El Camino Real #810, Palo Alto, CA 94306 (J.W., A.L., M.C., W.A.K., J.P., M.J.W.); Department of Radiology, University of Washington, Seattle, Wash (D.M.); Department of Radiology, OncoRad/Tumor Imaging Metrics Core, Seattle, Wash (D.M.); Harvard University, Cambridge, Mass (J.P.); Department of Radiology, Stanford University School of Medicine, Palo Alto, Calif (A.S.C.); Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, Calif (A.S.C.); Department of Biomedical Informatics, Harvard Medical School, Boston, Mass (P.R.); Microsoft, Redmond, Wash (M.P.L.); and Department of Radiology and Biomedical Imaging, University of California San Francisco, San Francisco, Calif (M.P.L.)
| | - Pranav Rajpurkar
- From the Delft University of Technology, Delft, the Netherlands (L.R.K.); Segmed, 3790 El Camino Real #810, Palo Alto, CA 94306 (J.W., A.L., M.C., W.A.K., J.P., M.J.W.); Department of Radiology, University of Washington, Seattle, Wash (D.M.); Department of Radiology, OncoRad/Tumor Imaging Metrics Core, Seattle, Wash (D.M.); Harvard University, Cambridge, Mass (J.P.); Department of Radiology, Stanford University School of Medicine, Palo Alto, Calif (A.S.C.); Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, Calif (A.S.C.); Department of Biomedical Informatics, Harvard Medical School, Boston, Mass (P.R.); Microsoft, Redmond, Wash (M.P.L.); and Department of Radiology and Biomedical Imaging, University of California San Francisco, San Francisco, Calif (M.P.L.)
| | - Matthew P Lungren
- From the Delft University of Technology, Delft, the Netherlands (L.R.K.); Segmed, 3790 El Camino Real #810, Palo Alto, CA 94306 (J.W., A.L., M.C., W.A.K., J.P., M.J.W.); Department of Radiology, University of Washington, Seattle, Wash (D.M.); Department of Radiology, OncoRad/Tumor Imaging Metrics Core, Seattle, Wash (D.M.); Harvard University, Cambridge, Mass (J.P.); Department of Radiology, Stanford University School of Medicine, Palo Alto, Calif (A.S.C.); Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, Calif (A.S.C.); Department of Biomedical Informatics, Harvard Medical School, Boston, Mass (P.R.); Microsoft, Redmond, Wash (M.P.L.); and Department of Radiology and Biomedical Imaging, University of California San Francisco, San Francisco, Calif (M.P.L.)
| | - Martin J Willemink
- From the Delft University of Technology, Delft, the Netherlands (L.R.K.); Segmed, 3790 El Camino Real #810, Palo Alto, CA 94306 (J.W., A.L., M.C., W.A.K., J.P., M.J.W.); Department of Radiology, University of Washington, Seattle, Wash (D.M.); Department of Radiology, OncoRad/Tumor Imaging Metrics Core, Seattle, Wash (D.M.); Harvard University, Cambridge, Mass (J.P.); Department of Radiology, Stanford University School of Medicine, Palo Alto, Calif (A.S.C.); Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, Calif (A.S.C.); Department of Biomedical Informatics, Harvard Medical School, Boston, Mass (P.R.); Microsoft, Redmond, Wash (M.P.L.); and Department of Radiology and Biomedical Imaging, University of California San Francisco, San Francisco, Calif (M.P.L.)
| |
Collapse
|
6
|
Tang CC, Nagesh S, Fussell DA, Glavis-Bloom J, Mishra N, Li C, Cortes G, Hill R, Zhao J, Gordon A, Wright J, Troutt H, Tarrago R, Chow DS. Generating colloquial radiology reports with large language models. J Am Med Inform Assoc 2024:ocae223. [PMID: 39178375 DOI: 10.1093/jamia/ocae223] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2024] [Revised: 08/05/2024] [Accepted: 08/08/2024] [Indexed: 08/25/2024] Open
Abstract
OBJECTIVES Patients are increasingly being given direct access to their medical records. However, radiology reports are written for clinicians and typically contain medical jargon, which can be confusing. One solution is for radiologists to provide a "colloquial" version that is accessible to the layperson. Because manually generating these colloquial translations would represent a significant burden for radiologists, a way to automatically produce accurate, accessible patient-facing reports is desired. We propose a novel method to produce colloquial translations of radiology reports by providing specialized prompts to a large language model (LLM). MATERIALS AND METHODS Our method automatically extracts and defines medical terms and includes their definitions in the LLM prompt. Using our method and a naive strategy, translations were generated at 4 different reading levels for 100 de-identified neuroradiology reports from an academic medical center. Translations were evaluated by a panel of radiologists for accuracy, likability, harm potential, and readability. RESULTS Our approach translated the Findings and Impression sections at the 8th-grade level with accuracies of 88% and 93%, respectively. Across all grade levels, our approach was 20% more accurate than the baseline method. Overall, translations were more readable than the original reports, as evaluated using standard readability indices. CONCLUSION We find that our translations at the eighth-grade level strike an optimal balance between accuracy and readability. Notably, this corresponds to nationally recognized recommendations for patient-facing health communication. We believe that using this approach to draft patient-accessible reports will benefit patients without significantly increasing the burden on radiologists.
Collapse
Affiliation(s)
- Cynthia Crystal Tang
- Department of Radiological Sciences, University of California, Irvine, Irvine, CA 92868, United States
| | - Supriya Nagesh
- Amazon Web Services, East Palo Alto, CA 94303, United States
| | - David A Fussell
- Department of Radiological Sciences, University of California, Irvine, Irvine, CA 92868, United States
| | - Justin Glavis-Bloom
- Department of Radiological Sciences, University of California, Irvine, Irvine, CA 92868, United States
| | - Nina Mishra
- Amazon Web Services, East Palo Alto, CA 94303, United States
| | - Charles Li
- Department of Radiological Sciences, University of California, Irvine, Irvine, CA 92868, United States
| | - Gillean Cortes
- Department of Radiological Sciences, University of California, Irvine, Irvine, CA 92868, United States
| | - Robert Hill
- Department of Radiological Sciences, University of California, Irvine, Irvine, CA 92868, United States
| | - Jasmine Zhao
- Department of Radiological Sciences, University of California, Irvine, Irvine, CA 92868, United States
| | - Angellica Gordon
- Department of Radiological Sciences, University of California, Irvine, Irvine, CA 92868, United States
| | - Joshua Wright
- Department of Radiological Sciences, University of California, Irvine, Irvine, CA 92868, United States
| | - Hayden Troutt
- Department of Radiological Sciences, University of California, Irvine, Irvine, CA 92868, United States
| | - Rod Tarrago
- Amazon Web Services, Seattle, WA 98121, United States
| | - Daniel S Chow
- Department of Radiological Sciences, University of California, Irvine, Irvine, CA 92868, United States
| |
Collapse
|
7
|
Viswanathan VS, Parmar V, Madabhushi A. Towards equitable AI in oncology. Nat Rev Clin Oncol 2024; 21:628-637. [PMID: 38849530 DOI: 10.1038/s41571-024-00909-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/21/2024] [Indexed: 06/09/2024]
Abstract
Artificial intelligence (AI) stands at the threshold of revolutionizing clinical oncology, with considerable potential to improve early cancer detection and risk assessment, and to enable more accurate personalized treatment recommendations. However, a notable imbalance exists in the distribution of the benefits of AI, which disproportionately favour those living in specific geographical locations and in specific populations. In this Perspective, we discuss the need to foster the development of equitable AI tools that are both accurate in and accessible to a diverse range of patient populations, including those in low-income to middle-income countries. We also discuss some of the challenges and potential solutions in attaining equitable AI, including addressing the historically limited representation of diverse populations in existing clinical datasets and the use of inadequate clinical validation methods. Additionally, we focus on extant sources of inequity including the type of model approach (such as deep learning, and feature engineering-based methods), the implications of dataset curation strategies, the need for rigorous validation across a variety of populations and settings, and the risk of introducing contextual bias that comes with developing tools predominantly in high-income countries.
Collapse
Affiliation(s)
| | - Vani Parmar
- Department of Breast Surgical Oncology, Punyashlok Ahilyadevi Holkar Head & Neck Cancer Institute of India, Mumbai, India
| | - Anant Madabhushi
- Department of Biomedical Engineering, Emory University and Georgia Institute of Technology, Atlanta, GA, USA.
- Atlanta Veterans Administration Medical Center, Atlanta, GA, USA.
| |
Collapse
|
8
|
Bandyopadhyay A, Oks M, Sun H, Prasad B, Rusk S, Jefferson F, Malkani RG, Haghayegh S, Sachdeva R, Hwang D, Agustsson J, Mignot E, Summers M, Fabbri D, Deak M, Anastasi M, Sampson A, Van Hout S, Seixas A. Strengths, weaknesses, opportunities, and threats of using AI-enabled technology in sleep medicine: a commentary. J Clin Sleep Med 2024; 20:1183-1191. [PMID: 38533757 PMCID: PMC11217619 DOI: 10.5664/jcsm.11132] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2024] [Accepted: 03/20/2024] [Indexed: 03/28/2024]
Abstract
Over the past few years, artificial intelligence (AI) has emerged as a powerful tool used to efficiently automate several tasks across multiple domains. Sleep medicine is perfectly positioned to leverage this tool due to the wealth of physiological signals obtained through sleep studies or sleep tracking devices and abundance of accessible clinical data through electronic medical records. However, caution must be applied when utilizing AI, due to intrinsic challenges associated with novel technology. The Artificial Intelligence in Sleep Medicine Committee of the American Academy of Sleep Medicine reviews advancements in AI within the sleep medicine field. In this article, the Artificial Intelligence in Sleep Medicine committee members provide a commentary on the scope of AI technology in sleep medicine. The commentary identifies 3 pivotal areas in sleep medicine that can benefit from AI technologies: clinical care, lifestyle management, and population health management. This article provides a detailed analysis of the strengths, weaknesses, opportunities, and threats associated with using AI-enabled technologies in each pivotal area. Finally, the article broadly reviews barriers and challenges associated with using AI-enabled technologies and offers possible solutions. CITATION Bandyopadhyay A, Oks M, Sun H, et al. Strengths, weaknesses, opportunities, and threats of using AI-enabled technology in sleep medicine: a commentary. J Clin Sleep Med. 2024;20(7):1183-1191.
Collapse
Affiliation(s)
- Anuja Bandyopadhyay
- Department of Pediatrics, Indiana University School of Medicine, Indianapolis, Indiana
| | - Margarita Oks
- Department of Medicine, Northwell Health System, New York, New York
| | - Haoqi Sun
- Department of Neurology, Beth Israel Deaconess Medical Center, Boston, Massachusetts
| | - Bharati Prasad
- Department of Medicine, University of Illinois, Chicago, Illinois
| | - Sam Rusk
- EnsoData Research, EnsoData, Madison, Wisconsin
| | - Felicia Jefferson
- Department of Biochemistry and Molecular Biology, University of Nevada, Reno, Nevada
| | - Roneil Gopal Malkani
- Department of Neurology, Northwestern University Feinberg School of Medicine, Chicago, Illinois
- Neurology Service, Jesse Brown Veterans Affairs Medical Center, Chicago, Illinois
| | - Shahab Haghayegh
- Brigham and Women’s Hospital and Harvard Medical School, Boston, Massachusetts
| | - Ramesh Sachdeva
- Children’s Hospital of Michigan and Central Michigan University College of Medicine, Detroit, Michigan
| | - Dennis Hwang
- Kaiser Permanente Southern California, Los Angeles, California
| | | | - Emmanuel Mignot
- Stanford University, School of Medicine, Stanford, California
| | - Michael Summers
- Division of Pulmonary, Critical Care, and Sleep Medicine, University of Nebraska Medical Center, Omaha, Nebraska
| | | | | | | | | | | | - Azizi Seixas
- Department of Informatics and Health Data Science, University of Miami Miller School of Medicine, Miami, Florida
| |
Collapse
|
9
|
Dyachkova Y, Dunger-Baldauf C, Barbier N, Devenport J, Franzén S, Kazeem G, Künzel T, Mancini P, Mordenti G, Richert K, Ridolfi A, Saure D. Do You Want to Stay Single? Considerations on Single-Arm Trials in Drug Development and the Postregulatory Space. Pharm Stat 2024. [PMID: 38923796 DOI: 10.1002/pst.2412] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2023] [Revised: 04/03/2024] [Accepted: 05/23/2024] [Indexed: 06/28/2024]
Abstract
Single-arm trials (SATs), while not preferred, remain in use throughout the drug development cycle. They may be accepted by regulators in particular contexts (e.g., in oncology or rare diseases) when the potential effects of new treatments are very large and placebo treatment is unethical. However, in the postregulatory space, SATs are common, and perhaps even more poorly suited to address the questions of interest. In this manuscript, we review regulatory and HTA positions on SATs; challenges posed by SATs to address research questions beyond regulators, evolving statistical methods to provide context for SATs, case studies where SATs could and could not address questions of interest, and communication strategies to influence decision making and optimize study design to address evidence needs.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | | | | | | | - Daniel Saure
- Boehringer Ingelheim Europe GmbH, Ingelheim, Germany
| |
Collapse
|
10
|
Akiya I, Ishihara T, Yamamoto K. Comparison of Synthetic Data Generation Techniques for Control Group Survival Data in Oncology Clinical Trials: Simulation Study. JMIR Med Inform 2024; 12:e55118. [PMID: 38889082 PMCID: PMC11196245 DOI: 10.2196/55118] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2023] [Revised: 04/06/2024] [Accepted: 05/08/2024] [Indexed: 05/24/2024] Open
Abstract
Background Synthetic patient data (SPD) generation for survival analysis in oncology trials holds significant potential for accelerating clinical development. Various machine learning methods, including classification and regression trees (CART), random forest (RF), Bayesian network (BN), and conditional tabular generative adversarial network (CTGAN), have been used for this purpose, but their performance in reflecting actual patient survival data remains under investigation. Objective The aim of this study was to determine the most suitable SPD generation method for oncology trials, specifically focusing on both progression-free survival (PFS) and overall survival (OS), which are the primary evaluation end points in oncology trials. To achieve this goal, we conducted a comparative simulation of 4 generation methods, including CART, RF, BN, and the CTGAN, and the performance of each method was evaluated. Methods Using multiple clinical trial data sets, 1000 data sets were generated by using each method for each clinical trial data set and evaluated as follows: (1) median survival time (MST) of PFS and OS; (2) hazard ratio distance (HRD), which indicates the similarity between the actual survival function and a synthetic survival function; and (3) visual analysis of Kaplan-Meier (KM) plots. Each method's ability to mimic the statistical properties of real patient data was evaluated from these multiple angles. Results In most simulation cases, CART demonstrated the high percentages of MSTs for synthetic data falling within the 95% CI range of the MST of the actual data. These percentages ranged from 88.8% to 98.0% for PFS and from 60.8% to 96.1% for OS. In the evaluation of HRD, CART revealed that HRD values were concentrated at approximately 0.9. Conversely, for the other methods, no consistent trend was observed for either PFS or OS. CART demonstrated better similarity than RF, in that CART caused overfitting and RF (a kind of ensemble learning approach) prevented it. In SPD generation, the statistical properties close to the actual data should be the focus, not a well-generalized prediction model. Both the BN and CTGAN methods cannot accurately reflect the statistical properties of the actual data because small data sets are not suitable. Conclusions As a method for generating SPD for survival data from small data sets, such as clinical trial data, CART demonstrated to be the most effective method compared to RF, BN, and CTGAN. Additionally, it is possible to improve CART-based generation methods by incorporating feature engineering and other methods in future work.
Collapse
Affiliation(s)
- Ippei Akiya
- Biometrics, ICON Clinical Research GK, Tokyo, Japan
| | - Takuma Ishihara
- Innovative and Clinical Research Promotion Center, Gifu University Hospital, Gifu, Japan
| | - Keiichi Yamamoto
- Division of Data Science, Center for Industrial Research and Innovation, Translational Research Institute for Medical Innovation, Osaka Dental University, Osaka, Japan
| |
Collapse
|
11
|
Moukheiber D, Restrepo D, Cajas SA, Montoya MPA, Celi LA, Kuo KT, López DM, Moukheiber L, Moukheiber M, Moukheiber S, Osorio-Valencia JS, Purkayastha S, Paddo AR, Wu C, Kuo PC. A multimodal framework for extraction and fusion of satellite images and public health data. Sci Data 2024; 11:634. [PMID: 38879585 PMCID: PMC11180113 DOI: 10.1038/s41597-024-03366-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2023] [Accepted: 05/10/2024] [Indexed: 06/19/2024] Open
Abstract
In low- and middle-income countries, the substantial costs associated with traditional data collection pose an obstacle to facilitating decision-making in the field of public health. Satellite imagery offers a potential solution, but the image extraction and analysis can be costly and requires specialized expertise. We introduce SatelliteBench, a scalable framework for satellite image extraction and vector embeddings generation. We also propose a novel multimodal fusion pipeline that utilizes a series of satellite imagery and metadata. The framework was evaluated generating a dataset with a collection of 12,636 images and embeddings accompanied by comprehensive metadata, from 81 municipalities in Colombia between 2016 and 2018. The dataset was then evaluated in 3 tasks: including dengue case prediction, poverty assessment, and access to education. The performance showcases the versatility and practicality of SatelliteBench, offering a reproducible, accessible and open tool to enhance decision-making in public health.
Collapse
Affiliation(s)
- Dana Moukheiber
- Laboratory for Computational Physiology, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
| | - David Restrepo
- Laboratory for Computational Physiology, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA.
- Departamento de Telemática, Universidad del Cauca, Popayán, Cauca, Colombia.
| | - Sebastián Andrés Cajas
- John A. Paulson School of Engineering and Applied Sciences, Harvard University, Boston, Massachusetts, USA
- School of Computer Science, University College Dublin, Dublin, Ireland
| | | | - Leo Anthony Celi
- Laboratory for Computational Physiology, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
- Department of Biostatistics, Harvard TH Chan School of Public Health, Boston, Massachusetts, USA
- Department of Medicine, Beth Israel Deaconess Medical Center, Boston, Massachusetts, USA
| | - Kuan-Ting Kuo
- Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan
| | - Diego M López
- Departamento de Telemática, Universidad del Cauca, Popayán, Cauca, Colombia
| | - Lama Moukheiber
- Laboratory for Computational Physiology, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
| | - Mira Moukheiber
- Laboratory for Computational Physiology, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
| | - Sulaiman Moukheiber
- Department of Computer Science, Worcester Polytechnic Institute, Worcester, Massachusetts, USA
| | | | - Saptarshi Purkayastha
- Department of BioHealth Informatics, Indiana University Luddy School of Informatics, Computing, and Engineering, Indianapolis, Indiana, USA
| | - Atika Rahman Paddo
- Department of BioHealth Informatics, Indiana University Luddy School of Informatics, Computing, and Engineering, Indianapolis, Indiana, USA
| | - Chenwei Wu
- Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, Michigan, USA
| | - Po-Chih Kuo
- Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan.
| |
Collapse
|
12
|
Nortje N, Palmer A, Enck G, Masciari CF, Neumann J, Gallagher CM. Evolving Landscape of Ethics in Oncology: A Journey Through the Past, Present, and Future. Am Soc Clin Oncol Educ Book 2024; 44:e100043. [PMID: 38788171 DOI: 10.1200/edbk_100043] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/26/2024]
Abstract
Providing a brief overview of past, present, and future ethics issues in oncology, this article begins with historical contexts, including the paternalistic approach to cancer care. It delves into present-day challenges such as navigating cancer treatment during pregnancy and addressing health care disparities faced by LGBTQ+ individuals. It also explores the ethical implications of emerging technologies, notably artificial intelligence and Big Data, in clinical decision making and medical education.
Collapse
Affiliation(s)
- Nico Nortje
- University of Texas MD Anderson Cancer Center, Houston, TX
| | - Amitabha Palmer
- Department of Critical Care Medicine, Section of Integrated Ethics, The University of Texas MD Anderson Cancer Center, Houston, TX
| | - Gavin Enck
- Department of Critical Care Medicine, Section of Integrated Ethics, The University of Texas MD Anderson Cancer Center, Houston, TX
| | - Christopher Frank Masciari
- Department of Critical Care Medicine, Section of Integrated Ethics, The University of Texas MD Anderson Cancer Center, Houston, TX
| | - Joyce Neumann
- Department of Critical Care Medicine, Section of Integrated Ethics, The University of Texas MD Anderson Cancer Center, Houston, TX
| | - Colleen Mary Gallagher
- Department of Critical Care Medicine, Section of Integrated Ethics, The University of Texas MD Anderson Cancer Center, Houston, TX
| |
Collapse
|
13
|
Jeanson F, Farkouh ME, Godoy LC, Minha S, Tzuman O, Marcus G. Medical calculators derived synthetic cohorts: a novel method for generating synthetic patient data. Sci Rep 2024; 14:11437. [PMID: 38763934 PMCID: PMC11102910 DOI: 10.1038/s41598-024-61721-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2023] [Accepted: 05/08/2024] [Indexed: 05/21/2024] Open
Abstract
This study shows that we can use synthetic cohorts created from medical risk calculators to gain insights into how risk estimations, clinical reasoning, data-driven subgrouping, and the confidence in risk calculator scores are connected. When prediction variables aren't evenly distributed in these synthetic cohorts, they can be used to group similar cases together, revealing new insights about how cohorts behave. We also found that the confidence in predictions made by these calculators can vary depending on patient characteristics. This suggests that it might be beneficial to include a "normalized confidence" score in future versions of these calculators for healthcare professionals. We plan to explore this idea further in our upcoming research.
Collapse
Affiliation(s)
| | - Michael E Farkouh
- Peter Munk Cardiac Centre and Heart and Stroke Richard Lewar Centre, University of Toronto, Toronto, Canada
| | - Lucas C Godoy
- Peter Munk Cardiac Centre and Heart and Stroke Richard Lewar Centre, University of Toronto, Toronto, Canada
| | - Sa'ar Minha
- Department of Cardiology, Shamir Medical Center, Zeriffin, Israel
- Tel Aviv University Faculty of Medicine, Tel Aviv, Israel
| | - Oran Tzuman
- Department of Cardiology, Shamir Medical Center, Zeriffin, Israel
- Tel Aviv University Faculty of Medicine, Tel Aviv, Israel
| | - Gil Marcus
- Department of Cardiology, Shamir Medical Center, Zeriffin, Israel
- Tel Aviv University Faculty of Medicine, Tel Aviv, Israel
| |
Collapse
|
14
|
Pickering JW, Young JM, George PM, Watson AS, Aldous SJ, Verryt T, Troughton RW, Pemberton CJ, Richards AM, Cullen LA, Apple FS, Than MP. Derivation and Validation of Thresholds Using Synthetic Data Methods for Single-Test Screening of Emergency Department Patients with Possible Acute Myocardial Infarction Using a Point-of-Care Troponin Assay. J Appl Lab Med 2024; 9:526-539. [PMID: 38442340 DOI: 10.1093/jalm/jfae001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2023] [Accepted: 11/17/2023] [Indexed: 03/07/2024]
Abstract
BACKGROUND Single-sample (screening) rule-out of acute myocardial infarction (AMI) with troponin requires derivation of a single-test screening threshold. In data sets with small event numbers, the lowest one or two concentrations of myocardial infarction (MI) patients dictate the threshold. This is not optimal. We aimed to demonstrate a process incorporating both real and synthetic data for deriving such thresholds using a novel pre-production high-precision point-of-care assay. METHODS cTnI concentrations were measured from thawed plasma using the Troponin I Next (TnI-Nx) assay (i-STAT; Abbott) in adults on arrival to the emergency department with symptoms suggestive of AMI. The primary outcome was an AMI or cardiac death within 30 days. We used internal-external validation with synthetic data production based on clinical and demographic data, plus the measured TnI-Nx concentration, to derive and validate decision thresholds for TnI-Nx. The target low-risk threshold was a sensitivity of 99% and a high-risk threshold specificity of >95%. RESULTS In total, 1356 patients were included, of whom 191 (14.1%) had the primary outcome. A total of 500 synthetic data sets were constructed. The mean low-risk threshold was determined to be 5 ng/L. This categorized 38% (95% CI, 6%-68%) to low-risk with a sensitivity of 99.0% (95% CI, 98.6%-99.5%) and a negative predictive value of 99.4% (95% CI, 97.6%-99.8%). A similarly derived high-risk threshold of 25 ng/L had a specificity of 95.0% (95% CI, 94.8%-95.1%) and a positive predictive value of 74.8% (95% CI, 71.5%-78.0%). CONCLUSIONS With the TnI-Nx assay, we successfully demonstrated an approach using synthetic data generation to derive low-risk thresholds for safe and effective screening.
Collapse
Affiliation(s)
- John W Pickering
- Department of Emergency Medicine, Christchurch Hospital, Christchurch, New Zealand
- Christchurch Heart Institute, University of Otago Christchurch, Christchurch, New Zealand
| | - Joanna M Young
- Department of Emergency Medicine, Christchurch Hospital, Christchurch, New Zealand
| | | | - Antony S Watson
- Department of Emergency Medicine, Christchurch Hospital, Christchurch, New Zealand
| | - Sally J Aldous
- Cardiology Department, Christchurch Hospital, Christchurch, New Zealand
| | - Toby Verryt
- Cardiology Department, Christchurch Hospital, Christchurch, New Zealand
| | - Richard W Troughton
- Christchurch Heart Institute, University of Otago Christchurch, Christchurch, New Zealand
- Cardiology Department, Christchurch Hospital, Christchurch, New Zealand
| | | | - A Mark Richards
- Christchurch Heart Institute, University of Otago Christchurch, Christchurch, New Zealand
- Cardiovascular Research Institute, National University of Singapore, Singapore
| | - Louise A Cullen
- Emergency Department, Royal Brisbane and Women's Hospital, Brisbane, Australia
| | - Fred S Apple
- Department of Laboratory Medicine and Pathology, Hennepin County Medical Center of Hennepin Healthcare and University of Minnesota Minneapolis, Minneapolis, MN, United States
| | - Martin P Than
- Department of Emergency Medicine, Christchurch Hospital, Christchurch, New Zealand
| |
Collapse
|
15
|
Shanley D, Hogenboom J, Lysen F, Wee L, Lobo Gomes A, Dekker A, Meacham D. Getting real about synthetic data ethics : Are AI ethics principles a good starting point for synthetic data ethics? EMBO Rep 2024; 25:2152-2155. [PMID: 38388694 PMCID: PMC11094102 DOI: 10.1038/s44319-024-00101-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2024] [Accepted: 02/13/2024] [Indexed: 02/24/2024] Open
Abstract
Synthetic data promises to be a viable alternative when data collection and data sharing may not be feasible or cost effective, but it raises distinct ethical issue that merit serious consideration.
Collapse
Affiliation(s)
| | | | - Flora Lysen
- Maastricht University, Maastricht, The Netherlands
| | - Leonard Wee
- Maastricht University, Maastricht, The Netherlands
| | | | - Andre Dekker
- Maastricht University, Maastricht, The Netherlands
| | | |
Collapse
|
16
|
Yan C, Zhang Z, Nyemba S, Li Z. Generating Synthetic Electronic Health Record Data Using Generative Adversarial Networks: Tutorial. JMIR AI 2024; 3:e52615. [PMID: 38875595 PMCID: PMC11074891 DOI: 10.2196/52615] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/10/2023] [Revised: 01/24/2024] [Accepted: 03/07/2024] [Indexed: 06/16/2024]
Abstract
Synthetic electronic health record (EHR) data generation has been increasingly recognized as an important solution to expand the accessibility and maximize the value of private health data on a large scale. Recent advances in machine learning have facilitated more accurate modeling for complex and high-dimensional data, thereby greatly enhancing the data quality of synthetic EHR data. Among various approaches, generative adversarial networks (GANs) have become the main technical path in the literature due to their ability to capture the statistical characteristics of real data. However, there is a scarcity of detailed guidance within the domain regarding the development procedures of synthetic EHR data. The objective of this tutorial is to present a transparent and reproducible process for generating structured synthetic EHR data using a publicly accessible EHR data set as an example. We cover the topics of GAN architecture, EHR data types and representation, data preprocessing, GAN training, synthetic data generation and postprocessing, and data quality evaluation. We conclude this tutorial by discussing multiple important issues and future opportunities in this domain. The source code of the entire process has been made publicly available.
Collapse
Affiliation(s)
- Chao Yan
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, United States
| | - Ziqi Zhang
- Department of Computer Science, Vanderbilt University, Nashville, TN, United States
| | - Steve Nyemba
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, United States
| | - Zhuohang Li
- Department of Computer Science, Vanderbilt University, Nashville, TN, United States
| |
Collapse
|
17
|
Naik K, Goyal RK, Foschini L, Chak CW, Thielscher C, Zhu H, Lu J, Lehár J, Pacanoswki MA, Terranova N, Mehta N, Korsbo N, Fakhouri T, Liu Q, Gobburu J. Current Status and Future Directions: The Application of Artificial Intelligence/Machine Learning for Precision Medicine. Clin Pharmacol Ther 2024; 115:673-686. [PMID: 38103204 DOI: 10.1002/cpt.3152] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2023] [Accepted: 11/28/2023] [Indexed: 12/18/2023]
Abstract
Technological innovations, such as artificial intelligence (AI) and machine learning (ML), have the potential to expedite the goal of precision medicine, especially when combined with increased capacity for voluminous data from multiple sources and expanded therapeutic modalities; however, they also present several challenges. In this communication, we first discuss the goals of precision medicine, and contextualize the use of AI in precision medicine by showcasing innovative applications (e.g., prediction of tumor growth and overall survival, biomarker identification using biomedical images, and identification of patient population for clinical practice) which were presented during the February 2023 virtual public workshop entitled "Application of Artificial Intelligence and Machine Learning for Precision Medicine," hosted by the US Food and Drug Administration (FDA) and University of Maryland Center of Excellence in Regulatory Science and Innovation (M-CERSI). Next, we put forward challenges brought about by the multidisciplinary nature of AI, particularly highlighting the need for AI to be trustworthy. To address such challenges, we subsequently note practical approaches, viz., differential privacy, synthetic data generation, and federated learning. The proposed strategies - some of which are highlighted presentations from the workshop - are for the protection of personal information and intellectual property. In addition, methods such as the risk-based management approach and the need for an agile regulatory ecosystem are discussed. Finally, we lay out a call for action that includes sharing of data and algorithms, development of regulatory guidance documents, and pooling of expertise from a broad-spectrum of stakeholders to enhance the application of AI in precision medicine.
Collapse
Affiliation(s)
- Kunal Naik
- Office of Clinical Pharmacology, Office of Translational Sciences, Center for Drug Evaluation and Research, US Food and Drug Administration, Silver Spring, Maryland, USA
| | - Rahul K Goyal
- Center for Translational Medicine, University of Maryland School of Pharmacy, Baltimore, Maryland, USA
| | | | | | | | - Hao Zhu
- Office of Clinical Pharmacology, Office of Translational Sciences, Center for Drug Evaluation and Research, US Food and Drug Administration, Silver Spring, Maryland, USA
| | - James Lu
- Modeling & Simulation/Clinical Pharmacology, Genentech Inc., South San Francisco, California, USA
| | | | - Michael A Pacanoswki
- Office of Clinical Pharmacology, Office of Translational Sciences, Center for Drug Evaluation and Research, US Food and Drug Administration, Silver Spring, Maryland, USA
| | - Nadia Terranova
- Quantitative Pharmacology, Ares Trading S.A. (an affiliate of Merck KGaA, Darmstadt, Germany), Lausanne, Switzerland
| | - Neha Mehta
- Office of Clinical Pharmacology, Office of Translational Sciences, Center for Drug Evaluation and Research, US Food and Drug Administration, Silver Spring, Maryland, USA
| | | | - Tala Fakhouri
- Office of Medical Policy, Center for Drug Evaluation and Research, US Food and Drug Administration, Silver Spring, Maryland, USA
| | - Qi Liu
- Office of Clinical Pharmacology, Office of Translational Sciences, Center for Drug Evaluation and Research, US Food and Drug Administration, Silver Spring, Maryland, USA
| | - Jogarao Gobburu
- Center for Translational Medicine, University of Maryland School of Pharmacy, Baltimore, Maryland, USA
| |
Collapse
|
18
|
Eckardt JN, Hahn W, Röllig C, Stasik S, Platzbecker U, Müller-Tidow C, Serve H, Baldus CD, Schliemann C, Schäfer-Eckart K, Hanoun M, Kaufmann M, Burchert A, Thiede C, Schetelig J, Sedlmayr M, Bornhäuser M, Wolfien M, Middeke JM. Mimicking clinical trials with synthetic acute myeloid leukemia patients using generative artificial intelligence. NPJ Digit Med 2024; 7:76. [PMID: 38509224 PMCID: PMC10954666 DOI: 10.1038/s41746-024-01076-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2023] [Accepted: 03/07/2024] [Indexed: 03/22/2024] Open
Abstract
Clinical research relies on high-quality patient data, however, obtaining big data sets is costly and access to existing data is often hindered by privacy and regulatory concerns. Synthetic data generation holds the promise of effectively bypassing these boundaries allowing for simplified data accessibility and the prospect of synthetic control cohorts. We employed two different methodologies of generative artificial intelligence - CTAB-GAN+ and normalizing flows (NFlow) - to synthesize patient data derived from 1606 patients with acute myeloid leukemia, a heterogeneous hematological malignancy, that were treated within four multicenter clinical trials. Both generative models accurately captured distributions of demographic, laboratory, molecular and cytogenetic variables, as well as patient outcomes yielding high performance scores regarding fidelity and usability of both synthetic cohorts (n = 1606 each). Survival analysis demonstrated close resemblance of survival curves between original and synthetic cohorts. Inter-variable relationships were preserved in univariable outcome analysis enabling explorative analysis in our synthetic data. Additionally, training sample privacy is safeguarded mitigating possible patient re-identification, which we quantified using Hamming distances. We provide not only a proof-of-concept for synthetic data generation in multimodal clinical data for rare diseases, but also full public access to synthetic data sets to foster further research.
Collapse
Affiliation(s)
- Jan-Niklas Eckardt
- Department of Internal Medicine I, University Hospital Carl Gustav Carus, Technical University Dresden, Dresden, Germany.
- Else Kröner Fresenius Center for Digital Health, Technical University Dresden, Dresden, Germany.
| | - Waldemar Hahn
- Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI) Dresden/Leipzig, Leipzig, Germany
- Institute for Medical Informatics and Biometry, Technical University Dresden, Dresden, Germany
| | - Christoph Röllig
- Department of Internal Medicine I, University Hospital Carl Gustav Carus, Technical University Dresden, Dresden, Germany
| | - Sebastian Stasik
- Department of Internal Medicine I, University Hospital Carl Gustav Carus, Technical University Dresden, Dresden, Germany
| | - Uwe Platzbecker
- Medical Clinic and Policlinic I Hematology and Cell Therapy, University Hospital, Leipzig, Germany
| | | | - Hubert Serve
- Department of Medicine 2, Hematology and Oncology, Goethe University Frankfurt, Frankfurt, Germany
| | - Claudia D Baldus
- Department of Hematology and Oncology, University Hospital Schleswig Holstein, Kiel, Germany
| | | | - Kerstin Schäfer-Eckart
- Department of Internal Medicine V, Paracelsus Medizinische Privatuniversität and University Hospital Nürnberg, Nürnberg, Germany
| | - Maher Hanoun
- Department of Hematology, University Hospital Essen, Essen, Germany
| | - Martin Kaufmann
- Department of Hematology, Oncology and Palliative Care, Robert-Bosch-Hospital, Stuttgart, Germany
| | - Andreas Burchert
- Department of Hematology, Oncology and Immunology, Philipps-University-Marburg, Marburg, Germany
| | - Christian Thiede
- Department of Internal Medicine I, University Hospital Carl Gustav Carus, Technical University Dresden, Dresden, Germany
| | - Johannes Schetelig
- Department of Internal Medicine I, University Hospital Carl Gustav Carus, Technical University Dresden, Dresden, Germany
| | - Martin Sedlmayr
- Institute for Medical Informatics and Biometry, Technical University Dresden, Dresden, Germany
| | - Martin Bornhäuser
- Department of Internal Medicine I, University Hospital Carl Gustav Carus, Technical University Dresden, Dresden, Germany
- German Consortium for Translational Cancer Research DKFZ, Heidelberg, Germany
- National Center for Tumor Diseases (NCT), Dresden, Germany
| | - Markus Wolfien
- Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI) Dresden/Leipzig, Leipzig, Germany
- Institute for Medical Informatics and Biometry, Technical University Dresden, Dresden, Germany
| | - Jan Moritz Middeke
- Department of Internal Medicine I, University Hospital Carl Gustav Carus, Technical University Dresden, Dresden, Germany
- Else Kröner Fresenius Center for Digital Health, Technical University Dresden, Dresden, Germany
| |
Collapse
|
19
|
Liu X, Reigle J, Prasath VBS, Dhaliwal J. Artificial intelligence image-based prediction models in IBD exhibit high risk of bias: A systematic review. Comput Biol Med 2024; 171:108093. [PMID: 38354499 DOI: 10.1016/j.compbiomed.2024.108093] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2023] [Revised: 01/04/2024] [Accepted: 01/30/2024] [Indexed: 02/16/2024]
Abstract
BACKGROUND There has been an increase in the development of both machine learning (ML) and deep learning (DL) prediction models in Inflammatory Bowel Disease. We aim in this systematic review to assess the methodological quality and risk of bias of ML and DL IBD image-based prediction studies. METHODS We searched three databases, PubMed, Scopus and Embase, to identify ML and DL diagnostic or prognostic predictive models using imaging data in IBD, to Dec 31, 2022. We restricted our search to include studies that primarily used conventional imaging data, were undertaken in human participants, and published in English. Two reviewers independently reviewed the abstracts. The methodological quality of the studies was determined, and risk of bias evaluated using the prediction risk of bias assessment tool (PROBAST). RESULTS Forty studies were included, thirty-nine developed diagnostic models. Seven studies utilized ML approaches, six were retrospective and none used multicenter data for model development. Thirty-three studies utilized DL approaches, ten were prospective, and twelve multicenter studies. Overall, all studies demonstrated high risk of bias. ML studies were evaluated in 4 domains all rated as high risk of bias: participants (6/7), predictors (1/7), outcome (3/7), and analysis (7/7), and DL studies evaluated in 3 domains: participants (24/33), outcome (10/33), and analysis (18/33). The majority of image-based studies used colonoscopy images. CONCLUSION The risk of bias was high in AI IBD image-based prediction models, owing to insufficient sample size, unreported missingness and lack of an external validation cohort. Models with a high risk of bias are unlikely to be generalizable and suitable for clinical implementation.
Collapse
Affiliation(s)
- Xiaoxuan Liu
- Department of Biomedical Informatics, College of Medicine, University of Cincinnati, OH, USA; Department of Pediatrics, University of Cincinnati, College of Medicine, Cincinnati, OH, USA
| | - James Reigle
- Department of Pediatrics, University of Cincinnati, College of Medicine, Cincinnati, OH, USA; Cincinnati Children's Hospital Medical Center, Division of Gastroenterology, Hepatology and Nutrition, USA
| | - V B Surya Prasath
- Department of Biomedical Informatics, College of Medicine, University of Cincinnati, OH, USA; Department of Pediatrics, University of Cincinnati, College of Medicine, Cincinnati, OH, USA; Cincinnati Children's Hospital Medical Center, Division of Gastroenterology, Hepatology and Nutrition, USA
| | - Jasbir Dhaliwal
- Department of Biomedical Informatics, College of Medicine, University of Cincinnati, OH, USA; Department of Pediatrics, University of Cincinnati, College of Medicine, Cincinnati, OH, USA; Cincinnati Children's Hospital Medical Center, Division of Gastroenterology, Hepatology and Nutrition, USA.
| |
Collapse
|
20
|
Umer F, Adnan N. Generative artificial intelligence: synthetic datasets in dentistry. BDJ Open 2024; 10:13. [PMID: 38429258 PMCID: PMC10907705 DOI: 10.1038/s41405-024-00198-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2024] [Revised: 02/15/2024] [Accepted: 02/16/2024] [Indexed: 03/03/2024] Open
Abstract
INTRODUCTION Artificial Intelligence (AI) algorithms, particularly Deep Learning (DL) models are known to be data intensive. This has increased the demand for digital data in all domains of healthcare, including dentistry. The main hindrance in the progress of AI is access to diverse datasets which train DL models ensuring optimal performance, comparable to subject experts. However, administration of these traditionally acquired datasets is challenging due to privacy regulations and the extensive manual annotation required by subject experts. Biases such as ethical, socioeconomic and class imbalances are also incorporated during the curation of these datasets, limiting their overall generalizability. These challenges prevent their accrual at a larger scale for training DL models. METHODS Generative AI techniques can be useful in the production of Synthetic Datasets (SDs) that can overcome issues affecting traditionally acquired datasets. Variational autoencoders, generative adversarial networks and diffusion models have been used to generate SDs. The following text is a review of these generative AI techniques and their operations. It discusses the chances of SDs and challenges with potential solutions which will improve the understanding of healthcare professionals working in AI research. CONCLUSION Synthetic data customized to the need of researchers can be produced to train robust AI models. These models, having been trained on such a diverse dataset will be applicable for dissemination across countries. However, there is a need for the limitations associated with SDs to be better understood, and attempts made to overcome those concerns prior to their widespread use.
Collapse
Affiliation(s)
- Fahad Umer
- Operative Dentistry and Endodontics, Department of Surgery, Aga Khan University Hospital, Karachi, Pakistan
| | - Niha Adnan
- Operative Dentistry and Endodontics, Department of Surgery, Aga Khan University Hospital, Karachi, Pakistan.
| |
Collapse
|
21
|
de-la-Torre R, Oña ED, Victores JG, Jardón A. SpasticSim: a synthetic data generation method for upper limb spasticity modelling in neurorehabilitation. Sci Rep 2024; 14:1646. [PMID: 38238475 PMCID: PMC10796340 DOI: 10.1038/s41598-024-51993-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2023] [Accepted: 01/11/2024] [Indexed: 01/22/2024] Open
Abstract
In neurorehabilitation, assessment of functional problems is essential to define optimal rehabilitation treatments. Usually, this assessment process requires distinguishing between impaired and non-impaired behavior of limbs. One of the common muscle motor disorders affecting limbs is spasticity, which is complicated to quantify objectively due to the complex nature of motor control. Thus, the lack of heterogeneous samples of patients constituting an acceptable amount of data is an obstacle which is relevant to understanding the behavior of spasticity and, consequently, quantifying it. In this article, we use the 3D creation suite Blender combined with the MBLab add-on to generate synthetic samples of human body models, aiming to be as sufficiently representative as possible to real human samples. Exporting these samples to OpenSim and performing four specific upper limb movements, we analyze the muscle behavior by simulating the six degrees of spasticity contemplated by the Modified Ashworth Scale (MAS). The complete dataset of patients and movements is open-source and available for future research. This approach advocates the potential to generate synthetic data for testing and validating musculoskeletal models.
Collapse
Affiliation(s)
- Rubén de-la-Torre
- Department of Systems Engineering and Automation, Universidad Carlos III de Madrid, Avda. de la Universidad 30, Leganés, 28911, Madrid, Spain
| | - Edwin Daniel Oña
- Department of Systems Engineering and Automation, Universidad Carlos III de Madrid, Avda. de la Universidad 30, Leganés, 28911, Madrid, Spain.
| | - Juan G Victores
- Department of Systems Engineering and Automation, Universidad Carlos III de Madrid, Avda. de la Universidad 30, Leganés, 28911, Madrid, Spain
| | - Alberto Jardón
- Department of Systems Engineering and Automation, Universidad Carlos III de Madrid, Avda. de la Universidad 30, Leganés, 28911, Madrid, Spain
| |
Collapse
|
22
|
Kim H, Jang WS, Sim WS, Kim HS, Choi JE, Baek ES, Park YR, Shin SJ. Synthetic Data Improve Survival Status Prediction Models in Early-Onset Colorectal Cancer. JCO Clin Cancer Inform 2024; 8:e2300201. [PMID: 38271642 PMCID: PMC10830088 DOI: 10.1200/cci.23.00201] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2023] [Revised: 11/19/2023] [Accepted: 12/07/2023] [Indexed: 01/27/2024] Open
Abstract
PURPOSE In artificial intelligence-based modeling, working with a limited number of patient groups is challenging. This retrospective study aimed to evaluate whether applying synthetic data generation methods to the clinical data of small patient groups can enhance the performance of prediction models. MATERIALS AND METHODS A data set collected by the Cancer Registry Library Project from the Yonsei Cancer Center (YCC), Severance Hospital, between January 2008 and October 2020 was reviewed. Patients with colorectal cancer younger than 50 years who started their initial treatment at YCC were included. A Bayesian network-based synthesizing model was used to generate a synthetic data set, combined with the differential privacy (DP) method. RESULTS A synthetic population of 5,005 was generated from a data set of 1,253 patients with 93 clinical features. The Hellinger distance and correlation difference metric were below 0.3 and 0.5, respectively, indicating no statistical difference. The overall survival by disease stage did not differ between the synthetic and original populations. Training with the synthetic data and validating with the original data showed the highest performances of 0.850, 0.836, and 0.790 for the Decision Tree, Random Forest, and XGBoost models, respectively. Comparison of synthetic data sets with different epsilon parameters from the original data sets showed improved performance >0.1%. For extremely small data sets, models using synthetic data outperformed those using only original data sets. The reidentification risk measures demonstrated that the epsilons between 0.1 and 100 fell below the baseline, indicating a preserved privacy state. CONCLUSION The synthetic data generation approach enhances predictive modeling performance by maintaining statistical and clinical integrity, and simultaneously reduces privacy risks through the application of DP techniques.
Collapse
Affiliation(s)
- Hyunwook Kim
- Division of Medical Oncology, Department of Internal Medicine, Yonsei Cancer Center, Severance Hospital, Yonsei University College of Medicine, Seoul, South Korea
| | - Won Seok Jang
- Miner School of Computer & Information Sciences, University of Massachusetts Lowell, Lowell, MA
| | - Woo Seob Sim
- Medical Informatics Collaboration Unit, Department of Research Affairs, Yonsei University College of Medicine, Seoul, South Korea
| | - Han Sang Kim
- Division of Medical Oncology, Department of Internal Medicine, Yonsei Cancer Center, Severance Hospital, Yonsei University College of Medicine, Seoul, South Korea
| | - Jeong Eun Choi
- Office of Data Services at Division of Digital Health, Yonsei University Health System, Seoul, South Korea
| | - Eun Sil Baek
- Songdang Institute for Cancer Research, Yonsei University College of Medicine, Seoul, South Korea
| | - Yu Rang Park
- Department of Biomedical Systems Informatics, Yonsei University College of Medicine, Seoul, South Korea
| | - Sang Joon Shin
- Division of Medical Oncology, Department of Internal Medicine, Yonsei Cancer Center, Severance Hospital, Yonsei University College of Medicine, Seoul, South Korea
| |
Collapse
|
23
|
Weberpals J, Wang SV. The FAIRification of research in real-world evidence: A practical introduction to reproducible analytic workflows using Git and R. Pharmacoepidemiol Drug Saf 2024; 33:e5740. [PMID: 38173166 DOI: 10.1002/pds.5740] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2023] [Revised: 11/29/2023] [Accepted: 11/30/2023] [Indexed: 01/05/2024]
Abstract
Transparency and reproducibility are major prerequisites for conducting meaningful real-world evidence (RWE) studies that are fit for decision-making. Many advances have been made in the documentation and reporting of study protocols and results, but the principles for version control and sharing of analytic code in RWE are not yet as established as in other quantitative disciplines like computational biology and health informatics. In this practical tutorial, we aim to give an introduction to distributed version control systems (VCS) tailored toward the FAIR (Findable, Accessible, Interoperable, and Reproducible) implementation of RWE studies. To ease adoption, we provide detailed step-by-step instructions with practical examples on how the Git VCS and R programming language can be implemented into RWE study workflows to facilitate reproducible analyzes. We further discuss and showcase how these tools can be used to track changes, collaborate, disseminate, and archive RWE studies through dedicated project repositories that maintain a complete audit trail of all relevant study documents.
Collapse
Affiliation(s)
- Janick Weberpals
- Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, Massachusetts, USA
| | - Shirley V Wang
- Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, Massachusetts, USA
| |
Collapse
|
24
|
Moore JH, Li X, Chang JH, Tatonetti NP, Theodorescu D, Chen Y, Asselbergs FW, Venkatesan M, Wang ZP. SynTwin: A graph-based approach for predicting clinical outcomes using digital twins derived from synthetic patients. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2024; 29:96-107. [PMID: 38160272 PMCID: PMC10827004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 01/03/2024]
Abstract
The concept of a digital twin came from the engineering, industrial, and manufacturing domains to create virtual objects or machines that could inform the design and development of real objects. This idea is appealing for precision medicine where digital twins of patients could help inform healthcare decisions. We have developed a methodology for generating and using digital twins for clinical outcome prediction. We introduce a new approach that combines synthetic data and network science to create digital twins (i.e. SynTwin) for precision medicine. First, our approach starts by estimating the distance between all subjects based on their available features. Second, the distances are used to construct a network with subjects as nodes and edges defining distance less than the percolation threshold. Third, communities or cliques of subjects are defined. Fourth, a large population of synthetic patients are generated using a synthetic data generation algorithm that models the correlation structure of the data to generate new patients. Fifth, digital twins are selected from the synthetic patient population that are within a given distance defining a subject community in the network. Finally, we compare and contrast community-based prediction of clinical endpoints using real subjects, digital twins, or both within and outside of the community. Key to this approach are the digital twins defined using patient similarity that represent hypothetical unobserved patients with patterns similar to nearby real patients as defined by network distance and community structure. We apply our SynTwin approach to predicting mortality in a population-based cancer registry (n=87,674) from the Surveillance, Epidemiology, and End Results (SEER) program from the National Cancer Institute (USA). Our results demonstrate that nearest network neighbor prediction of mortality in this study is significantly improved with digital twins (AUROC=0.864, 95% CI=0.857-0.872) over just using real data alone (AUROC=0.791, 95% CI=0.781-0.800). These results suggest a network-based digital twin strategy using synthetic patients may add value to precision medicine efforts.
Collapse
Affiliation(s)
- Jason H Moore
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, West Hollywood, CA, United States2Cedars-Sinai Cancer, Cedars-Sinai Medical Center, Los Angeles, CA, United States,
| | | | | | | | | | | | | | | | | |
Collapse
|
25
|
Mannstadt I, Mehta B. Large language models and the future of rheumatology: assessing impact and emerging opportunities. Curr Opin Rheumatol 2024; 36:46-51. [PMID: 37729050 DOI: 10.1097/bor.0000000000000981] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/22/2023]
Abstract
PURPOSE OF REVIEW Large language models (LLMs) have grown rapidly in size and capabilities as more training data and compute power has become available. Since the release of ChatGPT in late 2022, there has been growing interest and exploration around potential applications of LLM technology. Numerous examples and pilot studies demonstrating the capabilities of these tools have emerged across several domains. For rheumatology professionals and patients, LLMs have the potential to transform current practices in medicine. RECENT FINDINGS Recent studies have begun exploring capabilities of LLMs that can assist rheumatologists in clinical practice, research, and medical education, though applications are still emerging. In clinical settings, LLMs have shown promise in assist healthcare professionals enabling more personalized medicine or generating routine documentation like notes and letters. Challenges remain around integrating LLMs into clinical workflows, accuracy of the LLMs and ensuring patient data confidentiality. In research, early experiments demonstrate LLMs can offer analysis of datasets, with quality control as a critical piece. Lastly, LLMs could supplement medical education by providing personalized learning experiences and integration into established curriculums. SUMMARY As these powerful tools continue evolving at a rapid pace, rheumatology professionals should stay informed on how they may impact the field.
Collapse
Affiliation(s)
| | - Bella Mehta
- Weill Cornell Medicine
- Hospital for Special Surgery, New York, New York, USA
| |
Collapse
|
26
|
Rafiei A, Ghiasi Rad M, Sikora A, Kamaleswaran R. Improving mixed-integer temporal modeling by generating synthetic data using conditional generative adversarial networks: A case study of fluid overload prediction in the intensive care unit. Comput Biol Med 2024; 168:107749. [PMID: 38011778 DOI: 10.1016/j.compbiomed.2023.107749] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2023] [Revised: 10/29/2023] [Accepted: 11/20/2023] [Indexed: 11/29/2023]
Abstract
OBJECTIVE The challenge of mixed-integer temporal data, which is particularly prominent for medication use in the critically ill, limits the performance of predictive models. The purpose of this evaluation was to pilot test integrating synthetic data within an existing dataset of complex medication data to improve machine learning model prediction of fluid overload. MATERIALS AND METHODS This retrospective cohort study evaluated patients admitted to an ICU ≥ 72 h. Four machine learning algorithms to predict fluid overload after 48-72 h of ICU admission were developed using the original dataset. Then, two distinct synthetic data generation methodologies (synthetic minority over-sampling technique (SMOTE) and conditional tabular generative adversarial network (CTGAN)) were used to create synthetic data. Finally, a stacking ensemble technique designed to train a meta-learner was established. Models underwent training in three scenarios of varying qualities and quantities of datasets. RESULTS Training machine learning algorithms on the combined synthetic and original dataset overall increased the performance of the predictive models compared to training on the original dataset. The highest performing model was the meta-model trained on the combined dataset with 0.83 AUROC while it managed to significantly enhance the sensitivity across different training scenarios. DISCUSSION The integration of synthetically generated data is the first time such methods have been applied to ICU medication data and offers a promising solution to enhance the performance of machine learning models for fluid overload, which may be translated to other ICU outcomes. A meta-learner was able to make a trade-off between different performance metrics and improve the ability to identify the minority class.
Collapse
Affiliation(s)
- Alireza Rafiei
- Department of Computer Science and Informatics, Emory University, Ste. W302, 400 Dowman Dr., Atlanta, GA, 30322, USA.
| | - Milad Ghiasi Rad
- Department of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA, USA.
| | - Andrea Sikora
- University of Georgia College of Pharmacy, Department of Clinical and Administrative Pharmacy, Augusta, GA, USA.
| | - Rishikesan Kamaleswaran
- Department of Biomedical Informatics, Emory University School of Medicine, Atlanta, GA, USA; Department of Biomedical Engineering, Georgia Institute of Technology, Atlanta, GA, USA.
| |
Collapse
|
27
|
Prasanna A, Jing B, Plopper G, Miller KK, Sanjak J, Feng A, Prezek S, Vidyaprakash E, Thovarai V, Maier EJ, Bhattacharya A, Naaman L, Stephens H, Watford S, Boscardin WJ, Johanson E, Lienau A. Synthetic Health Data Can Augment Community Research Efforts to Better Inform the Public During Emerging Pandemics. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2023:2023.12.11.23298687. [PMID: 38168217 PMCID: PMC10760275 DOI: 10.1101/2023.12.11.23298687] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/05/2024]
Abstract
The COVID-19 pandemic had disproportionate effects on the Veteran population due to the increased prevalence of medical and environmental risk factors. Synthetic electronic health record (EHR) data can help meet the acute need for Veteran population-specific predictive modeling efforts by avoiding the strict barriers to access, currently present within Veteran Health Administration (VHA) datasets. The U.S. Food and Drug Administration (FDA) and the VHA launched the precisionFDA COVID-19 Risk Factor Modeling Challenge to develop COVID-19 diagnostic and prognostic models; identify Veteran population-specific risk factors; and test the usefulness of synthetic data as a substitute for real data. The use of synthetic data boosted challenge participation by providing a dataset that was accessible to all competitors. Models trained on synthetic data showed similar but systematically inflated model performance metrics to those trained on real data. The important risk factors identified in the synthetic data largely overlapped with those identified from the real data, and both sets of risk factors were validated in the literature. Tradeoffs exist between synthetic data generation approaches based on whether a real EHR dataset is required as input. Synthetic data generated directly from real EHR input will more closely align with the characteristics of the relevant cohort. This work shows that synthetic EHR data will have practical value to the Veterans' health research community for the foreseeable future.
Collapse
Affiliation(s)
| | - Bocheng Jing
- Northern California Institute for Research and Education
- San Francisco VA Medical Center
| | | | | | | | | | | | | | | | | | | | | | | | - Sean Watford
- Booz Allen Hamilton
- Currently U.S. Environmental Protection Agency
| | - W John Boscardin
- University of California, San Francisco, Department of Medicine
- University of California, San Francisco, Department of Epidemiology & Biostatistics
| | | | | |
Collapse
|
28
|
Timbie JW, Reynolds KA, Evans EL, Brown DS, Cohen JW, Darien G, DeVoe JE, Grosse SD, Holve E, Meltzer DO, Merritt JG, Neumann PJ, Yabroff KR, Smith SR. Advancing Data Capacity for Economic Outcomes in Patient-Centered Outcomes Research: Challenges and Opportunities. Med Care 2023; 61:S161-S165. [PMID: 37963036 PMCID: PMC10635327 DOI: 10.1097/mlr.0000000000001901] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2023]
Affiliation(s)
| | | | - Emily L. Evans
- US Department of Health and Human Services, Office of the Assistant Secretary for Planning and Evaluation, Washington, DC
| | - Derek S. Brown
- Brown School, Washington University in St. Louis, St. Louis, MO
| | - Joel W. Cohen
- Agency for Healthcare Research and Quality, Rockville, MD
| | - Gwen Darien
- National Patient Advocate Foundation, Washington, DC
| | - Jennifer E. DeVoe
- Department of Family Medicine, Oregon Health & Science University, Portland, OR
| | - Scott D. Grosse
- National Center on Birth Defects and Developmental Disabilities, Centers for Disease Control and Prevention, Atlanta, GA
| | - Erin Holve
- Patient Centered Outcomes Research Institute, Washington, DC
| | - David O. Meltzer
- Departments of Economics and Medicine, University of Chicago Harris School of Public Policy, Chicago, IL
| | | | | | | | - Scott R. Smith
- US Department of Health and Human Services, Office of the Assistant Secretary for Planning and Evaluation, Washington, DC
| |
Collapse
|
29
|
Kang HYJ, Batbaatar E, Choi DW, Choi KS, Ko M, Ryu KS. Synthetic Tabular Data Based on Generative Adversarial Networks in Health Care: Generation and Validation Using the Divide-and-Conquer Strategy. JMIR Med Inform 2023; 11:e47859. [PMID: 37999942 DOI: 10.2196/47859] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2023] [Revised: 08/02/2023] [Accepted: 10/28/2023] [Indexed: 11/25/2023] Open
Abstract
BACKGROUND Synthetic data generation (SDG) based on generative adversarial networks (GANs) is used in health care, but research on preserving data with logical relationships with synthetic tabular data (STD) remains challenging. Filtering methods for SDG can lead to the loss of important information. OBJECTIVE This study proposed a divide-and-conquer (DC) method to generate STD based on the GAN algorithm, while preserving data with logical relationships. METHODS The proposed method was evaluated on data from the Korea Association for Lung Cancer Registry (KALC-R) and 2 benchmark data sets (breast cancer and diabetes). The DC-based SDG strategy comprises 3 steps: (1) We used 2 different partitioning methods (the class-specific criterion distinguished between survival and death groups, while the Cramer V criterion identified the highest correlation between columns in the original data); (2) the entire data set was divided into a number of subsets, which were then used as input for the conditional tabular generative adversarial network and the copula generative adversarial network to generate synthetic data; and (3) the generated synthetic data were consolidated into a single entity. For validation, we compared DC-based SDG and conditional sampling (CS)-based SDG through the performances of machine learning models. In addition, we generated imbalanced and balanced synthetic data for each of the 3 data sets and compared their performance using 4 classifiers: decision tree (DT), random forest (RF), Extreme Gradient Boosting (XGBoost), and light gradient-boosting machine (LGBM) models. RESULTS The synthetic data of the 3 diseases (non-small cell lung cancer [NSCLC], breast cancer, and diabetes) generated by our proposed model outperformed the 4 classifiers (DT, RF, XGBoost, and LGBM). The CS- versus DC-based model performances were compared using the mean area under the curve (SD) values: 74.87 (SD 0.77) versus 63.87 (SD 2.02) for NSCLC, 73.31 (SD 1.11) versus 67.96 (SD 2.15) for breast cancer, and 61.57 (SD 0.09) versus 60.08 (SD 0.17) for diabetes (DT); 85.61 (SD 0.29) versus 79.01 (SD 1.20) for NSCLC, 78.05 (SD 1.59) versus 73.48 (SD 4.73) for breast cancer, and 59.98 (SD 0.24) versus 58.55 (SD 0.17) for diabetes (RF); 85.20 (SD 0.82) versus 76.42 (SD 0.93) for NSCLC, 77.86 (SD 2.27) versus 68.32 (SD 2.37) for breast cancer, and 60.18 (SD 0.20) versus 58.98 (SD 0.29) for diabetes (XGBoost); and 85.14 (SD 0.77) versus 77.62 (SD 1.85) for NSCLC, 78.16 (SD 1.52) versus 70.02 (SD 2.17) for breast cancer, and 61.75 (SD 0.13) versus 61.12 (SD 0.23) for diabetes (LGBM). In addition, we found that balanced synthetic data performed better. CONCLUSIONS This study is the first attempt to generate and validate STD based on a DC approach and shows improved performance using STD. The necessity for balanced SDG was also demonstrated.
Collapse
Affiliation(s)
- Ha Ye Jin Kang
- Department of Applied Artificial Intelligence, Hanyang University, Ansan, Republic of Korea
- Department of Cancer AI & Digital Health, Graduate School of Cancer Science and Policy, National Cancer Center, Gyeonggi-do, Republic of Korea
| | - Erdenebileg Batbaatar
- National Cancer Data Center, National Cancer Control Institute, National Cancer Center, Gyeonggi-do, Republic of Korea
| | - Dong-Woo Choi
- National Cancer Data Center, National Cancer Control Institute, National Cancer Center, Gyeonggi-do, Republic of Korea
| | - Kui Son Choi
- National Cancer Data Center, National Cancer Control Institute, National Cancer Center, Gyeonggi-do, Republic of Korea
- Department of Cancer Control and Policy, Graduate School of Cancer Science and Policy, National Cancer Center, Gyeonggi-do, Republic of Korea
| | - Minsam Ko
- Department of Applied Artificial Intelligence, Hanyang University, Ansan, Republic of Korea
- Department of Human-Computer Interaction, Hanyang University, Ansan, Republic of Korea
| | - Kwang Sun Ryu
- Department of Cancer AI & Digital Health, Graduate School of Cancer Science and Policy, National Cancer Center, Gyeonggi-do, Republic of Korea
- National Cancer Data Center, National Cancer Control Institute, National Cancer Center, Gyeonggi-do, Republic of Korea
| |
Collapse
|
30
|
Choi J, Marwaha JS. Clinical prediction tool pitfalls and considerations: Data and algorithms. Surgery 2023; 174:1270-1272. [PMID: 37709646 DOI: 10.1016/j.surg.2023.08.009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2023] [Revised: 08/02/2023] [Accepted: 08/08/2023] [Indexed: 09/16/2023]
Abstract
In recent years, many surgical prediction models have been developed and published to augment surgeon decision-making, predict postoperative patient trajectories, and more. Collectively underlying all of these models is a wide variety of data sources and algorithms. Each data set and algorithm has its unique strengths, weaknesses, and type of prediction task for which it is best suited. The purpose of this piece is to highlight important characteristics of common data sources and algorithms used in surgical prediction model development so that future researchers interested in developing models of their own may be able to critically evaluate them and select the optimal ones for their study.
Collapse
Affiliation(s)
- Jeff Choi
- Department of Surgery, Stanford University, Stanford, CA. https://www.twitter.com/JeffChoi01
| | - Jayson S Marwaha
- Department of Surgery, Georgetown University Medical Center, Washington, DC.
| |
Collapse
|
31
|
Giuffrè M, Shung DL. Harnessing the power of synthetic data in healthcare: innovation, application, and privacy. NPJ Digit Med 2023; 6:186. [PMID: 37813960 PMCID: PMC10562365 DOI: 10.1038/s41746-023-00927-3] [Citation(s) in RCA: 13] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2023] [Accepted: 09/14/2023] [Indexed: 10/11/2023] Open
Abstract
Data-driven decision-making in modern healthcare underpins innovation and predictive analytics in public health and clinical research. Synthetic data has shown promise in finance and economics to improve risk assessment, portfolio optimization, and algorithmic trading. However, higher stakes, potential liabilities, and healthcare practitioner distrust make clinical use of synthetic data difficult. This paper explores the potential benefits and limitations of synthetic data in the healthcare analytics context. We begin with real-world healthcare applications of synthetic data that informs government policy, enhance data privacy, and augment datasets for predictive analytics. We then preview future applications of synthetic data in the emergent field of digital twin technology. We explore the issues of data quality and data bias in synthetic data, which can limit applicability across different applications in the clinical context, and privacy concerns stemming from data misuse and risk of re-identification. Finally, we evaluate the role of regulatory agencies in promoting transparency and accountability and propose strategies for risk mitigation such as Differential Privacy (DP) and a dataset chain of custody to maintain data integrity, traceability, and accountability. Synthetic data can improve healthcare, but measures to protect patient well-being and maintain ethical standards are key to promote responsible use.
Collapse
Affiliation(s)
- Mauro Giuffrè
- Department of Medicine (Digestive Diseases), Yale School of Medicine, Yale University, New Haven, CT, USA.
- Department of Medical, Surgical and Health Science, University of Trieste, Trieste, Italy.
| | - Dennis L Shung
- Department of Medicine (Digestive Diseases), Yale School of Medicine, Yale University, New Haven, CT, USA
| |
Collapse
|
32
|
Tagmatova Z, Abdusalomov A, Nasimov R, Nasimova N, Dogru AH, Cho YI. New Approach for Generating Synthetic Medical Data to Predict Type 2 Diabetes. Bioengineering (Basel) 2023; 10:1031. [PMID: 37760133 PMCID: PMC10525473 DOI: 10.3390/bioengineering10091031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2023] [Revised: 08/28/2023] [Accepted: 08/30/2023] [Indexed: 09/29/2023] Open
Abstract
The lack of medical databases is currently the main barrier to the development of artificial intelligence-based algorithms in medicine. This issue can be partially resolved by developing a reliable high-quality synthetic database. In this study, an easy and reliable method for developing a synthetic medical database based only on statistical data is proposed. This method changes the primary database developed based on statistical data using a special shuffle algorithm to achieve a satisfactory result and evaluates the resulting dataset using a neural network. Using the proposed method, a database was developed to predict the risk of developing type 2 diabetes 5 years in advance. This dataset consisted of data from 172,290 patients. The prediction accuracy reached 94.45% during neural network training of the dataset.
Collapse
Affiliation(s)
- Zarnigor Tagmatova
- Department of Computer Engineering, Gachon University, Sujeong-Gu, Seongnam-Si 461-701, Republic of Korea
| | - Akmalbek Abdusalomov
- Department of Computer Engineering, Gachon University, Sujeong-Gu, Seongnam-Si 461-701, Republic of Korea
| | - Rashid Nasimov
- Department of Artificial Intelligence, Tashkent State University of Economics, Tashkent 100066, Uzbekistan
| | - Nigorakhon Nasimova
- Department of Artificial Intelligence, Tashkent State University of Economics, Tashkent 100066, Uzbekistan
| | - Ali Hikmet Dogru
- Department of Computer Science, University of Texas at San Antonio, San Antonio, TX 78249-0667, USA;
| | - Young-Im Cho
- Department of Computer Engineering, Gachon University, Sujeong-Gu, Seongnam-Si 461-701, Republic of Korea
| |
Collapse
|
33
|
Jacobs F, D'Amico S, Benvenuti C, Gaudio M, Saltalamacchia G, Miggiano C, De Sanctis R, Della Porta MG, Santoro A, Zambelli A. Opportunities and Challenges of Synthetic Data Generation in Oncology. JCO Clin Cancer Inform 2023; 7:e2300045. [PMID: 37535875 DOI: 10.1200/cci.23.00045] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2023] [Revised: 05/05/2023] [Accepted: 05/25/2023] [Indexed: 08/05/2023] Open
Abstract
Widespread interest in artificial intelligence (AI) in health care has focused mainly on deductive systems that analyze available real-world data to discover patterns not otherwise visible. Generative adversarial network, a new type of inductive AI, has recently evolved to generate high-fidelity virtual synthetic data (SD) trained on relatively limited real-world information. The AI system is fed with a collection of real data, and it learns to generate new augmented data while maintaining the general characteristics of the original data set. The use of SD to enhance clinical research and protect patient privacy has drawn a lot of interest in medicine and in the complex field of oncology. This article summarizes the main characteristics of this innovative technology and critically discusses how it can be used to accelerate data access for secondary purposes, providing an overview of the opportunities and challenges of SD generation for clinical cancer research and health care.
Collapse
Affiliation(s)
- Flavia Jacobs
- Department of Biomedical Sciences, Humanitas University, Milan, Italy
- IRCCS Istituto Clinico Humanitas, Milan, Italy
| | | | - Chiara Benvenuti
- Department of Biomedical Sciences, Humanitas University, Milan, Italy
- IRCCS Istituto Clinico Humanitas, Milan, Italy
| | - Mariangela Gaudio
- Department of Biomedical Sciences, Humanitas University, Milan, Italy
- IRCCS Istituto Clinico Humanitas, Milan, Italy
| | | | - Chiara Miggiano
- Department of Biomedical Sciences, Humanitas University, Milan, Italy
- IRCCS Istituto Clinico Humanitas, Milan, Italy
| | - Rita De Sanctis
- Department of Biomedical Sciences, Humanitas University, Milan, Italy
- IRCCS Istituto Clinico Humanitas, Milan, Italy
| | - Matteo Giovanni Della Porta
- Department of Biomedical Sciences, Humanitas University, Milan, Italy
- IRCCS Istituto Clinico Humanitas, Milan, Italy
| | - Armando Santoro
- Department of Biomedical Sciences, Humanitas University, Milan, Italy
- IRCCS Istituto Clinico Humanitas, Milan, Italy
| | - Alberto Zambelli
- Department of Biomedical Sciences, Humanitas University, Milan, Italy
- IRCCS Istituto Clinico Humanitas, Milan, Italy
| |
Collapse
|
34
|
McDonnell KJ. Leveraging the Academic Artificial Intelligence Silecosystem to Advance the Community Oncology Enterprise. J Clin Med 2023; 12:4830. [PMID: 37510945 PMCID: PMC10381436 DOI: 10.3390/jcm12144830] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2023] [Revised: 07/05/2023] [Accepted: 07/07/2023] [Indexed: 07/30/2023] Open
Abstract
Over the last 75 years, artificial intelligence has evolved from a theoretical concept and novel paradigm describing the role that computers might play in our society to a tool with which we daily engage. In this review, we describe AI in terms of its constituent elements, the synthesis of which we refer to as the AI Silecosystem. Herein, we provide an historical perspective of the evolution of the AI Silecosystem, conceptualized and summarized as a Kuhnian paradigm. This manuscript focuses on the role that the AI Silecosystem plays in oncology and its emerging importance in the care of the community oncology patient. We observe that this important role arises out of a unique alliance between the academic oncology enterprise and community oncology practices. We provide evidence of this alliance by illustrating the practical establishment of the AI Silecosystem at the City of Hope Comprehensive Cancer Center and its team utilization by community oncology providers.
Collapse
Affiliation(s)
- Kevin J McDonnell
- Center for Precision Medicine, Department of Medical Oncology & Therapeutics Research, City of Hope Comprehensive Cancer Center, Duarte, CA 91010, USA
| |
Collapse
|
35
|
Rafiei A, Rad MG, Sikora A, Kamaleswaran R. Improving irregular temporal modeling by integrating synthetic data to the electronic medical record using conditional GANs: a case study of fluid overload prediction in the intensive care unit. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2023:2023.06.20.23291680. [PMID: 37425768 PMCID: PMC10327174 DOI: 10.1101/2023.06.20.23291680] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/11/2023]
Abstract
Objective The challenge of irregular temporal data, which is particularly prominent for medication use in the critically ill, limits the performance of predictive models. The purpose of this evaluation was to pilot test integrating synthetic data within an existing dataset of complex medication data to improve machine learning model prediction of fluid overload. Materials and Methods This retrospective cohort study evaluated patients admitted to an ICU ≥ 72 hours. Four machine learning algorithms to predict fluid overload after 48-72 hours of ICU admission were developed using the original dataset. Then, two distinct synthetic data generation methodologies (synthetic minority over-sampling technique (SMOTE) and conditional tabular generative adversarial network (CT-GAN)) were used to create synthetic data. Finally, a stacking ensemble technique designed to train a meta-learner was established. Models underwent training in three scenarios of varying qualities and quantities of datasets. Results Training machine learning algorithms on the combined synthetic and original dataset overall increased the performance of the predictive models compared to training on the original dataset. The highest performing model was the metamodel trained on the combined dataset with 0.83 AUROC while it managed to significantly enhance the sensitivity across different training scenarios. Discussion The integration of synthetically generated data is the first time such methods have been applied to ICU medication data and offers a promising solution to enhance the performance of machine learning models for fluid overload, which may be translated to other ICU outcomes. A meta-learner was able to make a trade-off between different performance metrics and improve the ability to identify the minority class.
Collapse
|