1
|
McDarby M, Mroz EL, Hahne J, Malling CD, Carpenter BD, Parker PA. "Hospice Care Could Be a Compassionate Choice": ChatGPT Responses to Questions About Decision Making in Advanced Cancer. J Palliat Med 2024. [PMID: 39263979 DOI: 10.1089/jpm.2024.0256] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/13/2024] Open
Abstract
Background: Patients with cancer use the internet to inform medical decision making. Objective: To examine the content of ChatGPT responses to a hypothetical patient question about decision making in advanced cancer. Design: We developed a medical advice-seeking vignette in English about a patient with metastatic melanoma. When inputting this vignette, we varied five characteristics (patient age, race, ethnicity, insurance status, and preexisting recommendation of hospice/the opinion of an adult daughter regarding the recommendation). ChatGPT responses (N = 96) were coded for mentions of: hospice care, palliative care, financial implications of treatment, second opinions, clinical trials, discussing the decision with loved ones, and discussing the decision with care providers. We conducted additional analyses to understand how ChatGPT described hospice and referenced the adult daughter. Data were analyzed using descriptive statistics and chi-square analysis. Results: Responses more frequently mentioned clinical trials for vignettes describing 45-year-old patients compared with 65- and 85-year-old patients. When vignettes mentioned a preexisting recommendation for hospice, responses more frequently mentioned seeking a second opinion and hospice care. ChatGPT's descriptions of hospice focused primarily on its ability to provide comfort and support. When vignettes referenced the daughter's opinion on the hospice recommendation, approximately one third of responses also referenced this, stating the importance of talking to her about treatment preferences and values. Conclusion: ChatGPT responses to questions about advanced cancer decision making can be heterogeneous based on demographic and clinical characteristics. Findings underscore the possible impact of this heterogeneity on treatment decision making in patients with cancer.
Collapse
Affiliation(s)
- Meghan McDarby
- Department of Psychiatry and Behavioral Sciences, Memorial Sloan Kettering Cancer Center, New York, New York, USA
| | - Emily L Mroz
- Section of Geriatrics, Department of Internal Medicine, Yale School of Medicine, New Haven, Connecticut, USA
- Nell Hodgson Woodruff School of Nursing, Emory University, Atlanta, Georgia, USA
| | - Jessica Hahne
- Department of Psychological and Brain Sciences, Washington University in St. Louis, St. Louis, Missouri, USA
| | - Charlotte D Malling
- Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, New York, New York, USA
| | - Brian D Carpenter
- Department of Psychological and Brain Sciences, Washington University in St. Louis, St. Louis, Missouri, USA
| | - Patricia A Parker
- Department of Psychiatry and Behavioral Sciences, Memorial Sloan Kettering Cancer Center, New York, New York, USA
| |
Collapse
|
2
|
Kayastha A, Lakshmanan K, Valentine MJ, Nguyen A, Dholakia K, Wang D. Lumbar disc herniation with radiculopathy: a comparison of NASS guidelines and ChatGPT. NORTH AMERICAN SPINE SOCIETY JOURNAL 2024; 19:100333. [PMID: 39040948 PMCID: PMC11261487 DOI: 10.1016/j.xnsj.2024.100333] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/16/2024] [Revised: 05/25/2024] [Accepted: 05/27/2024] [Indexed: 07/24/2024]
Abstract
Background ChatGPT is an advanced language AI able to generate responses to clinical questions regarding lumbar disc herniation with radiculopathy. Artificial intelligence (AI) tools are increasingly being considered to assist clinicians in decision-making. This study compared ChatGPT-3.5 and ChatGPT-4.0 responses to established NASS clinical guidelines and evaluated concordance. Methods ChatGPT-3.5 and ChatGPT-4.0 were prompted with fifteen questions from The 2012 NASS Clinical Guidelines for the diagnosis and treatment of lumbar disc herniation with radiculopathy. Clinical questions organized into categories were directly entered as unmodified queries into ChatGPT. Language output was assessed by two independent authors on September 26, 2023 based on operationally-defined parameters of accuracy, over-conclusiveness, supplementary, and incompleteness. ChatGPT-3.5 and ChatGPT-4.0 performance was compared via chi-square analyses. Results Among the fifteen responses produced by ChatGPT-3.5, 7 (47%) were accurate, 7 (47%) were over-conclusive, fifteen (100%) were supplementary, and 6 (40%) were incomplete. For ChatGPT-4.0, ten (67%) were accurate, 5 (33%) were over-conclusive, 10 (67%) were supplementary, and 6 (40%) were incomplete. There was a statistically significant difference in supplementary information (100% vs. 67%; p=.014) between ChatGPT-3.5 and ChatGPT-4.0. Accuracy (47% vs. 67%; p=.269), over-conclusiveness (47% vs. 33%; p=.456), and incompleteness (40% vs. 40%; p=1.000) did not show significant differences between ChatGPT-3.5 and ChatGPT-4.0. ChatGPT-3.5 and ChatGPT-4.0 both yielded 100% accuracy for definition and history and physical examination categories. Diagnostic testing yielded 0% accuracy for ChatGPT-3.5 and 100% accuracy for ChatGPT-4.0. Nonsurgical interventions had 50% accuracy for ChatGPT-3.5 and 63% accuracy for ChatGPT-4.0. Surgical interventions resulted in 0% accuracy for ChatGPT-3.5 and 33% accuracy for ChatGPT-4.0. Conclusions ChatGPT-4.0 provided less supplementary information and overall higher accuracy in question categories than ChatGPT-3.5. ChatGPT showed reasonable concordance to NASS guidelines, but clinicians should caution use of ChatGPT in its current state as it fails to safeguard against misinformation.
Collapse
Affiliation(s)
| | | | | | - Anh Nguyen
- Kansas City University, Kansas City, MO, United States
| | | | - Daniel Wang
- MedStar Health, Baltimore, MD, United States
- Georgetown University Medical Center, Washington DC, United States
| |
Collapse
|
3
|
Liang Z, Li J, Tang Y, Zhang Y, Chen C, Li S, Wang X, Xu X, Zhuang Z, He S, Deng B. Predicting the risk category of thymoma with machine learning-based computed tomography radiomics signatures and their between-imaging phase differences. Sci Rep 2024; 14:19215. [PMID: 39160177 PMCID: PMC11333573 DOI: 10.1038/s41598-024-69735-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2024] [Accepted: 08/08/2024] [Indexed: 08/21/2024] Open
Abstract
The aim of this study was to develop a medical imaging and comprehensive stacked learning-based method for predicting high- and low-risk thymoma. A total of 126 patients with thymomas and 5 patients with thymic carcinoma treated at our institution, including 65 low-risk patients and 66 high-risk patients, were retrospectively recruited. Among them, 78 patients composed the training cohort, while the remaining 53 patients formed the validation cohort. We extracted 1702 features each from the patients' arterial-, venous-, and plain-phase images. Pairwise subtraction of these features yielded 1702 arterial-venous, arterial-plain, and venous-plain difference features each. The Mann‒Whitney U test and least absolute shrinkage and selection operator (LASSO) and SelectKBest methods were employed to select the best features from the training set. Six models were built with a stacked learning algorithm. By applying stacked ensemble learning, three machine learning algorithms (XGBoost, multilayer perceptron (MLP), and random forest) were combined by XGBoost to produce the the six basic imaging models. Then, the XGBoost algorithm was applied to the six basic imaging models to construct a combined radiomic model. Finally, the radiomic model was combined with clinical information to create a nomogram that could easily be used in clinical practice to predict the thymoma risk category. The areas under the curve (AUCs) of the combined radiomic model in the training and validation cohorts were 0.999 (95% CI 0.988-1.000) and 0.967 (95% CI 0.916-1.000), respectively, while those of the nomogram were 0.999 (95% CI 0.996-1.000) and 0.983 (95% CI 0.990-1.000). This study describes the application of CT-based radiomics in thymoma patients and proposes a nomogram for predicting the risk category for this disease, which could be advantageous for clinical decision-making for affected patients.
Collapse
Affiliation(s)
- Zhu Liang
- Department of Cardiothoracic Surgery, Affiliated Hospital of Guangdong Medical University, Xiashan District, Zhanjiang, Guangdong, China
| | - Jiamin Li
- Guangdong Medical Universiy, Xiashan District, Zhanjiang, Guangdong, China
| | - Yihan Tang
- Guangdong Medical Universiy, Xiashan District, Zhanjiang, Guangdong, China
| | - Yaxuan Zhang
- Guangdong Medical Universiy, Xiashan District, Zhanjiang, Guangdong, China
| | - Chunyuan Chen
- Department of Cardiothoracic Surgery, Affiliated Hospital of Guangdong Medical University, Xiashan District, Zhanjiang, Guangdong, China
| | - Siyuan Li
- Sun Yat-Sen University, Yuexiu District, Guangzhou, Guangdong, China
| | - Xuefeng Wang
- Department of Cardiothoracic Surgery, Affiliated Hospital of Guangdong Medical University, Xiashan District, Zhanjiang, Guangdong, China
| | - Xinyan Xu
- Guangdong Medical Universiy, Xiashan District, Zhanjiang, Guangdong, China
| | - Ziye Zhuang
- Guangdong Medical Universiy, Xiashan District, Zhanjiang, Guangdong, China
| | - Shuyan He
- Guangzhou Medical University, Panyu District, Guangzhou, Guangdong, China.
- Department of Radiology, Guangdong Women and Children Hospital, Guangzhou, China.
| | - Biao Deng
- Department of Cardiothoracic Surgery, Affiliated Hospital of Guangdong Medical University, Xiashan District, Zhanjiang, Guangdong, China.
| |
Collapse
|
4
|
Young CC, Enichen E, Rao A, Hilker S, Butler A, Laird-Gion J, Succi MD. Pilot Study of Large Language Models as an Age-Appropriate Explanatory Tool for Chronic Pediatric Conditions. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.08.06.24311544. [PMID: 39148860 PMCID: PMC11326333 DOI: 10.1101/2024.08.06.24311544] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 08/17/2024]
Abstract
There exists a gap in existing patient education resources for children with chronic conditions. This pilot study assesses large language models' (LLMs) capacity to deliver developmentally appropriate explanations of chronic conditions to pediatric patients. Two commonly used LLMs generated responses that accurately, appropriately, and effectively communicate complex medical information, making them a potentially valuable tool for enhancing patient understanding and engagement in clinical settings.
Collapse
Affiliation(s)
- Cameron C. Young
- Harvard Medical School, Boston, MA
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center, Mass General Brigham, Boston, MA
| | - Elizabeth Enichen
- Harvard Medical School, Boston, MA
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center, Mass General Brigham, Boston, MA
| | - Arya Rao
- Harvard Medical School, Boston, MA
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center, Mass General Brigham, Boston, MA
| | - Sidney Hilker
- Harvard Medical School, Boston, MA
- Boston Children’s Hospital, Boston, MA
| | - Alex Butler
- Harvard Medical School, Boston, MA
- Boston Children’s Hospital, Boston, MA
| | - Jessica Laird-Gion
- Harvard Medical School, Boston, MA
- Boston Children’s Hospital, Boston, MA
| | - Marc D. Succi
- Harvard Medical School, Boston, MA
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center, Mass General Brigham, Boston, MA
- Department of Radiology, Massachusetts General Hospital, Boston, MA
| |
Collapse
|
5
|
Haltaufderheide J, Ranisch R. The ethics of ChatGPT in medicine and healthcare: a systematic review on Large Language Models (LLMs). NPJ Digit Med 2024; 7:183. [PMID: 38977771 PMCID: PMC11231310 DOI: 10.1038/s41746-024-01157-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2024] [Accepted: 05/29/2024] [Indexed: 07/10/2024] Open
Abstract
With the introduction of ChatGPT, Large Language Models (LLMs) have received enormous attention in healthcare. Despite potential benefits, researchers have underscored various ethical implications. While individual instances have garnered attention, a systematic and comprehensive overview of practical applications currently researched and ethical issues connected to them is lacking. Against this background, this work maps the ethical landscape surrounding the current deployment of LLMs in medicine and healthcare through a systematic review. Electronic databases and preprint servers were queried using a comprehensive search strategy which generated 796 records. Studies were screened and extracted following a modified rapid review approach. Methodological quality was assessed using a hybrid approach. For 53 records, a meta-aggregative synthesis was performed. Four general fields of applications emerged showcasing a dynamic exploration phase. Advantages of using LLMs are attributed to their capacity in data analysis, information provisioning, support in decision-making or mitigating information loss and enhancing information accessibility. However, our study also identifies recurrent ethical concerns connected to fairness, bias, non-maleficence, transparency, and privacy. A distinctive concern is the tendency to produce harmful or convincing but inaccurate content. Calls for ethical guidance and human oversight are recurrent. We suggest that the ethical guidance debate should be reframed to focus on defining what constitutes acceptable human oversight across the spectrum of applications. This involves considering the diversity of settings, varying potentials for harm, and different acceptable thresholds for performance and certainty in healthcare. Additionally, critical inquiry is needed to evaluate the necessity and justification of LLMs' current experimental use.
Collapse
Affiliation(s)
- Joschka Haltaufderheide
- Faculty of Health Sciences Brandenburg, University of Potsdam, Am Mühlenberg 9, Potsdam, 14476, Germany
| | - Robert Ranisch
- Faculty of Health Sciences Brandenburg, University of Potsdam, Am Mühlenberg 9, Potsdam, 14476, Germany.
| |
Collapse
|
6
|
Hoppe JM, Auer MK, Strüven A, Massberg S, Stremmel C. ChatGPT With GPT-4 Outperforms Emergency Department Physicians in Diagnostic Accuracy: Retrospective Analysis. J Med Internet Res 2024; 26:e56110. [PMID: 38976865 PMCID: PMC11263899 DOI: 10.2196/56110] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2024] [Revised: 04/08/2024] [Accepted: 05/08/2024] [Indexed: 07/10/2024] Open
Abstract
BACKGROUND OpenAI's ChatGPT is a pioneering artificial intelligence (AI) in the field of natural language processing, and it holds significant potential in medicine for providing treatment advice. Additionally, recent studies have demonstrated promising results using ChatGPT for emergency medicine triage. However, its diagnostic accuracy in the emergency department (ED) has not yet been evaluated. OBJECTIVE This study compares the diagnostic accuracy of ChatGPT with GPT-3.5 and GPT-4 and primary treating resident physicians in an ED setting. METHODS Among 100 adults admitted to our ED in January 2023 with internal medicine issues, the diagnostic accuracy was assessed by comparing the diagnoses made by ED resident physicians and those made by ChatGPT with GPT-3.5 or GPT-4 against the final hospital discharge diagnosis, using a point system for grading accuracy. RESULTS The study enrolled 100 patients with a median age of 72 (IQR 58.5-82.0) years who were admitted to our internal medicine ED primarily for cardiovascular, endocrine, gastrointestinal, or infectious diseases. GPT-4 outperformed both GPT-3.5 (P<.001) and ED resident physicians (P=.01) in diagnostic accuracy for internal medicine emergencies. Furthermore, across various disease subgroups, GPT-4 consistently outperformed GPT-3.5 and resident physicians. It demonstrated significant superiority in cardiovascular (GPT-4 vs ED physicians: P=.03) and endocrine or gastrointestinal diseases (GPT-4 vs GPT-3.5: P=.01). However, in other categories, the differences were not statistically significant. CONCLUSIONS In this study, which compared the diagnostic accuracy of GPT-3.5, GPT-4, and ED resident physicians against a discharge diagnosis gold standard, GPT-4 outperformed both the resident physicians and its predecessor, GPT-3.5. Despite the retrospective design of the study and its limited sample size, the results underscore the potential of AI as a supportive diagnostic tool in ED settings.
Collapse
Affiliation(s)
| | - Matthias K Auer
- Department of Medicine IV, LMU University Hospital, Munich, Germany
| | - Anna Strüven
- Department of Medicine I, LMU University Hospital, Munich, Germany
- Munich Heart Alliance Partner Site, Deutsches Zentrum für Herz-Kreislaufforschung (German Centre for Cardiovascular Research), LMU University Hospital, Munich, Germany
| | - Steffen Massberg
- Department of Medicine I, LMU University Hospital, Munich, Germany
- Munich Heart Alliance Partner Site, Deutsches Zentrum für Herz-Kreislaufforschung (German Centre for Cardiovascular Research), LMU University Hospital, Munich, Germany
| | - Christopher Stremmel
- Department of Medicine I, LMU University Hospital, Munich, Germany
- Munich Heart Alliance Partner Site, Deutsches Zentrum für Herz-Kreislaufforschung (German Centre for Cardiovascular Research), LMU University Hospital, Munich, Germany
| |
Collapse
|
7
|
Mohammadi SS, Nguyen QD. A User-friendly Approach for the Diagnosis of Diabetic Retinopathy Using ChatGPT and Automated Machine Learning. OPHTHALMOLOGY SCIENCE 2024; 4:100495. [PMID: 38690313 PMCID: PMC11059323 DOI: 10.1016/j.xops.2024.100495] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/15/2023] [Revised: 02/15/2024] [Accepted: 02/15/2024] [Indexed: 05/02/2024]
Abstract
Purpose To assess the capabilities of Chat Generative Pre-trained Transformer (ChatGPT) and Vertex AI in executing code-free preprocessing, training machine learning (ML) models, and analyzing the data. Design Evaluation of diagnostic test or technology. Participants ChatGPT and Vetrex AI as publicly available large language model and ML platform, respectively. Methods ChatGPT was employed to improve the resolution of fundus photography images from the Methods to Evaluate Segmentation and Indexing Techniques in the field of Retinal Ophthalmology (Messidor-2) open-source dataset using the Contrast Limited Adaptive Histogram Equalization (CLAHE) technique by Fiji software. Subsequently, Vertex AI, an automated ML (AutoML) platform, was utilized to develop 2 classification models. The first model served as a binary classifier for detecting the presence of diabetic retinopathy (DR), while the second determined its severity. Finally, ChatGPT was used to provide scripts for R and Python programming languages for data analysis and was also directly employed in analyzing the data in a code-free method. Main Outcome Measures Evaluating the utility of ChatGPT in generating scripts for preprocessing images using Fiji and analyzing data across Python and R and assessing its potential in analyzing data through a code-free method. Investigating the capabilities of Vertex AI to train image classification models for detection of DR and its severity. Results Two ML models were trained using 1740 images from the Messidor-2 database. The first model, designed to detect the severity of DR, achieved an area under the precision-recall curve (AUPRC) of 0.81, with a precision rate of 81.81% and recall of 72.83%. The second model, tailored for the detection of the presence of DR, recorded a precision and recall of 84.48% with an AUPRC of 0.90. Conclusions ChatGPT and Vertex AI have the potential to enable physicians without coding expertise to preprocess images, analyze data, and train ML models. Financial Disclosures Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.
Collapse
Affiliation(s)
- S. Saeed Mohammadi
- Department of Ophthalmology, Byers Eye Institute, Stanford University, Palo Alto, California
| | - Quan Dong Nguyen
- Department of Ophthalmology, Byers Eye Institute, Stanford University, Palo Alto, California
| |
Collapse
|
8
|
Mitchell J, Bennett TD. Navigating Complexity: Enhancing Pediatric Diagnostics With Large Language Models. Pediatr Crit Care Med 2024; 25:577-580. [PMID: 38836714 PMCID: PMC11160974 DOI: 10.1097/pcc.0000000000003483] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 06/06/2024]
Affiliation(s)
- James Mitchell
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO
| | - Tellen D Bennett
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO
- Department of Pediatrics (Critical Care Medicine), University of Colorado School of Medicine, Aurora, CO
| |
Collapse
|
9
|
Preiksaitis C, Ashenburg N, Bunney G, Chu A, Kabeer R, Riley F, Ribeira R, Rose C. The Role of Large Language Models in Transforming Emergency Medicine: Scoping Review. JMIR Med Inform 2024; 12:e53787. [PMID: 38728687 PMCID: PMC11127144 DOI: 10.2196/53787] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2023] [Revised: 12/20/2023] [Accepted: 04/05/2024] [Indexed: 05/12/2024] Open
Abstract
BACKGROUND Artificial intelligence (AI), more specifically large language models (LLMs), holds significant potential in revolutionizing emergency care delivery by optimizing clinical workflows and enhancing the quality of decision-making. Although enthusiasm for integrating LLMs into emergency medicine (EM) is growing, the existing literature is characterized by a disparate collection of individual studies, conceptual analyses, and preliminary implementations. Given these complexities and gaps in understanding, a cohesive framework is needed to comprehend the existing body of knowledge on the application of LLMs in EM. OBJECTIVE Given the absence of a comprehensive framework for exploring the roles of LLMs in EM, this scoping review aims to systematically map the existing literature on LLMs' potential applications within EM and identify directions for future research. Addressing this gap will allow for informed advancements in the field. METHODS Using PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews) criteria, we searched Ovid MEDLINE, Embase, Web of Science, and Google Scholar for papers published between January 2018 and August 2023 that discussed LLMs' use in EM. We excluded other forms of AI. A total of 1994 unique titles and abstracts were screened, and each full-text paper was independently reviewed by 2 authors. Data were abstracted independently, and 5 authors performed a collaborative quantitative and qualitative synthesis of the data. RESULTS A total of 43 papers were included. Studies were predominantly from 2022 to 2023 and conducted in the United States and China. We uncovered four major themes: (1) clinical decision-making and support was highlighted as a pivotal area, with LLMs playing a substantial role in enhancing patient care, notably through their application in real-time triage, allowing early recognition of patient urgency; (2) efficiency, workflow, and information management demonstrated the capacity of LLMs to significantly boost operational efficiency, particularly through the automation of patient record synthesis, which could reduce administrative burden and enhance patient-centric care; (3) risks, ethics, and transparency were identified as areas of concern, especially regarding the reliability of LLMs' outputs, and specific studies highlighted the challenges of ensuring unbiased decision-making amidst potentially flawed training data sets, stressing the importance of thorough validation and ethical oversight; and (4) education and communication possibilities included LLMs' capacity to enrich medical training, such as through using simulated patient interactions that enhance communication skills. CONCLUSIONS LLMs have the potential to fundamentally transform EM, enhancing clinical decision-making, optimizing workflows, and improving patient outcomes. This review sets the stage for future advancements by identifying key research areas: prospective validation of LLM applications, establishing standards for responsible use, understanding provider and patient perceptions, and improving physicians' AI literacy. Effective integration of LLMs into EM will require collaborative efforts and thorough evaluation to ensure these technologies can be safely and effectively applied.
Collapse
Affiliation(s)
- Carl Preiksaitis
- Department of Emergency Medicine, Stanford University School of Medicine, Palo Alto, CA, United States
| | - Nicholas Ashenburg
- Department of Emergency Medicine, Stanford University School of Medicine, Palo Alto, CA, United States
| | - Gabrielle Bunney
- Department of Emergency Medicine, Stanford University School of Medicine, Palo Alto, CA, United States
| | - Andrew Chu
- Department of Emergency Medicine, Stanford University School of Medicine, Palo Alto, CA, United States
| | - Rana Kabeer
- Department of Emergency Medicine, Stanford University School of Medicine, Palo Alto, CA, United States
| | - Fran Riley
- Department of Emergency Medicine, Stanford University School of Medicine, Palo Alto, CA, United States
| | - Ryan Ribeira
- Department of Emergency Medicine, Stanford University School of Medicine, Palo Alto, CA, United States
| | - Christian Rose
- Department of Emergency Medicine, Stanford University School of Medicine, Palo Alto, CA, United States
| |
Collapse
|
10
|
Makhoul M, Melkane AE, Khoury PE, Hadi CE, Matar N. A cross-sectional comparative study: ChatGPT 3.5 versus diverse levels of medical experts in the diagnosis of ENT diseases. Eur Arch Otorhinolaryngol 2024; 281:2717-2721. [PMID: 38365990 DOI: 10.1007/s00405-024-08509-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2024] [Accepted: 01/24/2024] [Indexed: 02/18/2024]
Abstract
PURPOSE With recent advances in artificial intelligence (AI), it has become crucial to thoroughly evaluate its applicability in healthcare. This study aimed to assess the accuracy of ChatGPT in diagnosing ear, nose, and throat (ENT) pathology, and comparing its performance to that of medical experts. METHODS We conducted a cross-sectional comparative study where 32 ENT cases were presented to ChatGPT 3.5, ENT physicians, ENT residents, family medicine (FM) specialists, second-year medical students (Med2), and third-year medical students (Med3). Each participant provided three differential diagnoses. The study analyzed diagnostic accuracy rates and inter-rater agreement within and between participant groups and ChatGPT. RESULTS The accuracy rate of ChatGPT was 70.8%, being not significantly different from ENT physicians or ENT residents. However, a significant difference in correctness rate existed between ChatGPT and FM specialists (49.8%, p < 0.001), and between ChatGPT and medical students (Med2 47.5%, p < 0.001; Med3 47%, p < 0.001). Inter-rater agreement for the differential diagnosis between ChatGPT and each participant group was either poor or fair. In 68.75% of cases, ChatGPT failed to mention the most critical diagnosis. CONCLUSIONS ChatGPT demonstrated accuracy comparable to that of ENT physicians and ENT residents in diagnosing ENT pathology, outperforming FM specialists, Med2 and Med3. However, it showed limitations in identifying the most critical diagnosis.
Collapse
Affiliation(s)
- Mikhael Makhoul
- Department of Otolaryngology-Head and Neck Surgery, Hotel Dieu de France Hospital, Saint Joseph University, Alfred Naccache Boulevard, Ashrafieh, PO Box: 166830, Beirut, Lebanon.
| | - Antoine E Melkane
- Department of Otolaryngology-Head and Neck Surgery, Hotel Dieu de France Hospital, Saint Joseph University, Alfred Naccache Boulevard, Ashrafieh, PO Box: 166830, Beirut, Lebanon
| | - Patrick El Khoury
- Department of Otolaryngology-Head and Neck Surgery, Hotel Dieu de France Hospital, Saint Joseph University, Alfred Naccache Boulevard, Ashrafieh, PO Box: 166830, Beirut, Lebanon
| | - Christopher El Hadi
- Department of Otolaryngology-Head and Neck Surgery, Hotel Dieu de France Hospital, Saint Joseph University, Alfred Naccache Boulevard, Ashrafieh, PO Box: 166830, Beirut, Lebanon
| | - Nayla Matar
- Department of Otolaryngology-Head and Neck Surgery, Hotel Dieu de France Hospital, Saint Joseph University, Alfred Naccache Boulevard, Ashrafieh, PO Box: 166830, Beirut, Lebanon
| |
Collapse
|
11
|
Ha LT, Kelley KD. Artificial Intelligence: Promise or Pitfalls? A Clinical Vignette of Real-Life ChatGPT Implementation in Perioperative Medicine. J Gen Intern Med 2024; 39:1063-1067. [PMID: 38252252 DOI: 10.1007/s11606-024-08611-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/31/2023] [Accepted: 01/05/2024] [Indexed: 01/23/2024]
Affiliation(s)
- Leslie Thienly Ha
- Department of Internal Medicine, University of California, Davis, Davis, USA.
- , Sacramento, USA.
| | - Kristen D Kelley
- Department of Internal Medicine, University of California, Davis, Davis, USA
| |
Collapse
|
12
|
Rao A, Kim J, Lie W, Pang M, Fuh L, Dreyer KJ, Succi MD. Proactive Polypharmacy Management Using Large Language Models: Opportunities to Enhance Geriatric Care. J Med Syst 2024; 48:41. [PMID: 38632172 DOI: 10.1007/s10916-024-02058-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2024] [Accepted: 03/25/2024] [Indexed: 04/19/2024]
Abstract
Polypharmacy remains an important challenge for patients with extensive medical complexity. Given the primary care shortage and the increasing aging population, effective polypharmacy management is crucial to manage the increasing burden of care. The capacity of large language model (LLM)-based artificial intelligence to aid in polypharmacy management has yet to be evaluated. Here, we evaluate ChatGPT's performance in polypharmacy management via its deprescribing decisions in standardized clinical vignettes. We inputted several clinical vignettes originally from a study of general practicioners' deprescribing decisions into ChatGPT 3.5, a publicly available LLM, and evaluated its capacity for yes/no binary deprescribing decisions as well as list-based prompts in which the model was prompted to choose which of several medications to deprescribe. We recorded ChatGPT responses to yes/no binary deprescribing prompts and the number and types of medications deprescribed. In yes/no binary deprescribing decisions, ChatGPT universally recommended deprescribing medications regardless of ADL status in patients with no overlying CVD history; in patients with CVD history, ChatGPT's answers varied by technical replicate. Total number of medications deprescribed ranged from 2.67 to 3.67 (out of 7) and did not vary with CVD status, but increased linearly with severity of ADL impairment. Among medication types, ChatGPT preferentially deprescribed pain medications. ChatGPT's deprescribing decisions vary along the axes of ADL status, CVD history, and medication type, indicating some concordance of internal logic between general practitioners and the model. These results indicate that specifically trained LLMs may provide useful clinical support in polypharmacy management for primary care physicians.
Collapse
Affiliation(s)
- Arya Rao
- Harvard Medical School, Boston, MA, USA
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA, USA
- Massachusetts General Hospital, Department of Radiology, 55 Fruit Street, Boston, MA, 02114, USA
| | - John Kim
- Harvard Medical School, Boston, MA, USA
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA, USA
- Massachusetts General Hospital, Department of Radiology, 55 Fruit Street, Boston, MA, 02114, USA
| | - Winston Lie
- Harvard Medical School, Boston, MA, USA
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA, USA
- Massachusetts General Hospital, Department of Radiology, 55 Fruit Street, Boston, MA, 02114, USA
| | - Michael Pang
- Harvard Medical School, Boston, MA, USA
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA, USA
- Massachusetts General Hospital, Department of Radiology, 55 Fruit Street, Boston, MA, 02114, USA
| | - Lanting Fuh
- Massachusetts General Hospital, Department of Radiology, 55 Fruit Street, Boston, MA, 02114, USA
| | - Keith J Dreyer
- Harvard Medical School, Boston, MA, USA
- Data Science Office, Mass General Brigham, Boston, MA, USA
| | - Marc D Succi
- Harvard Medical School, Boston, MA, USA.
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA, USA.
- Massachusetts General Hospital, Department of Radiology, 55 Fruit Street, Boston, MA, 02114, USA.
| |
Collapse
|
13
|
Sievert M, Aubreville M, Mueller SK, Eckstein M, Breininger K, Iro H, Goncalves M. Diagnosis of malignancy in oropharyngeal confocal laser endomicroscopy using GPT 4.0 with vision. Eur Arch Otorhinolaryngol 2024; 281:2115-2122. [PMID: 38329525 DOI: 10.1007/s00405-024-08476-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2023] [Accepted: 01/11/2024] [Indexed: 02/09/2024]
Abstract
PURPOSE Confocal Laser Endomicroscopy (CLE) is an imaging tool, that has demonstrated potential for intraoperative, real-time, non-invasive, microscopical assessment of surgical margins of oropharyngeal squamous cell carcinoma (OPSCC). However, interpreting CLE images remains challenging. This study investigates the application of OpenAI's Generative Pretrained Transformer (GPT) 4.0 with Vision capabilities for automated classification of CLE images in OPSCC. METHODS CLE Images of histological confirmed SCC or healthy mucosa from a database of 12 809 CLE images from 5 patients with OPSCC were retrieved and anonymized. Using a training data set of 16 images, a validation set of 139 images, comprising SCC (83 images, 59.7%) and healthy normal mucosa (56 images, 40.3%) was classified using the application programming interface (API) of GPT4.0. The same set of images was also classified by CLE experts (two surgeons and one pathologist), who were blinded to the histology. Diagnostic metrics, the reliability of GPT and inter-rater reliability were assessed. RESULTS Overall accuracy of the GPT model was 71.2%, the intra-rater agreement was κ = 0.837, indicating an almost perfect agreement across the three runs of GPT-generated results. Human experts achieved an accuracy of 88.5% with a substantial level of agreement (κ = 0.773). CONCLUSIONS Though limited to a specific clinical framework, patient and image set, this study sheds light on some previously unexplored diagnostic capabilities of large language models using few-shot prompting. It suggests the model`s ability to extrapolate information and classify CLE images with minimal example data. Whether future versions of the model can achieve clinically relevant diagnostic accuracy, especially in uncurated data sets, remains to be investigated.
Collapse
Affiliation(s)
- Matti Sievert
- Department of Otorhinolaryngology, Head and Neck Surgery, Friedrich Alexander University of Erlangen-Nuremberg, Erlangen University Hospital, Erlangen, Germany
| | | | - Sarina Katrin Mueller
- Department of Otorhinolaryngology, Head and Neck Surgery, Friedrich Alexander University of Erlangen-Nuremberg, Erlangen University Hospital, Erlangen, Germany
| | - Markus Eckstein
- Institute of Pathology, Friedrich-Alexander-Universität Erlangen-Nürnberg, University Hospital, Erlangen, Germany
| | - Katharina Breininger
- Department of Artificial Intelligence in Biomedical Engineering, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
| | - Heinrich Iro
- Department of Otorhinolaryngology, Head and Neck Surgery, Friedrich Alexander University of Erlangen-Nuremberg, Erlangen University Hospital, Erlangen, Germany
| | - Miguel Goncalves
- Department of Otorhinolaryngology, Plastic and Aesthetic Operations, University Hospital Würzburg, Joseph-Schneider-Straße 11, 97080, Würzburg, Germany.
| |
Collapse
|
14
|
Ahmed W, Saturno M, Rajjoub R, Duey AH, Zaidat B, Hoang T, Restrepo Mejia M, Gallate ZS, Shrestha N, Tang J, Zapolsky I, Kim JS, Cho SK. ChatGPT versus NASS clinical guidelines for degenerative spondylolisthesis: a comparative analysis. EUROPEAN SPINE JOURNAL : OFFICIAL PUBLICATION OF THE EUROPEAN SPINE SOCIETY, THE EUROPEAN SPINAL DEFORMITY SOCIETY, AND THE EUROPEAN SECTION OF THE CERVICAL SPINE RESEARCH SOCIETY 2024:10.1007/s00586-024-08198-6. [PMID: 38489044 DOI: 10.1007/s00586-024-08198-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/19/2023] [Revised: 02/01/2024] [Accepted: 02/17/2024] [Indexed: 03/17/2024]
Abstract
BACKGROUND CONTEXT Clinical guidelines, developed in concordance with the literature, are often used to guide surgeons' clinical decision making. Recent advancements of large language models and artificial intelligence (AI) in the medical field come with exciting potential. OpenAI's generative AI model, known as ChatGPT, can quickly synthesize information and generate responses grounded in medical literature, which may prove to be a useful tool in clinical decision-making for spine care. The current literature has yet to investigate the ability of ChatGPT to assist clinical decision making with regard to degenerative spondylolisthesis. PURPOSE The study aimed to compare ChatGPT's concordance with the recommendations set forth by The North American Spine Society (NASS) Clinical Guideline for the Diagnosis and Treatment of Degenerative Spondylolisthesis and assess ChatGPT's accuracy within the context of the most recent literature. METHODS ChatGPT-3.5 and 4.0 was prompted with questions from the NASS Clinical Guideline for the Diagnosis and Treatment of Degenerative Spondylolisthesis and graded its recommendations as "concordant" or "nonconcordant" relative to those put forth by NASS. A response was considered "concordant" when ChatGPT generated a recommendation that accurately reproduced all major points made in the NASS recommendation. Any responses with a grading of "nonconcordant" were further stratified into two subcategories: "Insufficient" or "Over-conclusive," to provide further insight into grading rationale. Responses between GPT-3.5 and 4.0 were compared using Chi-squared tests. RESULTS ChatGPT-3.5 answered 13 of NASS's 28 total clinical questions in concordance with NASS's guidelines (46.4%). Categorical breakdown is as follows: Definitions and Natural History (1/1, 100%), Diagnosis and Imaging (1/4, 25%), Outcome Measures for Medical Intervention and Surgical Treatment (0/1, 0%), Medical and Interventional Treatment (4/6, 66.7%), Surgical Treatment (7/14, 50%), and Value of Spine Care (0/2, 0%). When NASS indicated there was sufficient evidence to offer a clear recommendation, ChatGPT-3.5 generated a concordant response 66.7% of the time (6/9). However, ChatGPT-3.5's concordance dropped to 36.8% when asked clinical questions that NASS did not provide a clear recommendation on (7/19). A further breakdown of ChatGPT-3.5's nonconcordance with the guidelines revealed that a vast majority of its inaccurate recommendations were due to them being "over-conclusive" (12/15, 80%), rather than "insufficient" (3/15, 20%). ChatGPT-4.0 answered 19 (67.9%) of the 28 total questions in concordance with NASS guidelines (P = 0.177). When NASS indicated there was sufficient evidence to offer a clear recommendation, ChatGPT-4.0 generated a concordant response 66.7% of the time (6/9). ChatGPT-4.0's concordance held up at 68.4% when asked clinical questions that NASS did not provide a clear recommendation on (13/19, P = 0.104). CONCLUSIONS This study sheds light on the duality of LLM applications within clinical settings: one of accuracy and utility in some contexts versus inaccuracy and risk in others. ChatGPT was concordant for most clinical questions NASS offered recommendations for. However, for questions NASS did not offer best practices, ChatGPT generated answers that were either too general or inconsistent with the literature, and even fabricated data/citations. Thus, clinicians should exercise extreme caution when attempting to consult ChatGPT for clinical recommendations, taking care to ensure its reliability within the context of recent literature.
Collapse
Affiliation(s)
- Wasil Ahmed
- Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | | | - Rami Rajjoub
- Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Akiro H Duey
- Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Bashar Zaidat
- Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Timothy Hoang
- Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | | | | | - Nancy Shrestha
- Chicago Medical School at Rosalind Franklin University, North Chicago, IL, USA
| | - Justin Tang
- Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Ivan Zapolsky
- Department of Orthopedics, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, New York, NY, 10029, USA
| | - Jun S Kim
- Department of Orthopedics, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, New York, NY, 10029, USA
| | - Samuel K Cho
- Department of Orthopedics, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, New York, NY, 10029, USA.
| |
Collapse
|
15
|
Park YJ, Pillai A, Deng J, Guo E, Gupta M, Paget M, Naugler C. Assessing the research landscape and clinical utility of large language models: a scoping review. BMC Med Inform Decis Mak 2024; 24:72. [PMID: 38475802 DOI: 10.1186/s12911-024-02459-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2023] [Accepted: 02/12/2024] [Indexed: 03/14/2024] Open
Abstract
IMPORTANCE Large language models (LLMs) like OpenAI's ChatGPT are powerful generative systems that rapidly synthesize natural language responses. Research on LLMs has revealed their potential and pitfalls, especially in clinical settings. However, the evolving landscape of LLM research in medicine has left several gaps regarding their evaluation, application, and evidence base. OBJECTIVE This scoping review aims to (1) summarize current research evidence on the accuracy and efficacy of LLMs in medical applications, (2) discuss the ethical, legal, logistical, and socioeconomic implications of LLM use in clinical settings, (3) explore barriers and facilitators to LLM implementation in healthcare, (4) propose a standardized evaluation framework for assessing LLMs' clinical utility, and (5) identify evidence gaps and propose future research directions for LLMs in clinical applications. EVIDENCE REVIEW We screened 4,036 records from MEDLINE, EMBASE, CINAHL, medRxiv, bioRxiv, and arXiv from January 2023 (inception of the search) to June 26, 2023 for English-language papers and analyzed findings from 55 worldwide studies. Quality of evidence was reported based on the Oxford Centre for Evidence-based Medicine recommendations. FINDINGS Our results demonstrate that LLMs show promise in compiling patient notes, assisting patients in navigating the healthcare system, and to some extent, supporting clinical decision-making when combined with human oversight. However, their utilization is limited by biases in training data that may harm patients, the generation of inaccurate but convincing information, and ethical, legal, socioeconomic, and privacy concerns. We also identified a lack of standardized methods for evaluating LLMs' effectiveness and feasibility. CONCLUSIONS AND RELEVANCE This review thus highlights potential future directions and questions to address these limitations and to further explore LLMs' potential in enhancing healthcare delivery.
Collapse
Affiliation(s)
- Ye-Jean Park
- Temerty Faculty of Medicine, University of Toronto, 1 King's College Cir, M5S 1A8, Toronto, ON, Canada.
| | - Abhinav Pillai
- Cumming School of Medicine, University of Calgary, 3330 Hospital Dr NW, T2N 4N1, Calgary, AB, Canada
| | - Jiawen Deng
- Temerty Faculty of Medicine, University of Toronto, 1 King's College Cir, M5S 1A8, Toronto, ON, Canada
| | - Eddie Guo
- Cumming School of Medicine, University of Calgary, 3330 Hospital Dr NW, T2N 4N1, Calgary, AB, Canada
| | - Mehul Gupta
- Cumming School of Medicine, University of Calgary, 3330 Hospital Dr NW, T2N 4N1, Calgary, AB, Canada
| | - Mike Paget
- Cumming School of Medicine, University of Calgary, 3330 Hospital Dr NW, T2N 4N1, Calgary, AB, Canada
| | - Christopher Naugler
- Cumming School of Medicine, University of Calgary, 3330 Hospital Dr NW, T2N 4N1, Calgary, AB, Canada
| |
Collapse
|
16
|
Tunçer G, Güçlü KG. How Reliable is ChatGPT as a Novel Consultant in Infectious Diseases and Clinical Microbiology? INFECTIOUS DISEASES & CLINICAL MICROBIOLOGY 2024; 6:55-59. [PMID: 38633442 PMCID: PMC11020004 DOI: 10.36519/idcm.2024.286] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 10/24/2023] [Accepted: 12/14/2023] [Indexed: 04/19/2024]
Abstract
Objective The study aimed to investigate the reliability of ChatGPT's answers to medical questions, including those sourced from patients and guide recommendations. The focus was on evaluating ChatGPT's accuracy in responding to various types of infectious disease questions. Materials and Methods The study was conducted using 200 questions sourced from social media, experts, and guidelines related to various infectious diseases like urinary tract infection, pneumonia, HIV, various types of hepatitis, COVID-19, skin infections, and tuberculosis. The questions were arranged for clarity and consistency by excluding repetitive or unclear ones. The answers were based on guidelines from reputable sources like the Infectious Diseases Society of America (IDSA), Centers for Disease Control and Prevention (CDC), European Association for the Study of Liver Disease (EASL) and Joint United Nations Programme on HIV/AIDS (UNAIDS) AIDSinfo. According to the scoring system, completely correct answers were given 1-point, and completely incorrect ones were given 4-points. To assess reproducibility, each question was posed twice on separate computers. Repeatability was determined by the consistency of the answers' scores. Results In the study, ChatGPT was posed with 200 questions: 107 from social media platforms and 93 from guidelines. The questions covered a range of topics: urinary tract infections (n=18 questions), pneumonia (n=22), HIV (n=39), hepatitis B and C (n=53), COVID-19 (n=11), skin and soft tissue infections (n=38), and tuberculosis (n=19). The lowest accuracy was 72% for urinary tract infections. ChatGPT answered 92% of social media platform questions correctly (scored 1-point) versus 69% of guideline questions (p=0.001; OR=5.48, 95% CI=2.29-13.11). Conclusion Artificial intelligence is widely used in the medical field by both healthcare professionals and patients. Although ChatGPT answers questions from social media platforms quite properly, we recommend that healthcare professionals be conscientious when using it.
Collapse
Affiliation(s)
- Gülşah Tunçer
- Bilecik Training and Research Hospital, Bilecik, Türkiye
| | | |
Collapse
|
17
|
Zhou Y, Moon C, Szatkowski J, Moore D, Stevens J. Evaluating ChatGPT responses in the context of a 53-year-old male with a femoral neck fracture: a qualitative analysis. EUROPEAN JOURNAL OF ORTHOPAEDIC SURGERY & TRAUMATOLOGY : ORTHOPEDIE TRAUMATOLOGIE 2024; 34:927-955. [PMID: 37776392 PMCID: PMC10858115 DOI: 10.1007/s00590-023-03742-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/14/2023] [Accepted: 09/18/2023] [Indexed: 10/02/2023]
Abstract
PURPOSE The integration of artificial intelligence (AI) tools, such as ChatGPT, in clinical medicine and medical education has gained significant attention due to their potential to support decision-making and improve patient care. However, there is a need to evaluate the benefits and limitations of these tools in specific clinical scenarios. METHODS This study used a case study approach within the field of orthopaedic surgery. A clinical case report featuring a 53-year-old male with a femoral neck fracture was used as the basis for evaluation. ChatGPT, a large language model, was asked to respond to clinical questions related to the case. The responses generated by ChatGPT were evaluated qualitatively, considering their relevance, justification, and alignment with the responses of real clinicians. Alternative dialogue protocols were also employed to assess the impact of additional prompts and contextual information on ChatGPT responses. RESULTS ChatGPT generally provided clinically appropriate responses to the questions posed in the clinical case report. However, the level of justification and explanation varied across the generated responses. Occasionally, clinically inappropriate responses and inconsistencies were observed in the generated responses across different dialogue protocols and on separate days. CONCLUSIONS The findings of this study highlight both the potential and limitations of using ChatGPT in clinical practice. While ChatGPT demonstrated the ability to provide relevant clinical information, the lack of consistent justification and occasional clinically inappropriate responses raise concerns about its reliability. These results underscore the importance of careful consideration and validation when using AI tools in healthcare. Further research and clinician training are necessary to effectively integrate AI tools like ChatGPT, ensuring their safe and reliable use in clinical decision-making.
Collapse
Affiliation(s)
- Yushy Zhou
- Department of Surgery, The University of Melbourne, St. Vincent's Hospital Melbourne, 29 Regent Street, Clinical Sciences Block Level 2, Melbourne, VIC, 3010, Australia.
- Department of Orthopaedic Surgery, St. Vincent's Hospital, Melbourne, Australia.
| | - Charles Moon
- Department of Orthopaedic Surgery, Cedars-Sinai Medical Centre, Los Angeles, CA, USA
| | - Jan Szatkowski
- Department of Orthopaedic Surgery, Indiana University Health Methodist Hospital, Indianapolis, IN, USA
| | - Derek Moore
- Santa Barbara Orthopedic Associates, Santa Barbara, CA, USA
| | - Jarrad Stevens
- Department of Orthopaedic Surgery, St. Vincent's Hospital, Melbourne, Australia
| |
Collapse
|
18
|
Tenner ZM, Cottone MC, Chavez MR. Harnessing the open access version of ChatGPT for enhanced clinical opinions. PLOS DIGITAL HEALTH 2024; 3:e0000355. [PMID: 38315648 PMCID: PMC10843476 DOI: 10.1371/journal.pdig.0000355] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/21/2023] [Accepted: 01/11/2024] [Indexed: 02/07/2024]
Abstract
With the advent of Large Language Models (LLMs) like ChatGPT, the integration of Generative Artificial Intelligence (GAI) into clinical medicine is becoming increasingly feasible. This study aimed to evaluate the ability of the freely available ChatGPT-3.5 to generate complex differential diagnoses, comparing its output to case records of the Massachusetts General Hospital published in the New England Journal of Medicine (NEJM). Forty case records were presented to ChatGPT-3.5, prompting it to provide a differential diagnosis and then narrow it down to the most likely diagnosis. The results indicated that the final diagnosis was included in ChatGPT-3.5's original differential list in 42.5% of the cases. After narrowing, ChatGPT correctly determined the final diagnosis in 27.5% of the cases, demonstrating a decrease in accuracy compared to previous studies using common chief complaints. These findings emphasize the necessity for further investigation into the capabilities and limitations of LLMs in clinical scenarios while highlighting the potential role of GAI as an augmented clinical opinion. Anticipating the growth and enhancement of GAI tools like ChatGPT, physicians and other healthcare workers will likely find increasing support in generating differential diagnoses. However, continued exploration and regulation are essential to ensure the safe and effective integration of GAI into healthcare practice. Future studies may seek to compare newer versions of ChatGPT or investigate patient outcomes with physicians integrating this GAI technology. Understanding and expanding GAI's capabilities, particularly in differential diagnosis, may foster innovation and provide additional resources, especially in underserved areas in the medical field.
Collapse
Affiliation(s)
- Zachary M. Tenner
- New York University Grossman Long Island School of Medicine, Mineola, New York, United States of America
| | - Michael C. Cottone
- New York University Grossman Long Island School of Medicine, Mineola, New York, United States of America
| | - Martin R. Chavez
- New York University Grossman Long Island School of Medicine, Mineola, New York, United States of America
- Department of Obstetrics and Gynecology, New York University Langone Health–Long Island, Mineola, New York, United States of America
| |
Collapse
|
19
|
Omiye JA, Gui H, Rezaei SJ, Zou J, Daneshjou R. Large Language Models in Medicine: The Potentials and Pitfalls : A Narrative Review. Ann Intern Med 2024; 177:210-220. [PMID: 38285984 DOI: 10.7326/m23-2772] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 01/31/2024] Open
Abstract
Large language models (LLMs) are artificial intelligence models trained on vast text data to generate humanlike outputs. They have been applied to various tasks in health care, ranging from answering medical examination questions to generating clinical reports. With increasing institutional partnerships between companies producing LLMs and health systems, the real-world clinical application of these models is nearing realization. As these models gain traction, health care practitioners must understand what LLMs are, their development, their current and potential applications, and the associated pitfalls in a medical setting. This review, coupled with a tutorial, provides a comprehensive yet accessible overview of these areas with the aim of familiarizing health care professionals with the rapidly changing landscape of LLMs in medicine. Furthermore, the authors highlight active research areas in the field that promise to improve LLMs' usability in health care contexts.
Collapse
Affiliation(s)
- Jesutofunmi A Omiye
- Department of Dermatology and Department of Biomedical Data Science, Stanford University, Stanford, California (J.A.O., R.D.)
| | - Haiwen Gui
- Department of Dermatology, Stanford University, Stanford, California (H.G., S.J.R.)
| | - Shawheen J Rezaei
- Department of Dermatology, Stanford University, Stanford, California (H.G., S.J.R.)
| | - James Zou
- Department of Biomedical Data Science, Stanford University, Stanford, California (J.Z.)
| | - Roxana Daneshjou
- Department of Dermatology and Department of Biomedical Data Science, Stanford University, Stanford, California (J.A.O., R.D.)
| |
Collapse
|
20
|
Padovan M, Cosci B, Petillo A, Nerli G, Porciatti F, Scarinci S, Carlucci F, Dell’Amico L, Meliani N, Necciari G, Lucisano VC, Marino R, Foddis R, Palla A. ChatGPT in Occupational Medicine: A Comparative Study with Human Experts. Bioengineering (Basel) 2024; 11:57. [PMID: 38247934 PMCID: PMC10813435 DOI: 10.3390/bioengineering11010057] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2023] [Revised: 01/01/2024] [Accepted: 01/04/2024] [Indexed: 01/23/2024] Open
Abstract
The objective of this study is to evaluate ChatGPT's accuracy and reliability in answering complex medical questions related to occupational health and explore the implications and limitations of AI in occupational health medicine. The study also provides recommendations for future research in this area and informs decision-makers about AI's impact on healthcare. A group of physicians was enlisted to create a dataset of questions and answers on Italian occupational medicine legislation. The physicians were divided into two teams, and each team member was assigned a different subject area. ChatGPT was used to generate answers for each question, with/without legislative context. The two teams then evaluated human and AI-generated answers blind, with each group reviewing the other group's work. Occupational physicians outperformed ChatGPT in generating accurate questions on a 5-point Likert score, while the answers provided by ChatGPT with access to legislative texts were comparable to those of professional doctors. Still, we found that users tend to prefer answers generated by humans, indicating that while ChatGPT is useful, users still value the opinions of occupational medicine professionals.
Collapse
Affiliation(s)
- Martina Padovan
- Department of Translational Research and New Technologies in Medicine and Surgery, University of Pisa, 56126 Pisa, Italy; (M.P.); (B.C.); (A.P.); (G.N.); (F.P.); (S.S.); (F.C.); (L.D.); (N.M.); (G.N.); (R.M.)
| | - Bianca Cosci
- Department of Translational Research and New Technologies in Medicine and Surgery, University of Pisa, 56126 Pisa, Italy; (M.P.); (B.C.); (A.P.); (G.N.); (F.P.); (S.S.); (F.C.); (L.D.); (N.M.); (G.N.); (R.M.)
| | - Armando Petillo
- Department of Translational Research and New Technologies in Medicine and Surgery, University of Pisa, 56126 Pisa, Italy; (M.P.); (B.C.); (A.P.); (G.N.); (F.P.); (S.S.); (F.C.); (L.D.); (N.M.); (G.N.); (R.M.)
| | - Gianluca Nerli
- Department of Translational Research and New Technologies in Medicine and Surgery, University of Pisa, 56126 Pisa, Italy; (M.P.); (B.C.); (A.P.); (G.N.); (F.P.); (S.S.); (F.C.); (L.D.); (N.M.); (G.N.); (R.M.)
| | - Francesco Porciatti
- Department of Translational Research and New Technologies in Medicine and Surgery, University of Pisa, 56126 Pisa, Italy; (M.P.); (B.C.); (A.P.); (G.N.); (F.P.); (S.S.); (F.C.); (L.D.); (N.M.); (G.N.); (R.M.)
| | - Sergio Scarinci
- Department of Translational Research and New Technologies in Medicine and Surgery, University of Pisa, 56126 Pisa, Italy; (M.P.); (B.C.); (A.P.); (G.N.); (F.P.); (S.S.); (F.C.); (L.D.); (N.M.); (G.N.); (R.M.)
| | - Francesco Carlucci
- Department of Translational Research and New Technologies in Medicine and Surgery, University of Pisa, 56126 Pisa, Italy; (M.P.); (B.C.); (A.P.); (G.N.); (F.P.); (S.S.); (F.C.); (L.D.); (N.M.); (G.N.); (R.M.)
| | - Letizia Dell’Amico
- Department of Translational Research and New Technologies in Medicine and Surgery, University of Pisa, 56126 Pisa, Italy; (M.P.); (B.C.); (A.P.); (G.N.); (F.P.); (S.S.); (F.C.); (L.D.); (N.M.); (G.N.); (R.M.)
| | - Niccolò Meliani
- Department of Translational Research and New Technologies in Medicine and Surgery, University of Pisa, 56126 Pisa, Italy; (M.P.); (B.C.); (A.P.); (G.N.); (F.P.); (S.S.); (F.C.); (L.D.); (N.M.); (G.N.); (R.M.)
| | - Gabriele Necciari
- Department of Translational Research and New Technologies in Medicine and Surgery, University of Pisa, 56126 Pisa, Italy; (M.P.); (B.C.); (A.P.); (G.N.); (F.P.); (S.S.); (F.C.); (L.D.); (N.M.); (G.N.); (R.M.)
| | - Vincenzo Carmelo Lucisano
- Department of Translational Research and New Technologies in Medicine and Surgery, University of Pisa, 56126 Pisa, Italy; (M.P.); (B.C.); (A.P.); (G.N.); (F.P.); (S.S.); (F.C.); (L.D.); (N.M.); (G.N.); (R.M.)
| | - Riccardo Marino
- Department of Translational Research and New Technologies in Medicine and Surgery, University of Pisa, 56126 Pisa, Italy; (M.P.); (B.C.); (A.P.); (G.N.); (F.P.); (S.S.); (F.C.); (L.D.); (N.M.); (G.N.); (R.M.)
| | - Rudy Foddis
- Department of Translational Research and New Technologies in Medicine and Surgery, University of Pisa, 56126 Pisa, Italy; (M.P.); (B.C.); (A.P.); (G.N.); (F.P.); (S.S.); (F.C.); (L.D.); (N.M.); (G.N.); (R.M.)
| | | |
Collapse
|
21
|
Morales-Ramirez P, Mishek H, Dasgupta A. The Genie Is Out of the Bottle: What ChatGPT Can and Cannot Do for Medical Professionals. Obstet Gynecol 2024; 143:e1-e6. [PMID: 37944140 DOI: 10.1097/aog.0000000000005446] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2023] [Accepted: 10/12/2023] [Indexed: 11/12/2023]
Abstract
ChatGPT is a cutting-edge artificial intelligence technology that was released for public use in November 2022. Its rapid adoption has raised questions about capabilities, limitations, and risks. This article presents an overview of ChatGPT, and it highlights the current state of this technology for the medical field. The article seeks to provide a balanced perspective on what the model can and cannot do in three specific domains: clinical practice, research, and medical education. It also provides suggestions on how to optimize the use of this tool.
Collapse
|
22
|
Koranteng E, Rao A, Flores E, Lev M, Landman A, Dreyer K, Succi M. Empathy and Equity: Key Considerations for Large Language Model Adoption in Health Care. JMIR MEDICAL EDUCATION 2023; 9:e51199. [PMID: 38153778 PMCID: PMC10884892 DOI: 10.2196/51199] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/24/2023] [Revised: 10/01/2023] [Accepted: 10/14/2023] [Indexed: 12/29/2023]
Abstract
The growing presence of large language models (LLMs) in health care applications holds significant promise for innovative advancements in patient care. However, concerns about ethical implications and potential biases have been raised by various stakeholders. Here, we evaluate the ethics of LLMs in medicine along 2 key axes: empathy and equity. We outline the importance of these factors in novel models of care and develop frameworks for addressing these alongside LLM deployment.
Collapse
Affiliation(s)
| | - Arya Rao
- Harvard Medical School, Boston, MA, United States
| | - Efren Flores
- Harvard Medical School, Boston, MA, United States
| | - Michael Lev
- Harvard Medical School, Boston, MA, United States
| | - Adam Landman
- Harvard Medical School, Boston, MA, United States
| | - Keith Dreyer
- Harvard Medical School, Boston, MA, United States
| | - Marc Succi
- Massachusetts General Hospital, Boston, United States
| |
Collapse
|
23
|
Alkhaaldi SMI, Kassab CH, Dimassi Z, Oyoun Alsoud L, Al Fahim M, Al Hageh C, Ibrahim H. Medical Student Experiences and Perceptions of ChatGPT and Artificial Intelligence: Cross-Sectional Study. JMIR MEDICAL EDUCATION 2023; 9:e51302. [PMID: 38133911 PMCID: PMC10770787 DOI: 10.2196/51302] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/27/2023] [Revised: 11/10/2023] [Accepted: 12/11/2023] [Indexed: 12/23/2023]
Abstract
BACKGROUND Artificial intelligence (AI) has the potential to revolutionize the way medicine is learned, taught, and practiced, and medical education must prepare learners for these inevitable changes. Academic medicine has, however, been slow to embrace recent AI advances. Since its launch in November 2022, ChatGPT has emerged as a fast and user-friendly large language model that can assist health care professionals, medical educators, students, trainees, and patients. While many studies focus on the technology's capabilities, potential, and risks, there is a gap in studying the perspective of end users. OBJECTIVE The aim of this study was to gauge the experiences and perspectives of graduating medical students on ChatGPT and AI in their training and future careers. METHODS A cross-sectional web-based survey of recently graduated medical students was conducted in an international academic medical center between May 5, 2023, and June 13, 2023. Descriptive statistics were used to tabulate variable frequencies. RESULTS Of 325 applicants to the residency programs, 265 completed the survey (an 81.5% response rate). The vast majority of respondents denied using ChatGPT in medical school, with 20.4% (n=54) using it to help complete written assessments and only 9.4% using the technology in their clinical work (n=25). More students planned to use it during residency, primarily for exploring new medical topics and research (n=168, 63.4%) and exam preparation (n=151, 57%). Male students were significantly more likely to believe that AI will improve diagnostic accuracy (n=47, 51.7% vs n=69, 39.7%; P=.001), reduce medical error (n=53, 58.2% vs n=71, 40.8%; P=.002), and improve patient care (n=60, 65.9% vs n=95, 54.6%; P=.007). Previous experience with AI was significantly associated with positive AI perception in terms of improving patient care, decreasing medical errors and misdiagnoses, and increasing the accuracy of diagnoses (P=.001, P<.001, P=.008, respectively). CONCLUSIONS The surveyed medical students had minimal formal and informal experience with AI tools and limited perceptions of the potential uses of AI in health care but had overall positive views of ChatGPT and AI and were optimistic about the future of AI in medical education and health care. Structured curricula and formal policies and guidelines are needed to adequately prepare medical learners for the forthcoming integration of AI in medicine.
Collapse
Affiliation(s)
- Saif M I Alkhaaldi
- Khalifa University College of Medicine and Health Sciences, Abu Dhabi, United Arab Emirates
| | - Carl H Kassab
- Khalifa University College of Medicine and Health Sciences, Abu Dhabi, United Arab Emirates
| | - Zakia Dimassi
- Department of Medical Science, Khalifa University College of Medicine and Health Sciences, Abu Dhabi, United Arab Emirates
| | - Leen Oyoun Alsoud
- Department of Medical Science, Khalifa University College of Medicine and Health Sciences, Abu Dhabi, United Arab Emirates
| | - Maha Al Fahim
- Education Institute, Sheikh Khalifa Medical City, Abu Dhabi, United Arab Emirates
| | - Cynthia Al Hageh
- Department of Medical Science, Khalifa University College of Medicine and Health Sciences, Abu Dhabi, United Arab Emirates
| | - Halah Ibrahim
- Department of Medical Science, Khalifa University College of Medicine and Health Sciences, Abu Dhabi, United Arab Emirates
| |
Collapse
|
24
|
Bagde H, Dhopte A, Alam MK, Basri R. A systematic review and meta-analysis on ChatGPT and its utilization in medical and dental research. Heliyon 2023; 9:e23050. [PMID: 38144348 PMCID: PMC10746423 DOI: 10.1016/j.heliyon.2023.e23050] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2023] [Revised: 10/24/2023] [Accepted: 11/24/2023] [Indexed: 12/26/2023] Open
Abstract
Since its release, ChatGPT has taken the world by storm with its utilization in various fields of life. This review's main goal was to offer a thorough and fact-based evaluation of ChatGPT's potential as a tool for medical and dental research, which could direct subsequent research and influence clinical practices. METHODS Different online databases were scoured for relevant articles that were in accordance with the study objectives. A team of reviewers was assembled to devise a proper methodological framework for inclusion of articles and meta-analysis. RESULTS 11 descriptive studies were considered for this review that evaluated the accuracy of ChatGPT in answering medical queries related to different domains such as systematic reviews, cancer, liver diseases, diagnostic imaging, education, and COVID-19 vaccination. The studies reported different accuracy ranges, from 18.3 % to 100 %, across various datasets and specialties. The meta-analysis showed an odds ratio (OR) of 2.25 and a relative risk (RR) of 1.47 with a 95 % confidence interval (CI), indicating that the accuracy of ChatGPT in providing correct responses was significantly higher compared to the total responses for queries. However, significant heterogeneity was present among the studies, suggesting considerable variability in the effect sizes across the included studies. CONCLUSION The observations indicate that ChatGPT has the ability to provide appropriate solutions to questions in the medical and dentistry areas, but researchers and doctors should cautiously assess its responses because they might not always be dependable. Overall, the importance of this study rests in shedding light on ChatGPT's accuracy in the medical and dentistry fields and emphasizing the need for additional investigation to enhance its performance. © 2017 Elsevier Inc. All rights reserved.
Collapse
Affiliation(s)
- Hiroj Bagde
- Department of Periodontology, Chhattisgarh Dental College and Research Institute, Rajnandgaon, Chhattisgarh, India
| | - Ashwini Dhopte
- Department of Oral Medicine and Radiology, Chhattisgarh Dental College and Research Institute, Rajnandgaon, Chhattisgarh, India
| | - Mohammad Khursheed Alam
- Preventive Dentistry Department, College of Dentistry, Jouf University, Sakaka, 72345, Saudi Arabia
- Department of Dental Research Cell, Saveetha Dental College and Hospitals, Saveetha Institute of Medical and Technical Sciences, Chennai, India
- Department of Public Health, Faculty of Allied Health Sciences, Daffodil International University, Dhaka, Bangladesh
| | - Rehana Basri
- Department of Internal Medicine, College of Medicine, Jouf University, Sakaka, 72345, Saudi Arabia
| |
Collapse
|
25
|
Pagano S, Holzapfel S, Kappenschneider T, Meyer M, Maderbacher G, Grifka J, Holzapfel DE. Arthrosis diagnosis and treatment recommendations in clinical practice: an exploratory investigation with the generative AI model GPT-4. J Orthop Traumatol 2023; 24:61. [PMID: 38015298 PMCID: PMC10684473 DOI: 10.1186/s10195-023-00740-4] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/08/2023] [Accepted: 11/05/2023] [Indexed: 11/29/2023] Open
Abstract
BACKGROUND The spread of artificial intelligence (AI) has led to transformative advancements in diverse sectors, including healthcare. Specifically, generative writing systems have shown potential in various applications, but their effectiveness in clinical settings has been barely investigated. In this context, we evaluated the proficiency of ChatGPT-4 in diagnosing gonarthrosis and coxarthrosis and recommending appropriate treatments compared with orthopaedic specialists. METHODS A retrospective review was conducted using anonymized medical records of 100 patients previously diagnosed with either knee or hip arthrosis. ChatGPT-4 was employed to analyse these historical records, formulating both a diagnosis and potential treatment suggestions. Subsequently, a comparative analysis was conducted to assess the concordance between the AI's conclusions and the original clinical decisions made by the physicians. RESULTS In diagnostic evaluations, ChatGPT-4 consistently aligned with the conclusions previously drawn by physicians. In terms of treatment recommendations, there was an 83% agreement between the AI and orthopaedic specialists. The therapeutic concordance was verified by the calculation of a Cohen's Kappa coefficient of 0.580 (p < 0.001). This indicates a moderate-to-good level of agreement. In recommendations pertaining to surgical treatment, the AI demonstrated a sensitivity and specificity of 78% and 80%, respectively. Multivariable logistic regression demonstrated that the variables reduced quality of life (OR 49.97, p < 0.001) and start-up pain (OR 12.54, p = 0.028) have an influence on ChatGPT-4's recommendation for a surgery. CONCLUSION This study emphasises ChatGPT-4's notable potential in diagnosing conditions such as gonarthrosis and coxarthrosis and in aligning its treatment recommendations with those of orthopaedic specialists. However, it is crucial to acknowledge that AI tools such as ChatGPT-4 are not meant to replace the nuanced expertise and clinical judgment of seasoned orthopaedic surgeons, particularly in complex decision-making scenarios regarding treatment indications. Due to the exploratory nature of the study, further research with larger patient populations and more complex diagnoses is necessary to validate the findings and explore the broader potential of AI in healthcare. LEVEL OF EVIDENCE Level III evidence.
Collapse
Affiliation(s)
- Stefano Pagano
- Department of Orthopaedic Surgery, University of Regensburg, Asklepios Klinikum, Bad Abbach, Germany.
| | - Sabrina Holzapfel
- Department of Neonatology, University Children's Hospital Regensburg, Hospital St. Hedwig of the Order of St. John, University of Regensburg, Regensburg, Germany
| | - Tobias Kappenschneider
- Department of Orthopaedic Surgery, University of Regensburg, Asklepios Klinikum, Bad Abbach, Germany
| | - Matthias Meyer
- Department of Orthopaedic Surgery, University of Regensburg, Asklepios Klinikum, Bad Abbach, Germany
| | - Günther Maderbacher
- Department of Orthopaedic Surgery, University of Regensburg, Asklepios Klinikum, Bad Abbach, Germany
| | - Joachim Grifka
- Department of Orthopaedic Surgery, University of Regensburg, Asklepios Klinikum, Bad Abbach, Germany
| | - Dominik Emanuel Holzapfel
- Department of Orthopaedic Surgery, University of Regensburg, Asklepios Klinikum, Bad Abbach, Germany
| |
Collapse
|
26
|
Wong RSY, Ming LC, Raja Ali RA. The Intersection of ChatGPT, Clinical Medicine, and Medical Education. JMIR MEDICAL EDUCATION 2023; 9:e47274. [PMID: 37988149 DOI: 10.2196/47274] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/14/2023] [Revised: 06/16/2023] [Accepted: 06/30/2023] [Indexed: 11/22/2023]
Abstract
As we progress deeper into the digital age, the robust development and application of advanced artificial intelligence (AI) technology, specifically generative language models like ChatGPT (OpenAI), have potential implications in all sectors including medicine. This viewpoint article aims to present the authors' perspective on the integration of AI models such as ChatGPT in clinical medicine and medical education. The unprecedented capacity of ChatGPT to generate human-like responses, refined through Reinforcement Learning with Human Feedback, could significantly reshape the pedagogical methodologies within medical education. Through a comprehensive review and the authors' personal experiences, this viewpoint article elucidates the pros, cons, and ethical considerations of using ChatGPT within clinical medicine and notably, its implications for medical education. This exploration is crucial in a transformative era where AI could potentially augment human capability in the process of knowledge creation and dissemination, potentially revolutionizing medical education and clinical practice. The importance of maintaining academic integrity and professional standards is highlighted. The relevance of establishing clear guidelines for the responsible and ethical use of AI technologies in clinical medicine and medical education is also emphasized.
Collapse
Affiliation(s)
- Rebecca Shin-Yee Wong
- Department of Medical Education, School of Medical and Life Sciences, Sunway University, Selangor, Malaysia
- Faculty of Medicine, Nursing and Health Sciences, SEGi University, Petaling Jaya, Malaysia
| | - Long Chiau Ming
- School of Medical and Life Sciences, Sunway University, Selangor, Malaysia
| | - Raja Affendi Raja Ali
- School of Medical and Life Sciences, Sunway University, Selangor, Malaysia
- GUT Research Group, Faculty of Medicine, Universiti Kebangsaan Malaysia, Kuala Lumpur, Malaysia
| |
Collapse
|
27
|
Truhn D, Weber CD, Braun BJ, Bressem K, Kather JN, Kuhl C, Nebelung S. A pilot study on the efficacy of GPT-4 in providing orthopedic treatment recommendations from MRI reports. Sci Rep 2023; 13:20159. [PMID: 37978240 PMCID: PMC10656559 DOI: 10.1038/s41598-023-47500-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2023] [Accepted: 11/14/2023] [Indexed: 11/19/2023] Open
Abstract
Large language models (LLMs) have shown potential in various applications, including clinical practice. However, their accuracy and utility in providing treatment recommendations for orthopedic conditions remain to be investigated. Thus, this pilot study aims to evaluate the validity of treatment recommendations generated by GPT-4 for common knee and shoulder orthopedic conditions using anonymized clinical MRI reports. A retrospective analysis was conducted using 20 anonymized clinical MRI reports, with varying severity and complexity. Treatment recommendations were elicited from GPT-4 and evaluated by two board-certified specialty-trained senior orthopedic surgeons. Their evaluation focused on semiquantitative gradings of accuracy and clinical utility and potential limitations of the LLM-generated recommendations. GPT-4 provided treatment recommendations for 20 patients (mean age, 50 years ± 19 [standard deviation]; 12 men) with acute and chronic knee and shoulder conditions. The LLM produced largely accurate and clinically useful recommendations. However, limited awareness of a patient's overall situation, a tendency to incorrectly appreciate treatment urgency, and largely schematic and unspecific treatment recommendations were observed and may reduce its clinical usefulness. In conclusion, LLM-based treatment recommendations are largely adequate and not prone to 'hallucinations', yet inadequate in particular situations. Critical guidance by healthcare professionals is obligatory, and independent use by patients is discouraged, given the dependency on precise data input.
Collapse
Grants
- ODELIA, 101057091 European Union's Horizon Europe programme
- COMFORT, 101079894 European Union's Horizon Europe programme
- TR 1700/7-1 Deutsche Forschungsgemeinschaft
- NE 2136/3-1 Deutsche Forschungsgemeinschaft
- DEEP LIVER, ZMVI1-2520DAT111 Bundesministerium für Gesundheit
- #70113864 Max-Eder-Programme of the German Cancer Aid
- PEARL, 01KD2104C German Federal Ministry of Education and Research
- CAMINO, 01EO2101 German Federal Ministry of Education and Research
- SWAG, 01KD2215A German Federal Ministry of Education and Research
- TRANSFORM LIVER, 031L0312A German Federal Ministry of Education and Research
- TANGERINE, 01KT2302 through ERA-NET Transcan German Federal Ministry of Education and Research
- SECAI, 57616814 Deutscher Akademischer Austauschdienst
- Transplant.KI, 01VSF21048 German Federal Joint Committee
- ODELIA, 101057091 European Union's Horizon Europe and innovation programme
- GENIAL, 101096312 European Union's Horizon Europe and innovation programme
- NIHR, NIHR213331 National Institute for Health and Care Research
- European Union’s Horizon Europe programme
- European Union’s Horizon Europe and innovation programme
- RWTH Aachen University (3131)
Collapse
Affiliation(s)
- Daniel Truhn
- Department of Diagnostic and Interventional Radiology, University Hospital RWTH Aachen, Pauwels Street 30, 52074, Aachen, Germany
| | - Christian D Weber
- Department of Orthopaedics and Trauma Surgery, University Hospital RWTH Aachen, Aachen, Germany
| | - Benedikt J Braun
- University Hospital Tuebingen on Behalf of the Eberhard-Karls-University Tuebingen, BG Hospital, Schnarrenbergstr. 95, Tübingen, Germany
| | - Keno Bressem
- Department of Radiology, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Hindenburgdamm 30, 12203, Berlin, Germany
| | - Jakob N Kather
- Else Kroener Fresenius Center for Digital Health, Technical University Dresden, Dresden, Germany
- Department of Medicine I, University Hospital Dresden, Dresden, Germany
- Department of Medicine III, University Hospital RWTH Aachen, Aachen, Germany
- Medical Oncology, National Center for Tumor Diseases (NCT), University Hospital Heidelberg, Heidelberg, Germany
| | - Christiane Kuhl
- Department of Diagnostic and Interventional Radiology, University Hospital RWTH Aachen, Pauwels Street 30, 52074, Aachen, Germany
| | - Sven Nebelung
- Department of Diagnostic and Interventional Radiology, University Hospital RWTH Aachen, Pauwels Street 30, 52074, Aachen, Germany.
| |
Collapse
|
28
|
Gödde D, Nöhl S, Wolf C, Rupert Y, Rimkus L, Ehlers J, Breuckmann F, Sellmann T. A SWOT (Strengths, Weaknesses, Opportunities, and Threats) Analysis of ChatGPT in the Medical Literature: Concise Review. J Med Internet Res 2023; 25:e49368. [PMID: 37865883 PMCID: PMC10690535 DOI: 10.2196/49368] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2023] [Revised: 09/26/2023] [Accepted: 09/27/2023] [Indexed: 10/23/2023] Open
Abstract
BACKGROUND ChatGPT is a 175-billion-parameter natural language processing model that is already involved in scientific content and publications. Its influence ranges from providing quick access to information on medical topics, assisting in generating medical and scientific articles and papers, performing medical data analyses, and even interpreting complex data sets. OBJECTIVE The future role of ChatGPT remains uncertain and a matter of debate already shortly after its release. This review aimed to analyze the role of ChatGPT in the medical literature during the first 3 months after its release. METHODS We performed a concise review of literature published in PubMed from December 1, 2022, to March 31, 2023. To find all publications related to ChatGPT or considering ChatGPT, the search term was kept simple ("ChatGPT" in AllFields). All publications available as full text in German or English were included. All accessible publications were evaluated according to specifications by the author team (eg, impact factor, publication modus, article type, publication speed, and type of ChatGPT integration or content). The conclusions of the articles were used for later SWOT (strengths, weaknesses, opportunities, and threats) analysis. All data were analyzed on a descriptive basis. RESULTS Of 178 studies in total, 160 met the inclusion criteria and were evaluated. The average impact factor was 4.423 (range 0-96.216), and the average publication speed was 16 (range 0-83) days. Among the articles, there were 77 editorials (48,1%), 43 essays (26.9%), 21 studies (13.1%), 6 reviews (3.8%), 6 case reports (3.8%), 6 news (3.8%), and 1 meta-analysis (0.6%). Of those, 54.4% (n=87) were published as open access, with 5% (n=8) provided on preprint servers. Over 400 quotes with information on strengths, weaknesses, opportunities, and threats were detected. By far, most (n=142, 34.8%) were related to weaknesses. ChatGPT excels in its ability to express ideas clearly and formulate general contexts comprehensibly. It performs so well that even experts in the field have difficulty identifying abstracts generated by ChatGPT. However, the time-limited scope and the need for corrections by experts were mentioned as weaknesses and threats of ChatGPT. Opportunities include assistance in formulating medical issues for nonnative English speakers, as well as the possibility of timely participation in the development of such artificial intelligence tools since it is in its early stages and can therefore still be influenced. CONCLUSIONS Artificial intelligence tools such as ChatGPT are already part of the medical publishing landscape. Despite their apparent opportunities, policies and guidelines must be implemented to ensure benefits in education, clinical practice, and research and protect against threats such as scientific misconduct, plagiarism, and inaccuracy.
Collapse
Affiliation(s)
- Daniel Gödde
- Department of Pathology and Molecularpathology, Helios University Hospital Wuppertal, Witten/Herdecke University, Witten, Germany
| | - Sophia Nöhl
- Faculty of Health, Witten/Herdecke University, Witten, Germany
| | - Carina Wolf
- Faculty of Health, Witten/Herdecke University, Witten, Germany
| | - Yannick Rupert
- Faculty of Health, Witten/Herdecke University, Witten, Germany
| | - Lukas Rimkus
- Faculty of Health, Witten/Herdecke University, Witten, Germany
| | - Jan Ehlers
- Department of Didactics and Education Research in the Health Sector, Faculty of Health, Witten/Herdecke University, Witten, Germany
| | - Frank Breuckmann
- Department of Cardiology and Vascular Medicine, West German Heart and Vascular Center Essen, University Duisburg-Essen, Essen, Germany
- Department of Cardiology, Pneumology, Neurology and Intensive Care Medicine, Klinik Kitzinger Land, Kitzingen, Germany
| | - Timur Sellmann
- Department of Anaesthesiology I, Witten/Herdecke University, Witten, Germany
- Department of Anaesthesiology and Intensive Care Medicine, Evangelisches Krankenhaus BETHESDA zu Duisburg, Duisburg, Germany
| |
Collapse
|
29
|
Yu P, Xu H, Hu X, Deng C. Leveraging Generative AI and Large Language Models: A Comprehensive Roadmap for Healthcare Integration. Healthcare (Basel) 2023; 11:2776. [PMID: 37893850 PMCID: PMC10606429 DOI: 10.3390/healthcare11202776] [Citation(s) in RCA: 9] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2023] [Revised: 10/13/2023] [Accepted: 10/17/2023] [Indexed: 10/29/2023] Open
Abstract
Generative artificial intelligence (AI) and large language models (LLMs), exemplified by ChatGPT, are promising for revolutionizing data and information management in healthcare and medicine. However, there is scant literature guiding their integration for non-AI professionals. This study conducts a scoping literature review to address the critical need for guidance on integrating generative AI and LLMs into healthcare and medical practices. It elucidates the distinct mechanisms underpinning these technologies, such as Reinforcement Learning from Human Feedback (RLFH), including few-shot learning and chain-of-thought reasoning, which differentiates them from traditional, rule-based AI systems. It requires an inclusive, collaborative co-design process that engages all pertinent stakeholders, including clinicians and consumers, to achieve these benefits. Although global research is examining both opportunities and challenges, including ethical and legal dimensions, LLMs offer promising advancements in healthcare by enhancing data management, information retrieval, and decision-making processes. Continued innovation in data acquisition, model fine-tuning, prompt strategy development, evaluation, and system implementation is imperative for realizing the full potential of these technologies. Organizations should proactively engage with these technologies to improve healthcare quality, safety, and efficiency, adhering to ethical and legal guidelines for responsible application.
Collapse
Affiliation(s)
- Ping Yu
- School of Computing and Information Technology, University of Wollongong, Wollongong, NSW 2522, Australia
| | - Hua Xu
- Section of Biomedical Informatics and Data Science, Yale School of Medicine, 100 College Street, Fl 9, New Haven, CT 06510, USA
| | - Xia Hu
- Department of Computer Science, Rice University, P.O. Box 1892, Houston, TX 77251-1892, USA
| | - Chao Deng
- School of Medical, Indigenous and Health Sciences, University of Wollongong, Wollongong, NSW 2522, Australia
| |
Collapse
|
30
|
Miao H, Li C, Wang J. A Future of Smarter Digital Health Empowered by Generative Pretrained Transformer. J Med Internet Res 2023; 25:e49963. [PMID: 37751243 PMCID: PMC10565615 DOI: 10.2196/49963] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2023] [Revised: 07/30/2023] [Accepted: 08/28/2023] [Indexed: 09/27/2023] Open
Abstract
Generative pretrained transformer (GPT) tools have been thriving, as ignited by the remarkable success of OpenAI's recent chatbot product. GPT technology offers countless opportunities to significantly improve or renovate current health care research and practice paradigms, especially digital health interventions and digital health-enabled clinical care, and a future of smarter digital health can thus be expected. In particular, GPT technology can be incorporated through various digital health platforms in homes and hospitals embedded with numerous sensors, wearables, and remote monitoring devices. In this viewpoint paper, we highlight recent research progress that depicts the future picture of a smarter digital health ecosystem through GPT-facilitated centralized communications, automated analytics, personalized health care, and instant decision-making.
Collapse
Affiliation(s)
- Hongyu Miao
- College of Nursing, Florida State University, Tallahassee, FL, United States
| | - Chengdong Li
- College of Nursing, Florida State University, Tallahassee, FL, United States
| | - Jing Wang
- College of Nursing, Florida State University, Tallahassee, FL, United States
| |
Collapse
|
31
|
Sallam M, Salim NA, Barakat M, Al-Mahzoum K, Al-Tammemi AB, Malaeb D, Hallit R, Hallit S. Assessing Health Students' Attitudes and Usage of ChatGPT in Jordan: Validation Study. JMIR MEDICAL EDUCATION 2023; 9:e48254. [PMID: 37578934 PMCID: PMC10509747 DOI: 10.2196/48254] [Citation(s) in RCA: 13] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/17/2023] [Revised: 07/25/2023] [Accepted: 08/14/2023] [Indexed: 08/16/2023]
Abstract
BACKGROUND ChatGPT is a conversational large language model that has the potential to revolutionize knowledge acquisition. However, the impact of this technology on the quality of education is still unknown considering the risks and concerns surrounding ChatGPT use. Therefore, it is necessary to assess the usability and acceptability of this promising tool. As an innovative technology, the intention to use ChatGPT can be studied in the context of the technology acceptance model (TAM). OBJECTIVE This study aimed to develop and validate a TAM-based survey instrument called TAME-ChatGPT (Technology Acceptance Model Edited to Assess ChatGPT Adoption) that could be employed to examine the successful integration and use of ChatGPT in health care education. METHODS The survey tool was created based on the TAM framework. It comprised 13 items for participants who heard of ChatGPT but did not use it and 23 items for participants who used ChatGPT. Using a convenient sampling approach, the survey link was circulated electronically among university students between February and March 2023. Exploratory factor analysis (EFA) was used to assess the construct validity of the survey instrument. RESULTS The final sample comprised 458 respondents, the majority among them undergraduate students (n=442, 96.5%). Only 109 (23.8%) respondents had heard of ChatGPT prior to participation and only 55 (11.3%) self-reported ChatGPT use before the study. EFA analysis on the attitude and usage scales showed significant Bartlett tests of sphericity scores (P<.001) and adequate Kaiser-Meyer-Olkin measures (0.823 for the attitude scale and 0.702 for the usage scale), confirming the factorability of the correlation matrices. The EFA showed that 3 constructs explained a cumulative total of 69.3% variance in the attitude scale, and these subscales represented perceived risks, attitude to technology/social influence, and anxiety. For the ChatGPT usage scale, EFA showed that 4 constructs explained a cumulative total of 72% variance in the data and comprised the perceived usefulness, perceived risks, perceived ease of use, and behavior/cognitive factors. All the ChatGPT attitude and usage subscales showed good reliability with Cronbach α values >.78 for all the deduced subscales. CONCLUSIONS The TAME-ChatGPT demonstrated good reliability, validity, and usefulness in assessing health care students' attitudes toward ChatGPT. The findings highlighted the importance of considering risk perceptions, usefulness, ease of use, attitudes toward technology, and behavioral factors when adopting ChatGPT as a tool in health care education. This information can aid the stakeholders in creating strategies to support the optimal and ethical use of ChatGPT and to identify the potential challenges hindering its successful implementation. Future research is recommended to guide the effective adoption of ChatGPT in health care education.
Collapse
Affiliation(s)
- Malik Sallam
- Department of Pathology, Microbiology and Forensic Medicine, School of Medicine, The University of Jordan, Amman, Jordan
- Department of Clinical Laboratories and Forensic Medicine, Jordan University Hospital, Amman, Jordan
| | - Nesreen A Salim
- Prosthodontic Department, School of Dentistry, The University of Jordan, Amman, Jordan
- Prosthodontic Department, Jordan University Hospital, Amman, Jordan
| | - Muna Barakat
- Department of Clinical Pharmacy and Therapeutics, Faculty of Pharmacy, Applied Science Private University, Amman, Jordan
- Middle East University Research Unit, Middle East University, Amman, Jordan
| | - Kholoud Al-Mahzoum
- Department of Pathology, Microbiology and Forensic Medicine, School of Medicine, The University of Jordan, Amman, Jordan
| | - Ala'a B Al-Tammemi
- Migration Health Division, International Organization for Migration, The United Nations Migration Agency, Amman, Jordan
| | - Diana Malaeb
- College of Pharmacy, Gulf Medical University, Ajman, United Arab Emirates
| | - Rabih Hallit
- School of Medicine and Medical Sciences, Holy Spirit University of Kaslik, Jounieh, Lebanon
- Department of Infectious Disease, Bellevue Medical Center, Mansourieh, Lebanon
- Department of Infectious Disease, Notre Dame des Secours, University Hospital Center, Byblos, Lebanon
| | - Souheil Hallit
- School of Medicine and Medical Sciences, Holy Spirit University of Kaslik, Jounieh, Lebanon
- Research Department, Psychiatric Hospital of the Cross, Jal Eddib, Lebanon
| |
Collapse
|
32
|
Russe MF, Fink A, Ngo H, Tran H, Bamberg F, Reisert M, Rau A. Performance of ChatGPT, human radiologists, and context-aware ChatGPT in identifying AO codes from radiology reports. Sci Rep 2023; 13:14215. [PMID: 37648742 PMCID: PMC10468502 DOI: 10.1038/s41598-023-41512-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2023] [Accepted: 08/28/2023] [Indexed: 09/01/2023] Open
Abstract
While radiologists can describe a fracture's morphology and complexity with ease, the translation into classification systems such as the Arbeitsgemeinschaft Osteosynthesefragen (AO) Fracture and Dislocation Classification Compendium is more challenging. We tested the performance of generic chatbots and chatbots aware of specific knowledge of the AO classification provided by a vector-index and compared it to human readers. In the 100 radiological reports we created based on random AO codes, chatbots provided AO codes significantly faster than humans (mean 3.2 s per case vs. 50 s per case, p < .001) though not reaching human performance (max. chatbot performance of 86% correct full AO codes vs. 95% in human readers). In general, chatbots based on GPT 4 outperformed the ones based on GPT 3.5-Turbo. Further, we found that providing specific knowledge substantially enhances the chatbot's performance and consistency as the context-aware chatbot based on GPT 4 provided 71% consistent correct full AO codes for the compared to the 2% consistent correct full AO codes for the generic ChatGPT 4. This provides evidence, that refining and providing specific context to ChatGPT will be the next essential step in harnessing its power.
Collapse
Affiliation(s)
- Maximilian F Russe
- Department of Diagnostic and Interventional Radiology, Medical Center - University of Freiburg, Faculty of Medicine, University of Freiburg, Breisacher Str. 64, 79106, Freiburg, Germany.
| | - Anna Fink
- Department of Diagnostic and Interventional Radiology, Medical Center - University of Freiburg, Faculty of Medicine, University of Freiburg, Breisacher Str. 64, 79106, Freiburg, Germany
| | - Helen Ngo
- Department of Diagnostic and Interventional Radiology, Medical Center - University of Freiburg, Faculty of Medicine, University of Freiburg, Breisacher Str. 64, 79106, Freiburg, Germany
| | - Hien Tran
- Department of Diagnostic and Interventional Radiology, Medical Center - University of Freiburg, Faculty of Medicine, University of Freiburg, Breisacher Str. 64, 79106, Freiburg, Germany
| | - Fabian Bamberg
- Department of Diagnostic and Interventional Radiology, Medical Center - University of Freiburg, Faculty of Medicine, University of Freiburg, Breisacher Str. 64, 79106, Freiburg, Germany
| | - Marco Reisert
- Department of Stereotactic and Functional Neurosurgery, Medical Center - University of Freiburg, Faculty of Medicine, University of Freiburg, Freiburg, Germany
- Medical Physics, Department of Diagnostic and Interventional Radiology, Medical Center - University of Freiburg, Faculty of Medicine, University of Freiburg, Freiburg, Germany
| | - Alexander Rau
- Department of Diagnostic and Interventional Radiology, Medical Center - University of Freiburg, Faculty of Medicine, University of Freiburg, Breisacher Str. 64, 79106, Freiburg, Germany
- Department of Neuroradiology, Medical Center - University of Freiburg, Faculty of Medicine, University of Freiburg, Freiburg, Germany
| |
Collapse
|
33
|
Rao A, Pang M, Kim J, Kamineni M, Lie W, Prasad AK, Landman A, Dreyer K, Succi MD. Assessing the Utility of ChatGPT Throughout the Entire Clinical Workflow: Development and Usability Study. J Med Internet Res 2023; 25:e48659. [PMID: 37606976 PMCID: PMC10481210 DOI: 10.2196/48659] [Citation(s) in RCA: 65] [Impact Index Per Article: 65.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2023] [Revised: 07/26/2023] [Accepted: 07/27/2023] [Indexed: 08/23/2023] Open
Abstract
BACKGROUND Large language model (LLM)-based artificial intelligence chatbots direct the power of large training data sets toward successive, related tasks as opposed to single-ask tasks, for which artificial intelligence already achieves impressive performance. The capacity of LLMs to assist in the full scope of iterative clinical reasoning via successive prompting, in effect acting as artificial physicians, has not yet been evaluated. OBJECTIVE This study aimed to evaluate ChatGPT's capacity for ongoing clinical decision support via its performance on standardized clinical vignettes. METHODS We inputted all 36 published clinical vignettes from the Merck Sharpe & Dohme (MSD) Clinical Manual into ChatGPT and compared its accuracy on differential diagnoses, diagnostic testing, final diagnosis, and management based on patient age, gender, and case acuity. Accuracy was measured by the proportion of correct responses to the questions posed within the clinical vignettes tested, as calculated by human scorers. We further conducted linear regression to assess the contributing factors toward ChatGPT's performance on clinical tasks. RESULTS ChatGPT achieved an overall accuracy of 71.7% (95% CI 69.3%-74.1%) across all 36 clinical vignettes. The LLM demonstrated the highest performance in making a final diagnosis with an accuracy of 76.9% (95% CI 67.8%-86.1%) and the lowest performance in generating an initial differential diagnosis with an accuracy of 60.3% (95% CI 54.2%-66.6%). Compared to answering questions about general medical knowledge, ChatGPT demonstrated inferior performance on differential diagnosis (β=-15.8%; P<.001) and clinical management (β=-7.4%; P=.02) question types. CONCLUSIONS ChatGPT achieves impressive accuracy in clinical decision-making, with increasing strength as it gains more clinical information at its disposal. In particular, ChatGPT demonstrates the greatest accuracy in tasks of final diagnosis as compared to initial diagnosis. Limitations include possible model hallucinations and the unclear composition of ChatGPT's training data set.
Collapse
Affiliation(s)
- Arya Rao
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA, United States
- Harvard Medical School, Boston, MA, United States
- Department of Radiology, Massachusetts General Hospital, Boston, MA, United States
| | - Michael Pang
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA, United States
- Harvard Medical School, Boston, MA, United States
- Department of Radiology, Massachusetts General Hospital, Boston, MA, United States
| | - John Kim
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA, United States
- Harvard Medical School, Boston, MA, United States
- Department of Radiology, Massachusetts General Hospital, Boston, MA, United States
| | - Meghana Kamineni
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA, United States
- Harvard Medical School, Boston, MA, United States
- Department of Radiology, Massachusetts General Hospital, Boston, MA, United States
| | - Winston Lie
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA, United States
- Harvard Medical School, Boston, MA, United States
- Department of Radiology, Massachusetts General Hospital, Boston, MA, United States
| | - Anoop K Prasad
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA, United States
- Harvard Medical School, Boston, MA, United States
- Department of Radiology, Massachusetts General Hospital, Boston, MA, United States
| | - Adam Landman
- Harvard Medical School, Boston, MA, United States
- Department of Radiology, Brigham and Women's Hospital, Boston, MA, United States
| | - Keith Dreyer
- Harvard Medical School, Boston, MA, United States
- Data Science Office, Mass General Brigham, Boston, MA, United States
| | - Marc D Succi
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA, United States
- Harvard Medical School, Boston, MA, United States
- Department of Radiology, Massachusetts General Hospital, Boston, MA, United States
- Mass General Brigham Innovation, Mass General Brigham, Boston, MA, United States
| |
Collapse
|
34
|
Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med 2023; 29:1930-1940. [PMID: 37460753 DOI: 10.1038/s41591-023-02448-8] [Citation(s) in RCA: 335] [Impact Index Per Article: 335.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2023] [Accepted: 06/08/2023] [Indexed: 08/17/2023]
Abstract
Large language models (LLMs) can respond to free-text queries without being specifically trained in the task in question, causing excitement and concern about their use in healthcare settings. ChatGPT is a generative artificial intelligence (AI) chatbot produced through sophisticated fine-tuning of an LLM, and other tools are emerging through similar developmental processes. Here we outline how LLM applications such as ChatGPT are developed, and we discuss how they are being leveraged in clinical settings. We consider the strengths and limitations of LLMs and their potential to improve the efficiency and effectiveness of clinical, educational and research work in medicine. LLM chatbots have already been deployed in a range of biomedical contexts, with impressive but mixed results. This review acts as a primer for interested clinicians, who will determine if and how LLM technology is used in healthcare for the benefit of patients and practitioners.
Collapse
Affiliation(s)
- Arun James Thirunavukarasu
- University of Cambridge School of Clinical Medicine, Cambridge, UK
- Corpus Christi College, University of Cambridge, Cambridge, UK
| | - Darren Shu Jeng Ting
- Academic Unit of Ophthalmology, Institute of Inflammation and Ageing, University of Birmingham, Birmingham, UK
- Birmingham and Midland Eye Centre, Birmingham, UK
- Academic Ophthalmology, School of Medicine, University of Nottingham, Nottingham, UK
| | - Kabilan Elangovan
- Artificial Intelligence and Digital Innovation Research Group, Singapore Eye Research Institute, Singapore National Eye Centre, Singapore, Singapore
| | - Laura Gutierrez
- Artificial Intelligence and Digital Innovation Research Group, Singapore Eye Research Institute, Singapore National Eye Centre, Singapore, Singapore
| | - Ting Fang Tan
- Artificial Intelligence and Digital Innovation Research Group, Singapore Eye Research Institute, Singapore National Eye Centre, Singapore, Singapore
- Department of Ophthalmology and Visual Sciences, Duke-National University of Singapore Medical School, Singapore, Singapore
| | - Daniel Shu Wei Ting
- Artificial Intelligence and Digital Innovation Research Group, Singapore Eye Research Institute, Singapore National Eye Centre, Singapore, Singapore.
- Department of Ophthalmology and Visual Sciences, Duke-National University of Singapore Medical School, Singapore, Singapore.
- Byers Eye Institute, Stanford University, Palo Alto, CA, USA.
| |
Collapse
|
35
|
Beaulieu-Jones BR, Shah S, Berrigan MT, Marwaha JS, Lai SL, Brat GA. Evaluating Capabilities of Large Language Models: Performance of GPT4 on Surgical Knowledge Assessments. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2023:2023.07.16.23292743. [PMID: 37502981 PMCID: PMC10371188 DOI: 10.1101/2023.07.16.23292743] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/29/2023]
Abstract
Background Artificial intelligence (AI) has the potential to dramatically alter healthcare by enhancing how we diagnosis and treat disease. One promising AI model is ChatGPT, a large general-purpose language model trained by OpenAI. The chat interface has shown robust, human-level performance on several professional and academic benchmarks. We sought to probe its performance and stability over time on surgical case questions. Methods We evaluated the performance of ChatGPT-4 on two surgical knowledge assessments: the Surgical Council on Resident Education (SCORE) and a second commonly used knowledge assessment, referred to as Data-B. Questions were entered in two formats: open-ended and multiple choice. ChatGPT output were assessed for accuracy and insights by surgeon evaluators. We categorized reasons for model errors and the stability of performance on repeat encounters. Results A total of 167 SCORE and 112 Data-B questions were presented to the ChatGPT interface. ChatGPT correctly answered 71% and 68% of multiple-choice SCORE and Data-B questions, respectively. For both open-ended and multiple-choice questions, approximately two-thirds of ChatGPT responses contained non-obvious insights. Common reasons for inaccurate responses included: inaccurate information in a complex question (n=16, 36.4%); inaccurate information in fact-based question (n=11, 25.0%); and accurate information with circumstantial discrepancy (n=6, 13.6%). Upon repeat query, the answer selected by ChatGPT varied for 36.4% of inaccurate questions; the response accuracy changed for 6/16 questions. Conclusion Consistent with prior findings, we demonstrate robust near or above human-level performance of ChatGPT within the surgical domain. Unique to this study, we demonstrate a substantial inconsistency in ChatGPT responses with repeat query. This finding warrants future consideration and presents an opportunity to further train these models to provide safe and consistent responses. Without mental and/or conceptual models, it is unclear whether language models such as ChatGPT would be able to safely assist clinicians in providing care.
Collapse
Affiliation(s)
- Brendin R Beaulieu-Jones
- Department of Surgery, Beth Israel Deaconess Medical Center, Boston, MA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA
| | - Sahaj Shah
- Geisinger Commonwealth School of Medicine, Scranton, PA
| | | | - Jayson S Marwaha
- Division of Colorectal Surgery, National Taiwan University Hospital, Taipei, Taiwan
| | - Shuo-Lun Lai
- Division of Colorectal Surgery, National Taiwan University Hospital, Taipei, Taiwan
| | - Gabriel A Brat
- Department of Surgery, Beth Israel Deaconess Medical Center, Boston, MA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA
| |
Collapse
|
36
|
Liu J, Wang C, Liu S. Utility of ChatGPT in Clinical Practice. J Med Internet Res 2023; 25:e48568. [PMID: 37379067 PMCID: PMC10365580 DOI: 10.2196/48568] [Citation(s) in RCA: 79] [Impact Index Per Article: 79.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2023] [Revised: 05/29/2023] [Accepted: 06/15/2023] [Indexed: 06/29/2023] Open
Abstract
ChatGPT is receiving increasing attention and has a variety of application scenarios in clinical practice. In clinical decision support, ChatGPT has been used to generate accurate differential diagnosis lists, support clinical decision-making, optimize clinical decision support, and provide insights for cancer screening decisions. In addition, ChatGPT has been used for intelligent question-answering to provide reliable information about diseases and medical queries. In terms of medical documentation, ChatGPT has proven effective in generating patient clinical letters, radiology reports, medical notes, and discharge summaries, improving efficiency and accuracy for health care providers. Future research directions include real-time monitoring and predictive analytics, precision medicine and personalized treatment, the role of ChatGPT in telemedicine and remote health care, and integration with existing health care systems. Overall, ChatGPT is a valuable tool that complements the expertise of health care providers and improves clinical decision-making and patient care. However, ChatGPT is a double-edged sword. We need to carefully consider and study the benefits and potential dangers of ChatGPT. In this viewpoint, we discuss recent advances in ChatGPT research in clinical practice and suggest possible risks and challenges of using ChatGPT in clinical practice. It will help guide and support future artificial intelligence research similar to ChatGPT in health.
Collapse
Affiliation(s)
- Jialin Liu
- Information Center, West China Hospital, Sichuan University, Chengdu, China
- Department of Medical Informatics, West China Medical School, Chengdu, China
- Department of Otolaryngology-Head and Neck Surgery, West China Hospital, Sichuan University, Chengdu, China
| | - Changyu Wang
- Information Center, West China Hospital, Sichuan University, Chengdu, China
- West China College of Stomatology, Sichuan University, Chengdu, China
| | - Siru Liu
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, United States
| |
Collapse
|
37
|
Temsah MH, Aljamaan F, Malki KH, Alhasan K, Altamimi I, Aljarbou R, Bazuhair F, Alsubaihin A, Abdulmajeed N, Alshahrani FS, Temsah R, Alshahrani T, Al-Eyadhy L, Alkhateeb SM, Saddik B, Halwani R, Jamal A, Al-Tawfiq JA, Al-Eyadhy A. ChatGPT and the Future of Digital Health: A Study on Healthcare Workers' Perceptions and Expectations. Healthcare (Basel) 2023; 11:1812. [PMID: 37444647 PMCID: PMC10340744 DOI: 10.3390/healthcare11131812] [Citation(s) in RCA: 28] [Impact Index Per Article: 28.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2023] [Revised: 06/14/2023] [Accepted: 06/19/2023] [Indexed: 07/15/2023] Open
Abstract
This study aimed to assess the knowledge, attitudes, and intended practices of healthcare workers (HCWs) in Saudi Arabia towards ChatGPT, an artificial intelligence (AI) Chatbot, within the first three months after its launch. We also aimed to identify potential barriers to AI Chatbot adoption among healthcare professionals. A cross-sectional survey was conducted among 1057 HCWs in Saudi Arabia, distributed electronically via social media channels from 21 February to 6 March 2023. The survey evaluated HCWs' familiarity with ChatGPT-3.5, their satisfaction, intended future use, and perceived usefulness in healthcare practice. Of the respondents, 18.4% had used ChatGPT for healthcare purposes, while 84.1% of non-users expressed interest in utilizing AI Chatbots in the future. Most participants (75.1%) were comfortable with incorporating ChatGPT into their healthcare practice. HCWs perceived the Chatbot to be useful in various aspects of healthcare, such as medical decision-making (39.5%), patient and family support (44.7%), medical literature appraisal (48.5%), and medical research assistance (65.9%). A majority (76.7%) believed ChatGPT could positively impact the future of healthcare systems. Nevertheless, concerns about credibility and the source of information provided by AI Chatbots (46.9%) were identified as the main barriers. Although HCWs recognize ChatGPT as a valuable addition to digital health in the early stages of adoption, addressing concerns regarding accuracy, reliability, and medicolegal implications is crucial. Therefore, due to their unreliability, the current forms of ChatGPT and other Chatbots should not be used for diagnostic or treatment purposes without human expert oversight. Ensuring the trustworthiness and dependability of AI Chatbots is essential for successful implementation in healthcare settings. Future research should focus on evaluating the clinical outcomes of ChatGPT and benchmarking its performance against other AI Chatbots.
Collapse
Affiliation(s)
- Mohamad-Hani Temsah
- College of Medicine, King Saud University, Riyadh 11587, Saudi Arabia
- Pediatric Department, King Saud University Medical City, King Saud University, Riyadh 11411, Saudi Arabia
- Evidence-Based Health Care & Knowledge Translation Research Chair, King Saud University, Riyadh 11587, Saudi Arabia
| | - Fadi Aljamaan
- College of Medicine, King Saud University, Riyadh 11587, Saudi Arabia
- Critical Care Department, King Saud University Medical City, Riyadh 11411, Saudi Arabia
| | - Khalid H. Malki
- Research Chair of Voice, Swallowing, and Communication Disorders, ENT Department, College of Medicine, King Saud University, Riyadh 11587, Saudi Arabia
| | - Khalid Alhasan
- College of Medicine, King Saud University, Riyadh 11587, Saudi Arabia
- Pediatric Department, King Saud University Medical City, King Saud University, Riyadh 11411, Saudi Arabia
- Solid Organ Transplant Center of Excellence, King Faisal Specialist Hospital and Research Center, Riyadh 11564, Saudi Arabia
| | - Ibraheem Altamimi
- College of Medicine, King Saud University, Riyadh 11587, Saudi Arabia
| | - Razan Aljarbou
- Pediatric Department, King Saud University Medical City, King Saud University, Riyadh 11411, Saudi Arabia
| | - Faisal Bazuhair
- Pediatric Department, King Saud University Medical City, King Saud University, Riyadh 11411, Saudi Arabia
| | - Abdulmajeed Alsubaihin
- College of Medicine, King Saud University, Riyadh 11587, Saudi Arabia
- Pediatric Department, King Saud University Medical City, King Saud University, Riyadh 11411, Saudi Arabia
| | - Naif Abdulmajeed
- Pediatric Department, King Saud University Medical City, King Saud University, Riyadh 11411, Saudi Arabia
- Pediatric Nephrology Department, Prince Sultan Military Medical City, Riyadh 12233, Saudi Arabia
| | - Fatimah S. Alshahrani
- College of Medicine, King Saud University, Riyadh 11587, Saudi Arabia
- Division of Infectious Diseases, Department of Internal Medicine, College of Medicine, King Saud University, Riyadh 11451, Saudi Arabia
| | - Reem Temsah
- College of Pharmacy, Alfaisal University, Riyadh 11533, Saudi Arabia
| | - Turki Alshahrani
- Pediatric Department, King Saud University Medical City, King Saud University, Riyadh 11411, Saudi Arabia
| | - Lama Al-Eyadhy
- College of Medicine, King Saud University, Riyadh 11587, Saudi Arabia
| | | | - Basema Saddik
- Sharjah Institute of Medical Research, University of Sharjah, Sharjah 27272, United Arab Emirates
- Department of Community and Family Medicine, College of Medicine, University of Sharjah, Sharjah 27272, United Arab Emirates
- School of Population Health, Faculty of Medicine & Health, UNSW Sydney, Sydney, NSW 2052, Australia
| | - Rabih Halwani
- Sharjah Institute of Medical Research, University of Sharjah, Sharjah 27272, United Arab Emirates
- Department of Clinical Sciences, College of Medicine, University of Sharjah, Sharjah 27272, United Arab Emirates
| | - Amr Jamal
- College of Medicine, King Saud University, Riyadh 11587, Saudi Arabia
- Evidence-Based Health Care & Knowledge Translation Research Chair, King Saud University, Riyadh 11587, Saudi Arabia
- Department of Family and Community Medicine, King Saud University Medical City, Riyadh 11411, Saudi Arabia
| | - Jaffar A. Al-Tawfiq
- Specialty Internal Medicine and Quality Department, Johns Hopkins Aramco Healthcare, Dhahran 34465, Saudi Arabia
- Infectious Disease Division, Department of Medicine, Indiana University School of Medicine, Indianapolis, IN 46202, USA
- Infectious Disease Division, Department of Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21218, USA
| | - Ayman Al-Eyadhy
- College of Medicine, King Saud University, Riyadh 11587, Saudi Arabia
- Pediatric Department, King Saud University Medical City, King Saud University, Riyadh 11411, Saudi Arabia
| |
Collapse
|
38
|
Shoja MM, Van de Ridder JMM, Rajput V. The Emerging Role of Generative Artificial Intelligence in Medical Education, Research, and Practice. Cureus 2023; 15:e40883. [PMID: 37492829 PMCID: PMC10363933 DOI: 10.7759/cureus.40883] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2023] [Accepted: 06/24/2023] [Indexed: 07/27/2023] Open
Abstract
Recent breakthroughs in generative artificial intelligence (GAI) and the emergence of transformer-based large language models such as Chat Generative Pre-trained Transformer (ChatGPT) have the potential to transform healthcare education, research, and clinical practice. This article examines the current trends in using GAI models in medicine, outlining their strengths and limitations. It is imperative to develop further consensus-based guidelines to govern the appropriate use of GAI, not only in medical education but also in research, scholarship, and clinical practice.
Collapse
Affiliation(s)
| | | | - Vijay Rajput
- Medical Education, Dr. Kiran C. Patel College of Allopathic Medicine, Nova Southeastern University, Fort Lauderdale, USA
| |
Collapse
|
39
|
Deik A. Potential Benefits and Perils of Incorporating ChatGPT to the Movement Disorders Clinic. J Mov Disord 2023; 16:158-162. [PMID: 37258279 PMCID: PMC10236019 DOI: 10.14802/jmd.23072] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2023] [Revised: 04/18/2023] [Accepted: 04/21/2023] [Indexed: 06/02/2023] Open
Affiliation(s)
- Andres Deik
- Parkinson’s Disease and Movement Disorders Center, Department of Neurology, University of Pennsylvania, Philadelphia, PA, USA
| |
Collapse
|
40
|
Hamed E, Eid A, Alberry M. Exploring ChatGPT's Potential in Facilitating Adaptation of Clinical Guidelines: A Case Study of Diabetic Ketoacidosis Guidelines. Cureus 2023; 15:e38784. [PMID: 37303347 PMCID: PMC10249915 DOI: 10.7759/cureus.38784] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/09/2023] [Indexed: 06/13/2023] Open
Abstract
Background This study aimed to evaluate the efficacy of ChatGPT, an advanced natural language processing model, in adapting and synthesizing clinical guidelines for diabetic ketoacidosis (DKA) by comparing and contrasting different guideline sources. Methodology We employed a comprehensive comparison approach and examined three reputable guideline sources: Diabetes Canada Clinical Practice Guidelines Expert Committee (2018), Emergency Management of Hyperglycaemia in Primary Care, and Joint British Diabetes Societies (JBDS) 02 The Management of Diabetic Ketoacidosis in Adults. Data extraction focused on diagnostic criteria, risk factors, signs and symptoms, investigations, and treatment recommendations. We compared the synthesized guidelines generated by ChatGPT and identified any misreporting or non-reporting errors. Results ChatGPT was capable of generating a comprehensive table comparing the guidelines. However, multiple recurrent errors, including misreporting and non-reporting errors, were identified, rendering the results unreliable. Additionally, inconsistencies were observed in the repeated reporting of data. The study highlights the limitations of using ChatGPT for the adaptation of clinical guidelines without expert human intervention. Conclusions Although ChatGPT demonstrates the potential for the synthesis of clinical guidelines, the presence of multiple recurrent errors and inconsistencies underscores the need for expert human intervention and validation. Future research should focus on improving the accuracy and reliability of ChatGPT, as well as exploring its potential applications in other areas of clinical practice and guideline development.
Collapse
Affiliation(s)
- Ehab Hamed
- Qatar University Health Center, Primary Health Care Corporation, Doha, QAT
| | - Ahmad Eid
- Umm Slal Health Center, Primary Health Care Corporation, Doha, QAT
| | | |
Collapse
|