Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Liu S, Wright AP, Patterson BL, Wanderer JP, Turer RW, Nelson SD, McCoy AB, Sittig DF, Wright A. Using AI-generated suggestions from ChatGPT to optimize clinical decision support. J Am Med Inform Assoc 2023:7136722. [PMID: 37087108 DOI: 10.1093/jamia/ocad072] [Citation(s) in RCA: 77] [Impact Index Per Article: 77.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2023] [Revised: 03/28/2023] [Accepted: 04/11/2023] [Indexed: 04/24/2023] Open

For:	Liu S, Wright AP, Patterson BL, Wanderer JP, Turer RW, Nelson SD, McCoy AB, Sittig DF, Wright A. Using AI-generated suggestions from ChatGPT to optimize clinical decision support. J Am Med Inform Assoc 2023:7136722. [PMID: 37087108 DOI: 10.1093/jamia/ocad072] [Citation(s) in RCA: 77] [Impact Index Per Article: 77.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2023] [Revised: 03/28/2023] [Accepted: 04/11/2023] [Indexed: 04/24/2023] Open

Number

Cited by Other Article(s)

Moulaei K, Yadegari A, Baharestani M, Farzanbakhsh S, Sabet B, Reza Afrash M. Generative artificial intelligence in healthcare: A scoping review on benefits, challenges and applications. Int J Med Inform 2024;188:105474. [PMID: 38733640 DOI: 10.1016/j.ijmedinf.2024.105474] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2024] [Revised: 05/03/2024] [Accepted: 05/04/2024] [Indexed: 05/13/2024]

Abstract

BACKGROUND

Generative artificial intelligence (GAI) is revolutionizing healthcare with solutions for complex challenges, enhancing diagnosis, treatment, and care through new data and insights. However, its integration raises questions about applications, benefits, and challenges. Our study explores these aspects, offering an overview of GAI's applications and future prospects in healthcare.

METHODS

This scoping review searched Web of Science, PubMed, and Scopus . The selection of studies involved screening titles, reviewing abstracts, and examining full texts, adhering to the PRISMA-ScR guidelines throughout the process.

RESULTS

From 1406 articles across three databases, 109 met inclusion criteria after screening and deduplication. Nine GAI models were utilized in healthcare, with ChatGPT (n = 102, 74 %), Google Bard (Gemini) (n = 16, 11 %), and Microsoft Bing AI (n = 10, 7 %) being the most frequently employed. A total of 24 different applications of GAI in healthcare were identified, with the most common being "offering insights and information on health conditions through answering questions" (n = 41) and "diagnosis and prediction of diseases" (n = 17). In total, 606 benefits and challenges were identified, which were condensed to 48 benefits and 61 challenges after consolidation. The predominant benefits included "Providing rapid access to information and valuable insights" and "Improving prediction and diagnosis accuracy", while the primary challenges comprised "generating inaccurate or fictional content", "unknown source of information and fake references for texts", and "lower accuracy in answering questions".

CONCLUSION

This scoping review identified the applications, benefits, and challenges of GAI in healthcare. This synthesis offers a crucial overview of GAI's potential to revolutionize healthcare, emphasizing the imperative to address its limitations.

Collapse

Lucas MM, Yang J, Pomeroy JK, Yang CC. Reasoning with large language models for medical question answering. J Am Med Inform Assoc 2024:ocae131. [PMID: 38960731 DOI: 10.1093/jamia/ocae131] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2023] [Revised: 05/13/2024] [Accepted: 05/20/2024] [Indexed: 07/05/2024] Open

Bortoli M, Fiore M, Tedeschi S, Oliveira V, Sousa R, Bruschi A, Campanacci DA, Viale P, De Paolis M, Sambri A. GPT-based chatbot tools are still unreliable in the management of prosthetic joint infections. Musculoskelet Surg 2024:10.1007/s12306-024-00846-w. [PMID: 38954323 DOI: 10.1007/s12306-024-00846-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2024] [Accepted: 06/21/2024] [Indexed: 07/04/2024]

Luo S, Canavese F, Aroojis A, Andreacchio A, Anticevic D, Bouchard M, Castaneda P, De Rosa V, Fiogbe MA, Frick SL, Hui JH, Johari AN, Loro A, Lyu X, Matsushita M, Omeroglu H, Roye DP, Shah MM, Yong B, Li L. Are Generative Pretrained Transformer 4 Responses to Developmental Dysplasia of the Hip Clinical Scenarios Universal? An International Review. J Pediatr Orthop 2024;44:e504-e511. [PMID: 38597198 DOI: 10.1097/bpo.0000000000002682] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 04/11/2024]

Abstract

OBJECTIVE

There is increasing interest in applying artificial intelligence chatbots like generative pretrained transformer 4 (GPT-4) in the medical field. This study aimed to explore the universality of GPT-4 responses to simulated clinical scenarios of developmental dysplasia of the hip (DDH) across diverse global settings.

METHODS

Seventeen international experts with more than 15 years of experience in pediatric orthopaedics were selected for the evaluation panel. Eight simulated DDH clinical scenarios were created, covering 4 key areas: (1) initial evaluation and diagnosis, (2) initial examination and treatment, (3) nursing care and follow-up, and (4) prognosis and rehabilitation planning. Each scenario was completed independently in a new GPT-4 session. Interrater reliability was assessed using Fleiss kappa, and the quality, relevance, and applicability of GPT-4 responses were analyzed using median scores and interquartile ranges. Following scoring, experts met in ZOOM sessions to generate Regional Consensus Assessment Scores, which were intended to represent a consistent regional assessment of the use of the GPT-4 in pediatric orthopaedic care.

RESULTS

GPT-4's responses to the 8 clinical DDH scenarios received performance scores ranging from 44.3% to 98.9% of the 88-point maximum. The Fleiss kappa statistic of 0.113 ( P = 0.001) indicated low agreement among experts in their ratings. When assessing the responses' quality, relevance, and applicability, the median scores were 3, with interquartile ranges of 3 to 4, 3 to 4, and 2 to 3, respectively. Significant differences were noted in the prognosis and rehabilitation domain scores ( P < 0.05 for all). Regional consensus scores were 75 for Africa, 74 for Asia, 73 for India, 80 for Europe, and 65 for North America, with the Kruskal-Wallis test highlighting significant disparities between these regions ( P = 0.034).

CONCLUSIONS

This study demonstrates the promise of GPT-4 in pediatric orthopaedic care, particularly in supporting preliminary DDH assessments and guiding treatment strategies for specialist care. However, effective integration of GPT-4 into clinical practice will require adaptation to specific regional health care contexts, highlighting the importance of a nuanced approach to health technology adaptation.

LEVEL OF EVIDENCE

Level IV.

Collapse

Affiliation(s)

Shaoting Luo Department of Pediatric Orthopaedics, Shengjing Hospital of China Medical University, Shenyang, Liaoning
Federico Canavese Department of Orthopaedic Surgery, School of Medicine, Stanford University, Palo Alto, CA
Alaric Aroojis Department of Orthopaedic Surgery, School of Medicine, Stanford University, Palo Alto, CA
Antonio Andreacchio Department of Orthopaedic Surgery, School of Medicine, Stanford University, Palo Alto, CA
Darko Anticevic Pediatric Orthopedics Clinic of Pediatric Surgery and Orthopedics, Pediatric Institute of Southern Switzerland (IPSI), Via Athos Gallino, Bellinzona, Switzerland
Maryse Bouchard Ufuk University Faculty of Medicine, Ankara, Turkey
Pablo Castaneda Department of Orthopaedic Surgery, School of Medicine, Stanford University, Palo Alto, CA
Vincenzo De Rosa Pediatric Orthopedics Clinic of Pediatric Surgery and Orthopedics, Pediatric Institute of Southern Switzerland (IPSI), Via Athos Gallino, Bellinzona, Switzerland
Michel Armand Fiogbe Ufuk University Faculty of Medicine, Ankara, Turkey
Steven L Frick Department of Orthopaedic Surgery, School of Medicine, Stanford University, Palo Alto, CA
James H Hui Department of Orthopaedic Surgery, Nagoya University Graduate School of Medicine, Nagoya, Aichi, Japan
Ashok N Johari Pediatric Orthopedics Clinic of Pediatric Surgery and Orthopedics, Pediatric Institute of Southern Switzerland (IPSI), Via Athos Gallino, Bellinzona, Switzerland
Antonio Loro Ufuk University Faculty of Medicine, Ankara, Turkey
Xuemin Lyu Department of Orthopaedic Surgery, School of Medicine, Stanford University, Palo Alto, CA
Masaki Matsushita Department of Orthopaedic Surgery, Nagoya University Graduate School of Medicine, Nagoya, Aichi, Japan
Hakan Omeroglu Ufuk University Faculty of Medicine, Ankara, Turkey
David P Roye Department of Orthopaedic Surgery, Nagoya University Graduate School of Medicine, Nagoya, Aichi, Japan
Maulin M Shah Ufuk University Faculty of Medicine, Ankara, Turkey
Bicheng Yong Department of Pediatric Orthopaedics, Beit CURE Children's Hospital of Malawi, Chichiri Blantyre, Malawi
Lianyong Li Department of Pediatric Orthopaedics, Shengjing Hospital of China Medical University, Shenyang, Liaoning

Collapse

Marshan A, Almutairi AN, Ioannou A, Bell D, Monaghan A, Arzoky M. MedT5SQL: a transformers-based large language model for text-to-SQL conversion in the healthcare domain. Front Big Data 2024;7:1371680. [PMID: 38988646 PMCID: PMC11233734 DOI: 10.3389/fdata.2024.1371680] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2024] [Accepted: 06/10/2024] [Indexed: 07/12/2024] Open

Abstract

Introduction

In response to the increasing prevalence of electronic medical records (EMRs) stored in databases, healthcare staff are encountering difficulties retrieving these records due to their limited technical expertise in database operations. As these records are crucial for delivering appropriate medical care, there is a need for an accessible method for healthcare staff to access EMRs.

Methods

To address this, natural language processing (NLP) for Text-to-SQL has emerged as a solution, enabling non-technical users to generate SQL queries using natural language text. This research assesses existing work on Text-to-SQL conversion and proposes the MedT5SQL model specifically designed for EMR retrieval. The proposed model utilizes the Text-to-Text Transfer Transformer (T5) model, a Large Language Model (LLM) commonly used in various text-based NLP tasks. The model is fine-tuned on the MIMICSQL dataset, the first Text-to-SQL dataset for the healthcare domain. Performance evaluation involves benchmarking the MedT5SQL model on two optimizers, varying numbers of training epochs, and using two datasets, MIMICSQL and WikiSQL.

Results

For MIMICSQL dataset, the model demonstrates considerable effectiveness in generating question-SQL pairs achieving accuracy of 80.63%, 98.937%, and 90% for exact match accuracy matrix, approximate string-matching, and manual evaluation, respectively. When testing the performance of the model on WikiSQL dataset, the model demonstrates efficiency in generating SQL queries, with an accuracy of 44.2% on WikiSQL and 94.26% for approximate string-matching.

Discussion

Results indicate improved performance with increased training epochs. This work highlights the potential of fine-tuned T5 model to convert medical-related questions written in natural language to Structured Query Language (SQL) in healthcare domain, providing a foundation for future research in this area.

Collapse

Pividori M, Greene CS. A publishing infrastructure for Artificial Intelligence (AI)-assisted academic authoring. J Am Med Inform Assoc 2024:ocae139. [PMID: 38879443 DOI: 10.1093/jamia/ocae139] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2023] [Revised: 05/23/2024] [Accepted: 05/29/2024] [Indexed: 06/25/2024] Open

Wu Y, Wu M, Wang C, Lin J, Liu J, Liu S. Evaluating the Prevalence of Burnout Among Health Care Professionals Related to Electronic Health Record Use: Systematic Review and Meta-Analysis. JMIR Med Inform 2024;12:e54811. [PMID: 38865188 PMCID: PMC11208837 DOI: 10.2196/54811] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2023] [Revised: 02/23/2024] [Accepted: 04/17/2024] [Indexed: 06/13/2024] Open

Abstract

BACKGROUND

Burnout among health care professionals is a significant concern, with detrimental effects on health care service quality and patient outcomes. The use of the electronic health record (EHR) system has been identified as a significant contributor to burnout among health care professionals.

OBJECTIVE

This systematic review and meta-analysis aims to assess the prevalence of burnout among health care professionals associated with the use of the EHR system, thereby providing evidence to improve health information systems and develop strategies to measure and mitigate burnout.

METHODS

We conducted a comprehensive search of the PubMed, Embase, and Web of Science databases for English-language peer-reviewed articles published between January 1, 2009, and December 31, 2022. Two independent reviewers applied inclusion and exclusion criteria, and study quality was assessed using the Joanna Briggs Institute checklist and the Newcastle-Ottawa Scale. Meta-analyses were performed using R (version 4.1.3; R Foundation for Statistical Computing), with EndNote X7 (Clarivate) for reference management.

RESULTS

The review included 32 cross-sectional studies and 5 case-control studies with a total of 66,556 participants, mainly physicians and registered nurses. The pooled prevalence of burnout among health care professionals in cross-sectional studies was 40.4% (95% CI 37.5%-43.2%). Case-control studies indicated a higher likelihood of burnout among health care professionals who spent more time on EHR-related tasks outside work (odds ratio 2.43, 95% CI 2.31-2.57).

CONCLUSIONS

The findings highlight the association between the increased use of the EHR system and burnout among health care professionals. Potential solutions include optimizing EHR systems, implementing automated dictation or note-taking, employing scribes to reduce documentation burden, and leveraging artificial intelligence to enhance EHR system efficiency and reduce the risk of burnout.

TRIAL REGISTRATION

PROSPERO International Prospective Register of Systematic Reviews CRD42021281173; https://www.crd.york.ac.uk/prospero/display_record.php?ID=CRD42021281173.

Collapse

Rao SJ, Isath A, Krishnan P, Tangsrivimol JA, Virk HUH, Wang Z, Glicksberg BS, Krittanawong C. ChatGPT: A Conceptual Review of Applications and Utility in the Field of Medicine. J Med Syst 2024;48:59. [PMID: 38836893 DOI: 10.1007/s10916-024-02075-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2024] [Accepted: 05/07/2024] [Indexed: 06/06/2024]

Eltaybani S. The Transformative Role of Large Language Models in Post-Acute and Long-Term Care. J Am Med Dir Assoc 2024;25:104982. [PMID: 38614135 DOI: 10.1016/j.jamda.2024.03.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2023] [Revised: 02/26/2024] [Accepted: 03/03/2024] [Indexed: 04/15/2024]

Mao C, Zhang T. A commentary on can ChatGPT assist urologists manage overactive bladder? Int J Surg 2024;110:3970-3971. [PMID: 38446864 PMCID: PMC11175744 DOI: 10.1097/js9.0000000000001261] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2024] [Accepted: 02/21/2024] [Indexed: 03/08/2024]

Zhang A, Dimock E, Gupta R, Chen K. The new frontier: utilizing ChatGPT to expand craniofacial research. Arch Craniofac Surg 2024;25:116-122. [PMID: 38977396 PMCID: PMC11231409 DOI: 10.7181/acfs.2024.00115] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/29/2024] [Revised: 04/05/2024] [Accepted: 06/11/2024] [Indexed: 07/10/2024] Open

Balasanjeevi G, Surapaneni KM. Comparison of ChatGPT version 3.5 & 4 for utility in respiratory medicine education using clinical case scenarios. Respir Med Res 2024;85:101091. [PMID: 38657295 DOI: 10.1016/j.resmer.2024.101091] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2024] [Revised: 02/14/2024] [Accepted: 02/15/2024] [Indexed: 04/26/2024]

Lorenzi A, Pugliese G, Maniaci A, Lechien JR, Allevi F, Boscolo-Rizzo P, Vaira LA, Saibene AM. Reliability of large language models for advanced head and neck malignancies management: a comparison between ChatGPT 4 and Gemini Advanced. Eur Arch Otorhinolaryngol 2024:10.1007/s00405-024-08746-2. [PMID: 38795148 DOI: 10.1007/s00405-024-08746-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2024] [Accepted: 05/17/2024] [Indexed: 05/27/2024]

Yang J, Walker KC, Bekar-Cesaretli AA, Hao B, Bhadelia N, Joseph-McCarthy D, Paschalidis IC. Automating biomedical literature review for rapid drug discovery: Leveraging GPT-4 to expedite pandemic response. Int J Med Inform 2024;189:105500. [PMID: 38815316 DOI: 10.1016/j.ijmedinf.2024.105500] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2024] [Revised: 04/13/2024] [Accepted: 05/21/2024] [Indexed: 06/01/2024]

Naqvi WM, Shaikh SZ, Mishra GV. Large language models in physical therapy: time to adapt and adept. Front Public Health 2024;12:1364660. [PMID: 38887241 PMCID: PMC11182445 DOI: 10.3389/fpubh.2024.1364660] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2024] [Accepted: 05/10/2024] [Indexed: 06/20/2024] Open

Liu S, McCoy AB, Wright AP, Nelson SD, Huang SS, Ahmad HB, Carro SE, Franklin J, Brogan J, Wright A. Why do users override alerts? Utilizing large language model to summarize comments and optimize clinical decision support. J Am Med Inform Assoc 2024;31:1388-1396. [PMID: 38452289 PMCID: PMC11105133 DOI: 10.1093/jamia/ocae041] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2023] [Revised: 02/06/2024] [Accepted: 02/21/2024] [Indexed: 03/09/2024] Open

Abstract

OBJECTIVES

To evaluate the capability of using generative artificial intelligence (AI) in summarizing alert comments and to determine if the AI-generated summary could be used to improve clinical decision support (CDS) alerts.

MATERIALS AND METHODS

We extracted user comments to alerts generated from September 1, 2022 to September 1, 2023 at Vanderbilt University Medical Center. For a subset of 8 alerts, comment summaries were generated independently by 2 physicians and then separately by GPT-4. We surveyed 5 CDS experts to rate the human-generated and AI-generated summaries on a scale from 1 (strongly disagree) to 5 (strongly agree) for the 4 metrics: clarity, completeness, accuracy, and usefulness.

RESULTS

Five CDS experts participated in the survey. A total of 16 human-generated summaries and 8 AI-generated summaries were assessed. Among the top 8 rated summaries, five were generated by GPT-4. AI-generated summaries demonstrated high levels of clarity, accuracy, and usefulness, similar to the human-generated summaries. Moreover, AI-generated summaries exhibited significantly higher completeness and usefulness compared to the human-generated summaries (AI: 3.4 ± 1.2, human: 2.7 ± 1.2, P = .001).

CONCLUSION

End-user comments provide clinicians' immediate feedback to CDS alerts and can serve as a direct and valuable data resource for improving CDS delivery. Traditionally, these comments may not be considered in the CDS review process due to their unstructured nature, large volume, and the presence of redundant or irrelevant content. Our study demonstrates that GPT-4 is capable of distilling these comments into summaries characterized by high clarity, accuracy, and completeness. AI-generated summaries are equivalent and potentially better than human-generated summaries. These AI-generated summaries could provide CDS experts with a novel means of reviewing user comments to rapidly optimize CDS alerts both online and offline.

Collapse

Iqbal U, Lee LTJ, Rahmanti AR, Celi LA, Li YCJ. Can large language models provide secondary reliable opinion on treatment options for dermatological diseases? J Am Med Inform Assoc 2024;31:1341-1347. [PMID: 38578616 PMCID: PMC11105123 DOI: 10.1093/jamia/ocae067] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2024] [Revised: 02/26/2024] [Accepted: 03/27/2024] [Indexed: 04/06/2024] Open

Abstract

OBJECTIVE

To investigate the consistency and reliability of medication recommendations provided by ChatGPT for common dermatological conditions, highlighting the potential for ChatGPT to offer second opinions in patient treatment while also delineating possible limitations.

MATERIALS AND METHODS

In this mixed-methods study, we used survey questions in April 2023 for drug recommendations generated by ChatGPT with data from secondary databases, that is, Taiwan's National Health Insurance Research Database and an US medical center database, and validated by dermatologists. The methodology included preprocessing queries, executing them multiple times, and evaluating ChatGPT responses against the databases and dermatologists. The ChatGPT-generated responses were analyzed statistically in a disease-drug matrix, considering disease-medication associations (Q-value) and expert evaluation.

RESULTS

ChatGPT achieved a high 98.87% dermatologist approval rate for common dermatological medication recommendations. We evaluated its drug suggestions using the Q-value, showing that human expert validation agreement surpassed Q-value cutoff-based agreement. Varying cutoff values for disease-medication associations, a cutoff of 3 achieved 95.14% accurate prescriptions, 5 yielded 85.42%, and 10 resulted in 72.92%. While ChatGPT offered accurate drug advice, it occasionally included incorrect ATC codes, leading to issues like incorrect drug use and type, nonexistent codes, repeated errors, and incomplete medication codes.

CONCLUSION

ChatGPT provides medication recommendations as a second opinion in dermatology treatment, but its reliability and comprehensiveness need refinement for greater accuracy. In the future, integrating a medical domain-specific knowledge base for training and ongoing optimization will enhance the precision of ChatGPT's results.

Collapse

Affiliation(s)

Usman Iqbal School of Population Health, Faculty of Medicine and Health, University of New South Wales (UNSW), Sydney, NSW 2052, Australia Department of Health, Tasmania 7000, Australia Global Health and Health Security Department, College of Public Health, Taipei Medical University, Taipei 110, Taiwan
Leon Tsung-Ju Lee Graduate Institute of Clinical Medicine, Taipei Medical University, Taipei 110, Taiwan Department of Dermatology, Taipei Medical University Hospital, Taipei Medical University, Taipei 110, Taiwan Department of Dermatology, School of Medicine, College of Medicine, Taipei Medical University, Taipei 110, Taiwan
Annisa Ristya Rahmanti Graduate Institute of Biomedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei 110, Taiwan International Center for Health Information and Technology, College of Medical Science and Technology, Taipei Medical University, Taipei 110, Taiwan Department Health Policy and Management, Faculty of Medicine, Public Health and Nursing, Universitas Gadjah Mada, Yogyakarta 55281, Indonesia
Leo Anthony Celi Laboratory for Computational Physiology, Massachusetts Institute of Technology, Cambridge, MA 02139, United States Division of Pulmonary, Critical Care and Sleep Medicine, Beth Israel Deaconess Medical Center, Boston, MA 02215, United States Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02115, United States
Yu-Chuan Jack Li Graduate Institute of Biomedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei 110, Taiwan International Center for Health Information and Technology, College of Medical Science and Technology, Taipei Medical University, Taipei 110, Taiwan Department of Dermatology, Taipei Municipal Wanfang Hospital, Taipei Medical University, Taipei 116, Taiwan The International Medical Informatics Association (IMIA), Genève CH-1204, Switzerland

Collapse

Liu S, McCoy AB, Wright AP, Carew B, Genkins JZ, Huang SS, Peterson JF, Steitz B, Wright A. Leveraging large language models for generating responses to patient messages-a subjective analysis. J Am Med Inform Assoc 2024;31:1367-1379. [PMID: 38497958 PMCID: PMC11105129 DOI: 10.1093/jamia/ocae052] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2023] [Revised: 01/17/2024] [Accepted: 02/28/2024] [Indexed: 03/19/2024] Open

Abstract

OBJECTIVE

This study aimed to develop and assess the performance of fine-tuned large language models for generating responses to patient messages sent via an electronic health record patient portal.

MATERIALS AND METHODS

Utilizing a dataset of messages and responses extracted from the patient portal at a large academic medical center, we developed a model (CLAIR-Short) based on a pre-trained large language model (LLaMA-65B). In addition, we used the OpenAI API to update physician responses from an open-source dataset into a format with informative paragraphs that offered patient education while emphasizing empathy and professionalism. By combining with this dataset, we further fine-tuned our model (CLAIR-Long). To evaluate fine-tuned models, we used 10 representative patient portal questions in primary care to generate responses. We asked primary care physicians to review generated responses from our models and ChatGPT and rated them for empathy, responsiveness, accuracy, and usefulness.

RESULTS

The dataset consisted of 499 794 pairs of patient messages and corresponding responses from the patient portal, with 5000 patient messages and ChatGPT-updated responses from an online platform. Four primary care physicians participated in the survey. CLAIR-Short exhibited the ability to generate concise responses similar to provider's responses. CLAIR-Long responses provided increased patient educational content compared to CLAIR-Short and were rated similarly to ChatGPT's responses, receiving positive evaluations for responsiveness, empathy, and accuracy, while receiving a neutral rating for usefulness.

CONCLUSION

This subjective analysis suggests that leveraging large language models to generate responses to patient messages demonstrates significant potential in facilitating communication between patients and healthcare providers.

Collapse

Kim H, Park H, Kang S, Kim J, Kim J, Jung J, Taira R. Evaluating the validity of the nursing statements algorithmically generated based on the International Classifications of Nursing Practice for respiratory nursing care using large language models. J Am Med Inform Assoc 2024;31:1397-1403. [PMID: 38630586 PMCID: PMC11105147 DOI: 10.1093/jamia/ocae070] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2023] [Revised: 02/21/2024] [Accepted: 03/19/2024] [Indexed: 04/19/2024] Open

Dos Santos FC, Johnson LG, Madandola OO, Priola KJB, Yao Y, Macieira TGR, Keenan GM. An example of leveraging AI for documentation: ChatGPT-generated nursing care plan for an older adult with lung cancer. J Am Med Inform Assoc 2024:ocae116. [PMID: 38758655 DOI: 10.1093/jamia/ocae116] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2023] [Revised: 04/22/2024] [Accepted: 05/03/2024] [Indexed: 05/19/2024] Open

Grimm DR, Lee YJ, Hu K, Liu L, Garcia O, Balakrishnan K, Ayoub NF. The utility of ChatGPT as a generative medical translator. Eur Arch Otorhinolaryngol 2024:10.1007/s00405-024-08708-8. [PMID: 38705894 DOI: 10.1007/s00405-024-08708-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2024] [Accepted: 04/24/2024] [Indexed: 05/07/2024]

Ferdush J, Begum M, Hossain ST. ChatGPT and Clinical Decision Support: Scope, Application, and Limitations. Ann Biomed Eng 2024;52:1119-1124. [PMID: 37516680 DOI: 10.1007/s10439-023-03329-4] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2023] [Accepted: 07/18/2023] [Indexed: 07/31/2023]

Wu J, Wu X, Qiu Z, Li M, Lin S, Zhang Y, Zheng Y, Yuan C, Yang J. Large language models leverage external knowledge to extend clinical insight beyond language boundaries. J Am Med Inform Assoc 2024:ocae079. [PMID: 38684792 DOI: 10.1093/jamia/ocae079] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2023] [Revised: 03/14/2024] [Accepted: 04/02/2024] [Indexed: 05/02/2024] Open

Abstract

OBJECTIVES

Large Language Models (LLMs) such as ChatGPT and Med-PaLM have excelled in various medical question-answering tasks. However, these English-centric models encounter challenges in non-English clinical settings, primarily due to limited clinical knowledge in respective languages, a consequence of imbalanced training corpora. We systematically evaluate LLMs in the Chinese medical context and develop a novel in-context learning framework to enhance their performance.

MATERIALS AND METHODS

The latest China National Medical Licensing Examination (CNMLE-2022) served as the benchmark. We collected 53 medical books and 381 149 medical questions to construct the medical knowledge base and question bank. The proposed Knowledge and Few-shot Enhancement In-context Learning (KFE) framework leverages the in-context learning ability of LLMs to integrate diverse external clinical knowledge sources. We evaluated KFE with ChatGPT (GPT-3.5), GPT-4, Baichuan2-7B, Baichuan2-13B, and QWEN-72B in CNMLE-2022 and further investigated the effectiveness of different pathways for incorporating LLMs with medical knowledge from 7 distinct perspectives.

RESULTS

Directly applying ChatGPT failed to qualify for the CNMLE-2022 at a score of 51. Cooperated with the KFE framework, the LLMs with varying sizes yielded consistent and significant improvements. The ChatGPT's performance surged to 70.04 and GPT-4 achieved the highest score of 82.59. This surpasses the qualification threshold (60) and exceeds the average human score of 68.70, affirming the effectiveness and robustness of the framework. It also enabled a smaller Baichuan2-13B to pass the examination, showcasing the great potential in low-resource settings.

DISCUSSION AND CONCLUSION

This study shed light on the optimal practices to enhance the capabilities of LLMs in non-English medical scenarios. By synergizing medical knowledge through in-context learning, LLMs can extend clinical insight beyond language barriers in healthcare, significantly reducing language-related disparities of LLM applications and ensuring global benefit in this field.

Collapse

Wang L, Wan Z, Ni C, Song Q, Li Y, Clayton EW, Malin BA, Yin Z. A Systematic Review of ChatGPT and Other Conversational Large Language Models in Healthcare. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.04.26.24306390. [PMID: 38712148 PMCID: PMC11071576 DOI: 10.1101/2024.04.26.24306390] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/08/2024]

Abstract

Background

The launch of the Chat Generative Pre-trained Transformer (ChatGPT) in November 2022 has attracted public attention and academic interest to large language models (LLMs), facilitating the emergence of many other innovative LLMs. These LLMs have been applied in various fields, including healthcare. Numerous studies have since been conducted regarding how to employ state-of-the-art LLMs in health-related scenarios to assist patients, doctors, and public health administrators.

Objective

This review aims to summarize the applications and concerns of applying conversational LLMs in healthcare and provide an agenda for future research on LLMs in healthcare.

Methods

We utilized PubMed, ACM, and IEEE digital libraries as primary sources for this review. We followed the guidance of Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRIMSA) to screen and select peer-reviewed research articles that (1) were related to both healthcare applications and conversational LLMs and (2) were published before September 1st, 2023, the date when we started paper collection and screening. We investigated these papers and classified them according to their applications and concerns.

Results

Our search initially identified 820 papers according to targeted keywords, out of which 65 papers met our criteria and were included in the review. The most popular conversational LLM was ChatGPT from OpenAI (60), followed by Bard from Google (1), Large Language Model Meta AI (LLaMA) from Meta (1), and other LLMs (5). These papers were classified into four categories in terms of their applications: 1) summarization, 2) medical knowledge inquiry, 3) prediction, and 4) administration, and four categories of concerns: 1) reliability, 2) bias, 3) privacy, and 4) public acceptability. There are 49 (75%) research papers using LLMs for summarization and/or medical knowledge inquiry, and 58 (89%) research papers expressing concerns about reliability and/or bias. We found that conversational LLMs exhibit promising results in summarization and providing medical knowledge to patients with a relatively high accuracy. However, conversational LLMs like ChatGPT are not able to provide reliable answers to complex health-related tasks that require specialized domain expertise. Additionally, no experiments in our reviewed papers have been conducted to thoughtfully examine how conversational LLMs lead to bias or privacy issues in healthcare research.

Conclusions

Future studies should focus on improving the reliability of LLM applications in complex health-related tasks, as well as investigating the mechanisms of how LLM applications brought bias and privacy issues. Considering the vast accessibility of LLMs, legal, social, and technical efforts are all needed to address concerns about LLMs to promote, improve, and regularize the application of LLMs in healthcare.

Collapse

Pham C, Govender R, Tehami S, Chavez S, Adepoju OE, Liaw W. ChatGPT's Performance in Cardiac Arrest and Bradycardia Simulations Using the American Heart Association's Advanced Cardiovascular Life Support Guidelines: Exploratory Study. J Med Internet Res 2024;26:e55037. [PMID: 38648098 DOI: 10.2196/55037] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2023] [Revised: 02/22/2024] [Accepted: 03/10/2024] [Indexed: 04/25/2024] Open

Abstract

BACKGROUND

ChatGPT is the most advanced large language model to date, with prior iterations having passed medical licensing examinations, providing clinical decision support, and improved diagnostics. Although limited, past studies of ChatGPT's performance found that artificial intelligence could pass the American Heart Association's advanced cardiovascular life support (ACLS) examinations with modifications. ChatGPT's accuracy has not been studied in more complex clinical scenarios. As heart disease and cardiac arrest remain leading causes of morbidity and mortality in the United States, finding technologies that help increase adherence to ACLS algorithms, which improves survival outcomes, is critical.

OBJECTIVE

This study aims to examine the accuracy of ChatGPT in following ACLS guidelines for bradycardia and cardiac arrest.

METHODS

We evaluated the accuracy of ChatGPT's responses to 2 simulations based on the 2020 American Heart Association ACLS guidelines with 3 primary outcomes of interest: the mean individual step accuracy, the accuracy score per simulation attempt, and the accuracy score for each algorithm. For each simulation step, ChatGPT was scored for correctness (1 point) or incorrectness (0 points). Each simulation was conducted 20 times.

RESULTS

ChatGPT's median accuracy for each step was 85% (IQR 40%-100%) for cardiac arrest and 30% (IQR 13%-81%) for bradycardia. ChatGPT's median accuracy over 20 simulation attempts for cardiac arrest was 69% (IQR 67%-74%) and for bradycardia was 42% (IQR 33%-50%). We found that ChatGPT's outputs varied despite consistent input, the same actions were persistently missed, repetitive overemphasis hindered guidance, and erroneous medication information was presented.

CONCLUSIONS

This study highlights the need for consistent and reliable guidance to prevent potential medical errors and optimize the application of ChatGPT to enhance its reliability and effectiveness in clinical practice.

Collapse

Deng QF, Bao YY, Yang YY, Mao CK. Re: David Musheyev, Alexander Pan, Stacy Loeb, Abdo E. Kabarriti. How Well Do Artificial Intelligence Chatbots Respond to the Top Search Queries About Urological Malignancies? Eur Urol 2023;85:13-6. Eur Urol 2024:S0302-2838(24)02312-1. [PMID: 38644145 DOI: 10.1016/j.eururo.2024.02.031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2024] [Accepted: 02/15/2024] [Indexed: 04/23/2024]

Wu J, Ma Y, Wang J, Xiao M. The Application of ChatGPT in Medicine: A Scoping Review and Bibliometric Analysis. J Multidiscip Healthc 2024;17:1681-1692. [PMID: 38650670 PMCID: PMC11034560 DOI: 10.2147/jmdh.s463128] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2024] [Accepted: 03/25/2024] [Indexed: 04/25/2024] Open

Abstract

Purpose

ChatGPT has a wide range of applications in the medical field. Therefore, this review aims to define the key issues and provide a comprehensive view of the literature based on the application of ChatGPT in medicine.

Methods

This scope follows Arksey and O'Malley's five-stage framework. A comprehensive literature search of publications (30 November 2022 to 16 August 2023) was conducted. Six databases were searched and relevant references were systematically catalogued. Attention was focused on the general characteristics of the articles, their fields of application, and the advantages and disadvantages of using ChatGPT. Descriptive statistics and narrative synthesis methods were used for data analysis.

Results

Of the 3426 studies, 247 met the criteria for inclusion in this review. The majority of articles (31.17%) were from the United States. Editorials (43.32%) ranked first, followed by experimental studys (11.74%). The potential applications of ChatGPT in medicine are varied, with the largest number of studies (45.75%) exploring clinical practice, including assisting with clinical decision support and providing disease information and medical advice. This was followed by medical education (27.13%) and scientific research (16.19%). Particularly noteworthy in the discipline statistics were radiology, surgery and dentistry at the top of the list. However, ChatGPT in medicine also faces issues of data privacy, inaccuracy and plagiarism.

Conclusion

The application of ChatGPT in medicine focuses on different disciplines and general application scenarios. ChatGPT has a paradoxical nature: it offers significant advantages, but at the same time raises great concerns about its application in healthcare settings. Therefore, it is imperative to develop theoretical frameworks that not only address its widespread use in healthcare but also facilitate a comprehensive assessment. In addition, these frameworks should contribute to the development of strict and effective guidelines and regulatory measures.

Collapse

Liu S, McCoy AB, Peterson JF, Lasko TA, Sittig DF, Nelson SD, Andrews J, Patterson L, Cobb CM, Mulherin D, Morton CT, Wright A. Leveraging explainable artificial intelligence to optimize clinical decision support. J Am Med Inform Assoc 2024;31:968-974. [PMID: 38383050 PMCID: PMC10990514 DOI: 10.1093/jamia/ocae019] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2023] [Revised: 01/02/2024] [Accepted: 01/22/2024] [Indexed: 02/23/2024] Open

Affiliation(s)

Siru Liu Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States Department of Computer Science, Vanderbilt University, Nashville, TN 37212, United States
Allison B McCoy Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
Josh F Peterson Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States Department of Medicine, Vanderbilt University Medical Center, Nashville, TN 37203, United States
Thomas A Lasko Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States Department of Computer Science, Vanderbilt University, Nashville, TN 37212, United States
Dean F Sittig School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX 77030, United States
Scott D Nelson Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
Jennifer Andrews Department of Pediatrics, Vanderbilt University Medical Center, Nashville, TN 37203, United States Department of Pathology, Microbiology and Immunology, Vanderbilt University Medical Center, Nashville, TN 37203, United States
Lorraine Patterson HeathIT, Vanderbilt University Medical Center, Nashville, TN 37203, United States
Cheryl M Cobb Department of Psychiatry and Behavioral Sciences, Vanderbilt University Medical Center, Nashville, TN 37203, United States
David Mulherin HeathIT, Vanderbilt University Medical Center, Nashville, TN 37203, United States
Colleen T Morton Department of Medicine, Vanderbilt University Medical Center, Nashville, TN 37203, United States
Adam Wright Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States Department of Medicine, Vanderbilt University Medical Center, Nashville, TN 37203, United States

Collapse

Saibene AM, Allevi F, Calvo-Henriquez C, Maniaci A, Mayo-Yáñez M, Paderno A, Vaira LA, Felisati G, Craig JR. Reliability of large language models in managing odontogenic sinusitis clinical scenarios: a preliminary multidisciplinary evaluation. Eur Arch Otorhinolaryngol 2024;281:1835-1841. [PMID: 38189967 PMCID: PMC10943141 DOI: 10.1007/s00405-023-08372-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2023] [Accepted: 11/22/2023] [Indexed: 01/09/2024]

Abstract

PURPOSE

This study aimed to evaluate the utility of large language model (LLM) artificial intelligence tools, Chat Generative Pre-Trained Transformer (ChatGPT) versions 3.5 and 4, in managing complex otolaryngological clinical scenarios, specifically for the multidisciplinary management of odontogenic sinusitis (ODS).

METHODS

A prospective, structured multidisciplinary specialist evaluation was conducted using five ad hoc designed ODS-related clinical scenarios. LLM responses to these scenarios were critically reviewed by a multidisciplinary panel of eight specialist evaluators (2 ODS experts, 2 rhinologists, 2 general otolaryngologists, and 2 maxillofacial surgeons). Based on the level of disagreement from panel members, a Total Disagreement Score (TDS) was calculated for each LLM response, and TDS comparisons were made between ChatGPT3.5 and ChatGPT4, as well as between different evaluators.

RESULTS

While disagreement to some degree was demonstrated in 73/80 evaluator reviews of LLMs' responses, TDSs were significantly lower for ChatGPT4 compared to ChatGPT3.5. Highest TDSs were found in the case of complicated ODS with orbital abscess, presumably due to increased case complexity with dental, rhinologic, and orbital factors affecting diagnostic and therapeutic options. There were no statistically significant differences in TDSs between evaluators' specialties, though ODS experts and maxillofacial surgeons tended to assign higher TDSs.

CONCLUSIONS

LLMs like ChatGPT, especially newer versions, showed potential for complimenting evidence-based clinical decision-making, but substantial disagreement was still demonstrated between LLMs and clinical specialists across most case examples, suggesting they are not yet optimal in aiding clinical management decisions. Future studies will be important to analyze LLMs' performance as they evolve over time.

Collapse

Braun EM, Juhasz-Böss I, Solomayer EF, Truhn D, Keller C, Heinrich V, Braun BJ. Will I soon be out of my job? Quality and guideline conformity of ChatGPT therapy suggestions to patient inquiries with gynecologic symptoms in a palliative setting. Arch Gynecol Obstet 2024;309:1543-1549. [PMID: 37975899 DOI: 10.1007/s00404-023-07272-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2023] [Accepted: 10/15/2023] [Indexed: 11/19/2023]

Teixeira-Marques F, Medeiros N, Nazaré F, Alves S, Lima N, Ribeiro L, Gama R, Oliveira P. Exploring the role of ChatGPT in clinical decision-making in otorhinolaryngology: a ChatGPT designed study. Eur Arch Otorhinolaryngol 2024;281:2023-2030. [PMID: 38345613 DOI: 10.1007/s00405-024-08498-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2023] [Accepted: 01/23/2024] [Indexed: 03/16/2024]

Aguero D, Nelson SD. The Potential Application of Large Language Models in Pharmaceutical Supply Chain Management. J Pediatr Pharmacol Ther 2024;29:200-205. [PMID: 38596417 PMCID: PMC11001215 DOI: 10.5863/1551-6776-29.2.200] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2023] [Accepted: 01/04/2024] [Indexed: 04/11/2024]

Lin KC, Chen TA, Lin MH, Chen YC, Chen TJ. Integration and Assessment of ChatGPT in Medical Case Reporting: A Multifaceted Approach. Eur J Investig Health Psychol Educ 2024;14:888-901. [PMID: 38667812 PMCID: PMC11049282 DOI: 10.3390/ejihpe14040057] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2024] [Revised: 03/18/2024] [Accepted: 03/19/2024] [Indexed: 04/28/2024] Open

Kim DW, Park JS, Sharma K, Velazquez A, Li L, Ostrominski JW, Tran T, Seitter Peréz RH, Shin JH. Qualitative evaluation of artificial intelligence-generated weight management diet plans. Front Nutr 2024;11:1374834. [PMID: 38577160 PMCID: PMC10991711 DOI: 10.3389/fnut.2024.1374834] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2024] [Accepted: 03/06/2024] [Indexed: 04/06/2024] Open

Abstract

Importance

The transformative potential of artificial intelligence (AI), particularly via large language models, is increasingly being manifested in healthcare. Dietary interventions are foundational to weight management efforts, but whether AI techniques are presently capable of generating clinically applicable diet plans has not been evaluated.

Objective

Our study sought to evaluate the potential of personalized AI-generated weight-loss diet plans for clinical applications by employing a survey-based assessment conducted by experts in the fields of obesity medicine and clinical nutrition.

Design setting and participants

We utilized ChatGPT (4.0) to create weight-loss diet plans and selected two control diet plans from tertiary medical centers for comparison. Dietitians, physicians, and nurse practitioners specializing in obesity medicine or nutrition were invited to provide feedback on the AI-generated plans. Each plan was assessed blindly based on its effectiveness, balanced-ness, comprehensiveness, flexibility, and applicability. Personalized plans for hypothetical patients with specific health conditions were also evaluated.

Main outcomes and measures

The primary outcomes measured included the indistinguishability of the AI diet plan from human-created plans, and the potential of personalized AI-generated diet plans for real-world clinical applications.

Results

Of 95 participants, 67 completed the survey and were included in the final analysis. No significant differences were found among the three weight-loss diet plans in any evaluation category. Among the 14 experts who believed that they could identify the AI plan, only five did so correctly. In an evaluation involving 57 experts, the AI-generated personalized weight-loss diet plan was assessed, with scores above neutral for all evaluation variables. Several limitations, of the AI-generated plans were highlighted, including conflicting dietary considerations, lack of affordability, and insufficient specificity in recommendations, such as exact portion sizes. These limitations suggest that refining inputs could enhance the quality and applicability of AI-generated diet plans.

Conclusion

Despite certain limitations, our study highlights the potential of AI-generated diet plans for clinical applications. AI-generated dietary plans were frequently indistinguishable from diet plans widely used at major tertiary medical centers. Although further refinement and prospective studies are needed, these findings illustrate the potential of AI in advancing personalized weight-centric care.

Collapse

Affiliation(s)

Dong Wook Kim Division of Endocrinology, Diabetes and Hypertension, Center for Weight Management and Wellness, Brigham and Women's Hospital, Boston, MA, United States Department of Medicine, Section of Endocrinology, Diabetes, Nutrition & Weight Management, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, United States
Ji Seok Park Department of Gastroenterology, Hepatology & Nutrition, Cleveland Clinic, Cleveland, OH, United States
Kavita Sharma Department of Medicine, Section of Endocrinology, Diabetes, Nutrition & Weight Management, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, United States
Amanda Velazquez Department of Medicine, Weight Management and Metabolic Health Center, Cedars Sinai Hospital, Los Angeles, CA, United States
Lu Li Department of Medicine, Section of Endocrinology, Diabetes, Nutrition & Weight Management, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, United States
John W. Ostrominski Division of Endocrinology, Diabetes and Hypertension, Center for Weight Management and Wellness, Brigham and Women's Hospital, Boston, MA, United States Division of Cardiovascular Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA, United States
Tram Tran Division of Endocrinology, Diabetes and Hypertension, Center for Weight Management and Wellness, Brigham and Women's Hospital, Boston, MA, United States Department of Medicine, Section of Endocrinology, Diabetes, Nutrition & Weight Management, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, United States
Robert H. Seitter Peréz Department of Medicine, Section of Endocrinology, Diabetes, Nutrition & Weight Management, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, United States
Jeong-Hun Shin Department of Medicine, Section of Endocrinology, Diabetes, Nutrition & Weight Management, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, United States Department of Internal Medicine, Hanyang University College of Medicine, Seoul, Republic of Korea

Collapse

Rodriguez DV, Lawrence K, Gonzalez J, Brandfield-Harvey B, Xu L, Tasneem S, Levine DL, Mann D. Leveraging Generative AI Tools to Support the Development of Digital Solutions in Health Care Research: Case Study. JMIR Hum Factors 2024;11:e52885. [PMID: 38446539 PMCID: PMC10955400 DOI: 10.2196/52885] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2023] [Revised: 11/27/2023] [Accepted: 12/15/2023] [Indexed: 03/07/2024] Open

Abstract

BACKGROUND

Generative artificial intelligence has the potential to revolutionize health technology product development by improving coding quality, efficiency, documentation, quality assessment and review, and troubleshooting.

OBJECTIVE

This paper explores the application of a commercially available generative artificial intelligence tool (ChatGPT) to the development of a digital health behavior change intervention designed to support patient engagement in a commercial digital diabetes prevention program.

METHODS

We examined the capacity, advantages, and limitations of ChatGPT to support digital product idea conceptualization, intervention content development, and the software engineering process, including software requirement generation, software design, and code production. In total, 11 evaluators, each with at least 10 years of experience in fields of study ranging from medicine and implementation science to computer science, participated in the output review process (ChatGPT vs human-generated output). All had familiarity or prior exposure to the original personalized automatic messaging system intervention. The evaluators rated the ChatGPT-produced outputs in terms of understandability, usability, novelty, relevance, completeness, and efficiency.

RESULTS

Most metrics received positive scores. We identified that ChatGPT can (1) support developers to achieve high-quality products faster and (2) facilitate nontechnical communication and system understanding between technical and nontechnical team members around the development goal of rapid and easy-to-build computational solutions for medical technologies.

CONCLUSIONS

ChatGPT can serve as a usable facilitator for researchers engaging in the software development life cycle, from product conceptualization to feature identification and user story development to code generation.

TRIAL REGISTRATION

ClinicalTrials.gov NCT04049500; https://clinicaltrials.gov/ct2/show/NCT04049500.

Collapse

Mu Y, He D. The Potential Applications and Challenges of ChatGPT in the Medical Field. Int J Gen Med 2024;17:817-826. [PMID: 38476626 PMCID: PMC10929156 DOI: 10.2147/ijgm.s456659] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2024] [Accepted: 02/26/2024] [Indexed: 03/14/2024] Open

Spotnitz M, Idnay B, Gordon ER, Shyu R, Zhang G, Liu C, Cimino JJ, Weng C. A Survey of Clinicians' Views of the Utility of Large Language Models. Appl Clin Inform 2024;15:306-312. [PMID: 38442909 PMCID: PMC11023712 DOI: 10.1055/a-2281-7092] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2023] [Accepted: 02/15/2024] [Indexed: 03/07/2024] Open

Wei Q, Yao Z, Cui Y, Wei B, Jin Z, Xu X. Evaluation of ChatGPT-generated medical responses: A systematic review and meta-analysis. J Biomed Inform 2024;151:104620. [PMID: 38462064 DOI: 10.1016/j.jbi.2024.104620] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2023] [Revised: 02/27/2024] [Accepted: 02/29/2024] [Indexed: 03/12/2024]

Abstract

OBJECTIVE

Large language models (LLMs) such as ChatGPT are increasingly explored in medical domains. However, the absence of standard guidelines for performance evaluation has led to methodological inconsistencies. This study aims to summarize the available evidence on evaluating ChatGPT's performance in answering medical questions and provide direction for future research.

METHODS

An extensive literature search was conducted on June 15, 2023, across ten medical databases. The keyword used was "ChatGPT," without restrictions on publication type, language, or date. Studies evaluating ChatGPT's performance in answering medical questions were included. Exclusions comprised review articles, comments, patents, non-medical evaluations of ChatGPT, and preprint studies. Data was extracted on general study characteristics, question sources, conversation processes, assessment metrics, and performance of ChatGPT. An evaluation framework for LLM in medical inquiries was proposed by integrating insights from selected literature. This study is registered with PROSPERO, CRD42023456327.

RESULTS

A total of 3520 articles were identified, of which 60 were reviewed and summarized in this paper and 17 were included in the meta-analysis. ChatGPT displayed an overall integrated accuracy of 56 % (95 % CI: 51 %-60 %, I2 = 87 %) in addressing medical queries. However, the studies varied in question resource, question-asking process, and evaluation metrics. As per our proposed evaluation framework, many studies failed to report methodological details, such as the date of inquiry, version of ChatGPT, and inter-rater consistency.

CONCLUSION

This review reveals ChatGPT's potential in addressing medical inquiries, but the heterogeneity of the study design and insufficient reporting might affect the results' reliability. Our proposed evaluation framework provides insights for the future study design and transparent reporting of LLM in responding to medical questions.

Collapse

Bužančić I, Belec D, Držaić M, Kummer I, Brkić J, Fialová D, Ortner Hadžiabdić M. Clinical decision-making in benzodiazepine deprescribing by healthcare providers vs. AI-assisted approach. Br J Clin Pharmacol 2024;90:662-674. [PMID: 37949663 DOI: 10.1111/bcp.15963] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2023] [Revised: 10/26/2023] [Accepted: 10/29/2023] [Indexed: 11/12/2023] Open

Abstract

AIMS

The aim of this study was to compare the clinical decision-making for benzodiazepine deprescribing between a healthcare provider (HCP) and an artificial intelligence (AI) chatbot GPT4 (ChatGPT-4).

METHODS

We analysed real-world data from a Croatian cohort of community-dwelling benzodiazepine patients (n = 154) within the EuroAgeism H2020 ESR 7 project. HCPs evaluated the data using pre-established deprescribing criteria to assess benzodiazepine discontinuation potential. The research team devised and tested AI prompts to ensure consistency with HCP judgements. An independent researcher employed ChatGPT-4 with predetermined prompts to simulate clinical decisions for each patient case. Data derived from human-HCP and ChatGPT-4 decisions were compared for agreement rates and Cohen's kappa.

RESULTS

Both HPC and ChatGPT identified patients for benzodiazepine deprescribing (96.1% and 89.6%, respectively), showing an agreement rate of 95% (κ = .200, P = .012). Agreement on four deprescribing criteria ranged from 74.7% to 91.3% (lack of indication κ = .352, P < .001; prolonged use κ = .088, P = .280; safety concerns κ = .123, P = .006; incorrect dosage κ = .264, P = .001). Important limitations of GPT-4 responses were identified, including 22.1% ambiguous outputs, generic answers and inaccuracies, posing inappropriate decision-making risks.

CONCLUSIONS

While AI-HCP agreement is substantial, sole AI reliance poses a risk for unsuitable clinical decision-making. This study's findings reveal both strengths and areas for enhancement of ChatGPT-4 in the deprescribing recommendations within a real-world sample. Our study underscores the need for additional research on chatbot functionality in patient therapy decision-making, further fostering the advancement of AI for optimal performance.

Collapse

Ge J, Buenaventura A, Berrean B, Purvis J, Fontil V, Lai JC, Pletcher MJ. Applying human-centered design to the construction of a cirrhosis management clinical decision support system. Hepatol Commun 2024;8:e0394. [PMID: 38407255 PMCID: PMC10898661 DOI: 10.1097/hc9.0000000000000394] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/17/2023] [Accepted: 12/13/2023] [Indexed: 02/27/2024] Open

Abstract

BACKGROUND

Electronic health record (EHR)-based clinical decision support is a scalable way to help standardize clinical care. Clinical decision support systems have not been extensively investigated in cirrhosis management. Human-centered design (HCD) is an approach that engages with potential users in intervention development. In this study, we applied HCD to design the features and interface for a clinical decision support system for cirrhosis management, called CirrhosisRx.

METHODS

We conducted technical feasibility assessments to construct a visual blueprint that outlines the basic features of the interface. We then convened collaborative-design workshops with generalist and specialist clinicians. We elicited current workflows for cirrhosis management, assessed gaps in existing EHR systems, evaluated potential features, and refined the design prototype for CirrhosisRx. At the conclusion of each workshop, we analyzed recordings and transcripts.

RESULTS

Workshop feedback showed that the aggregation of relevant clinical data into 6 cirrhosis decompensation domains (defined as common inpatient clinical scenarios) was the most important feature. Automatic inference of clinical events from EHR data, such as gastrointestinal bleeding from hemoglobin changes, was not accepted due to accuracy concerns. Visualizations for risk stratification scores were deemed not necessary. Lastly, the HCD co-design workshops allowed us to identify the target user population (generalists).

CONCLUSIONS

This is one of the first applications of HCD to design the features and interface for an electronic intervention for cirrhosis management. The HCD process altered features, modified the design interface, and likely improved CirrhosisRx's overall usability. The finalized design for CirrhosisRx proceeded to development and production and will be tested for effectiveness in a pragmatic randomized controlled trial. This work provides a model for the creation of other EHR-based interventions in hepatology care.

Collapse

Chandra A, Chakraborty A. Exploring the role of large language models in radiation emergency response. JOURNAL OF RADIOLOGICAL PROTECTION : OFFICIAL JOURNAL OF THE SOCIETY FOR RADIOLOGICAL PROTECTION 2024;44:011510. [PMID: 38324900 DOI: 10.1088/1361-6498/ad270c] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/19/2023] [Accepted: 02/07/2024] [Indexed: 02/09/2024]

Abi-Rafeh J, Xu HH, Kazan R, Tevlin R, Furnas H. Large Language Models and Artificial Intelligence: A Primer for Plastic Surgeons on the Demonstrated and Potential Applications, Promises, and Limitations of ChatGPT. Aesthet Surg J 2024;44:329-343. [PMID: 37562022 DOI: 10.1093/asj/sjad260] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2023] [Revised: 08/02/2023] [Accepted: 08/04/2023] [Indexed: 08/12/2023] Open

Stengel FC, Stienen MN, Ivanov M, Gandía-González ML, Raffa G, Ganau M, Whitfield P, Motov S. Can AI pass the written European Board Examination in Neurological Surgery? - Ethical and practical issues. BRAIN & SPINE 2024;4:102765. [PMID: 38510593 PMCID: PMC10951784 DOI: 10.1016/j.bas.2024.102765] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/27/2023] [Revised: 01/28/2024] [Accepted: 02/12/2024] [Indexed: 03/22/2024]

Abdullahi T, Singh R, Eickhoff C. Learning to Make Rare and Complex Diagnoses With Generative AI Assistance: Qualitative Study of Popular Large Language Models. JMIR MEDICAL EDUCATION 2024;10:e51391. [PMID: 38349725 PMCID: PMC10900078 DOI: 10.2196/51391] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/30/2023] [Revised: 11/07/2023] [Accepted: 12/11/2023] [Indexed: 02/15/2024]

Abstract

BACKGROUND

Patients with rare and complex diseases often experience delayed diagnoses and misdiagnoses because comprehensive knowledge about these diseases is limited to only a few medical experts. In this context, large language models (LLMs) have emerged as powerful knowledge aggregation tools with applications in clinical decision support and education domains.

OBJECTIVE

This study aims to explore the potential of 3 popular LLMs, namely Bard (Google LLC), ChatGPT-3.5 (OpenAI), and GPT-4 (OpenAI), in medical education to enhance the diagnosis of rare and complex diseases while investigating the impact of prompt engineering on their performance.

METHODS

We conducted experiments on publicly available complex and rare cases to achieve these objectives. We implemented various prompt strategies to evaluate the performance of these models using both open-ended and multiple-choice prompts. In addition, we used a majority voting strategy to leverage diverse reasoning paths within language models, aiming to enhance their reliability. Furthermore, we compared their performance with the performance of human respondents and MedAlpaca, a generative LLM specifically designed for medical tasks.

RESULTS

Notably, all LLMs outperformed the average human consensus and MedAlpaca, with a minimum margin of 5% and 13%, respectively, across all 30 cases from the diagnostic case challenge collection. On the frequently misdiagnosed cases category, Bard tied with MedAlpaca but surpassed the human average consensus by 14%, whereas GPT-4 and ChatGPT-3.5 outperformed MedAlpaca and the human respondents on the moderately often misdiagnosed cases category with minimum accuracy scores of 28% and 11%, respectively. The majority voting strategy, particularly with GPT-4, demonstrated the highest overall score across all cases from the diagnostic complex case collection, surpassing that of other LLMs. On the Medical Information Mart for Intensive Care-III data sets, Bard and GPT-4 achieved the highest diagnostic accuracy scores, with multiple-choice prompts scoring 93%, whereas ChatGPT-3.5 and MedAlpaca scored 73% and 47%, respectively. Furthermore, our results demonstrate that there is no one-size-fits-all prompting approach for improving the performance of LLMs and that a single strategy does not universally apply to all LLMs.

CONCLUSIONS

Our findings shed light on the diagnostic capabilities of LLMs and the challenges associated with identifying an optimal prompting strategy that aligns with each language model's characteristics and specific task requirements. The significance of prompt engineering is highlighted, providing valuable insights for researchers and practitioners who use these language models for medical training. Furthermore, this study represents a crucial step toward understanding how LLMs can enhance diagnostic reasoning in rare and complex medical cases, paving the way for developing effective educational tools and accurate diagnostic aids to improve patient care and outcomes.

Collapse

Wang Z, Zhang Z, Traverso A, Dekker A, Qian L, Sun P. Assessing the role of GPT-4 in thyroid ultrasound diagnosis and treatment recommendations: enhancing interpretability with a chain of thought approach. Quant Imaging Med Surg 2024;14:1602-1615. [PMID: 38415150 PMCID: PMC10895085 DOI: 10.21037/qims-23-1180] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2023] [Accepted: 11/30/2023] [Indexed: 02/29/2024]

Segal S, Saha AK, Khanna AK. Appropriateness of Answers to Common Preanesthesia Patient Questions Composed by the Large Language Model GPT-4 Compared to Human Authors. Anesthesiology 2024;140:333-335. [PMID: 38193737 DOI: 10.1097/aln.0000000000004824] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2024]

Liao Z, Wang J, Shi Z, Lu L, Tabata H. Revolutionary Potential of ChatGPT in Constructing Intelligent Clinical Decision Support Systems. Ann Biomed Eng 2024;52:125-129. [PMID: 37332008 DOI: 10.1007/s10439-023-03288-w] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2023] [Accepted: 06/13/2023] [Indexed: 06/20/2023]

Jin X, Frock A, Nagaraja S, Wallqvist A, Reifman J. AI algorithm for personalized resource allocation and treatment of hemorrhage casualties. Front Physiol 2024;15:1327948. [PMID: 38332989 PMCID: PMC10851938 DOI: 10.3389/fphys.2024.1327948] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2023] [Accepted: 01/09/2024] [Indexed: 02/10/2024] Open

Liu X, Wu J, Shao A, Shen W, Ye P, Wang Y, Ye J, Jin K, Yang J. Uncovering Language Disparity of ChatGPT on Retinal Vascular Disease Classification: Cross-Sectional Study. J Med Internet Res 2024;26:e51926. [PMID: 38252483 PMCID: PMC10845019 DOI: 10.2196/51926] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2023] [Revised: 10/07/2023] [Accepted: 11/30/2023] [Indexed: 01/23/2024] Open

Abstract

BACKGROUND

Benefiting from rich knowledge and the exceptional ability to understand text, large language models like ChatGPT have shown great potential in English clinical environments. However, the performance of ChatGPT in non-English clinical settings, as well as its reasoning, have not been explored in depth.

OBJECTIVE

This study aimed to evaluate ChatGPT's diagnostic performance and inference abilities for retinal vascular diseases in a non-English clinical environment.

METHODS

In this cross-sectional study, we collected 1226 fundus fluorescein angiography reports and corresponding diagnoses written in Chinese and tested ChatGPT with 4 prompting strategies (direct diagnosis or diagnosis with a step-by-step reasoning process and in Chinese or English).

RESULTS

Compared with ChatGPT using Chinese prompts for direct diagnosis that achieved an F1-score of 70.47%, ChatGPT using English prompts for direct diagnosis achieved the best diagnostic performance (80.05%), which was inferior to ophthalmologists (89.35%) but close to ophthalmologist interns (82.69%). As for its inference abilities, although ChatGPT can derive a reasoning process with a low error rate (0.4 per report) for both Chinese and English prompts, ophthalmologists identified that the latter brought more reasoning steps with less incompleteness (44.31%), misinformation (1.96%), and hallucinations (0.59%) (all P<.001). Also, analysis of the robustness of ChatGPT with different language prompts indicated significant differences in the recall (P=.03) and F1-score (P=.04) between Chinese and English prompts. In short, when prompted in English, ChatGPT exhibited enhanced diagnostic and inference capabilities for retinal vascular disease classification based on Chinese fundus fluorescein angiography reports.

CONCLUSIONS

ChatGPT can serve as a helpful medical assistant to provide diagnosis in non-English clinical environments, but there are still performance gaps, language disparities, and errors compared to professionals, which demonstrate the potential limitations and the need to continually explore more robust large language models in ophthalmology practice.

Collapse

Harrington L. ChatGPT Is Trending: Trust but Verify. AACN Adv Crit Care 2023;34:280-286. [PMID: 37619604 DOI: 10.4037/aacnacc2023129] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/26/2023]