1
|
Roustan D, Bastardot F. The Clinicians' Guide to Large Language Models: A General Perspective With a Focus on Hallucinations. Interact J Med Res 2025; 14:e59823. [PMID: 39874574 DOI: 10.2196/59823] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2024] [Revised: 10/17/2024] [Accepted: 11/14/2024] [Indexed: 01/30/2025] Open
Abstract
Large language models (LLMs) are artificial intelligence tools that have the prospect of profoundly changing how we practice all aspects of medicine. Considering the incredible potential of LLMs in medicine and the interest of many health care stakeholders for implementation into routine practice, it is therefore essential that clinicians be aware of the basic risks associated with the use of these models. Namely, a significant risk associated with the use of LLMs is their potential to create hallucinations. Hallucinations (false information) generated by LLMs arise from a multitude of causes, including both factors related to the training dataset as well as their auto-regressive nature. The implications for clinical practice range from the generation of inaccurate diagnostic and therapeutic information to the reinforcement of flawed diagnostic reasoning pathways, as well as a lack of reliability if not used properly. To reduce this risk, we developed a general technical framework for approaching LLMs in general clinical practice, as well as for implementation on a larger institutional scale.
Collapse
Affiliation(s)
- Dimitri Roustan
- Emergency Medicine Department, Cliniques Universitaires Saint-Luc, Brussels, Belgium
| | - François Bastardot
- Medical Directorate, Lausanne University Hospital, Lausanne, Switzerland
| |
Collapse
|
2
|
Yang X, Li T, Su Q, Liu Y, Kang C, Lyu Y, Zhao L, Nie Y, Pan Y. Application of large language models in disease diagnosis and treatment. Chin Med J (Engl) 2025; 138:130-142. [PMID: 39722188 PMCID: PMC11745858 DOI: 10.1097/cm9.0000000000003456] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2024] [Indexed: 12/28/2024] Open
Abstract
ABSTRACT Large language models (LLMs) such as ChatGPT, Claude, Llama, and Qwen are emerging as transformative technologies for the diagnosis and treatment of various diseases. With their exceptional long-context reasoning capabilities, LLMs are proficient in clinically relevant tasks, particularly in medical text analysis and interactive dialogue. They can enhance diagnostic accuracy by processing vast amounts of patient data and medical literature and have demonstrated their utility in diagnosing common diseases and facilitating the identification of rare diseases by recognizing subtle patterns in symptoms and test results. Building on their image-recognition abilities, multimodal LLMs (MLLMs) show promising potential for diagnosis based on radiography, chest computed tomography (CT), electrocardiography (ECG), and common pathological images. These models can also assist in treatment planning by suggesting evidence-based interventions and improving clinical decision support systems through integrated analysis of patient records. Despite these promising developments, significant challenges persist regarding the use of LLMs in medicine, including concerns regarding algorithmic bias, the potential for hallucinations, and the need for rigorous clinical validation. Ethical considerations also underscore the importance of maintaining the function of supervision in clinical practice. This paper highlights the rapid advancements in research on the diagnostic and therapeutic applications of LLMs across different medical disciplines and emphasizes the importance of policymaking, ethical supervision, and multidisciplinary collaboration in promoting more effective and safer clinical applications of LLMs. Future directions include the integration of proprietary clinical knowledge, the investigation of open-source and customized models, and the evaluation of real-time effects in clinical diagnosis and treatment practices.
Collapse
Affiliation(s)
- Xintian Yang
- State Key Laboratory of Holistic Integrative Management of Gastrointestinal Cancers and National Clinical Research Center for Digestive Diseases, Xijing Hospital of Digestive Diseases, Fourth Military Medical University, Xi’an, Shaanxi 710032, China
| | - Tongxin Li
- State Key Laboratory of Holistic Integrative Management of Gastrointestinal Cancers and National Clinical Research Center for Digestive Diseases, Xijing Hospital of Digestive Diseases, Fourth Military Medical University, Xi’an, Shaanxi 710032, China
| | - Qin Su
- State Key Laboratory of Holistic Integrative Management of Gastrointestinal Cancers and National Clinical Research Center for Digestive Diseases, Xijing Hospital of Digestive Diseases, Fourth Military Medical University, Xi’an, Shaanxi 710032, China
| | - Yaling Liu
- State Key Laboratory of Holistic Integrative Management of Gastrointestinal Cancers and National Clinical Research Center for Digestive Diseases, Xijing Hospital of Digestive Diseases, Fourth Military Medical University, Xi’an, Shaanxi 710032, China
| | - Chenxi Kang
- State Key Laboratory of Holistic Integrative Management of Gastrointestinal Cancers and National Clinical Research Center for Digestive Diseases, Xijing Hospital of Digestive Diseases, Fourth Military Medical University, Xi’an, Shaanxi 710032, China
| | - Yong Lyu
- State Key Laboratory of Holistic Integrative Management of Gastrointestinal Cancers and National Clinical Research Center for Digestive Diseases, Xijing Hospital of Digestive Diseases, Fourth Military Medical University, Xi’an, Shaanxi 710032, China
| | - Lina Zhao
- Department of Radiotherapy, Xijing Hospital, Fourth Military Medical University, Xi’an, Shaanxi 710032, China
| | - Yongzhan Nie
- State Key Laboratory of Holistic Integrative Management of Gastrointestinal Cancers and National Clinical Research Center for Digestive Diseases, Xijing Hospital of Digestive Diseases, Fourth Military Medical University, Xi’an, Shaanxi 710032, China
| | - Yanglin Pan
- State Key Laboratory of Holistic Integrative Management of Gastrointestinal Cancers and National Clinical Research Center for Digestive Diseases, Xijing Hospital of Digestive Diseases, Fourth Military Medical University, Xi’an, Shaanxi 710032, China
| |
Collapse
|
3
|
Ayers AT, Ho CN, Kerr D, Cichosz SL, Mathioudakis N, Wang M, Najafi B, Moon SJ, Pandey A, Klonoff DC. Artificial Intelligence to Diagnose Complications of Diabetes. J Diabetes Sci Technol 2025; 19:246-264. [PMID: 39578435 DOI: 10.1177/19322968241287773] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/24/2024]
Abstract
Artificial intelligence (AI) is increasingly being used to diagnose complications of diabetes. Artificial intelligence is technology that enables computers and machines to simulate human intelligence and solve complicated problems. In this article, we address current and likely future applications for AI to be applied to diabetes and its complications, including pharmacoadherence to therapy, diagnosis of hypoglycemia, diabetic eye disease, diabetic kidney diseases, diabetic neuropathy, diabetic foot ulcers, and heart failure in diabetes.Artificial intelligence is advantageous because it can handle large and complex datasets from a variety of sources. With each additional type of data incorporated into a clinical picture of a patient, the calculation becomes increasingly complex and specific. Artificial intelligence is the foundation of emerging medical technologies; it will power the future of diagnosing diabetes complications.
Collapse
Affiliation(s)
| | - Cindy N Ho
- Diabetes Technology Society, Burlingame, CA, USA
| | - David Kerr
- Center for Health Systems Research, Sutter Health, Santa Barbara, CA, USA
| | - Simon Lebech Cichosz
- Department of Health Science and Technology, Aalborg University, Aalborg, Denmark
| | | | - Michelle Wang
- University of California, San Francisco, San Francisco, CA, USA
| | - Bijan Najafi
- Michael E. DeBakey Department of Surgery, Baylor College of Medicine, Houston, TX, USA
- Center for Advanced Surgical and Interventional Technology (CASIT), Department of Surgery, Geffen School of Medicine, University of California, Los Angeles (UCLA), Los Angeles, CA, USA
| | - Sun-Joon Moon
- Division of Endocrinology and Metabolism, Department of Internal Medicine, Kangbuk Samsung Hospital, School of Medicine, Sungkyunkwan University, Seoul, Republic of Korea
| | - Ambarish Pandey
- Division of Cardiology and Geriatrics, Department of Internal Medicine, UT Southwestern Medical Center, Dallas, TX, USA
| | - David C Klonoff
- Diabetes Technology Society, Burlingame, CA, USA
- Diabetes Research Institute, Mills-Peninsula Medical Center, San Mateo, CA, USA
| |
Collapse
|
4
|
Savage T, Wang J, Gallo R, Boukil A, Patel V, Safavi-Naini SAA, Soroush A, Chen JH. Large language model uncertainty proxies: discrimination and calibration for medical diagnosis and treatment. J Am Med Inform Assoc 2025; 32:139-149. [PMID: 39396184 DOI: 10.1093/jamia/ocae254] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2024] [Revised: 08/12/2024] [Accepted: 09/24/2024] [Indexed: 10/14/2024] Open
Abstract
INTRODUCTION The inability of large language models (LLMs) to communicate uncertainty is a significant barrier to their use in medicine. Before LLMs can be integrated into patient care, the field must assess methods to estimate uncertainty in ways that are useful to physician-users. OBJECTIVE Evaluate the ability for uncertainty proxies to quantify LLM confidence when performing diagnosis and treatment selection tasks by assessing the properties of discrimination and calibration. METHODS We examined confidence elicitation (CE), token-level probability (TLP), and sample consistency (SC) proxies across GPT3.5, GPT4, Llama2, and Llama3. Uncertainty proxies were evaluated against 3 datasets of open-ended patient scenarios. RESULTS SC discrimination outperformed TLP and CE methods. SC by sentence embedding achieved the highest discriminative performance (ROC AUC 0.68-0.79), yet with poor calibration. SC by GPT annotation achieved the second-best discrimination (ROC AUC 0.66-0.74) with accurate calibration. Verbalized confidence (CE) was found to consistently overestimate model confidence. DISCUSSION AND CONCLUSIONS SC is the most effective method for estimating LLM uncertainty of the proxies evaluated. SC by sentence embedding can effectively estimate uncertainty if the user has a set of reference cases with which to re-calibrate their results, while SC by GPT annotation is the more effective method if the user does not have reference cases and requires accurate raw calibration. Our results confirm LLMs are consistently over-confident when verbalizing their confidence (CE).
Collapse
Affiliation(s)
- Thomas Savage
- Department of Medicine, Stanford University, Stanford, CA 94304, United States
- Division of Hospital Medicine, Stanford University, Stanford, CA 94304, United States
| | - John Wang
- Department of Medicine, Stanford University, Stanford, CA 94304, United States
- Division of Gastroenterology and Hepatology, Department of Medicine, Stanford University, Stanford, CA 94304, United States
| | - Robert Gallo
- Palo Alto Veterans Affairs Medical Center, Palo Alto, CA 94304, United States
- Department of Health Policy, Stanford University, Stanford, CA 94304, United States
| | | | - Vishwesh Patel
- M.P. Shah Government Medical College, Jamnagar, Gujarat 361008, India
| | - Seyed Amir Ahmad Safavi-Naini
- Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
| | - Ali Soroush
- Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
- The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
- Henry D. Janowitz Division of Gastroenterology, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
| | - Jonathan H Chen
- Department of Medicine, Stanford University, Stanford, CA 94304, United States
- Division of Hospital Medicine, Stanford University, Stanford, CA 94304, United States
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA 94304, United States
- Clinical Excellence Research Center, Stanford University, Stanford, CA 94304, United States
| |
Collapse
|
5
|
El Gharib K, Jundi B, Furfaro D, Abdulnour REE. AI-assisted human clinical reasoning in the ICU: beyond "to err is human". Front Artif Intell 2024; 7:1506676. [PMID: 39712469 PMCID: PMC11659639 DOI: 10.3389/frai.2024.1506676] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2024] [Accepted: 11/19/2024] [Indexed: 12/24/2024] Open
Abstract
Diagnostic errors pose a significant public health challenge, affecting nearly 800,000 Americans annually, with even higher rates globally. In the ICU, these errors are particularly prevalent, leading to substantial morbidity and mortality. The clinical reasoning process aims to reduce diagnostic uncertainty and establish a plausible differential diagnosis but is often hindered by cognitive load, patient complexity, and clinician burnout. These factors contribute to cognitive biases that compromise diagnostic accuracy. Emerging technologies like large language models (LLMs) offer potential solutions to enhance clinical reasoning and improve diagnostic precision. In this perspective article, we explore the roles of LLMs, such as GPT-4, in addressing diagnostic challenges in critical care settings through a case study of a critically ill patient managed with LLM assistance.
Collapse
Affiliation(s)
- Khalil El Gharib
- Division of Pulmonary and Critical Care Medicine, Rutgers Robert Wood Johnson Medical School, New Brunswick, NJ, United States
| | - Bakr Jundi
- Division of Pulmonary and Critical Care Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA, United States
| | - David Furfaro
- Division of Pulmonary and Critical Care Medicine, Beth Israel Deaconess Medical Center and Harvard Medical School, Boston, MA, United States
| | - Raja-Elie E. Abdulnour
- Division of Pulmonary and Critical Care Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA, United States
| |
Collapse
|
6
|
Kim J, Kincaid JWR, Rao AS, Lie W, Fuh L, Landman AB, Succi MD. Risk stratification of potential drug interactions involving common over-the-counter medications and herbal supplements by a large language model. J Am Pharm Assoc (2003) 2024:102304. [PMID: 39613295 DOI: 10.1016/j.japh.2024.102304] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2024] [Revised: 11/19/2024] [Accepted: 11/19/2024] [Indexed: 12/01/2024]
Abstract
BACKGROUND As polypharmacy, the use of over-the-counter (OTC) drugs, and herbal supplements becomes increasingly prevalent, the potential for adverse drug-drug interactions (DDIs) poses significant challenges to patient safety and health care outcomes. OBJECTIVE This study evaluates the capacity of Generative Pre-trained Transformer (GPT) models to accurately assess DDIs involving prescription drugs (Rx) with OTC medications and herbal supplements. METHODS Leveraging a popular subscription-based tool (Lexicomp), we compared the risk ratings assigned by these models to 43 Rx-OTC and 30 Rx-herbal supplement pairs. RESULTS Our findings reveal that all models generally underperform, with accuracies below 50% and poor agreement with Lexicomp standards as measured by Cohen's kappa. Notably, GPT-4 and GPT-4o demonstrated a modest improvement in identifying higher-risk interactions compared to GPT-3.5. CONCLUSION These results highlight the challenges and limitations of using off-the-shelf large language models for guidance in DDI assessment.
Collapse
|
7
|
Park HJ, Huh JY, Chae G, Choi MG. Extraction of clinical data on major pulmonary diseases from unstructured radiologic reports using a large language model. PLoS One 2024; 19:e0314136. [PMID: 39585830 PMCID: PMC11588275 DOI: 10.1371/journal.pone.0314136] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2024] [Accepted: 11/06/2024] [Indexed: 11/27/2024] Open
Abstract
Despite significant strides in big data technology, extracting information from unstructured clinical data remains a formidable challenge. This study investigated the utility of large language models (LLMs) for extracting clinical data from unstructured radiological reports without additional training. In this retrospective study, 1800 radiologic reports, 600 from each of the three university hospitals, were collected, with seven pulmonary outcomes defined. Three pulmonology-trained specialists discerned the presence or absence of diseases. Data extraction from the reports was executed using Google Gemini Pro 1.0, OpenAI's GPT-3.5, and GPT-4. The gold standard was predicated on agreement between at least two pulmonologists. This study evaluated the performance of the three LLMs in diagnosing seven pulmonary diseases (active tuberculosis, emphysema, interstitial lung disease, lung cancer, pleural effusion, pneumonia, and pulmonary edema) utilizing chest radiography and computed tomography scans. All models exhibited high accuracy (0.85-1.00) for most conditions. GPT-4 consistently outperformed its counterparts, demonstrating a sensitivity of 0.71-1.00; specificity of 0.89-1.00; and accuracy of 0.89 and 0.99 across both modalities, thus underscoring its superior capability in interpreting radiological reports. Notably, the accuracy of pleural effusion and emphysema on chest radiographs and pulmonary edema on chest computed tomography scans reached 0.99. The proficiency of LLMs, particularly GPT-4, in accurately classifying unstructured radiological data hints at their potential as alternatives to the traditional manual chart reviews conducted by clinicians.
Collapse
Affiliation(s)
- Hyung Jun Park
- Department of Internal Medicine, Division of Pulmonary and Critical Care Medicine, Shihwa Medical Center, Siheung, Korea
| | - Jin-Young Huh
- Department of Internal Medicine, Division of Pulmonary, Allergy and Critical Care Medicine, Chung-Ang University Gwangmyeong Hospital, Gwangmyeong, Korea
| | - Ganghee Chae
- Department of Internal Medicine, Division of Pulmonary and Critical Care Medicine, Ulsan University Hospital, University of Ulsan College of Medicine, Ulsan, Korea
| | - Myeong Geun Choi
- Department of Internal Medicine, Division of Pulmonary and Critical Care Medicine, Mokdong Hospital, College of Medicine, Ewha Womans University, Seoul, Korea
| |
Collapse
|
8
|
Chen A, Chen DO, Tian L. Benchmarking the symptom-checking capabilities of ChatGPT for a broad range of diseases. J Am Med Inform Assoc 2024; 31:2084-2088. [PMID: 38109889 PMCID: PMC11339504 DOI: 10.1093/jamia/ocad245] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2023] [Revised: 11/17/2023] [Accepted: 12/04/2023] [Indexed: 12/20/2023] Open
Abstract
OBJECTIVE This study evaluates ChatGPT's symptom-checking accuracy across a broad range of diseases using the Mayo Clinic Symptom Checker patient service as a benchmark. METHODS We prompted ChatGPT with symptoms of 194 distinct diseases. By comparing its predictions with expectations, we calculated a relative comparative score (RCS) to gauge accuracy. RESULTS ChatGPT's GPT-4 model achieved an average RCS of 78.8%, outperforming the GPT-3.5-turbo by 10.5%. Some specialties scored above 90%. DISCUSSION The test set, although extensive, was not exhaustive. Future studies should include a more comprehensive disease spectrum. CONCLUSION ChatGPT exhibits high accuracy in symptom checking for a broad range of diseases, showcasing its potential as a medical training tool in learning health systems to enhance care quality and address health disparities.
Collapse
Affiliation(s)
- Anjun Chen
- Health Sciences, ELHS Institute, Palo Alto, CA 94306, United States
- LHS Tech Forum Initiative, Learning Health Community, Palo Alto, CA 94306, United States
| | - Drake O Chen
- LHS Tech Forum Initiative, Learning Health Community, Palo Alto, CA 94306, United States
| | - Lu Tian
- Department of Biomedical Data Science, Stanford University, Stanford, CA 94305, United States
| |
Collapse
|
9
|
Du X, Zhou Z, Wang Y, Chuang YW, Yang R, Zhang W, Wang X, Zhang R, Hong P, Bates DW, Zhou L. Generative Large Language Models in Electronic Health Records for Patient Care Since 2023: A Systematic Review. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.08.11.24311828. [PMID: 39228726 PMCID: PMC11370524 DOI: 10.1101/2024.08.11.24311828] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/05/2024]
Abstract
Background Generative Large language models (LLMs) represent a significant advancement in natural language processing, achieving state-of-the-art performance across various tasks. However, their application in clinical settings using real electronic health records (EHRs) is still rare and presents numerous challenges. Objective This study aims to systematically review the use of generative LLMs, and the effectiveness of relevant techniques in patient care-related topics involving EHRs, summarize the challenges faced, and suggest future directions. Methods A Boolean search for peer-reviewed articles was conducted on May 19th, 2024 using PubMed and Web of Science to include research articles published since 2023, which was one month after the release of ChatGPT. The search results were deduplicated. Multiple reviewers, including biomedical informaticians, computer scientists, and a physician, screened the publications for eligibility and conducted data extraction. Only studies utilizing generative LLMs to analyze real EHR data were included. We summarized the use of prompt engineering, fine-tuning, multimodal EHR data, and evaluation matrices. Additionally, we identified current challenges in applying LLMs in clinical settings as reported by the included studies and proposed future directions. Results The initial search identified 6,328 unique studies, with 76 studies included after eligibility screening. Of these, 67 studies (88.2%) employed zero-shot prompting, five of them reported 100% accuracy on five specific clinical tasks. Nine studies used advanced prompting strategies; four tested these strategies experimentally, finding that prompt engineering improved performance, with one study noting a non-linear relationship between the number of examples in a prompt and performance improvement. Eight studies explored fine-tuning generative LLMs, all reported performance improvements on specific tasks, but three of them noted potential performance degradation after fine-tuning on certain tasks. Only two studies utilized multimodal data, which improved LLM-based decision-making and enabled accurate rare disease diagnosis and prognosis. The studies employed 55 different evaluation metrics for 22 purposes, such as correctness, completeness, and conciseness. Two studies investigated LLM bias, with one detecting no bias and the other finding that male patients received more appropriate clinical decision-making suggestions. Six studies identified hallucinations, such as fabricating patient names in structured thyroid ultrasound reports. Additional challenges included but were not limited to the impersonal tone of LLM consultations, which made patients uncomfortable, and the difficulty patients had in understanding LLM responses. Conclusion Our review indicates that few studies have employed advanced computational techniques to enhance LLM performance. The diverse evaluation metrics used highlight the need for standardization. LLMs currently cannot replace physicians due to challenges such as bias, hallucinations, and impersonal responses.
Collapse
Affiliation(s)
- Xinsong Du
- Division of General Internal Medicine and Primary Care, Brigham and Women’s Hospital, Boston, Massachusetts 02115
- Department of Medicine, Harvard Medical School, Boston, Massachusetts 02115
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts 02115
| | - Zhengyang Zhou
- Department of Computer Science, Brandeis University, Waltham, MA 02453
| | - Yifei Wang
- Department of Computer Science, Brandeis University, Waltham, MA 02453
| | - Ya-Wen Chuang
- Division of Nephrology, Department of Internal Medicine, Taichung Veterans General Hospital, Taichung, Taiwan, 407219
- Department of Post-Baccalaureate Medicine, College of Medicine, National Chung Hsing University, Taichung, Taiwan, 402202
- School of Medicine, College of Medicine, China Medical University, Taichung, Taiwan, 404328
| | - Richard Yang
- Division of General Internal Medicine and Primary Care, Brigham and Women’s Hospital, Boston, Massachusetts 02115
- Department of Medicine, Harvard Medical School, Boston, Massachusetts 02115
| | - Wenyu Zhang
- Division of General Internal Medicine and Primary Care, Brigham and Women’s Hospital, Boston, Massachusetts 02115
- Department of Medicine, Harvard Medical School, Boston, Massachusetts 02115
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts 02115
| | - Xinyi Wang
- Division of General Internal Medicine and Primary Care, Brigham and Women’s Hospital, Boston, Massachusetts 02115
- Department of Medicine, Harvard Medical School, Boston, Massachusetts 02115
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts 02115
| | - Rui Zhang
- Division of Computational Health Sciences, University of Minnesota, Minneapolis, MN 55455
| | - Pengyu Hong
- Department of Computer Science, Brandeis University, Waltham, MA 02453
| | - David W. Bates
- Division of General Internal Medicine and Primary Care, Brigham and Women’s Hospital, Boston, Massachusetts 02115
- Department of Medicine, Harvard Medical School, Boston, Massachusetts 02115
- Department of Health Policy and Management, Harvard T.H. Chan School of Public Health, Boston, MA 02115
| | - Li Zhou
- Division of General Internal Medicine and Primary Care, Brigham and Women’s Hospital, Boston, Massachusetts 02115
- Department of Medicine, Harvard Medical School, Boston, Massachusetts 02115
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts 02115
| |
Collapse
|
10
|
Shah-Mohammadi F, Finkelstein J. Accuracy Evaluation of GPT-Assisted Differential Diagnosis in Emergency Department. Diagnostics (Basel) 2024; 14:1779. [PMID: 39202267 PMCID: PMC11354035 DOI: 10.3390/diagnostics14161779] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2024] [Revised: 08/10/2024] [Accepted: 08/13/2024] [Indexed: 09/03/2024] Open
Abstract
In emergency department (ED) settings, rapid and precise diagnostic evaluations are critical to ensure better patient outcomes and efficient healthcare delivery. This study assesses the accuracy of differential diagnosis lists generated by the third-generation ChatGPT (ChatGPT-3.5) and the fourth-generation ChatGPT (ChatGPT-4) based on electronic health record notes recorded within the first 24 h of ED admission. These models process unstructured text to formulate a ranked list of potential diagnoses. The accuracy of these models was benchmarked against actual discharge diagnoses to evaluate their utility as diagnostic aids. Results indicated that both GPT-3.5 and GPT-4 reasonably accurately predicted diagnoses at the body system level, with GPT-4 slightly outperforming its predecessor. However, their performance at the more granular category level was inconsistent, often showing decreased precision. Notably, GPT-4 demonstrated improved accuracy in several critical categories that underscores its advanced capabilities in managing complex clinical scenarios.
Collapse
Affiliation(s)
| | - Joseph Finkelstein
- Department of Biomedical Informatics, School of Medicine, University of Utah, Salt Lake City, UT 84112, USA;
| |
Collapse
|
11
|
Kaczmarczyk R, Wilhelm TI, Martin R, Roos J. Evaluating multimodal AI in medical diagnostics. NPJ Digit Med 2024; 7:205. [PMID: 39112822 PMCID: PMC11306783 DOI: 10.1038/s41746-024-01208-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2024] [Accepted: 07/29/2024] [Indexed: 08/10/2024] Open
Abstract
This study evaluates multimodal AI models' accuracy and responsiveness in answering NEJM Image Challenge questions, juxtaposed with human collective intelligence, underscoring AI's potential and current limitations in clinical diagnostics. Anthropic's Claude 3 family demonstrated the highest accuracy among the evaluated AI models, surpassing the average human accuracy, while collective human decision-making outperformed all AI models. GPT-4 Vision Preview exhibited selectivity, responding more to easier questions with smaller images and longer questions.
Collapse
Affiliation(s)
- Robert Kaczmarczyk
- Department of Dermatology and Allergy, School of Medicine, Technical University of Munich, Munich, Germany
| | | | - Ron Martin
- Clinic of Plastic, Hand and Aesthetic Surgery, Burn Center, BG Clinic Bergmannstrost, Halle (Saale), Germany
| | - Jonas Roos
- Department of Orthopedics and Trauma Surgery, University Hospital of Bonn, Bonn, Germany
| |
Collapse
|
12
|
Kavanagh KT, Pontus C, Cormier LE. Healthcare Violence and the Potential Promises and Harms of Artificial Intelligence. J Patient Saf 2024; 20:307-313. [PMID: 38860829 DOI: 10.1097/pts.0000000000001245] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/12/2024]
Abstract
ABSTRACT Currently, the healthcare workplace is one of the most dangerous in the United States. Over a 3-month period in 2022, two nurses were assaulted every hour. Artificial intelligence (AI) has the potential to prevent workplace violence by developing unique patient insights through accessing almost instantly a patient's medical history, past institutional encounters, and possibly even their social media posts. De-escalating dialog can then be formulated, and hot-button topics avoided. AIs can also monitor patients in waiting areas for potential confrontational behavior.Many have concerns implementing AIs in healthcare. AIs are not expected to be 100% accurate, their performance is not compared with a computer but instead measured against humans. However, AIs are outperforming humans in many tasks. They are especially adept at taking standardized examinations, such as Board Exams, the Uniform Bar Exam, and the SAT and Graduate Record Exam. AIs are also performing diagnosis. Initial reports found that newer models have been observed to equal or outperform physicians in diagnostic accuracy and in the conveyance of empathy.In the area of interdiction, AI robots can both navigate and monitor for confrontational and illegal behavior. A human security agent would then be notified to resolve the situation. Our military is fielding autonomous AI robots to counter potential adversaries. For many, this new arms race has grave implications because of the potential of fielding this same security technology in healthcare and other civil settings.The healthcare delivery sector must determine the future roles of AI in relationship to human workers. AIs should only be used to support a human employee. AIs should not be the primary caregiver and a single human should not be monitoring multiple AIs simultaneously. Similar to not being copyrightable, disinformation produced by AIs should not be afforded 'free speech' protections. Any increase in productivity of an AI will equate with a loss of jobs. We need to ask, If all business sectors utilize AIs, will there be enough paid workers for the purchasing of services and products to keep our economy and society a float?
Collapse
Affiliation(s)
| | - Christine Pontus
- Health Watch USA, Massachusetts Nurses Association, Canton, Massachusetts
| | | |
Collapse
|
13
|
Edalati S, Vasan V, Cheng CP, Patel Z, Govindaraj S, Iloreta AM. Can GPT-4 revolutionize otolaryngology? Navigating opportunities and ethical considerations. Am J Otolaryngol 2024; 45:104303. [PMID: 38678799 DOI: 10.1016/j.amjoto.2024.104303] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2024] [Accepted: 04/14/2024] [Indexed: 05/01/2024]
Abstract
Otolaryngologists can enhance workflow efficiency, provide better patient care, and advance medical research and education by integrating artificial intelligence (AI) into their practices. GPT-4 technology is a revolutionary and contemporary example of AI that may apply to otolaryngology. The knowledge of otolaryngologists should be supplemented, not replaced when using GPT-4 to make critical medical decisions and provide individualized patient care. In our thorough examination, we explore the potential uses of the groundbreaking GPT-4 technology in the field of otolaryngology, covering aspects such as potential outcomes and technical boundaries. Additionally, we delve into the intricate and intellectually challenging dilemmas that emerge when incorporating GPT-4 into otolaryngology, considering the ethical considerations inherent in its implementation. Our stance is that GPT-4 has the potential to be very helpful. Its capabilities, which include aid in clinical decision-making, patient care, and administrative job automation, present exciting possibilities for enhancing patient outcomes, boosting the efficiency of healthcare delivery, and enhancing patient experiences. Even though there are still certain obstacles and limitations, the progress made so far shows that GPT-4 can be a valuable tool for modern medicine. GPT-4 may play a more significant role in clinical practice as technology develops, helping medical professionals deliver high-quality care tailored to every patient's unique needs.
Collapse
Affiliation(s)
- Shaun Edalati
- Department of Otolaryngology-Head and Neck Surgery, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
| | - Vikram Vasan
- Department of Otolaryngology-Head and Neck Surgery, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Christopher P Cheng
- Department of Otolaryngology-Head and Neck Surgery, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Zara Patel
- Department of Otolaryngology-Head & Neck Surgery, Stanford University School of Medicine, Stanford, CA, USA
| | - Satish Govindaraj
- Department of Otolaryngology-Head and Neck Surgery, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Alfred Marc Iloreta
- Department of Otolaryngology-Head and Neck Surgery, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| |
Collapse
|
14
|
AlRyalat SA, Musleh AM, Kahook MY. Evaluating the strengths and limitations of multimodal ChatGPT-4 in detecting glaucoma using fundus images. FRONTIERS IN OPHTHALMOLOGY 2024; 4:1387190. [PMID: 38984105 PMCID: PMC11182172 DOI: 10.3389/fopht.2024.1387190] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/16/2024] [Accepted: 05/17/2024] [Indexed: 07/11/2024]
Abstract
Overview This study evaluates the diagnostic accuracy of a multimodal large language model (LLM), ChatGPT-4, in recognizing glaucoma using color fundus photographs (CFPs) with a benchmark dataset and without prior training or fine tuning. Methods The publicly accessible Retinal Fundus Glaucoma Challenge "REFUGE" dataset was utilized for analyses. The input data consisted of the entire 400 image testing set. The task involved classifying fundus images into either 'Likely Glaucomatous' or 'Likely Non-Glaucomatous'. We constructed a confusion matrix to visualize the results of predictions from ChatGPT-4, focusing on accuracy of binary classifications (glaucoma vs non-glaucoma). Results ChatGPT-4 demonstrated an accuracy of 90% with a 95% confidence interval (CI) of 87.06%-92.94%. The sensitivity was found to be 50% (95% CI: 34.51%-65.49%), while the specificity was 94.44% (95% CI: 92.08%-96.81%). The precision was recorded at 50% (95% CI: 34.51%-65.49%), and the F1 Score was 0.50. Conclusion ChatGPT-4 achieved relatively high diagnostic accuracy without prior fine tuning on CFPs. Considering the scarcity of data in specialized medical fields, including ophthalmology, the use of advanced AI techniques, such as LLMs, might require less data for training compared to other forms of AI with potential savings in time and financial resources. It may also pave the way for the development of innovative tools to support specialized medical care, particularly those dependent on multimodal data for diagnosis and follow-up, irrespective of resource constraints.
Collapse
Affiliation(s)
- Saif Aldeen AlRyalat
- Department of Ophthalmology, The University of Jordan, Amman, Jordan
- Department of Ophthalmology, Houston Methodist Hospital, Houston, TX, United States
| | | | - Malik Y. Kahook
- Department of Ophthalmology, University of Colorado School of Medicine, Sue Anschutz-Rodgers Eye Center, Aurora, CO, United States
| |
Collapse
|
15
|
Harada Y, Suzuki T, Harada T, Sakamoto T, Ishizuka K, Miyagami T, Kawamura R, Kunitomo K, Nagano H, Shimizu T, Watari T. Performance evaluation of ChatGPT in detecting diagnostic errors and their contributing factors: an analysis of 545 case reports of diagnostic errors. BMJ Open Qual 2024; 13:e002654. [PMID: 38830730 PMCID: PMC11149143 DOI: 10.1136/bmjoq-2023-002654] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2023] [Accepted: 05/28/2024] [Indexed: 06/05/2024] Open
Abstract
BACKGROUND Manual chart review using validated assessment tools is a standardised methodology for detecting diagnostic errors. However, this requires considerable human resources and time. ChatGPT, a recently developed artificial intelligence chatbot based on a large language model, can effectively classify text based on suitable prompts. Therefore, ChatGPT can assist manual chart reviews in detecting diagnostic errors. OBJECTIVE This study aimed to clarify whether ChatGPT could correctly detect diagnostic errors and possible factors contributing to them based on case presentations. METHODS We analysed 545 published case reports that included diagnostic errors. We imputed the texts of case presentations and the final diagnoses with some original prompts into ChatGPT (GPT-4) to generate responses, including the judgement of diagnostic errors and contributing factors of diagnostic errors. Factors contributing to diagnostic errors were coded according to the following three taxonomies: Diagnosis Error Evaluation and Research (DEER), Reliable Diagnosis Challenges (RDC) and Generic Diagnostic Pitfalls (GDP). The responses on the contributing factors from ChatGPT were compared with those from physicians. RESULTS ChatGPT correctly detected diagnostic errors in 519/545 cases (95%) and coded statistically larger numbers of factors contributing to diagnostic errors per case than physicians: DEER (median 5 vs 1, p<0.001), RDC (median 4 vs 2, p<0.001) and GDP (median 4 vs 1, p<0.001). The most important contributing factors of diagnostic errors coded by ChatGPT were 'failure/delay in considering the diagnosis' (315, 57.8%) in DEER, 'atypical presentation' (365, 67.0%) in RDC, and 'atypical presentation' (264, 48.4%) in GDP. CONCLUSION ChatGPT accurately detects diagnostic errors from case presentations. ChatGPT may be more sensitive than manual reviewing in detecting factors contributing to diagnostic errors, especially for 'atypical presentation'.
Collapse
Affiliation(s)
- Yukinori Harada
- Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Shimotsuga-gun, Tochigi, Japan
| | | | - Taku Harada
- Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Shimotsuga-gun, Tochigi, Japan
- Nerima Hikarigaoka Hospital, Nerima-ku, Tokyo, Japan
| | - Tetsu Sakamoto
- Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Shimotsuga-gun, Tochigi, Japan
| | - Kosuke Ishizuka
- Yokohama City University School of Medicine Graduate School of Medicine, Yokohama, Kanagawa, Japan
| | - Taiju Miyagami
- Department of General Medicine, Faculty of Medicine, Juntendo University, Bunkyo-ku, Tokyo, Japan
| | - Ren Kawamura
- Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Shimotsuga-gun, Tochigi, Japan
| | | | - Hiroyuki Nagano
- Department of General Internal Medicine, Tenri Hospital, Tenri, Nara, Japan
| | - Taro Shimizu
- Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Shimotsuga-gun, Tochigi, Japan
| | - Takashi Watari
- Integrated Clinical Education Center, Kyoto University Hospital, Kyoto, Kyoto, Japan
| |
Collapse
|
16
|
Kim H, Park H, Kang S, Kim J, Kim J, Jung J, Taira R. Evaluating the validity of the nursing statements algorithmically generated based on the International Classifications of Nursing Practice for respiratory nursing care using large language models. J Am Med Inform Assoc 2024; 31:1397-1403. [PMID: 38630586 PMCID: PMC11105147 DOI: 10.1093/jamia/ocae070] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2023] [Revised: 02/21/2024] [Accepted: 03/19/2024] [Indexed: 04/19/2024] Open
Abstract
OBJECTIVE This study aims to facilitate the creation of quality standardized nursing statements in South Korea's hospitals using algorithmic generation based on the International Classifications of Nursing Practice (ICNP) and evaluation through Large Language Models. MATERIALS AND METHODS We algorithmically generated 15 972 statements related to acute respiratory care using 117 concepts and concept composition models of ICNP. Human reviewers, Generative Pre-trained Transformers 4.0 (GPT-4.0), and Bio_Clinical Bidirectional Encoder Representations from Transformers (BERT) evaluated the generated statements for validity. The evaluation by GPT-4.0 and Bio_ClinicalBERT was conducted with and without contextual information and training. RESULTS Of the generated statements, 2207 were deemed valid by expert reviewers. GPT-4.0 showed a zero-shot AUC of 0.857, which aggravated with contextual information. Bio_ClinicalBERT, after training, significantly improved, reaching an AUC of 0.998. CONCLUSION Bio_ClinicalBERT effectively validates auto-generated nursing statements, offering a promising solution to enhance and streamline healthcare documentation processes.
Collapse
Affiliation(s)
- Hyeoneui Kim
- College of Nursing, Seoul National University, Seoul, 03080, Republic of Korea
- The Research Institute of Nursing Science, Seoul National University, Seoul, 03080, Republic of Korea
- Center for Human-Caring Nurse Leaders for the Future by Brain Korea 21 (BK 21) Four Project, College of Nursing, Seoul National University, Seoul, 03080, Republic of Korea
| | - Hyewon Park
- College of Nursing, Seoul National University, Seoul, 03080, Republic of Korea
- Samsung Medical Center, Seoul, 06351, Republic of Korea
| | - Sunghoon Kang
- The Department of Science Studies, Seoul National University, Seoul, 08826, Republic of Korea
| | - Jinsol Kim
- College of Nursing, Seoul National University, Seoul, 03080, Republic of Korea
- Center for Human-Caring Nurse Leaders for the Future by Brain Korea 21 (BK 21) Four Project, College of Nursing, Seoul National University, Seoul, 03080, Republic of Korea
| | - Jeongha Kim
- College of Nursing, Seoul National University, Seoul, 03080, Republic of Korea
- Asan Medical Center, Seoul, 05505, Republic of Korea
| | - Jinsun Jung
- College of Nursing, Seoul National University, Seoul, 03080, Republic of Korea
- Center for Human-Caring Nurse Leaders for the Future by Brain Korea 21 (BK 21) Four Project, College of Nursing, Seoul National University, Seoul, 03080, Republic of Korea
| | - Ricky Taira
- The Department of Radiological Science, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, CA 90095, United States
| |
Collapse
|
17
|
Lakhnati Y, Pascher M, Gerken J. Exploring a GPT-based large language model for variable autonomy in a VR-based human-robot teaming simulation. Front Robot AI 2024; 11:1347538. [PMID: 38633059 PMCID: PMC11021771 DOI: 10.3389/frobt.2024.1347538] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2023] [Accepted: 03/13/2024] [Indexed: 04/19/2024] Open
Abstract
In a rapidly evolving digital landscape autonomous tools and robots are becoming commonplace. Recognizing the significance of this development, this paper explores the integration of Large Language Models (LLMs) like Generative pre-trained transformer (GPT) into human-robot teaming environments to facilitate variable autonomy through the means of verbal human-robot communication. In this paper, we introduce a novel simulation framework for such a GPT-powered multi-robot testbed environment, based on a Unity Virtual Reality (VR) setting. This system allows users to interact with simulated robot agents through natural language, each powered by individual GPT cores. By means of OpenAI's function calling, we bridge the gap between unstructured natural language input and structured robot actions. A user study with 12 participants explores the effectiveness of GPT-4 and, more importantly, user strategies when being given the opportunity to converse in natural language within a simulated multi-robot environment. Our findings suggest that users may have preconceived expectations on how to converse with robots and seldom try to explore the actual language and cognitive capabilities of their simulated robot collaborators. Still, those users who did explore were able to benefit from a much more natural flow of communication and human-like back-and-forth. We provide a set of lessons learned for future research and technical implementations of similar systems.
Collapse
Affiliation(s)
- Younes Lakhnati
- Inclusive Human-Robot-Interaction, TU Dortmund University, Dortmund, NW, Germany
| | - Max Pascher
- Inclusive Human-Robot-Interaction, TU Dortmund University, Dortmund, NW, Germany
- Human-Computer Interaction, University of Duisburg-Essen, Essen, NW, Germany
| | - Jens Gerken
- Inclusive Human-Robot-Interaction, TU Dortmund University, Dortmund, NW, Germany
| |
Collapse
|
18
|
Li J, Dada A, Puladi B, Kleesiek J, Egger J. ChatGPT in healthcare: A taxonomy and systematic review. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2024; 245:108013. [PMID: 38262126 DOI: 10.1016/j.cmpb.2024.108013] [Citation(s) in RCA: 44] [Impact Index Per Article: 44.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/07/2023] [Revised: 12/29/2023] [Accepted: 01/08/2024] [Indexed: 01/25/2024]
Abstract
The recent release of ChatGPT, a chat bot research project/product of natural language processing (NLP) by OpenAI, stirs up a sensation among both the general public and medical professionals, amassing a phenomenally large user base in a short time. This is a typical example of the 'productization' of cutting-edge technologies, which allows the general public without a technical background to gain firsthand experience in artificial intelligence (AI), similar to the AI hype created by AlphaGo (DeepMind Technologies, UK) and self-driving cars (Google, Tesla, etc.). However, it is crucial, especially for healthcare researchers, to remain prudent amidst the hype. This work provides a systematic review of existing publications on the use of ChatGPT in healthcare, elucidating the 'status quo' of ChatGPT in medical applications, for general readers, healthcare professionals as well as NLP scientists. The large biomedical literature database PubMed is used to retrieve published works on this topic using the keyword 'ChatGPT'. An inclusion criterion and a taxonomy are further proposed to filter the search results and categorize the selected publications, respectively. It is found through the review that the current release of ChatGPT has achieved only moderate or 'passing' performance in a variety of tests, and is unreliable for actual clinical deployment, since it is not intended for clinical applications by design. We conclude that specialized NLP models trained on (bio)medical datasets still represent the right direction to pursue for critical clinical applications.
Collapse
Affiliation(s)
- Jianning Li
- Institute for Artificial Intelligence in Medicine, University Hospital Essen (AöR), Girardetstraße 2, 45131 Essen, Germany
| | - Amin Dada
- Institute for Artificial Intelligence in Medicine, University Hospital Essen (AöR), Girardetstraße 2, 45131 Essen, Germany
| | - Behrus Puladi
- Institute of Medical Informatics, University Hospital RWTH Aachen, Pauwelsstraße 30, 52074 Aachen, Germany; Department of Oral and Maxillofacial Surgery, University Hospital RWTH Aachen, Pauwelsstraße 30, 52074 Aachen, Germany
| | - Jens Kleesiek
- Institute for Artificial Intelligence in Medicine, University Hospital Essen (AöR), Girardetstraße 2, 45131 Essen, Germany; TU Dortmund University, Department of Physics, Otto-Hahn-Straße 4, 44227 Dortmund, Germany
| | - Jan Egger
- Institute for Artificial Intelligence in Medicine, University Hospital Essen (AöR), Girardetstraße 2, 45131 Essen, Germany; Center for Virtual and Extended Reality in Medicine (ZvRM), University Hospital Essen, University Medicine Essen, Hufelandstraße 55, 45147 Essen, Germany.
| |
Collapse
|
19
|
Luk DWA, Ip WCT, Shea YF. Performance of GPT-4 and GPT-3.5 in generating accurate and comprehensive diagnoses across medical subspecialties. J Chin Med Assoc 2024; 87:259-260. [PMID: 38305423 DOI: 10.1097/jcma.0000000000001064] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 02/03/2024] Open
Abstract
Artificial intelligence has demonstrated a promising potential for diagnosing complex medical cases, with Generative Pre-Trained Transformer 4 (GPT-4) being the most recent advancement in this field. This study evaluated the diagnostic performance of the GPT-4 in comparison with that of its predecessor, GPT-3.5, using 81 complex medical case records from the New England Journal of Medicine . The cases were categorized as cognitive impairment, infectious disease, rheumatology, or drug reactions. The GPT-4 achieved a primary diagnostic accuracy of 38.3%, which improved to 71.6% when differential diagnoses were included. In 84.0% of cases, primary diagnoses were made by conducting investigations suggested by GPT-4. GPT-4 outperformed GPT-3.5 in all subspecialties except for drug reactions. GPT-4 demonstrated the highest performance in infectious diseases and drug reactions, whereas it underperformed in cases of cognitive impairment. These findings indicate that GPT-4 can provide reasonably accurate diagnoses, comprehensive differential diagnoses, and appropriate investigations. However, its performance varies across subspecialties.
Collapse
Affiliation(s)
- Dik Wai Anderson Luk
- Department of Medicine, Queen Mary Hospital, University of Hong Kong, Hong Kong, China
| | | | | |
Collapse
|
20
|
Kuk LY, Kwong DLW, Chan WLW, Shea YF. Limitations of GPT-4 as a geriatrician in geri-oncology case conference: A case series. J Chin Med Assoc 2024; 87:148-150. [PMID: 38051043 DOI: 10.1097/jcma.0000000000001032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 12/07/2023] Open
Abstract
Generative pre-trained transformer 4 (GPT-4) is an artificial intelligence (AI) system with a chat interface. The number of studies testing GPT-4 in clinical applications has been increasing. We hypothesized that GPT-4 would be able to suggest management strategies for medical issues in elderly oncology patients, similar to those provided by geriatricians. We compared the responses of GPT-4 to those of a geriatrician for four oncological patients. After these case conferences, none of the patients required admission for medical consultation. In three out of four scenarios, GPT-4 was able to offer a multidisciplinary approach in the first prompt. In all three scenarios, GPT-4 identified medication-related side effects and suggested appropriate medications in the first prompt. However, GPT-4 was unable to suggest initial dosages of medications to be used in the first prompt and was unable to suggest a more humanistic and non-pharmacological approach to anorexia, even with a follow-up prompt. In conclusion, GPT-4 may be used as a screening tool to provide potential rudimentary directions for management, which can then be reviewed by medical professionals before considering a formal consultation for more tailored and refined opinions from specialists.
Collapse
Affiliation(s)
- Ling-Yuk Kuk
- Geriatrics Division, Department of Medicine, Queen Mary Hospital, Hong Kong, China
| | - Dora Lai-Wan Kwong
- Department of Clinical Oncology, Center of Cancer Medicine, Queen Mary Hospital University of Hong Kong, Hong Kong, China
| | - Wing-Lok Wendy Chan
- Department of Clinical Oncology, Center of Cancer Medicine, Queen Mary Hospital University of Hong Kong, Hong Kong, China
| | - Yat-Fung Shea
- Geriatrics Division, Department of Medicine, Queen Mary Hospital, Hong Kong, China
| |
Collapse
|
21
|
Infante A, Gaudino S, Orsini F, Del Ciello A, Gullì C, Merlino B, Natale L, Iezzi R, Sala E. Large language models (LLMs) in the evaluation of emergency radiology reports: performance of ChatGPT-4, Perplexity, and Bard. Clin Radiol 2024; 79:102-106. [PMID: 38087683 DOI: 10.1016/j.crad.2023.11.011] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2023] [Revised: 11/10/2023] [Accepted: 11/15/2023] [Indexed: 01/02/2024]
Affiliation(s)
- A Infante
- ARC Advanced Radiology Center (ARC), Department of Oncological Radiotherapy, and Hematology, Fondazione Policlinico Universitario Agostino Gemelli IRCCS, Rome, Italy.
| | - S Gaudino
- ARC Advanced Radiology Center (ARC), Department of Oncological Radiotherapy, and Hematology, Fondazione Policlinico Universitario Agostino Gemelli IRCCS, Rome, Italy; Università Cattolica del Sacro Cuore, Facoltà di Medicina e Chirurgia, Rome, Italy
| | - F Orsini
- Università Cattolica del Sacro Cuore, Facoltà di Medicina e Chirurgia, Rome, Italy
| | - A Del Ciello
- ARC Advanced Radiology Center (ARC), Department of Oncological Radiotherapy, and Hematology, Fondazione Policlinico Universitario Agostino Gemelli IRCCS, Rome, Italy
| | - C Gullì
- ARC Advanced Radiology Center (ARC), Department of Oncological Radiotherapy, and Hematology, Fondazione Policlinico Universitario Agostino Gemelli IRCCS, Rome, Italy
| | - B Merlino
- ARC Advanced Radiology Center (ARC), Department of Oncological Radiotherapy, and Hematology, Fondazione Policlinico Universitario Agostino Gemelli IRCCS, Rome, Italy; Università Cattolica del Sacro Cuore, Facoltà di Medicina e Chirurgia, Rome, Italy
| | - L Natale
- ARC Advanced Radiology Center (ARC), Department of Oncological Radiotherapy, and Hematology, Fondazione Policlinico Universitario Agostino Gemelli IRCCS, Rome, Italy; Università Cattolica del Sacro Cuore, Facoltà di Medicina e Chirurgia, Rome, Italy
| | - R Iezzi
- ARC Advanced Radiology Center (ARC), Department of Oncological Radiotherapy, and Hematology, Fondazione Policlinico Universitario Agostino Gemelli IRCCS, Rome, Italy; Università Cattolica del Sacro Cuore, Facoltà di Medicina e Chirurgia, Rome, Italy
| | - E Sala
- ARC Advanced Radiology Center (ARC), Department of Oncological Radiotherapy, and Hematology, Fondazione Policlinico Universitario Agostino Gemelli IRCCS, Rome, Italy; Università Cattolica del Sacro Cuore, Facoltà di Medicina e Chirurgia, Rome, Italy
| |
Collapse
|
22
|
Truhn D, Weber CD, Braun BJ, Bressem K, Kather JN, Kuhl C, Nebelung S. A pilot study on the efficacy of GPT-4 in providing orthopedic treatment recommendations from MRI reports. Sci Rep 2023; 13:20159. [PMID: 37978240 PMCID: PMC10656559 DOI: 10.1038/s41598-023-47500-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2023] [Accepted: 11/14/2023] [Indexed: 11/19/2023] Open
Abstract
Large language models (LLMs) have shown potential in various applications, including clinical practice. However, their accuracy and utility in providing treatment recommendations for orthopedic conditions remain to be investigated. Thus, this pilot study aims to evaluate the validity of treatment recommendations generated by GPT-4 for common knee and shoulder orthopedic conditions using anonymized clinical MRI reports. A retrospective analysis was conducted using 20 anonymized clinical MRI reports, with varying severity and complexity. Treatment recommendations were elicited from GPT-4 and evaluated by two board-certified specialty-trained senior orthopedic surgeons. Their evaluation focused on semiquantitative gradings of accuracy and clinical utility and potential limitations of the LLM-generated recommendations. GPT-4 provided treatment recommendations for 20 patients (mean age, 50 years ± 19 [standard deviation]; 12 men) with acute and chronic knee and shoulder conditions. The LLM produced largely accurate and clinically useful recommendations. However, limited awareness of a patient's overall situation, a tendency to incorrectly appreciate treatment urgency, and largely schematic and unspecific treatment recommendations were observed and may reduce its clinical usefulness. In conclusion, LLM-based treatment recommendations are largely adequate and not prone to 'hallucinations', yet inadequate in particular situations. Critical guidance by healthcare professionals is obligatory, and independent use by patients is discouraged, given the dependency on precise data input.
Collapse
Grants
- ODELIA, 101057091 European Union's Horizon Europe programme
- COMFORT, 101079894 European Union's Horizon Europe programme
- TR 1700/7-1 Deutsche Forschungsgemeinschaft
- NE 2136/3-1 Deutsche Forschungsgemeinschaft
- DEEP LIVER, ZMVI1-2520DAT111 Bundesministerium für Gesundheit
- #70113864 Max-Eder-Programme of the German Cancer Aid
- PEARL, 01KD2104C German Federal Ministry of Education and Research
- CAMINO, 01EO2101 German Federal Ministry of Education and Research
- SWAG, 01KD2215A German Federal Ministry of Education and Research
- TRANSFORM LIVER, 031L0312A German Federal Ministry of Education and Research
- TANGERINE, 01KT2302 through ERA-NET Transcan German Federal Ministry of Education and Research
- SECAI, 57616814 Deutscher Akademischer Austauschdienst
- Transplant.KI, 01VSF21048 German Federal Joint Committee
- ODELIA, 101057091 European Union's Horizon Europe and innovation programme
- GENIAL, 101096312 European Union's Horizon Europe and innovation programme
- NIHR, NIHR213331 National Institute for Health and Care Research
- European Union’s Horizon Europe programme
- European Union’s Horizon Europe and innovation programme
- RWTH Aachen University (3131)
Collapse
Affiliation(s)
- Daniel Truhn
- Department of Diagnostic and Interventional Radiology, University Hospital RWTH Aachen, Pauwels Street 30, 52074, Aachen, Germany
| | - Christian D Weber
- Department of Orthopaedics and Trauma Surgery, University Hospital RWTH Aachen, Aachen, Germany
| | - Benedikt J Braun
- University Hospital Tuebingen on Behalf of the Eberhard-Karls-University Tuebingen, BG Hospital, Schnarrenbergstr. 95, Tübingen, Germany
| | - Keno Bressem
- Department of Radiology, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Hindenburgdamm 30, 12203, Berlin, Germany
| | - Jakob N Kather
- Else Kroener Fresenius Center for Digital Health, Technical University Dresden, Dresden, Germany
- Department of Medicine I, University Hospital Dresden, Dresden, Germany
- Department of Medicine III, University Hospital RWTH Aachen, Aachen, Germany
- Medical Oncology, National Center for Tumor Diseases (NCT), University Hospital Heidelberg, Heidelberg, Germany
| | - Christiane Kuhl
- Department of Diagnostic and Interventional Radiology, University Hospital RWTH Aachen, Pauwels Street 30, 52074, Aachen, Germany
| | - Sven Nebelung
- Department of Diagnostic and Interventional Radiology, University Hospital RWTH Aachen, Pauwels Street 30, 52074, Aachen, Germany.
| |
Collapse
|