1
|
Armoundas AA, Ahmad FS, Attia ZI, Doudesis D, Khera R, Kyriakoulis KG, Stergiou GS, Tang WHW. Controversy in Hypertension: Pro-Side of the Argument Using Artificial Intelligence for Hypertension Diagnosis and Management. Hypertension 2025; 82:929-944. [PMID: 40091745 DOI: 10.1161/hypertensionaha.124.22349] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/19/2025]
Abstract
Hypertension presents the largest modifiable public health challenge due to its high prevalence, its intimate relationship to cardiovascular diseases, and its complex pathogenesis and pathophysiology. Low awareness of blood pressure elevation and suboptimal hypertension diagnosis serve as the major hurdles in effective hypertension management. Advances in artificial intelligence in hypertension have permitted the integrative analysis of large data sets including omics, clinical (with novel sensor and wearable technologies), health-related, social, behavioral, and environmental sources, and hold transformative potential in achieving large-scale, data-driven approaches toward personalized diagnosis, treatment, and long-term management. However, although the emerging artificial intelligence science may advance the concept of precision hypertension in discovery, drug targeting and development, patient care, and management, its clinical adoption at scale today is lacking. Recognizing that clinical implementation of artificial intelligence-based solutions need evidence generation, this opinion statement examines a clinician-centric perspective of the state-of-art in using artificial intelligence in the management of hypertension and puts forward recommendations toward equitable precision hypertension care.
Collapse
Affiliation(s)
- Antonis A Armoundas
- Cardiovascular Research Center, Massachusetts General Hospital and Broad Institute, Massachusetts Institute of Technology, Boston (A.A.A.)
| | - Faraz S Ahmad
- Division of Cardiology, Department of Medicine, Northwestern University, Feinberg School of Medicine, Chicago, IL (F.S.A.)
| | - Zachi I Attia
- Department of Cardiovascular Medicine, Mayo Clinic, Rochester, MN (Z.I.A.)
| | - Dimitrios Doudesis
- British Heart Foundation (BHF) Centre for Cardiovascular Science, University of Edinburgh, United Kingdom (D.D.)
| | - Rohan Khera
- Section of Cardiovascular Medicine, Department of Internal Medicine, Yale School of Medicine (R.K.)
- Section of Health Informatics, Department of Biostatistics, Yale School of Public Health, New Haven, CT (R.K.)
| | - Konstantinos G Kyriakoulis
- Hypertension Center STRIDE-7, National and Kapodistrian University of Athens, School of Medicine, Third Department of Medicine, Athens, Greece (K.G.K., G.S.S.)
| | - George S Stergiou
- Hypertension Center STRIDE-7, National and Kapodistrian University of Athens, School of Medicine, Third Department of Medicine, Athens, Greece (K.G.K., G.S.S.)
| | - W H Wilson Tang
- Heart Vascular and Thoracic Institute, Cleveland Clinic, Cleveland, OH (W.H.W.T.)
| |
Collapse
|
2
|
Güvel MC, Kıyak YS, Varan HD, Sezenöz B, Coşkun Ö, Uluoğlu C. Generative AI vs. human expertise: a comparative analysis of case-based rational pharmacotherapy question generation. Eur J Clin Pharmacol 2025; 81:875-883. [PMID: 40205076 DOI: 10.1007/s00228-025-03838-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2025] [Accepted: 03/31/2025] [Indexed: 04/11/2025]
Abstract
PURPOSE This study evaluated the performance of three generative AI models-ChatGPT- 4o, Gemini 1.5 Advanced Pro, and Claude 3.5 Sonnet-in producing case-based rational pharmacology questions compared to expert educators. METHODS Using one-shot prompting, 60 questions (20 per model) addressing essential hypertension and type 2 diabetes subjects were generated. A multidisciplinary panel categorized questions by usability (no revisions needed, minor or major revisions required, or unusable). Subsequently, 24 AI-generated and 8 expert-created questions were asked to 103 medical students in a real-world exam setting. Performance metrics, including correct response rate, discrimination index, and identification of nonfunctional distractors, were analyzed. RESULTS No statistically significant differences were found between AI-generated and expert-created questions, with mean correct response rates surpassing 50% and discrimination indices consistently equal to or above 0.20. Claude produced the highest proportion of error-free items (12/20), whereas ChatGPT exhibited the fewest unusable items (5/20). Expert revisions required approximately one minute per AI-generated question, representing a substantial efficiency gain over manual question preperation. Nonetheless, 19 out of 60 AI-generated questions were deemed unusable, highlighting the necessity of expert oversight. CONCLUSION Large language models can profoundly accelerate the development of high-quality assessment questions in medical education. However, expert review remains critical to address lapses in reliability and validity. A hybrid model, integrating AI-driven efficiencies with rigorous expert validation, may offer an optimal approach for enhancing educational outcomes.
Collapse
Affiliation(s)
- Muhammed Cihan Güvel
- Department Medical Pharmacology, Gazi University Faculty of Medicine, Ankara, Turkey
| | - Yavuz Selim Kıyak
- Department of Medical Education and Informatics, Gazi University Faculty of Medicine, Ankara, Turkey
| | - Hacer Doğan Varan
- Department of Internal Medicine, Gazi University Faculty of Medicine, Ankara, Turkey
| | - Burak Sezenöz
- Department of Cardiology, Gazi University Faculty of Medicine, Ankara, Turkey
| | - Özlem Coşkun
- Department of Medical Education and Informatics, Gazi University Faculty of Medicine, Ankara, Turkey
| | - Canan Uluoğlu
- Department Medical Pharmacology, Gazi University Faculty of Medicine, Ankara, Turkey.
| |
Collapse
|
3
|
Shi B, Chen L, Pang S, Wang Y, Wang S, Li F, Zhao W, Guo P, Zhang L, Fan C, Zou Y, Wu X. Large Language Models and Artificial Neural Networks for Assessing 1-Year Mortality in Patients With Myocardial Infarction: Analysis From the Medical Information Mart for Intensive Care IV (MIMIC-IV) Database. J Med Internet Res 2025; 27:e67253. [PMID: 40354652 PMCID: PMC12107198 DOI: 10.2196/67253] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2024] [Revised: 04/01/2025] [Accepted: 04/17/2025] [Indexed: 05/14/2025] Open
Abstract
BACKGROUND Accurate mortality risk prediction is crucial for effective cardiovascular risk management. Recent advancements in artificial intelligence (AI) have demonstrated potential in this specific medical field. Qwen-2 and Llama-3 are high-performance, open-source large language models (LLMs) available online. An artificial neural network (ANN) algorithm derived from the SWEDEHEART (Swedish Web System for Enhancement and Development of Evidence-Based Care in Heart Disease Evaluated According to Recommended Therapies) registry, termed SWEDEHEART-AI, can predict patient prognosis following acute myocardial infarction (AMI). OBJECTIVE This study aims to evaluate the 3 models mentioned above in predicting 1-year all-cause mortality in critically ill patients with AMI. METHODS The Medical Information Mart for Intensive Care IV (MIMIC-IV) database is a publicly available data set in critical care medicine. We included 2758 patients who were first admitted for AMI and discharged alive. SWEDEHEART-AI calculated the mortality rate based on each patient's 21 clinical variables. Qwen-2 and Llama-3 analyzed the content of patients' discharge records and directly provided a 1-decimal value between 0 and 1 to represent 1-year death risk probabilities. The patients' actual mortality was verified using follow-up data. The predictive performance of the 3 models was assessed and compared using the Harrell C-statistic (C-index), the area under the receiver operating characteristic curve (AUROC), calibration plots, Kaplan-Meier curves, and decision curve analysis. RESULTS SWEDEHEART-AI demonstrated strong discrimination in predicting 1-year all-cause mortality in patients with AMI, with a higher C-index than Qwen-2 and Llama-3 (C-index 0.72, 95% CI 0.69-0.74 vs C-index 0.65, 0.62-0.67 vs C-index 0.56, 95% CI 0.53-0.58, respectively; all P<.001 for both comparisons). SWEDEHEART-AI also showed high and consistent AUROC in the time-dependent ROC curve. The death rates calculated by SWEDEHEART-AI were positively correlated with actual mortality, and the 3 risk classes derived from this model showed clear differentiation in the Kaplan-Meier curve (P<.001). Calibration plots indicated that SWEDEHEART-AI tended to overestimate mortality risk, with an observed-to-expected ratio of 0.478. Compared with the LLMs, SWEDEHEART-AI demonstrated positive and greater net benefits at risk thresholds below 19%. CONCLUSIONS SWEDEHEART-AI, a trained ANN model, demonstrated the best performance, with strong discrimination and clinical utility in predicting 1-year all-cause mortality in patients with AMI from an intensive care cohort. Among the LLMs, Qwen-2 outperformed Llama-3 and showed moderate predictive value. Qwen-2 and SWEDEHEART-AI exhibited comparable classification effectiveness. The future integration of LLMs into clinical decision support systems holds promise for accurate risk stratification in patients with AMI; however, further research is needed to optimize LLM performance and address calibration issues across diverse patient populations.
Collapse
Affiliation(s)
- Boqun Shi
- Department of Cardiology, Beijing Anzhen Hospital, Capital Medical University, Beijing, China
| | - Liangguo Chen
- Department of Cardiology, Beijing Anzhen Hospital, Capital Medical University, Beijing, China
| | - Shuo Pang
- Department of Cardiology, Beijing Anzhen Hospital, Capital Medical University, Beijing, China
| | - Yue Wang
- Department of Cardiology, Beijing Anzhen Hospital, Capital Medical University, Beijing, China
| | - Shen Wang
- Department of Cardiology, Beijing Anzhen Hospital, Capital Medical University, Beijing, China
| | - Fadong Li
- Department of Cardiology, Beijing Anzhen Hospital, Capital Medical University, Beijing, China
| | - Wenxin Zhao
- Department of Cardiology, Beijing Anzhen Hospital, Capital Medical University, Beijing, China
| | - Pengrong Guo
- Department of Cardiology, Beijing Anzhen Hospital, Capital Medical University, Beijing, China
| | - Leli Zhang
- Department of Cardiology, Beijing Anzhen Hospital, Capital Medical University, Beijing, China
| | - Chu Fan
- Department of Cardiology, Beijing Anzhen Hospital, Capital Medical University, Beijing, China
| | - Yi Zou
- Department of Cardiology, Beijing Anzhen Hospital, Capital Medical University, Beijing, China
| | - Xiaofan Wu
- Department of Cardiology, Beijing Anzhen Hospital, Capital Medical University, Beijing, China
| |
Collapse
|
4
|
Garcia-Lopez A, Cuervo-Rojas J, Garcia-Lopez J, Giron-Luque F. Using Natural Language Processing and Machine Learning to classify the status of kidney allograft in Electronic Medical Records written in Spanish. PLoS One 2025; 20:e0322587. [PMID: 40338843 PMCID: PMC12061128 DOI: 10.1371/journal.pone.0322587] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2024] [Accepted: 03/23/2025] [Indexed: 05/10/2025] Open
Abstract
INTRODUCTION Accurate identification of graft loss in Electronic Medical Records of kidney transplant recipients is essential but challenging due to inconsistent and not mandatory International Classification of Diseases (ICD) codes. We developed and validated Natural Language Processing (NLP) and machine learning models to classify the status of kidney allografts in unstructured text in EMRs written in Spanish. METHODS We conducted a retrospective cohort of 2712 patients transplanted between July 2008 and January 2023, analyzing 117,566 unstructured medical records. NLP involved text normalization, tokenization, stopwords removal, spell-checking, elimination of low-frequency words and stemming. Data was split in training, validation and test sets. Data balance was performed using undersampling technique. Feature selection was performed using LASSO regression. We developed, validated and tested Logistic Regression, Random Forest, and Neural Networks models using 10-fold cross-validation. Performance metrics included area under the curve, F1 Score, accuracy, sensitivity, specificity, Negative Predictive Value, and Positive Predictive Value. RESULTS The test performance results showed that the Random Forest model achieved the highest AUC (0.98) and F1 score (0.65). However, it had a modest sensitivity (0.76) and a relatively low PPV (0.56), implying a significant number of false positives. The Neural Network model also performed well with a high AUC (0.98) and reasonable F1 score (0.61), but its PPV (0.49) was lower, indicating more false positives. The Logistic Regression model, while having the lowest AUC (0.91) and F1 score (0.49), showed the highest sensitivity (0.83) with the lowest PPV (0.35). CONCLUSION We developed and validated three machine learning models combined with NLP techniques for unstructured texts written in Spanish. The models performed well on the validation set but showed modest performance on the test set due to data imbalance. These models could be adapted for clinical practice, though they may require additional manual work due to high false positive rates.
Collapse
Affiliation(s)
- Andrea Garcia-Lopez
- PhD Program in Clinical Epidemiology, Department of Clinical Epidemiology and Biostatistics, Faculty of Medicine, Pontificia Universidad Javeriana, Bogotá, Colombia
- Department of Transplant Research, Colombiana de Trasplantes, Bogotá, Colombia
| | - Juliana Cuervo-Rojas
- Department of Clinical Epidemiology and Biostatistics, Faculty of Medicine, Pontificia Universidad Javeriana, Bogotá, Colombia
| | - Juan Garcia-Lopez
- Department of Technology and Informatics, Colombiana de Trasplantes, Bogotá, Colombia
| | - Fernando Giron-Luque
- Department of Transplant Research, Colombiana de Trasplantes, Bogotá, Colombia
- Department of Transplant Surgery, Colombiana de Trasplantes, Bogotá, Colombia
| |
Collapse
|
5
|
Liu C, Zhang H, Zheng Z, Liu W, Gu C, Lan Q, Zhang W, Yang J. ChatOCT: Embedded Clinical Decision Support Systems for Optical Coherence Tomography in Offline and Resource-Limited Settings. J Med Syst 2025; 49:59. [PMID: 40332685 DOI: 10.1007/s10916-025-02188-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2025] [Accepted: 04/23/2025] [Indexed: 05/08/2025]
Abstract
Optical Coherence Tomography (OCT) is a critical imaging modality for diagnosing ocular and systemic conditions, yet its accessibility is hindered by the need for specialized expertise and high computational demands. To address these challenges, we introduce ChatOCT, an offline-capable, domain-adaptive clinical decision support system (CDSS) that integrates structured expert Q&A generation, OCT-specific knowledge injection, and activation-aware model compression. Unlike existing systems, ChatOCT functions without internet access, making it suitable for low-resource environments. ChatOCT is built upon LLaMA-2-7B, incorporating domain-specific knowledge from PubMed and OCT News through a two-stage training process: (1) knowledge injection for OCT-specific expertise and (2) Q&A instruction tuning for structured, interactive diagnostic reasoning. To ensure feasibility in offline environments, we apply activation-aware weight quantization, reducing GPU memory usage to ~ 4.74 GB, enabling deployment on standard OCT hardware. A novel expert answer generation framework mitigates hallucinations by structuring responses in a multi-step process, ensuring accuracy and interpretability. ChatOCT outperforms state-of-the-art baselines such as LLaMA-2, PMC-LLaMA-13B, and ChatDoctor by 10-15 points in coherence, relevance, and clinical utility, while reducing GPU memory requirements by 79%, while maintaining real-time responsiveness (~ 20 ms inference time). Expert ophthalmologists rated ChatOCT's outputs as clinically actionable and aligned with real-world decision-making needs, confirming its potential to assist frontline healthcare providers. ChatOCT represents an innovative offline clinical decision support system for optical coherence tomography (OCT) that runs entirely on local embedded hardware, enabling real-time analysis in resource-limited settings without internet connectivity. By offering a scalable, generalizable pipeline that integrates knowledge injection, instruction tuning, and model compression, ChatOCT provides a blueprint for next-generation, resource-efficient clinical AI solutions across multiple medical domains.
Collapse
Affiliation(s)
- Chang Liu
- School of Biomedical Engineering, Shanghai Jiao Tong University, Xuhui District, No. 3 Teaching Building, 1954 Huashan RD, Shanghai, China
| | - Haoran Zhang
- School of Biomedical Engineering, Shanghai Jiao Tong University, Xuhui District, No. 3 Teaching Building, 1954 Huashan RD, Shanghai, China
| | - Zheng Zheng
- Department of Ophthalmology, Shanghai General Hospital, Shanghai, China
- National Clinical Research Center for Eye Diseases, Shanghai, China
| | - Wenjia Liu
- Department of Ophthalmology, Shanghai General Hospital, Shanghai, China
- National Clinical Research Center for Eye Diseases, Shanghai, China
| | - Chengfu Gu
- School of Biomedical Engineering, Shanghai Jiao Tong University, Xuhui District, No. 3 Teaching Building, 1954 Huashan RD, Shanghai, China
| | - Qi Lan
- School of Biomedical Engineering, Shanghai Jiao Tong University, Xuhui District, No. 3 Teaching Building, 1954 Huashan RD, Shanghai, China
| | - Weiyi Zhang
- School of Biomedical Engineering, Shanghai Jiao Tong University, Xuhui District, No. 3 Teaching Building, 1954 Huashan RD, Shanghai, China
| | - Jianlong Yang
- School of Biomedical Engineering, Shanghai Jiao Tong University, Xuhui District, No. 3 Teaching Building, 1954 Huashan RD, Shanghai, China.
| |
Collapse
|
6
|
Nasirov R. The Role of Claude 3.5 Sonet and ChatGPT-4 in Posterior Cervical Fusion Patient Guidance. World Neurosurg 2025; 197:123889. [PMID: 40081488 DOI: 10.1016/j.wneu.2025.123889] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2024] [Revised: 03/04/2025] [Accepted: 03/05/2025] [Indexed: 03/16/2025]
Abstract
BACKGROUND This study evaluates the role of ChatGPT-4 and Claude 3.5 Sonet in postoperative management for patients undergoing posterior cervical fusion. It focuses on their ability to provide accurate, clear, and relevant responses to patient concerns, highlighting their potential as supplementary tools in surgical aftercare. METHODS Ten common postoperative questions were selected and posed to ChatGPT-4 and Claude 3.5 Sonet. Ten independent neurosurgeons evaluated responses using a structured framework that assessed accuracy, response time, clarity, and relevance. A 5-point Likert scale also measured satisfaction, quality, performance, and importance. Advanced statistical analyses were used to compare the 2 artificial intelligence platforms, including sensitivity, specificity, P values, confidence intervals, and Cohen's d. RESULTS Claude 3.5 Sonet outperformed ChatGPT-4 across all metrics, particularly in accuracy (96.5% vs. 80.6%), response time (92.9% vs. 76.4%), clarity (94.6% vs. 75.4%), and relevance (95.5% vs. 74.0%). Likert scale evaluations showed significant differences (P < 0.001) in satisfaction, quality, and performance, with Claude achieving higher ratings. Statistical analyses confirmed large effect sizes, high inter-rater reliability (kappa: 0.85-0.92 for Claude), and narrower confidence intervals, reinforcing Claude's consistency and superior performance. CONCLUSIONS Claude 3.5 Sonet demonstrated exceptional capability in addressing postoperative concerns for posterior cervical fusion patients, surpassing ChatGPT-4 in accuracy, clarity, and practical relevance. These findings underscore its potential as a reliable artificial intelligence tool for enhancing patient care and satisfaction in surgical aftercare.
Collapse
Affiliation(s)
- Rauf Nasirov
- Department of Neurosurgery, Denver Health Medical Center, University of Colorado, Denver, Colorado, USA.
| |
Collapse
|
7
|
Gunes YC, Cesur T, Camur E, Cifci BE, Kaya T, Colakoglu MN, Koc U, Okten RS. Textual Proficiency and Visual Deficiency: A Comparative Study of Large Language Models and Radiologists in MRI Artifact Detection and Correction. Acad Radiol 2025; 32:2411-2421. [PMID: 39939230 DOI: 10.1016/j.acra.2025.01.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2024] [Revised: 12/29/2024] [Accepted: 01/06/2025] [Indexed: 02/14/2025]
Abstract
RATIONALE AND OBJECTIVES To assess the performance of Large Language Models (LLMs) in detecting and correcting MRI artifacts compared to radiologists using text-based and visual questions. MATERIALS AND METHODS This cross-sectional observational study included three phases. Phase 1 involved six LLMs (ChatGPT o1-preview, ChatGPT-4o, ChatGPT-4V, Google Gemini 1.5 Pro, Claude 3.5 Sonnet, Claude 3 Opus) and five radiologists (two residents, two junior radiologists, one senior radiologist) answering 42 text-based questions on MRI artifacts. In Phase 2, the same radiologists and five multimodal LLMs evaluated 100 MRI images, each containing a single artifact. Phase 3 reassessed the identical tasks 1.5 months later to evaluate temporal consistency. Responses were graded using 4-point Likert scales for "Management Score" (text-based) and "Correction Score" (visual). McNemar's test compared response accuracy, and the Wilcoxon test assessed score differences. RESULTS LLMs outperformed radiologists in text-based tasks, with ChatGPT o1-preview scoring the highest (3.71±0.60 in Round 1; 3.76±0.84 in Round 2) (p<0.05). In visual tasks, radiologists performed significantly better, with the Senior Radiologist achieving 92% and 94% accuracy in Rounds 1 and 2, respectively (p<0.05). The top-performing LLM (ChatGPT-4o) achieved only 20% and 18% accuracy. Correction Scores mirrored this difference, with radiologists consistently scoring higher than LLMs (p<0.05). CONCLUSION LLMs excel in text-based tasks but have notable limitations in visual artifact interpretation, making them unsuitable for independent diagnostics. They are promising as educational tools or adjuncts in "human-in-the-loop" systems, with multimodal AI improvements necessary to bridge these gaps.
Collapse
Affiliation(s)
- Yasin Celal Gunes
- Department of Radiology, Kirikkale Yuksek Ihtisas Hospital, Kirikkale, Turkey (Y.C.G.).
| | - Turay Cesur
- Department of Radiology, Mamak State Hospital, Ankara, Turkey (T.C.)
| | - Eren Camur
- Department of Radiology, Ankara 29 Mayıs State Hospital, Ankara, Turkey (E.C.)
| | - Bilal Egemen Cifci
- Department of Radiology, Ankara Bilkent City Hospital, Ankara, Turkey (B.E.C., T.K., M.N.C., U.K., R.S.O.)
| | - Turan Kaya
- Department of Radiology, Ankara Bilkent City Hospital, Ankara, Turkey (B.E.C., T.K., M.N.C., U.K., R.S.O.)
| | - Mehmet Numan Colakoglu
- Department of Radiology, Ankara Bilkent City Hospital, Ankara, Turkey (B.E.C., T.K., M.N.C., U.K., R.S.O.)
| | - Ural Koc
- Department of Radiology, Ankara Bilkent City Hospital, Ankara, Turkey (B.E.C., T.K., M.N.C., U.K., R.S.O.)
| | - Rıza Sarper Okten
- Department of Radiology, Ankara Bilkent City Hospital, Ankara, Turkey (B.E.C., T.K., M.N.C., U.K., R.S.O.)
| |
Collapse
|
8
|
Sumner J, Wang Y, Tan SY, Chew EHH, Wenjun Yip A. Perspectives and Experiences With Large Language Models in Health Care: Survey Study. J Med Internet Res 2025; 27:e67383. [PMID: 40310666 PMCID: PMC12082058 DOI: 10.2196/67383] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2024] [Revised: 01/14/2025] [Accepted: 01/15/2025] [Indexed: 05/02/2025] Open
Abstract
BACKGROUND Large language models (LLMs) are transforming how data is used, including within the health care sector. However, frameworks including the Unified Theory of Acceptance and Use of Technology highlight the importance of understanding the factors that influence technology use for successful implementation. OBJECTIVE This study aimed to (1) investigate users' uptake, perceptions, and experiences regarding LLMs in health care and (2) contextualize survey responses by demographics and professional profiles. METHODS An electronic survey was administered to elicit stakeholder perspectives of LLMs (health care providers and support functions), their experiences with LLMs, and their potential impact on functional roles. Survey domains included: demographics (6 questions), user experiences of LLMs (8 questions), motivations for using LLMs (6 questions), and perceived impact on functional roles (4 questions). The survey was launched electronically, targeting health care providers or support staff, health care students, and academics in health-related fields. Respondents were adults (>18 years) aware of LLMs. RESULTS Responses were received from 1083 individuals, of which 845 were analyzable. Of the 845 respondents, 221 had yet to use an LLM. Nonusers were more likely to be health care workers (P<.001), older (P<.001), and female (P<.01). Users primarily adopted LLMs for speed, convenience, and productivity. While 75% (470/624) agreed that the user experience was positive, 46% (294/624) found the generated content unhelpful. Regression analysis showed that the experience with LLMs is more likely to be positive if the user is male (odds ratio [OR] 1.62, CI 1.06-2.48), and increasing age was associated with a reduced likelihood of reporting LLM output as useful (OR 0.98, CI 0.96-0.99). Nonusers compared to LLM users were less likely to report LLMs meeting unmet needs (45%, 99/221 vs 65%, 407/624; OR 0.48, CI 0.35-0.65), and males were more likely to report that LLMs do address unmet needs (OR 1.64, CI 1.18-2.28). Furthermore, nonusers compared to LLM users were less likely to agree that LLMs will improve functional roles (63%, 140/221 vs 75%, 469/624; OR 0.60, CI 0.43-0.85). Free-text opinions highlighted concerns regarding autonomy, outperformance, and reduced demand for care. Respondents also predicted changes to human interactions, including fewer but higher quality interactions and a change in consumer needs as LLMs become more common, which would require provider adaptation. CONCLUSIONS Despite the reported benefits of LLMs, nonusers-primarily health care workers, older individuals, and females-appeared more hesitant to adopt these tools. These findings underscore the need for targeted education and support to address adoption barriers and ensure the successful integration of LLMs in health care. Anticipated role changes, evolving human interactions, and the risk of the digital divide further emphasize the need for careful implementation and ongoing evaluation of LLMs in health care to ensure equity and sustainability.
Collapse
Affiliation(s)
- Jennifer Sumner
- Alexandra Research Centre for Healthcare in a Virtual Environment, Alexandra Hospital, Singapore, Singapore
| | - Yuchen Wang
- School of Computing, National University of Singapore, Singapore, Singapore
| | - Si Ying Tan
- Alexandra Research Centre for Healthcare in a Virtual Environment, Alexandra Hospital, Singapore, Singapore
| | - Emily Hwee Hoon Chew
- Alexandra Research Centre for Healthcare in a Virtual Environment, Alexandra Hospital, Singapore, Singapore
| | - Alexander Wenjun Yip
- Alexandra Research Centre for Healthcare in a Virtual Environment, Alexandra Hospital, Singapore, Singapore
| |
Collapse
|
9
|
Ballard DH, Antigua-Made A, Barre E, Edney E, Gordon EB, Kelahan L, Lodhi T, Martin JG, Ozkan M, Serdynski K, Spieler B, Zhu D, Adams SJ. Impact of ChatGPT and Large Language Models on Radiology Education: Association of Academic Radiology-Radiology Research Alliance Task Force White Paper. Acad Radiol 2025; 32:3039-3049. [PMID: 39616097 DOI: 10.1016/j.acra.2024.10.023] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2024] [Revised: 10/06/2024] [Accepted: 10/17/2024] [Indexed: 04/23/2025]
Abstract
Generative artificial intelligence, including large language models (LLMs), holds immense potential to enhance healthcare, medical education, and health research. Recognizing the transformative opportunities and potential risks afforded by LLMs, the Association of Academic Radiology-Radiology Research Alliance convened a task force to explore the promise and pitfalls of using LLMs such as ChatGPT in radiology. This white paper explores the impact of LLMs on radiology education, highlighting their potential to enrich curriculum development, teaching and learning, and learner assessment. Despite these advantages, the implementation of LLMs presents challenges, including limits on accuracy and transparency, the risk of misinformation, data privacy issues, and potential biases, which must be carefully considered. We provide recommendations for the successful integration of LLMs and LLM-based educational tools into radiology education programs, emphasizing assessment of the technological readiness of LLMs for specific use cases, structured planning, regular evaluation, faculty development, increased training opportunities, academic-industry collaboration, and research on best practices for employing LLMs in education.
Collapse
Affiliation(s)
- David H Ballard
- Mallinckrodt Institute of Radiology, Washington University School of Medicine, St. Louis, Missouri, USA
| | | | - Emily Barre
- Duke University School of Medicine, Durham, North Carolina, USA
| | - Elizabeth Edney
- Department of Radiology, University of Nebraska Medical Center, Omaha, Nebraska, USA
| | - Emile B Gordon
- Department of Radiology, University of California San Diego, San Diego, California, USA
| | - Linda Kelahan
- Department of Radiology, Northwestern University Feinberg School of Medicine, Chicago, Illinois, USA
| | - Taha Lodhi
- Brody School of Medicine at East Carolina University, Greenville, North Carolina, USA
| | | | - Melis Ozkan
- University of Michigan Medical School, Ann Arbor, Michigan, USA
| | | | - Bradley Spieler
- Department of Radiology, Louisiana State University School of Medicine, University Medical Center, New Orleans, Louisiana, USA
| | - Daphne Zhu
- Duke University School of Medicine, Durham, North Carolina, USA
| | - Scott J Adams
- Department of Medical Imaging, Royal University Hospital, College of Medicine, University of Saskatchewan, Saskatoon, Saskatchewan, Canada.
| |
Collapse
|
10
|
Araujo MLD, Winger T, Ghosn S, Saab C, Srivastava J, Kazaglis L, Mathur P, Mehra R. Status and opportunities of machine learning applications in obstructive sleep apnea: A narrative review. Comput Struct Biotechnol J 2025; 28:167-174. [PMID: 40421411 PMCID: PMC12104685 DOI: 10.1016/j.csbj.2025.04.033] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2025] [Revised: 04/22/2025] [Accepted: 04/23/2025] [Indexed: 05/28/2025] Open
Abstract
Background Obstructive sleep apnea (OSA) is a prevalent and potentially severe sleep disorder characterized by repeated interruptions in breathing during sleep. Machine learning models have been increasingly applied in various aspects of OSA research, including diagnosis, treatment optimization, and developing biomarkers for endotypes and disease mechanisms. Objective This narrative review evaluates the application of machine learning in OSA research, focusing on model performance, dataset characteristics, demographic representation, and validation strategies. We aim to identify trends and gaps to guide future research and improve clinical decision-making that leverages machine learning. Methods This narrative review examines data extracted from 254 scientific publications published in the PubMed database between January 2018 and March 2023. Studies were categorized by machine learning applications, models, tasks, validation metrics, data sources, and demographics. Results Our analysis revealed that most machine learning applications focused on OSA classification and diagnosis, utilizing various data sources such as polysomnography, electrocardiogram data, and wearable devices. We also found that study cohorts were predominantly overweight males, with an underrepresentation of women, younger obese adults, individuals over 60 years old, and diverse racial groups. Many studies had small sample sizes and limited use of robust model validation. Conclusion Our findings highlight the need for more inclusive research approaches, starting with adequate data collection in terms of sample size and bias mitigation for better generalizability of machine learning models in OSA research. Addressing these demographic gaps and methodological opportunities is critical for ensuring more robust and equitable applications of artificial intelligence in healthcare.
Collapse
Affiliation(s)
| | | | - Samer Ghosn
- Cleveland Clinic Foundation, Cleveland, OH, USA
| | - Carl Saab
- Cleveland Clinic Foundation, Cleveland, OH, USA
| | | | | | | | | |
Collapse
|
11
|
Raza M, Jahangir Z, Riaz MB, Saeed MJ, Sattar MA. Industrial applications of large language models. Sci Rep 2025; 15:13755. [PMID: 40258923 PMCID: PMC12012124 DOI: 10.1038/s41598-025-98483-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2024] [Accepted: 04/11/2025] [Indexed: 04/23/2025] Open
Abstract
Large language models (LLMs) are artificial intelligence (AI) based computational models designed to understand and generate human like text. With billions of training parameters, LLMs excel in identifying intricate language patterns, enabling remarkable performance across a variety of natural language processing (NLP) tasks. After the introduction of transformer architectures, they are impacting the industry with their text generation capabilities. LLMs play an innovative role across various industries by automating NLP tasks. In healthcare, they assist in diagnosing diseases, personalizing treatment plans, and managing patient data. LLMs provide predictive maintenance in automotive industry. LLMs provide recommendation systems, and consumer behavior analyzers. LLMs facilitates researchers and offer personalized learning experiences in education. In finance and banking, LLMs are used for fraud detection, customer service automation, and risk management. LLMs are driving significant advancements across the industries by automating tasks, improving accuracy, and providing deeper insights. Despite these advancements, LLMs face challenges such as ethical concerns, biases in training data, and significant computational resource requirements, which must be addressed to ensure impartial and sustainable deployment. This study provides a comprehensive analysis of LLMs, their evolution, and their diverse applications across industries, offering researchers valuable insights into their transformative potential and the accompanying limitations.
Collapse
Affiliation(s)
- Mubashar Raza
- Department of Computer Science, COMSATS University, Sahiwal Campus, Islamabad, Pakistan
| | - Zarmina Jahangir
- Department of Computer Science, Riphah International University, Lahore Campus, Lahore, Pakistan
| | - Muhammad Bilal Riaz
- IT4Innovations, VSB - Technical University of Ostrava, Ostrava, Czech Republic.
- Applied Science Research Center, Applied Science Private University, Amman, Jordan.
| | - Muhammad Jasim Saeed
- Department of Computer Science, Riphah International University, Lahore Campus, Lahore, Pakistan
| | - Muhammad Awais Sattar
- Department of Computer Science, Electrical and Space Engineering, Luleå University of Technology, Luleå, Sweden
| |
Collapse
|
12
|
Wang TD, Murphy SN, Castro VM, Klann JG. From Spreadsheets and Bespoke Models to Enterprise Data Warehouses: GPT-enabled Clinical Data Ingestion into i2b2. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2025:2025.04.17.25325962. [PMID: 40321272 PMCID: PMC12047957 DOI: 10.1101/2025.04.17.25325962] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 05/11/2025]
Abstract
Objective Clinical and phenotypic data available to researchers are often found in spreadsheets or bespoke data models. Bridging these to enterprise data warehouses would enable sophisticated analytics and cohort discovery for users of platforms like NHGRI's Genomic Data Science Analysis, Visualization, and Informatics Lab-space (AnVlL). We combine data mapping methodologies, biomedical ontologies, and large language models (LLMs) to load these data into Informatics for Integrating Biology and the Bedside (i2b2), making them available to AnVIL users. Materials and Methods We developed few-shot prompts for ChatGPT-4o to generate Python scripts that facilitate the extract, transform, and load (ETL) process into i2b2. The scripts first convert a designated data dictionary (in various formats) into an intermediate common format, and then into an i2b2 ontology. Finally, the original data file is converted into i2b2 facts, using standard ontologies hosted by the National Center for Biomedical Ontology (NCBO). Results ChatGPT-4o correctly produced Python code to facilitate ETL. We converted phenotype data from three synthetic datasets from three disparate data models available in AnVIL. Our prompts generated scripts which successfully converted data on 3,458 fake patients, making it queryable in i2b2. Discussion For a few datasets, iterative prompt refinement might reduce ETL efficiency gains. However, prompt reuse significantly reduces incremental effort for additional data models. At scale, we anticipate our pipeline offers substantial time savings, which could transform future ETL workflows. Conclusion We developed an LLM-powered ETL pipeline to convert disparate datasets into i2b2 format, enabling advanced analytics and cohort discovery across heterogeneous data models.
Collapse
Affiliation(s)
- Taowei David Wang
- Harvard Medical School, Boston, MA
- Research Information Science and Computing, Mass General Brigham, Boston, MA
| | - Shawn N Murphy
- Harvard Medical School, Boston, MA
- Research Information Science and Computing, Mass General Brigham, Boston, MA
- Laboratory of Computer Science, Massachusetts General Hospital, Boston, MA
| | - Victor M Castro
- Research Information Science and Computing, Mass General Brigham, Boston, MA
| | - Jeffrey G Klann
- Harvard Medical School, Boston, MA
- Research Information Science and Computing, Mass General Brigham, Boston, MA
- Laboratory of Computer Science, Massachusetts General Hospital, Boston, MA
| |
Collapse
|
13
|
Lim B, Seth I, Maxwell M, Cuomo R, Ross RJ, Rozen WM. Evaluating the Efficacy of Large Language Models in Generating Medical Documentation: A Comparative Study of ChatGPT-4, ChatGPT-4o, and Claude. Aesthetic Plast Surg 2025:10.1007/s00266-025-04842-8. [PMID: 40229614 DOI: 10.1007/s00266-025-04842-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2024] [Accepted: 03/14/2025] [Indexed: 04/16/2025]
Abstract
BACKGROUND Large language models (LLMs) have demonstrated transformative potential in health care. They can enhance clinical and academic medicine by facilitating accurate diagnoses, interpreting laboratory results, and automating documentation processes. This study evaluates the efficacy of LLMs in generating surgical operation reports and discharge summaries, focusing on accuracy, efficiency, and quality. METHODS This study assessed the effectiveness of three leading LLMs-ChatGPT-4.0, ChatGPT-4o, and Claude-using six prompts and analyzing their responses for readability and output quality, validated by plastic surgeons. Readability was measured with the Flesch-Kincaid, Flesch reading ease scores, and Coleman-Liau Index, while reliability was evaluated using the DISCERN score. A paired two-tailed t-test (p<0.05) compared the statistical significance of these metrics and the time taken to generate operation reports and discharge summaries against the authors' results. RESULTS Table 3 shows statistically significant differences in readability between ChatGPT-4o and Claude across all metrics, while ChatGPT-4 and Claude differ significantly in the Flesch reading ease and Coleman-Liau indices. Table 6 reveals extremely low p-values across BL, IS, and MM for all models, with Claude consistently outperforming both ChatGPT-4 and ChatGPT-4o. Additionally, Claude generated documents the fastest, completing tasks in approximately 10 to 14 s. These results suggest that Claude not only excels in readability but also demonstrates superior reliability and speed, making it an efficient choice for practical applications. CONCLUSION The study highlights the importance of selecting appropriate LLMs for clinical use. Integrating these LLMs can streamline healthcare documentation, improve efficiency, and enhance patient outcomes through clearer communication and more accurate medical reports. LEVEL OF EVIDENCE V This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors www.springer.com/00266 .
Collapse
Affiliation(s)
- Bryan Lim
- Department of Plastic and Reconstructive Surgery, Frankston Hospital, Peninsula Health, Frankston, VIC, Australia.
- Peninsula Clinical School, Central Clinical School, Faculty of Medicine, Monash University, Frankston, VIC, Australia.
| | - Ishith Seth
- Department of Plastic and Reconstructive Surgery, Frankston Hospital, Peninsula Health, Frankston, VIC, Australia
- Peninsula Clinical School, Central Clinical School, Faculty of Medicine, Monash University, Frankston, VIC, Australia
| | - Molly Maxwell
- Department of Plastic and Reconstructive Surgery, Frankston Hospital, Peninsula Health, Frankston, VIC, Australia
| | - Roberto Cuomo
- Department of Plastic and Reconstructive Surgery, University of Siena, Siena, Italy
| | - Richard J Ross
- Department of Plastic and Reconstructive Surgery, Frankston Hospital, Peninsula Health, Frankston, VIC, Australia
| | - Warren M Rozen
- Department of Plastic and Reconstructive Surgery, Frankston Hospital, Peninsula Health, Frankston, VIC, Australia
- Peninsula Clinical School, Central Clinical School, Faculty of Medicine, Monash University, Frankston, VIC, Australia
| |
Collapse
|
14
|
Wang Y, Zhu T, Zhou T, Wu B, Tan W, Ma K, Yao Z, Wang J, Li S, Qin F, Xu Y, Tan L, Liu J, Wang J. Hyper-DREAM, a Multimodal Digital Transformation Hypertension Management Platform Integrating Large Language Model and Digital Phenotyping: Multicenter Development and Initial Validation Study. J Med Syst 2025; 49:42. [PMID: 40172683 DOI: 10.1007/s10916-025-02176-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2025] [Accepted: 03/22/2025] [Indexed: 04/04/2025]
Abstract
Within the mHealth framework, systematic research that collects and analyzes patient data to establish comprehensive digital health archives for hypertensive patients, and leverages large language models (LLMs) to assist clinicians in health management and Blood Pressure (BP) control remains limited. In this study, our aims to describe the design, development and usability evaluation process of a management platform (Hyper-DREAM) for hypertension. Our multidisciplinary team employed an iterative design approach over the course of a year to develop the Hyper-DREAM platform. This platform's primary functionalities encompass multimodal data collection (personal hypertensive digital phenotype archive), multimodal interventions (BP measurement, medication assistance, behavior modification, and hypertension education) and multimodal interactions (clinician-patient engagement and BP Coach component). In August 2024, the mHealth App Usability Questionnaire (MAUQ) was conducted involving 51 hypertensive patients recruited from three distinct centers. In parallel, six clinicians engaged in management activities and contributed feedback via the Doctor's Software Satisfaction Questionnaire (DSSQ). Concurrently, a real-world comparative experiment was conducted to evaluate the usability of the BP Coach, ChatGPT-4o Mini, ChatGPT-4o and clinicians. The comparative experiment demonstrated that the BP Coach achieved significantly higher scores in utility (mean scores 4.05, SD 0.87) and completeness (mean scores 4.12, SD 0.78) when compared to ChatGPT-4o Mini, ChatGPT-4o, and clinicians. In terms of clarity, the BP Coach was slightly lower than clinicians (mean scores 4.03, SD 0.88). In addition, the BP Coach exhibited lower performance in conciseness (mean scores 3.00, SD 0.96). Clinicians reported a marked improvement in work efficiency (2.67 vs. 4.17, P < .001) and experienced faster and more effective patient interactions (3.0 vs. 4.17, P = .004). Furthermore, the Hyper-DREAM platform significantly decreased work intensity (2.5 vs. 3.5, P = .01) and minimized disruptions to daily routines (2.33 vs. 3.55, P = .004). The Hyper-DREAM platform demonstrated significantly greater overall satisfaction compared to the WeChat-based standard management (3.33 vs. 4.17, P = .01). Additionally, clinicians exhibited a markedly higher willingness to integrate the Hyper-DREAM platform into clinical practice (2.67 vs. 4.17, P < .001). Furthermore, patient management time decreased from 11.5 min (SD 1.87) with Wechat-based standard management to 7.5 min (SD 1.84, P = .01) with Hyper-DREAM. Hypertensive patients reported high satisfaction with the Hyper-DREAM platform, including ease of use (mean scores 1.60, SD 0.69), system information arrangement (mean scores 1.69, SD 0.71), and usefulness (mean scores 1.57, SD 0.58). In conclusion, our study presents Hyper-DREAM, a novel artificial intelligence-driven platform for hypertension management, designed to alleviate clinician workload and exhibiting significant promise for clinical application. The Hyper-DREAM platform is distinguished by its user-friendliness, high satisfaction rates, utility, and effective organization of information. Furthermore, the BP Coach component underscores the potential of LLMs in advancing mHealth approaches to hypertension management.
Collapse
Affiliation(s)
- Yijun Wang
- Department of Cardiology, The First Affiliated Hospital of Bengbu Medical Universtiy, 287 Changhuai Road, Longzihu District, Bengbu City, Anhui Province, 430060, P.R. China
- West China Hospital, Sichuan University, Chengdu, 610041, China
| | - Tongjian Zhu
- Department of Cardiology, Institute of Cardiovascular Diseases, Xiangyang Central Hospital, Affliated Hospital of Hubei University of Arts and Science, Xiangyang, Hubei, China
| | - Tong Zhou
- Department of Cardiology, The First Affiliated Hospital of Bengbu Medical Universtiy, 287 Changhuai Road, Longzihu District, Bengbu City, Anhui Province, 430060, P.R. China
| | - Bing Wu
- Institute of Clinical Medicine and Department of Cardiology, Renmin Hospital, Hubei University of Medicine, Shiyan, 442000, Hubei, China
| | - Wuping Tan
- Sun Yat-Sen Memorial Hospital, Sun Yat-Sen University, Guangzhou, China
| | - Kezhong Ma
- Department of Cardiology, Institute of Cardiovascular Diseases, Xiangyang Central Hospital, Affliated Hospital of Hubei University of Arts and Science, Xiangyang, Hubei, China
| | - Zhuoya Yao
- Department of Cardiology, The First Affiliated Hospital of Bengbu Medical Universtiy, 287 Changhuai Road, Longzihu District, Bengbu City, Anhui Province, 430060, P.R. China
| | - Jian Wang
- Department of Cardiology, The First Affiliated Hospital of Bengbu Medical Universtiy, 287 Changhuai Road, Longzihu District, Bengbu City, Anhui Province, 430060, P.R. China
| | - Siyang Li
- Department of Cardiology, Institute of Cardiovascular Diseases, Xiangyang Central Hospital, Affliated Hospital of Hubei University of Arts and Science, Xiangyang, Hubei, China
| | - Fanglin Qin
- Mental Health Center and Psychiatric Laboratory, West China Hospital, Sichuan University, Chengdu, 610041, China
| | - Yannan Xu
- Pulmonary and Critical Care Medicine, The First Affiliated Hospital of Bengbu Medical Universtiy, Bengbu, Anhui, China
| | - Liguo Tan
- Institute of Clinical Medicine and Department of Cardiology, Renmin Hospital, Hubei University of Medicine, Shiyan, 442000, Hubei, China.
| | - Jinjun Liu
- Department of Cardiology, The First Affiliated Hospital of Bengbu Medical Universtiy, 287 Changhuai Road, Longzihu District, Bengbu City, Anhui Province, 430060, P.R. China.
| | - Jun Wang
- Department of Cardiology, The First Affiliated Hospital of Bengbu Medical Universtiy, 287 Changhuai Road, Longzihu District, Bengbu City, Anhui Province, 430060, P.R. China.
| |
Collapse
|
15
|
Clay B, Bergman HI, Salim S, Pergola G, Shalhoub J, Davies AH. Natural language processing techniques applied to the electronic health record in clinical research and practice - an introduction to methodologies. Comput Biol Med 2025; 188:109808. [PMID: 39946783 DOI: 10.1016/j.compbiomed.2025.109808] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2024] [Revised: 02/03/2025] [Accepted: 02/04/2025] [Indexed: 03/05/2025]
Abstract
Natural Language Processing (NLP) has the potential to revolutionise clinical research utilising Electronic Health Records (EHR) through the automated analysis of unstructured free text. Despite this potential, relatively few applications have entered real-world clinical practice. This paper aims to introduce the whole pipeline of NLP methodologies for EHR analysis to the clinical researcher, with case studies to demonstrate the application of these methods in the existing literature. Essential pre-processing steps are introduced, followed by the two major classes of analytical frameworks: statistical methods and Artificial Neural Networks (ANNs). Case studies which apply statistical and ANN-based methods are then provided and discussed, illustrating information extraction tasks for objective and subjective information, and classification/prediction tasks using supervised and unsupervised approaches. State-of-the-art large language models and future directions for research are then discussed. This educational article aims to bridge the gap between the clinical researcher and the NLP expert, providing clinicians with a background understanding of the NLP techniques relevant to EHR analysis, allowing engagement with this rapidly evolving area of research, which is likely to have a major impact on clinical practice in coming years.
Collapse
Affiliation(s)
- Benjamin Clay
- Department of Trauma and Orthopaedic Surgery, East Suffolk and North Essex NHS Foundation Trust, Ipswich Hospital, Heath Road, Ipswich, IP4 5PD, United Kingdom; Department of Public Health and Primary Care, University of Cambridge, Forvie Site, Robinson Way, Cambridge, CB2 0SR, United Kingdom.
| | - Henry I Bergman
- Academic Section of Vascular Surgery, Department of Surgery and Cancer, Imperial College London, London, SW7 2AZ, United Kingdom.
| | - Safa Salim
- Academic Section of Vascular Surgery, Department of Surgery and Cancer, Imperial College London, London, SW7 2AZ, United Kingdom.
| | - Gabriele Pergola
- Department of Computer Science, University of Warwick, Coventry, CV4 7AL, United Kingdom.
| | - Joseph Shalhoub
- Academic Section of Vascular Surgery, Department of Surgery and Cancer, Imperial College London, London, SW7 2AZ, United Kingdom.
| | - Alun H Davies
- Academic Section of Vascular Surgery, Department of Surgery and Cancer, Imperial College London, London, SW7 2AZ, United Kingdom.
| |
Collapse
|
16
|
Biesheuvel LA, Workum JD, Reuland M, van Genderen ME, Thoral P, Dongelmans D, Elbers P. Large language models in critical care. JOURNAL OF INTENSIVE MEDICINE 2025; 5:113-118. [PMID: 40241839 PMCID: PMC11997603 DOI: 10.1016/j.jointm.2024.12.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/07/2024] [Revised: 11/29/2024] [Accepted: 12/01/2024] [Indexed: 04/18/2025]
Abstract
The advent of chat generative pre-trained transformer (ChatGPT) and large language models (LLMs) has revolutionized natural language processing (NLP). These models possess unprecedented capabilities in understanding and generating human-like language. This breakthrough holds significant promise for critical care medicine, where unstructured data and complex clinical information are abundant. Key applications of LLMs in this field include administrative support through automated documentation and patient chart summarization; clinical decision support by assisting in diagnostics and treatment planning; personalized communication to enhance patient and family understanding; and improving data quality by extracting insights from unstructured clinical notes. Despite these opportunities, challenges such as the risk of generating inaccurate or biased information "hallucinations", ethical considerations, and the need for clinician artificial intelligence (AI) literacy must be addressed. Integrating LLMs with traditional machine learning models - an approach known as Hybrid AI - combines the strengths of both technologies while mitigating their limitations. Careful implementation, regulatory compliance, and ongoing validation are essential to ensure that LLMs enhance patient care rather than hinder it. LLMs have the potential to transform critical care practices, but integrating them requires caution. Responsible use and thorough clinician training are crucial to fully realize their benefits.
Collapse
Affiliation(s)
- Laurens A. Biesheuvel
- Department of Intensive Care Medicine, Center for Critical Care Computational Intelligence, Amsterdam Medical Data Science, Amsterdam Public Health, Amsterdam Institute for Immunity and Infectious Diseases, Amsterdam Cardiovascular Science, Amsterdam UMC, Vrije Universiteit, University of Amsterdam, Amsterdam, The Netherlands
| | - Jessica D. Workum
- Department of Intensive Care, Elisabeth-TweeSteden Hospital, Tilburg, The Netherlands
- Department of Adult Intensive Care, Erasmus Medical Center, Rotterdam, The Netherlands
| | - Merijn Reuland
- Department of Intensive Care Medicine, Center for Critical Care Computational Intelligence, Amsterdam Medical Data Science, Amsterdam Public Health, Amsterdam Institute for Immunity and Infectious Diseases, Amsterdam Cardiovascular Science, Amsterdam UMC, Vrije Universiteit, University of Amsterdam, Amsterdam, The Netherlands
| | | | - Patrick Thoral
- Department of Intensive Care Medicine, Center for Critical Care Computational Intelligence, Amsterdam Medical Data Science, Amsterdam Public Health, Amsterdam Institute for Immunity and Infectious Diseases, Amsterdam Cardiovascular Science, Amsterdam UMC, Vrije Universiteit, University of Amsterdam, Amsterdam, The Netherlands
| | - Dave Dongelmans
- Department of Intensive Care Medicine, Amsterdam UMC, National Intensive Care Evaluation (NICE) Foundation, Amsterdam UMC location University of Amsterdam, Department of Medical Informatics, Amsterdam Public Health, Amsterdam, The Netherlands
| | - Paul Elbers
- Department of Intensive Care Medicine, Center for Critical Care Computational Intelligence, Amsterdam Medical Data Science, Amsterdam Public Health, Amsterdam Institute for Immunity and Infectious Diseases, Amsterdam Cardiovascular Science, Amsterdam UMC, Vrije Universiteit, University of Amsterdam, Amsterdam, The Netherlands
| |
Collapse
|
17
|
Li J, Yang Y, Chen R, Zheng D, Pang PCI, Lam CK, Wong D, Wang Y. Identifying healthcare needs with patient experience reviews using ChatGPT. PLoS One 2025; 20:e0313442. [PMID: 40100826 PMCID: PMC11918364 DOI: 10.1371/journal.pone.0313442] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2024] [Accepted: 10/23/2024] [Indexed: 03/20/2025] Open
Abstract
BACKGROUND Valuable findings can be obtained through data mining in patients' online reviews. Also identifying healthcare needs from the patient's perspective can more accurately improve the quality of care and the experience of the visit. Thereby avoiding unnecessary waste of health care resources. The large language model (LLM) can be a promising tool due to research that demonstrates its outstanding performance and potential in directions such as data mining, healthcare management, and more. OBJECTIVE We aim to propose a methodology to address this problem, specifically, the recent breakthrough of LLM can be leveraged for effectively understanding healthcare needs from patient experience reviews. METHODS We used 504,198 reviews collected from a large online medical platform, haodf.com. We used the reviews to create Aspect Based Sentiment Analysis (ABSA) templates, which categorized patient reviews into three categories, reflecting the areas of concern of patients. With the introduction of thought chains, we embedded ABSA templates into the prompts for ChatGPT, which was then used to identify patient needs. RESULTS Our method has a weighted total precision of 0.944, which was outstanding compared to the direct narrative tasks in ChatGPT-4o, which have a weighted total precision of 0.890. Weighted total recall and F1 scores also reached 0.884 and 0.912 respectively, surpassing the 0.802 and 0.843 scores for "direct narratives in ChatGPT." Finally, the accuracy of the three sampling methods was 91.8%, 91.7%, and 91.2%, with an average accuracy of over 91.5%. CONCLUSIONS Combining ChatGPT with ABSA templates can achieve satisfactory results in analyzing patient reviews. As our work applies to other LLMs, we shed light on understanding the demands of patients and health consumers with novel models, which can contribute to the agenda of enhancing patient experience and better healthcare resource allocations effectively.
Collapse
Affiliation(s)
- Jiaxuan Li
- Faculty of Applied Sciences, Macao Polytechnic University, Macao, China
| | - Yunchu Yang
- Faculty of Applied Sciences, Macao Polytechnic University, Macao, China
| | - Rong Chen
- Department of Rehabilitation Medicine, The First Affiliated Hospital, Sun Yat-Sen University, Guangzhou, China
| | - Dashun Zheng
- Faculty of Applied Sciences, Macao Polytechnic University, Macao, China
| | | | - Chi Kin Lam
- Faculty of Applied Sciences, Macao Polytechnic University, Macao, China
| | - Dennis Wong
- Faculty of Applied Sciences, Macao Polytechnic University, Macao, China
- State University of New York, Songdo, Korea
| | - Yapeng Wang
- Faculty of Applied Sciences, Macao Polytechnic University, Macao, China
| |
Collapse
|
18
|
Menz BD, Modi ND, Abuhelwa AY, Ruanglertboon W, Vitry A, Gao Y, Li LX, Chhetri R, Chu B, Bacchi S, Kichenadasse G, Shahnam A, Rowland A, Sorich MJ, Hopkins AM. Generative AI chatbots for reliable cancer information: Evaluating web-search, multilingual, and reference capabilities of emerging large language models. Eur J Cancer 2025; 218:115274. [PMID: 39922126 DOI: 10.1016/j.ejca.2025.115274] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2025] [Revised: 01/22/2025] [Accepted: 01/24/2025] [Indexed: 02/10/2025]
Abstract
Recent advancements in large language models (LLMs) enable real-time web search, improved referencing, and multilingual support, yet ensuring they provide safe health information remains crucial. This perspective evaluates seven publicly accessible LLMs-ChatGPT, Co-Pilot, Gemini, MetaAI, Claude, Grok, Perplexity-on three simple cancer-related queries across eight languages (336 responses: English, French, Chinese, Thai, Hindi, Nepali, Vietnamese, and Arabic). None of the 42 English responses contained clinically meaningful hallucinations, whereas 7 of 294 non-English responses did. 48 % (162/336) of responses included valid references, but 39 % of the English references were.com links reflecting quality concerns. English responses frequently exceeded an eighth-grade level, and many non-English outputs were also complex. These findings reflect substantial progress over the past 2-years but reveal persistent gaps in multilingual accuracy, reliable reference inclusion, referral practices, and readability. Ongoing benchmarking is essential to ensure LLMs safely support global health information dichotomy and meet online information standards.
Collapse
Affiliation(s)
- Bradley D Menz
- College of Medicine and Public Health, Flinders Health and Medical Research Institute, Flinders University, Adelaide, Australia
| | - Natansh D Modi
- College of Medicine and Public Health, Flinders Health and Medical Research Institute, Flinders University, Adelaide, Australia
| | - Ahmad Y Abuhelwa
- Department of Pharmacy Practice and Pharmacotherapeutics, College of Pharmacy, University of Sharjah, Sharjah, United Arab Emirates
| | - Warit Ruanglertboon
- Division of Health and Applied Sciences, Prince of Songkla University, Songkhla, Thailand; Research Center in Mathematics and Statistics with Applications, Discipline of Statistics, Division of Computational Science, Faculty of Science, Prince of Songkla University, Songkhla, Thailand
| | - Agnes Vitry
- University of South Australia, Clinical and Health Sciences, Adelaide, Australia
| | - Yuan Gao
- College of Medicine and Public Health, Flinders Health and Medical Research Institute, Flinders University, Adelaide, Australia
| | - Lee X Li
- College of Medicine and Public Health, Flinders Health and Medical Research Institute, Flinders University, Adelaide, Australia
| | - Rakchha Chhetri
- College of Medicine and Public Health, Flinders Health and Medical Research Institute, Flinders University, Adelaide, Australia
| | - Bianca Chu
- College of Medicine and Public Health, Flinders Health and Medical Research Institute, Flinders University, Adelaide, Australia
| | - Stephen Bacchi
- Department of Neurology and the Center for Genomic Medicine, Massachusetts General Hospital and Harvard Medical School, Boston, MA 02138, USA
| | - Ganessan Kichenadasse
- College of Medicine and Public Health, Flinders Health and Medical Research Institute, Flinders University, Adelaide, Australia; Flinders Centre for Innovation in Cancer, Department of Medical Oncology, Flinders Medical Centre, Flinders University, Bedford Park, South Australia, Australia
| | - Adel Shahnam
- Medical Oncology, Peter MacCallum Cancer Centre, Melbourne, Australia
| | - Andrew Rowland
- College of Medicine and Public Health, Flinders Health and Medical Research Institute, Flinders University, Adelaide, Australia
| | - Michael J Sorich
- College of Medicine and Public Health, Flinders Health and Medical Research Institute, Flinders University, Adelaide, Australia
| | - Ashley M Hopkins
- College of Medicine and Public Health, Flinders Health and Medical Research Institute, Flinders University, Adelaide, Australia.
| |
Collapse
|
19
|
Omar M, Soffer S, Agbareia R, Bragazzi NL, Glicksberg BS, Hurd YL, Apakama DU, Charney AW, Reich DL, Nadkarni GN, Klang E. LLM-Guided Pain Management: Examining Socio-Demographic Gaps in Cancer vs non-Cancer cases. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2025:2025.03.04.25323396. [PMID: 40093243 PMCID: PMC11908302 DOI: 10.1101/2025.03.04.25323396] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 03/19/2025]
Abstract
Large language models (LLMs) offer potential benefits in clinical care. However, concerns remain regarding socio-demographic biases embedded in their outputs. Opioid prescribing is one domain in which these biases can have serious implications, especially given the ongoing opioid epidemic and the need to balance effective pain management with addiction risk. We tested ten LLMs-both open-access and closed-source-on 1,000 acute-pain vignettes. Half of the vignettes were labeled as non-cancer and half as cancer. Each vignette was presented in 34 socio-demographic variations, including a control group without demographic identifiers. We analyzed the models' recommendations on opioids, anxiety treatment, perceived psychological stress, risk scores, and monitoring recommendations. Overall, yielding 3.4 million model-generated responses. Using logistic and linear mixed-effects models, we measured how these outputs varied by demographic group and whether a cancer diagnosis intensified or reduced observed disparities. Across both cancer and non-cancer cases, historically marginalized groups-especially cases labeled as individuals who are unhoused, Black, or identify as LGBTQIA+-often received more or stronger opioid recommendations, sometimes exceeding 90% in cancer settings, despite being labeled as high risk by the same models. Meanwhile, low-income or unemployed groups were assigned elevated risk scores yet fewer opioid recommendations, hinting at inconsistent rationales. Disparities in anxiety treatment and perceived psychological stress similarly clustered within marginalized populations, even when clinical details were identical. These patterns diverged from standard guidelines and point to model-driven bias rather than acceptable clinical variation. Our findings underscore the need for rigorous bias evaluation and the integration of guideline-based checks in LLMs to ensure equitable and evidence-based pain care.
Collapse
Affiliation(s)
- Mahmud Omar
- The Windreich Department of Artificial Intelligence and Human Health, Mount Sinai Medical Center, NY, USA
- The Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Shelly Soffer
- Institute of Hematology, Davidoff Cancer Center, Rabin Medical Center; Petah-Tikva, Israel
| | - Reem Agbareia
- Ophthalmology Department, Hadassah Medical Center, Jerusalem, Israel
| | - Nicola Luigi Bragazzi
- Human Nutrition Unit (HNU), Department of Food and Drugs, Medical School, Parma, Italy
| | - Benjamin S Glicksberg
- The Windreich Department of Artificial Intelligence and Human Health, Mount Sinai Medical Center, NY, USA
- The Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Yasmin L Hurd
- Department of Psychiatry, Icahn School of Medicine at Mount Sinai, Addiction Institute of Mount Sinai, 1399 Park Ave, Room 3-330, New York, NY, 10029, USA
| | - Donald U. Apakama
- The Windreich Department of Artificial Intelligence and Human Health, Mount Sinai Medical Center, NY, USA
- The Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY, USA
- Institute for Health Equity Research, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Alexander W Charney
- The Windreich Department of Artificial Intelligence and Human Health, Mount Sinai Medical Center, NY, USA
- The Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - David L Reich
- Department of Anesthesiology, Perioperative, and Pain Medicine, Icahn School of Medicine at Mount Sinai, New York, NY
| | - Girish N Nadkarni
- The Windreich Department of Artificial Intelligence and Human Health, Mount Sinai Medical Center, NY, USA
- The Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Eyal Klang
- The Windreich Department of Artificial Intelligence and Human Health, Mount Sinai Medical Center, NY, USA
- The Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY, USA
| |
Collapse
|
20
|
Lo Bianco G, Robinson CL, D’Angelo FP, Cascella M, Natoli S, Sinagra E, Mercadante S, Drago F. Effectiveness of Generative Artificial Intelligence-Driven Responses to Patient Concerns in Long-Term Opioid Therapy: Cross-Model Assessment. Biomedicines 2025; 13:636. [PMID: 40149612 PMCID: PMC11940240 DOI: 10.3390/biomedicines13030636] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2025] [Revised: 02/28/2025] [Accepted: 03/04/2025] [Indexed: 03/29/2025] Open
Abstract
Background: While long-term opioid therapy is a widely utilized strategy for managing chronic pain, many patients have understandable questions and concerns regarding its safety, efficacy, and potential for dependency and addiction. Providing clear, accurate, and reliable information is essential for fostering patient understanding and acceptance. Generative artificial intelligence (AI) applications offer interesting avenues for delivering patient education in healthcare. This study evaluates the reliability, accuracy, and comprehensibility of ChatGPT's responses to common patient inquiries about opioid long-term therapy. Methods: An expert panel selected thirteen frequently asked questions regarding long-term opioid therapy based on the authors' clinical experience in managing chronic pain patients and a targeted review of patient education materials. Questions were prioritized based on prevalence in patient consultations, relevance to treatment decision-making, and the complexity of information typically required to address them comprehensively. We assessed comprehensibility by implementing the multimodal generative AI Copilot (Microsoft 365 Copilot Chat). Spanning three domains-pre-therapy, during therapy, and post-therapy-each question was submitted to GPT-4.0 with the prompt "If you were a physician, how would you answer a patient asking…". Ten pain physicians and two non-healthcare professionals independently assessed the responses using a Likert scale to rate reliability (1-6 points), accuracy (1-3 points), and comprehensibility (1-3 points). Results: Overall, ChatGPT's responses demonstrated high reliability (5.2 ± 0.6) and good comprehensibility (2.8 ± 0.2), with most answers meeting or exceeding predefined thresholds. Accuracy was moderate (2.7 ± 0.3), with lower performance on more technical topics like opioid tolerance and dependency management. Conclusions: While AI applications exhibit significant potential as a supplementary tool for patient education on opioid long-term therapy, limitations in addressing highly technical or context-specific queries underscore the need for ongoing refinement and domain-specific training. Integrating AI systems into clinical practice should involve collaboration between healthcare professionals and AI developers to ensure safe, personalized, and up-to-date patient education in chronic pain management.
Collapse
Affiliation(s)
- Giuliano Lo Bianco
- Anesthesiology and Pain Department, Foundation G. Giglio Cefalù, 90015 Palermo, Italy
| | - Christopher L. Robinson
- Anesthesiology, Perioperative, and Pain Medicine, Brigham and Women’s Hospital, Harvard Medical School, Harvard University, Boston, MA 02115, USA;
| | - Francesco Paolo D’Angelo
- Department of Anaesthesia, Intensive Care and Emergency, University Hospital Policlinico Paolo Giaccone, 90127 Palermo, Italy;
| | - Marco Cascella
- Anesthesia and Pain Medicine, Department of Medicine, Surgery and Dentistry “Scuola Medica Salernitana”, University of Salerno, 84081 Baronissi, Italy;
| | - Silvia Natoli
- Department of Clinical-Surgical, Diagnostic and Pediatric Sciences, University of Pavia, 27100 Pavia, Italy;
- Pain Unit, Fondazione IRCCS Policlinico San Matteo, 27100 Pavia, Italy
| | - Emanuele Sinagra
- Gastroenterology and Endoscopy Unit, Fondazione Istituto San Raffaele Giglio, 90015 Cefalù, Italy;
| | - Sebastiano Mercadante
- Main Regional Center for Pain Relief and Supportive/Palliative Care, La Maddalena Cancer Center, Via San Lorenzo 312, 90146 Palermo, Italy;
| | - Filippo Drago
- Department of Biomedical and Biotechnological Sciences, University of Catania, 95124 Catania, Italy;
| |
Collapse
|
21
|
Woo JJ, Yang AJ, Olsen RJ, Hasan SS, Nawabi DH, Nwachukwu BU, Williams RJ, Ramkumar PN. Custom Large Language Models Improve Accuracy: Comparing Retrieval Augmented Generation and Artificial Intelligence Agents to Noncustom Models for Evidence-Based Medicine. Arthroscopy 2025; 41:565-573.e6. [PMID: 39521391 DOI: 10.1016/j.arthro.2024.10.042] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/21/2024] [Revised: 10/27/2024] [Accepted: 10/27/2024] [Indexed: 11/16/2024]
Abstract
PURPOSE To show the value of custom methods, namely Retrieval Augmented Generation (RAG)-based Large Language Models (LLMs) and Agentic Augmentation, over standard LLMs in delivering accurate information using an anterior cruciate ligament (ACL) injury case. METHODS A set of 100 questions and answers based on the 2022 AAOS ACL guidelines were curated. Closed-source (open AI GPT4/GPT 3.5 and Anthropic's Claude3) and open-source models (LLama3 8b/70b and Mistral 8×7b) were asked questions in base form and again with AAOS guidelines embedded into a RAG system. The top-performing models were further augmented with artificial intelligence (AI) agents and reevaluated. Two fellowship-trained surgeons blindly evaluated the accuracy of the responses of each cohort. Recall-Oriented Understudy of Gisting Evaluation and Metric for Evaluation of Translation with Explicit Ordering scores were calculated to assess semantic similarity in the response. RESULTS All noncustom LLM models started below 60% accuracy. Applying RAG improved the accuracy of every model by an average 39.7%. The highest performing model with just RAG was Meta's open-source Llama3 70b (94%). The highest performing model with RAG and AI agents was Open AI's GPT4 (95%). CONCLUSIONS RAG improved accuracy by an average of 39.7%, with the highest accuracy rate of 94% in the Meta Llama3 70b. Incorporating AI agents into a previously RAG-augmented LLM improved ChatGPT4 accuracy rate to 95%. Thus, Agentic and RAG augmented LLMs can be accurate liaisons of information, supporting our hypothesis. CLINICAL RELEVANCE Despite literature surrounding the use of LLM in medicine, there has been considerable and appropriate skepticism given the variably accurate response rates. This study establishes the groundwork to identify whether custom modifications to LLMs using RAG and agentic augmentation can better deliver accurate information in orthopaedic care. With this knowledge, online medical information commonly sought in popular LLMs, such as ChatGPT, can be standardized and provide relevant online medical information to better support shared decision making between surgeon and patient.
Collapse
Affiliation(s)
- Joshua J Woo
- Brown University/The Warren Alpert School of Brown University, Providence, Rhode Island, U.S.A
| | - Andrew J Yang
- Brown University/The Warren Alpert School of Brown University, Providence, Rhode Island, U.S.A
| | - Reena J Olsen
- Tufts University School of Medicine, Boston, Massachusetts, U.S.A
| | | | | | | | | | | |
Collapse
|
22
|
Giacobbe DR, Marelli C, La Manna B, Padua D, Malva A, Guastavino S, Signori A, Mora S, Rosso N, Campi C, Piana M, Murgia Y, Giacomini M, Bassetti M. Advantages and limitations of large language models for antibiotic prescribing and antimicrobial stewardship. NPJ ANTIMICROBIALS AND RESISTANCE 2025; 3:14. [PMID: 40016394 PMCID: PMC11868396 DOI: 10.1038/s44259-025-00084-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/30/2024] [Accepted: 02/06/2025] [Indexed: 03/01/2025]
Abstract
Antibiotic prescribing requires balancing optimal treatment for patients with reducing antimicrobial resistance. There is a lack of standardization in research on using large language models (LLMs) for supporting antibiotic prescribing, necessitating more efforts to identify biases and misinformation in their outputs. Educating future medical professionals on these aspects is crucial for ensuring the proper use of LLMs for supporting antibiotic prescribing, providing a deeper understanding of their strengths and limitations.
Collapse
Affiliation(s)
- Daniele Roberto Giacobbe
- Department of Health Sciences (DISSAL), University of Genoa, Genoa, Italy.
- UO Clinica Malattie Infettive, IRCCS Ospedale Policlinico San Martino, Genoa, Italy.
| | - Cristina Marelli
- UO Clinica Malattie Infettive, IRCCS Ospedale Policlinico San Martino, Genoa, Italy
| | - Bianca La Manna
- Department of Informatics, Bioengineering, Robotics and System Engineering (DIBRIS), University of Genoa, Genoa, Italy
| | - Donatella Padua
- Departmental Faculty of Medicine, UniCamillus - International University of Health and Medical Science, Rome, Italy
| | - Alberto Malva
- Italian Interdisciplinary Society for Primary Care, Bari, Italy
| | | | - Alessio Signori
- Section of Biostatistics, Department of Health Sciences (DISSAL), University of Genoa, Genoa, Italy
- IRCCS Ospedale Policlinico San Martino, Genoa, Italy
| | - Sara Mora
- UO Information and Communication Technologies, IRCCS Ospedale Policlinico San Martino, Genoa, Italy
| | - Nicola Rosso
- UO Information and Communication Technologies, IRCCS Ospedale Policlinico San Martino, Genoa, Italy
| | - Cristina Campi
- Department of Mathematics (DIMA), University of Genoa, Genoa, Italy
- Life Science Computational Laboratory (LISCOMP), IRCCS Ospedale Policlinico San Martino, Genoa, Italy
| | - Michele Piana
- Department of Mathematics (DIMA), University of Genoa, Genoa, Italy
- Life Science Computational Laboratory (LISCOMP), IRCCS Ospedale Policlinico San Martino, Genoa, Italy
| | - Ylenia Murgia
- Department of Informatics, Bioengineering, Robotics and System Engineering (DIBRIS), University of Genoa, Genoa, Italy
| | - Mauro Giacomini
- Department of Informatics, Bioengineering, Robotics and System Engineering (DIBRIS), University of Genoa, Genoa, Italy
| | - Matteo Bassetti
- Department of Health Sciences (DISSAL), University of Genoa, Genoa, Italy
- UO Clinica Malattie Infettive, IRCCS Ospedale Policlinico San Martino, Genoa, Italy
| |
Collapse
|
23
|
Bignami EG, Russo M, Bellini V. Reclaiming Patient-Centered Care: How Intelligent Time is Redefining Healthcare Priorities. J Med Syst 2025; 49:30. [PMID: 39982622 DOI: 10.1007/s10916-025-02163-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2025] [Accepted: 02/12/2025] [Indexed: 02/22/2025]
Affiliation(s)
- Elena Giovanna Bignami
- Anesthesiology, Critical Care and Pain Medicine Division, Department of Medicine and Surgery, University of Parma, Viale Gramsci 14, 43126, Parma, Italy.
| | - Michele Russo
- Anesthesiology, Critical Care and Pain Medicine Division, Department of Medicine and Surgery, University of Parma, Viale Gramsci 14, 43126, Parma, Italy
| | - Valentina Bellini
- Anesthesiology, Critical Care and Pain Medicine Division, Department of Medicine and Surgery, University of Parma, Viale Gramsci 14, 43126, Parma, Italy
| |
Collapse
|
24
|
Lo Bianco G, Cascella M, Li S, Day M, Kapural L, Robinson CL, Sinagra E. Reliability, Accuracy, and Comprehensibility of AI-Based Responses to Common Patient Questions Regarding Spinal Cord Stimulation. J Clin Med 2025; 14:1453. [PMID: 40094896 PMCID: PMC11899866 DOI: 10.3390/jcm14051453] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2025] [Revised: 02/06/2025] [Accepted: 02/19/2025] [Indexed: 03/19/2025] Open
Abstract
Background: Although spinal cord stimulation (SCS) is an effective treatment for managing chronic pain, many patients have understandable questions and concerns regarding this therapy. Artificial intelligence (AI) has shown promise in delivering patient education in healthcare. This study evaluates the reliability, accuracy, and comprehensibility of ChatGPT's responses to common patient inquiries about SCS. Methods: Thirteen commonly asked questions regarding SCS were selected based on the authors' clinical experience managing chronic pain patients and a targeted review of patient education materials and relevant medical literature. The questions were prioritized based on their frequency in patient consultations, relevance to decision-making about SCS, and the complexity of the information typically required to comprehensively address the questions. These questions spanned three domains: pre-procedural, intra-procedural, and post-procedural concerns. Responses were generated using GPT-4.0 with the prompt "If you were a physician, how would you answer a patient asking…". Responses were independently assessed by 10 pain physicians and two non-healthcare professionals using a Likert scale for reliability (1-6 points), accuracy (1-3 points), and comprehensibility (1-3 points). Results: ChatGPT's responses demonstrated strong reliability (5.1 ± 0.7) and comprehensibility (2.8 ± 0.2), with 92% and 98% of responses, respectively, meeting or exceeding our predefined thresholds. Accuracy was 2.7 ± 0.3, with 95% of responses rated sufficiently accurate. General queries, such as "What is spinal cord stimulation?" and "What are the risks and benefits?", received higher scores compared to technical questions like "What are the different types of waveforms used in SCS?". Conclusions: ChatGPT can be implemented as a supplementary tool for patient education, particularly in addressing general and procedural queries about SCS. However, the AI's performance was less robust in addressing highly technical or nuanced questions.
Collapse
Affiliation(s)
- Giuliano Lo Bianco
- Anesthesiology and Pain Department, Foundation G. Giglio Cefalù, 90015 Palermo, Italy;
| | - Marco Cascella
- Anesthesia and Pain Medicine, Department of Medicine, Surgery and Dentistry “Scuola Medica Salernitana”, University of Salerno, 84081 Baronissi, Italy
| | - Sean Li
- National Spine and Pain Centers, Shrewsbury, NJ 07702, USA;
| | - Miles Day
- Department of Anesthesiology, Texas Tech University Health Sciences Center, Lubbock, TX 79430, USA;
| | | | - Christopher L. Robinson
- Anesthesiology, Perioperative, and Pain Medicine, Harvard Medical School, Brigham and Women’s Hospital, Boston, MA 02115, USA
| | - Emanuele Sinagra
- Gastroenterology and Endoscopy Unit, Fondazione Istituto San Raffaele Giglio, 90015 Cefalù, Italy;
| |
Collapse
|
25
|
Perogamvros L, Rochas V, Beau JB, Sterpenich V, Bayer L. The cathartic dream: Using a large language model to study a new type of functional dream in healthy and clinical populations. J Sleep Res 2025:e70001. [PMID: 39924340 DOI: 10.1111/jsr.70001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2024] [Revised: 01/12/2025] [Accepted: 01/28/2025] [Indexed: 02/11/2025]
Abstract
According to some theories of emotion regulation, dreams could modify negative emotions and ultimately reduce their intensity. We introduce here the idea of cathartic dream, a specific and separate type of emotional dream, which is characterized by a dynamic plot with emotional twists, and where negative emotions are expressed and ultimately decreased. This process would reflect psychological relief (catharsis according to the Aristotelian definition) and fulfil an emotion regulation function. We developed and validated a tool using a large language model to emotionally categorize the different dreams from dream diaries. Based on this tool, we were able to detect the prevalence of cathartic dreams in datasets of both healthy participants and patients with nightmares. Additionally, we observed the increase of cathartic dreams during 2 weeks of imagery rehearsal therapy and targeted memory reactivation during rapid eye movement sleep. We also demonstrate how the increase of cathartic dreams correlates significantly with the decrease of depression scores in patients with nightmares under therapy, thus supporting their likely functional role in well-being and their distinct nature among other emotional dreams.
Collapse
Affiliation(s)
- Lampros Perogamvros
- Center for Sleep Medicine, Department of Psychiatry, Geneva University Hospitals, Geneva, Switzerland
- Department of Basic Neurosciences, University of Geneva, Geneva, Switzerland
| | - Vincent Rochas
- M/EEG & Neuromod Platform, Fondation Campus Biotech Geneva, Geneva, Switzerland
| | | | - Virginie Sterpenich
- Department of Basic Neurosciences, University of Geneva, Geneva, Switzerland
| | - Laurence Bayer
- Center for Sleep Medicine, Department of Psychiatry, Geneva University Hospitals, Geneva, Switzerland
- Department of Basic Neurosciences, University of Geneva, Geneva, Switzerland
| |
Collapse
|
26
|
Alibudbud RC, Aruta JJBR, Sison KA, Guinto RR. Artificial intelligence in the era of planetary health: insights on its application for the climate change-mental health nexus in the Philippines. Int Rev Psychiatry 2025; 37:21-32. [PMID: 40035376 DOI: 10.1080/09540261.2024.2363373] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/22/2024] [Accepted: 05/29/2024] [Indexed: 03/05/2025]
Abstract
This review explores the transformative potential of Artificial Intelligence (AI) in the light of evolving threats to planetary health, particularly the dangers posed by the climate crisis and its emerging mental health impacts, in the context of a climate-vulnerable country such as the Philippines. This paper describes the country's mental health system, outlines the chronic systemic challenges that it faces, and discusses the intensifying and widening impacts of climate change on mental health. Integrated mental healthcare must be part of the climate adaptation response, particularly for vulnerable populations. AI holds promise for mental healthcare in the Philippines, and be a tool that can potentially aid in addressing the shortage of mental health professionals, improve service accessibility, and provide direct services in climate-affected communities. However, the incorporation of AI into mental healthcare also presents significant challenges, such as potentially worsening the existing mental health inequities due to unequal access to resources and technologies, data privacy concerns, and potential AI algorithm biases. It is crucial to approach AI integration with ethical consideration and responsible implementation to harness its benefits, mitigate potential risks, and ensure inclusivity in mental healthcare delivery, especially in the era of a warming planet.
Collapse
Affiliation(s)
- Rowalt C Alibudbud
- Department of Sociology and Behavioral Sciences, De La Salle University, Manila, Philippines
| | | | - Kevin Anthony Sison
- St. Luke's Medical Center College of Medicine, William H. Quasha Memorial, Quezon City, Philippines
| | - Renzo R Guinto
- St. Luke's Medical Center College of Medicine, William H. Quasha Memorial, Quezon City, Philippines
- SingHealth Duke-NUS Global Health Institute, Duke-NUS Medical School, National University of Singapore, Singapore
| |
Collapse
|
27
|
Landau M, Kroumpouzos G, Goldust M. Large Language Models in Cosmetic Dermatology. J Cosmet Dermatol 2025; 24:e70044. [PMID: 39936220 DOI: 10.1111/jocd.70044] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2024] [Revised: 01/03/2025] [Accepted: 01/31/2025] [Indexed: 02/13/2025]
Affiliation(s)
- Marina Landau
- Arena Dermatology and Department of Plastic Surgery, Shamir Medical Center, Be'er Ya'akov, Israel
| | - George Kroumpouzos
- Department of Dermatology, Warren Alpert Medical School at Brown University, Providence, Rhode Island, USA
- GK Dermatology PC, South Weymouth, Massachusetts, USA
| | - Mohamad Goldust
- Department of Dermatology, Yale University School of Medicine, New Haven, Connecticut, USA
| |
Collapse
|
28
|
Gencer G, Gencer K. Large Language Models in Healthcare: A Bibliometric Analysis and Examination of Research Trends. J Multidiscip Healthc 2025; 18:223-238. [PMID: 39844924 PMCID: PMC11750729 DOI: 10.2147/jmdh.s502351] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2024] [Accepted: 01/07/2025] [Indexed: 01/24/2025] Open
Abstract
Background The integration of large language models (LLMs) in healthcare has generated significant interest due to their potential to improve diagnostic accuracy, personalization of treatment, and patient care efficiency. Objective This study aims to conduct a comprehensive bibliometric analysis to identify current research trends, main themes and future directions regarding applications in the healthcare sector. Methods A systematic scan of publications until 08.05.2024 was carried out from an important database such as Web of Science.Using bibliometric tools such as VOSviewer and CiteSpace, we analyzed data covering publication counts, citation analysis, co-authorship, co- occurrence of keywords and thematic development to map the intellectual landscape and collaborative networks in this field. Results The analysis included more than 500 articles published between 2021 and 2024. The United States, Germany and the United Kingdom were the top contributors to this field. The study highlights that neural network applications in diagnostic imaging, natural language processing for clinical documentation, and patient data in the field of general internal medicine, radiology, medical informatics, health care services, surgery, oncology, ophthalmology, neurology, orthopedics and psychiatry have seen significant growth in publications over the past two years. Keyword trend analysis revealed emerging sub-themes such as clinical research, artificial intelligence, ChatGPT, education, natural language processing, clinical management, virtual reality, chatbot, indicating a shift towards addressing the broader implications of LLM application in healthcare. Conclusion The use of LLM in healthcare is an expanding field with significant academic and clinical interest. This bibliometric analysis not only maps the current state of the research, but also identifies important areas that require further research and development. Continued advances in this field are expected to significantly impact future healthcare applications, with a focus on increasing the accuracy and personalization of patient care through advanced data analytics.
Collapse
Affiliation(s)
- Gülcan Gencer
- Department of Biostatistics and Medical Informatics, Afyonkarahisar Health Sciences University, Faculty of Medicine, Afyonkarahisar, Turkey
| | - Kerem Gencer
- Department of Computer Engineering, Afyon Kocatepe University, Faculty of Engineering, Afyonkarahisar, Turkey
| |
Collapse
|
29
|
Li X, Shu Q, Kong C, Wang J, Li G, Fang X, Lou X, Yu G. An Intelligent System for Classifying Patient Complaints Using Machine Learning and Natural Language Processing: Development and Validation Study. J Med Internet Res 2025; 27:e55721. [PMID: 39778195 PMCID: PMC11754990 DOI: 10.2196/55721] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2023] [Revised: 04/28/2024] [Accepted: 11/04/2024] [Indexed: 01/11/2025] Open
Abstract
BACKGROUND Accurate classification of patient complaints is crucial for enhancing patient satisfaction management in health care settings. Traditional manual methods for categorizing complaints often lack efficiency and precision. Thus, there is a growing demand for advanced and automated approaches to streamline the classification process. OBJECTIVE This study aimed to develop and validate an intelligent system for automatically classifying patient complaints using machine learning (ML) and natural language processing (NLP) techniques. METHODS An ML-based NLP technology was proposed to extract frequently occurring dissatisfactory words related to departments, staff, and key treatment procedures. A dataset containing 1465 complaint records from 2019 to 2023 was used for training and validation, with an additional 376 complaints from Hangzhou Cancer Hospital serving as an external test set. Complaints were categorized into 4 types-communication problems, diagnosis and treatment issues, management problems, and sense of responsibility concerns. The imbalanced data were balanced using the Synthetic Minority Oversampling Technique (SMOTE) algorithm to ensure equal representation across all categories. A total of 3 ML algorithms (Multifactor Logistic Regression, Multinomial Naive Bayes, and Support Vector Machines [SVM]) were used for model training and validation. The best-performing model was tested using a 5-fold cross-validation on external data. RESULTS The original dataset consisted of 719, 376, 260, and 86 records for communication problems, diagnosis and treatment issues, management problems, and sense of responsibility concerns, respectively. The Multifactor Logistic Regression and SVM models achieved weighted average accuracies of 0.89 and 0.93 in the training set, and 0.83 and 0.87 in the internal test set, respectively. Ngram-level term frequency-inverse document frequency did not significantly improve classification performance, with only a marginal 1% increase in precision, recall, and F1-score when implementing Ngram-level term frequency-inverse document frequency (n=2) from 0.91 to 0.92. The SVM algorithm performed best in prediction, achieving an average accuracy of 0.91 on the external test set with a 95% CI of 0.87-0.97. CONCLUSIONS The NLP-driven SVM algorithm demonstrates effective classification performance in automatically categorizing patient complaint texts. It showed superior performance in both internal and external test sets for communication and management problems. However, caution is advised when using it for classifying sense of responsibility complaints. This approach holds promises for implementation in medical institutions with high complaint volumes and limited resources for addressing patient feedback.
Collapse
Affiliation(s)
- Xiadong Li
- Children's Hospital, Zhejiang University School of Medicine, National Clinical Research Center For Child Health, Hang Zhou, China
| | - Qiang Shu
- Children's Hospital, Zhejiang University School of Medicine, National Clinical Research Center For Child Health, Hang Zhou, China
| | - Canhong Kong
- Patient Service Surveillance Office, Medical Information Department, Hangzhou Red Cross Hospital, Hang Zhou, China
| | - Jinhu Wang
- Children's Hospital, Zhejiang University School of Medicine, National Clinical Research Center For Child Health, Hang Zhou, China
| | - Gang Li
- Department of Radiation Oncology, Zhe Jiang Xiaoshan hospital, Hangzhou Normal University, Hang Zhou, China
| | - Xin Fang
- Hospital Management Office, Hangzhou Cancer Hospital, Hang Zhou, China
| | - Xiaomin Lou
- Patient Service Surveillance Office, Hangzhou Red Cross Hospital, Hang Zhou, China
| | - Gang Yu
- Children's Hospital, Zhejiang University School of Medicine, National Clinical Research Center For Child Health, Hang Zhou, China
| |
Collapse
|
30
|
Cascella M, Shariff MN, Viswanath O, Leoni MLG, Varrassi G. Ethical Considerations in the Use of Artificial Intelligence in Pain Medicine. Curr Pain Headache Rep 2025; 29:10. [PMID: 39760779 DOI: 10.1007/s11916-024-01330-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 09/09/2024] [Indexed: 01/07/2025]
Abstract
Although the integration of artificial intelligence (AI) into medicine and healthcare holds transformative potential, significant challenges must be necessarily addressed. This technological innovation requires a commitment to ethical principles. Key issues concern autonomy, reliability, and bias. Furthermore, AI development must guarantee rigorous data privacy and security standards. Effective AI implementation demands thorough validation, transparency, and the involvement of multidisciplinary teams to oversee ethical considerations. These issues also concern pain medicine where careful assessment of subjective experiences and individualized care are crucial. Notably, in this rapidly evolving technological landscape, politics plays a pivotal role in establishing rules and regulations. Regulatory frameworks, such as the European Union's Artificial Intelligence Act and recent U.S. executive orders, provide essential guidelines for the responsible use of AI. This step is crucial for balancing innovation with rigorous ethical standards, ultimately leveraging the incredible AI's benefits. As the field evolves rapidly and concepts like algorethics and data ethics become more widespread, the scientific community is increasingly recognizing the need for specialists in this area, such as AI Ethics Specialists.
Collapse
Affiliation(s)
- Marco Cascella
- Anesthesia and Pain Medicine, Department of Medicine, Surgery and Dentistry "Scuola Medica Salernitana", University of Salerno, Via S. Allende, Baronissi, 84081, Italy.
| | | | - Omar Viswanath
- Department of Anesthesiology, Creighton University School of Medicine, Phoenix, AZ, USA
| | - Matteo Luigi Giuseppe Leoni
- Department of Medical and Surgical Sciences and Translational Medicine, Sapienza University of Roma, Roma, Italy
| | | |
Collapse
|
31
|
Agbareia R, Omar M, Zloto O, Glicksberg BS, Nadkarni GN, Klang E. Multimodal LLMs for retinal disease diagnosis via OCT: few-shot versus single-shot learning. Ther Adv Ophthalmol 2025; 17:25158414251340569. [PMID: 40400723 PMCID: PMC12093016 DOI: 10.1177/25158414251340569] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2024] [Accepted: 04/15/2025] [Indexed: 05/23/2025] Open
Abstract
Background and aim Multimodal large language models (LLMs) have shown potential in processing both text and image data for clinical applications. This study evaluated their diagnostic performance in identifying retinal diseases from optical coherence tomography (OCT) images. Methods We assessed the diagnostic accuracy of GPT-4o and Claude Sonnet 3.5 using two public OCT datasets (OCTID, OCTDL) containing expert-labeled images of four pathological conditions and normal retinas. Both models were tested using single-shot and few-shot prompts, with an overall of 3088 models' API calls. Statistical analyses were performed to evaluate differences in overall and condition-specific performance. Results GPT-4o's accuracy improved from 56.29% with single-shot prompts to 73.08% with few-shot prompts (p < 0.001). Similarly, Claude Sonnet 3.5 increased from 40.03% to 70.98% using the same approach (p < 0.001). Condition-specific analyses revealed similar trends, with absolute improvements ranging from 2% to 64%. These findings were consistent across the validation dataset. Conclusion Few-shot prompted multimodal LLMs show promise for clinical integration, particularly in identifying normal retinas, which could help streamline referral processes in primary care. While these models fall short of the diagnostic accuracy reported in established deep learning literature, they offer simple, effective tools for assisting in routine retinal disease diagnosis. Future research should focus on further validation and integrating clinical text data with imaging.
Collapse
Affiliation(s)
- Reem Agbareia
- Ophthalmology Department, Hadassah Medical Center, Jerusalem, Israel
- Faculty of Medicine, Hebrew University of Jerusalem, Jerusalem, Israel
| | - Mahmud Omar
- Maccabi Healthcare Services, Tel Aviv, Israel
- The Windreich Department of Artificial Intelligence and Human Health, Mount Sinai Medical Center, New York, NY, USA
- Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY 10029-6574, USA
| | - Ofira Zloto
- Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel
- Goldschleger Eye Institute, Sheba Medical Center, Tel Hashomer, Israel
| | - Benjamin S. Glicksberg
- The Windreich Department of Artificial Intelligence and Human Health, Mount Sinai Medical Center, New York, NY, USA
- Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Girish N. Nadkarni
- The Windreich Department of Artificial Intelligence and Human Health, Mount Sinai Medical Center, New York, NY, USA
- Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Eyal Klang
- The Windreich Department of Artificial Intelligence and Human Health, Mount Sinai Medical Center, New York, NY, USA
- Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY, USA
| |
Collapse
|
32
|
Zhang L, Zhao Q, Zhang D, Song M, Zhang Y, Wang X. Application of large language models in healthcare: A bibliometric analysis. Digit Health 2025; 11:20552076251324444. [PMID: 40035041 PMCID: PMC11873863 DOI: 10.1177/20552076251324444] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2024] [Accepted: 02/11/2025] [Indexed: 03/05/2025] Open
Abstract
Objective The objective is to provide an overview of the application of large language models (LLMs) in healthcare by employing a bibliometric analysis methodology. Method We performed a comprehensive search for peer-reviewed English-language articles using PubMed and Web of Science. The selected articles were subsequently clustered and analyzed textually, with a focus on lexical co-occurrences, country-level and inter-author collaborations, and other relevant factors. This textual analysis produced high-level concept maps that illustrate specific terms and their interconnections. Findings Our final sample comprised 371 English-language journal articles. The study revealed a sharp rise in the number of publications related to the application of LLMs in healthcare. However, the development is geographically imbalanced, with a higher concentration of articles originating from developed countries like the United States, Italy, and Germany, which also exhibit strong inter-country collaboration. LLMs are applied across various specialties, with researchers investigating their use in medical education, diagnosis, treatment, administrative reporting, and enhancing doctor-patient communication. Nonetheless, significant concerns persist regarding the risks and ethical implications of LLMs, including the potential for gender and racial bias, as well as the lack of transparency in the training datasets, which can lead to inaccurate or misleading responses. Conclusion While the application of LLMs in healthcare is promising, the widespread adoption of LLMs in practice requires further improvements in their standardization and accuracy. It is critical to establish clear accountability guidelines, develop a robust regulatory framework, and ensure that training datasets are based on evidence-based sources to minimize risk and ensure ethical and reliable use.
Collapse
Affiliation(s)
- Lanping Zhang
- Department of the Third Pulmonary Disease, Shenzhen Third People's Hospital, Shenzhen, Guangdong Province, China
- Shenzhen Clinical Research Center for Tuberculosis, Shenzhen, Guangdong Province, China
| | - Qing Zhao
- Acacia Lab for Implementation Science, School of Public Health Management, Southern Medical University, Guangzhou, Guangdong, China
| | - Dandan Zhang
- Department of the Third Pulmonary Disease, Shenzhen Third People's Hospital, Shenzhen, Guangdong Province, China
- Shenzhen Clinical Research Center for Tuberculosis, Shenzhen, Guangdong Province, China
| | - Meijuan Song
- Department of the Third Pulmonary Disease, Shenzhen Third People's Hospital, Shenzhen, Guangdong Province, China
- Shenzhen Clinical Research Center for Tuberculosis, Shenzhen, Guangdong Province, China
| | - Yu Zhang
- School of Humanities Changzhou Vocational Institute of Textile and Garment Changzhou, China
| | - Xiufen Wang
- Department of the Third Pulmonary Disease, Shenzhen Third People's Hospital, Shenzhen, Guangdong Province, China
- Shenzhen Clinical Research Center for Tuberculosis, Shenzhen, Guangdong Province, China
| |
Collapse
|
33
|
Mondal H, Tiu DN, Mondal S, Dutta R, Naskar A, Podder I. Evaluating Accuracy and Readability of Responses to Midlife Health Questions: A Comparative Analysis of Six Large Language Model Chatbots. J Midlife Health 2025; 16:45-50. [PMID: 40330238 PMCID: PMC12052287 DOI: 10.4103/jmh.jmh_182_24] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2024] [Revised: 11/22/2024] [Accepted: 12/02/2024] [Indexed: 05/08/2025] Open
Abstract
Background The use of large language model (LLM) chatbots in health-related queries is growing due to their convenience and accessibility. However, concerns about the accuracy and readability of their information persist. Many individuals, including patients and healthy adults, may rely on chatbots for midlife health queries instead of consulting a doctor. In this context, we evaluated the accuracy and readability of responses from six LLM chatbots to midlife health questions for men and women. Methods Twenty questions on midlife health were asked to six different LLM chatbots - ChatGPT, Claude, Copilot, Gemini, Meta artificial intelligence (AI), and Perplexity. Each chatbot's responses were collected and evaluated for accuracy, relevancy, fluency, and coherence by three independent expert physicians. An overall score was also calculated by taking the average of four criteria. In addition, readability was analyzed using the Flesch-Kincaid Grade Level, to determine how easily the information could be understood by the general population. Results In terms of fluency, Perplexity scored the highest (4.3 ± 1.78), coherence was highest for Meta AI (4.26 ± 0.16), accuracy of responses was highest for Meta AI, and relevancy score was highest for Meta AI (4.35 ± 0.24). Overall, Meta AI scored the highest (4.28 ± 0.16), followed by ChatGPT (4.22 ± 0.21), whereas Copilot had the lowest score (3.72 ± 0.19) (P < 0.0001). Perplexity showed the highest score of 41.24 ± 10.57 in readability and lowest in grade level (11.11 ± 1.93), meaning its text is the easiest to read and requires a lower level of education. Conclusion LLM chatbots can answer midlife-related health questions with variable capabilities. Meta AI was found to be highest scoring chatbot for addressing men's and women's midlife health questions, whereas Perplexity offers high readability for accessible information. Hence, LLM chatbots can be used as educational tools for midlife health by selecting appropriate chatbots according to its capability.
Collapse
Affiliation(s)
- Himel Mondal
- Department of Physiology, All India Institute of Medical Sciences, Deoghar, Jharkhand, India
| | - Devendra Nath Tiu
- Department of Physiology, Sheikh Bhikhari Medical College, Hazaribagh, Jharkhand, India
| | - Shaikat Mondal
- Department of Physiology, Raiganj Government Medical College and Hospital, Raiganj, West Bengal, India
| | - Rajib Dutta
- Department of Gynecology and Obstetrics, Diamond Harbour Government Medical College and Hospital, Diamond Harbour, West Bengal, India
| | - Avijit Naskar
- Department of General Medicine, Baruipur Sub-Divisional Hospital, Baruipur, West Bengal, India
| | - Indrashis Podder
- Department of Dermatology, College of Medicine and Sagore Dutta Hospital, Kolkata, West Bengal, India
| |
Collapse
|
34
|
Sprint G, Schmitter-Edgecombe M, Cook D. Building a Human Digital Twin (HDTwin) Using Large Language Models for Cognitive Diagnosis: Algorithm Development and Validation. JMIR Form Res 2024; 8:e63866. [PMID: 39715540 PMCID: PMC11704625 DOI: 10.2196/63866] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2024] [Revised: 09/30/2024] [Accepted: 11/07/2024] [Indexed: 12/25/2024] Open
Abstract
BACKGROUND Human digital twins have the potential to change the practice of personalizing cognitive health diagnosis because these systems can integrate multiple sources of health information and influence into a unified model. Cognitive health is multifaceted, yet researchers and clinical professionals struggle to align diverse sources of information into a single model. OBJECTIVE This study aims to introduce a method called HDTwin, for unifying heterogeneous data using large language models. HDTwin is designed to predict cognitive diagnoses and offer explanations for its inferences. METHODS HDTwin integrates cognitive health data from multiple sources, including demographic, behavioral, ecological momentary assessment, n-back test, speech, and baseline experimenter testing session markers. Data are converted into text prompts for a large language model. The system then combines these inputs with relevant external knowledge from scientific literature to construct a predictive model. The model's performance is validated using data from 3 studies involving 124 participants, comparing its diagnostic accuracy with baseline machine learning classifiers. RESULTS HDTwin achieves a peak accuracy of 0.81 based on the automated selection of markers, significantly outperforming baseline classifiers. On average, HDTwin yielded accuracy=0.77, precision=0.88, recall=0.63, and Matthews correlation coefficient=0.57. In comparison, the baseline classifiers yielded average accuracy=0.65, precision=0.86, recall=0.35, and Matthews correlation coefficient=0.36. The experiments also reveal that HDTwin yields superior predictive accuracy when information sources are fused compared to single sources. HDTwin's chatbot interface provides interactive dialogues, aiding in diagnosis interpretation and allowing further exploration of patient data. CONCLUSIONS HDTwin integrates diverse cognitive health data, enhancing the accuracy and explainability of cognitive diagnoses. This approach outperforms traditional models and provides an interface for navigating patient information. The approach shows promise for improving early detection and intervention strategies in cognitive health.
Collapse
Affiliation(s)
- Gina Sprint
- Department of Computer Science, Gonzaga University, Spokane, WA, United States
| | - Maureen Schmitter-Edgecombe
- School of Electrical Engineering and Computer Science, Washington State University, Pullman, WA, United States
| | - Diane Cook
- School of Electrical Engineering and Computer Science, Washington State University, Pullman, WA, United States
| |
Collapse
|
35
|
Sabaner MC, Anguita R, Antaki F, Balas M, Boberg-Ans LC, Ferro Desideri L, Grauslund J, Hansen MS, Klefter ON, Potapenko I, Rasmussen MLR, Subhi Y. Opportunities and Challenges of Chatbots in Ophthalmology: A Narrative Review. J Pers Med 2024; 14:1165. [PMID: 39728077 DOI: 10.3390/jpm14121165] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2024] [Revised: 12/18/2024] [Accepted: 12/19/2024] [Indexed: 12/28/2024] Open
Abstract
Artificial intelligence (AI) is becoming increasingly influential in ophthalmology, particularly through advancements in machine learning, deep learning, robotics, neural networks, and natural language processing (NLP). Among these, NLP-based chatbots are the most readily accessible and are driven by AI-based large language models (LLMs). These chatbots have facilitated new research avenues and have gained traction in both clinical and surgical applications in ophthalmology. They are also increasingly being utilized in studies on ophthalmology-related exams, particularly those containing multiple-choice questions (MCQs). This narrative review evaluates both the opportunities and the challenges of integrating chatbots into ophthalmology research, with separate assessments of studies involving open- and close-ended questions. While chatbots have demonstrated sufficient accuracy in handling MCQ-based studies, supporting their use in education, additional exam security measures are necessary. The research on open-ended question responses suggests that AI-based LLM chatbots could be applied across nearly all areas of ophthalmology. They have shown promise for addressing patient inquiries, offering medical advice, patient education, supporting triage, facilitating diagnosis and differential diagnosis, and aiding in surgical planning. However, the ethical implications, confidentiality concerns, physician liability, and issues surrounding patient privacy remain pressing challenges. Although AI has demonstrated significant promise in clinical patient care, it is currently most effective as a supportive tool rather than as a replacement for human physicians.
Collapse
Affiliation(s)
- Mehmet Cem Sabaner
- Department of Ophthalmology, Kastamonu University, Training and Research Hospital, 37150 Kastamonu, Türkiye
| | - Rodrigo Anguita
- Department of Ophthalmology, Inselspital, University Hospital Bern, University of Bern, 3010 Bern, Switzerland
- Moorfields Eye Hospital National Health Service Foundation Trust, London EC1V 2PD, UK
| | - Fares Antaki
- Moorfields Eye Hospital National Health Service Foundation Trust, London EC1V 2PD, UK
- The CHUM School of Artificial Intelligence in Healthcare, Montreal, QC H2X 0A9, Canada
- Cole Eye Institute, Cleveland Clinic, Cleveland, OH 44195, USA
| | - Michael Balas
- Department of Ophthalmology & Vision Sciences, University of Toronto, Toronto, ON M5T 2S8, Canada
| | | | - Lorenzo Ferro Desideri
- Department of Ophthalmology, Inselspital, University Hospital Bern, University of Bern, 3010 Bern, Switzerland
- Graduate School for Health Sciences, University of Bern, 3012 Bern, Switzerland
| | - Jakob Grauslund
- Department of Ophthalmology, Odense University Hospital, 5000 Odense, Denmark
- Department of Clinical Research, University of Southern Denmark, 5230 Odense, Denmark
- Department of Ophthalmology, Vestfold Hospital Trust, 3103 Tønsberg, Norway
| | | | - Oliver Niels Klefter
- Department of Ophthalmology, Rigshospitalet, 2100 Copenhagen, Denmark
- Department of Clinical Medicine, University of Copenhagen, 1172 Copenhagen, Denmark
| | - Ivan Potapenko
- Department of Ophthalmology, Rigshospitalet, 2100 Copenhagen, Denmark
| | - Marie Louise Roed Rasmussen
- Department of Ophthalmology, Rigshospitalet, 2100 Copenhagen, Denmark
- Department of Clinical Medicine, University of Copenhagen, 1172 Copenhagen, Denmark
| | - Yousif Subhi
- Department of Clinical Research, University of Southern Denmark, 5230 Odense, Denmark
- Department of Ophthalmology, Rigshospitalet, 2100 Copenhagen, Denmark
- Department of Clinical Medicine, University of Copenhagen, 1172 Copenhagen, Denmark
| |
Collapse
|
36
|
Cascella M, Miranda B, Gagliardi C, Santaniello L, Mottola M, Mancusi A, Ferrara L, Monaco F, Gargano F, Perri F, Ottaiano A, Capuozzo M, Piazza O, Pepe S, Crispo A, Guida A, Salzano G, Varrassi G, Liguori L, Sabbatino F, The TRIAL Group. Dissecting the link between PD-1/PD-L1-based immunotherapy and cancer pain: mechanisms, research implications, and artificial intelligence perspectives. EXPLORATION OF IMMUNOLOGY 2024:802-821. [DOI: 10.37349/ei.2024.00174] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/28/2024] [Accepted: 11/01/2024] [Indexed: 02/02/2025]
Abstract
Cancer-related pain represents one of the most common complaints of cancer patients especially for those with advanced-stage of disease and/or bone metastases. More effective therapeutic strategies are needed not only to improve the survival of cancer patients but also to relieve cancer-related pain. In the last decade, immune checkpoint inhibitor (ICI)-based immunotherapy targeting programmed cell death-1 (PD-1) and its ligand 1 (PD-L1) has revolutionized cancer care. Beyond its anticancer role, PD-1/PD-L1 axis pathway is involved in many other physiological processes. PD-L1 expression is found in both malignant tissues and normal tissues including the dorsal root ganglion, and spinal cord. Through its interaction with PD-1, PD-L1 can modulate neuron excitability, leading to the suppression of inflammatory, neuropathic, and bone cancer pain. Therefore, since the intricate relationship between immunotherapy and pain should be largely dissected, this comprehensive review explores the complex relationship between PD-1/PD-L1-based immunotherapy and cancer-related pain. It delves into the potential mechanisms through which PD-1/PD-L1 immunotherapy might modulate pain pathways, including neuroinflammation, neuromodulation, opioid mechanisms, and bone processes. Understanding these mechanisms is crucial for developing future research directions in order to optimize pain management strategies in cancer patients. Finally, this article discusses the role of artificial intelligence (AI) in advancing research and clinical practice in this context. AI-based strategies, such as analyzing large datasets and creating predictive models, can identify patterns and correlations between PD-1/PD-L1 immunotherapy and pain. These tools can assist healthcare providers in tailoring treatment plans and pain management strategies to individual patients, ultimately improving outcomes and quality of life for those undergoing PD-1/PD-L1-based immunotherapy.
Collapse
Affiliation(s)
- Marco Cascella
- Anesthesia and Pain Management, Department of Medicine, Surgery and Dentistry, University of Salerno, 84081 Baronissi, Italy
| | - Brigida Miranda
- Oncology Unit, Department of Medicine, Surgery and Dentistry, University of Salerno, 84081 Baronissi, Italy
| | - Carmen Gagliardi
- Oncology Unit, Department of Medicine, Surgery and Dentistry, University of Salerno, 84081 Baronissi, Italy
| | - Lucia Santaniello
- Oncology Unit, Department of Medicine, Surgery and Dentistry, University of Salerno, 84081 Baronissi, Italy
| | - Milena Mottola
- Oncology Unit, Department of Medicine, Surgery and Dentistry, University of Salerno, 84081 Baronissi, Italy
| | - Alida Mancusi
- Oncology Unit, Department of Medicine, Surgery and Dentistry, University of Salerno, 84081 Baronissi, Italy
| | - Laura Ferrara
- Anesthesia and Pain Management, Department of Medicine, Surgery and Dentistry, University of Salerno, 84081 Baronissi, Italy
| | - Federica Monaco
- Unit of Anesthesia, ASL Napoli 1 Centro, 80145 Naples, Italy
| | - Francesca Gargano
- Anesthesia and Intensive Care, U.O.C. Fondazione Policlinico Campus Bio-Medico, 00128 Roma, Italy
| | - Francesco Perri
- Medical and Experimental Head and Neck Oncology Unit, Istituto Nazionale Tumori Di Napoli, IRCCS “G. Pascale”, 80131 Naples, Italy
| | - Alessandro Ottaiano
- Unit of Innovative Therapies for Abdominal Metastases, Istituto Nazionale Tumori Di Napoli, IRCCS “G. Pascale”, 80131 Naples, Italy
| | | | - Ornella Piazza
- Anesthesia and Pain Management, Department of Medicine, Surgery and Dentistry, University of Salerno, 84081 Baronissi, Italy
| | - Stefano Pepe
- Oncology Unit, Department of Medicine, Surgery and Dentistry, University of Salerno, 84081 Baronissi, Italy
| | - Anna Crispo
- Epidemiology and Biostatistics Unit, Istituto Nazionale Tumori Di Napoli, IRCCS “G. Pascale”, 80131 Naples, Italy
| | - Agostino Guida
- U.O.C. Odontostomatologia, A.O.R.N. A. Cardarelli, 80131 Naples, Italy
| | - Giovanni Salzano
- Maxillofacial Surgery Unit, Department of Neurosciences, Reproductive and Odontostomatological Sciences, University of Naples Federico II, 80138 Naples, Italy
| | - Giustino Varrassi
- Department of Research, Fondazione Paolo Procacci, 00193 Rome, Italy
| | - Luigi Liguori
- Oncology Unit, Department of Medicine, Surgery and Dentistry, University of Salerno, 84081 Baronissi, Italy
| | - Francesco Sabbatino
- Oncology Unit, Department of Medicine, Surgery and Dentistry, University of Salerno, 84081 Baronissi, Italy
| | - The TRIAL Group
- The TRIAL (Try to Research and to Improve the Anticancer Links) Group, 82100 Benevento, Italy
| |
Collapse
|
37
|
Naz R, Akacı O, Erdoğan H, Açıkgöz A. Can large language models provide accurate and quality information to parents regarding chronic kidney diseases? J Eval Clin Pract 2024; 30:1556-1564. [PMID: 38959373 DOI: 10.1111/jep.14084] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/25/2024] [Accepted: 06/24/2024] [Indexed: 07/05/2024]
Abstract
RATIONALE Artificial Intelligence (AI) large language models (LLM) are tools capable of generating human-like text responses to user queries across topics. The use of these language models in various medical contexts is currently being studied. However, the performance and content quality of these language models have not been evaluated in specific medical fields. AIMS AND OBJECTIVES This study aimed to compare the performance of AI LLMs ChatGPT, Gemini and Copilot in providing information to parents about chronic kidney diseases (CKD) and compare the information accuracy and quality with that of a reference source. METHODS In this study, 40 frequently asked questions about CKD were identified. The accuracy and quality of the answers were evaluated with reference to the Kidney Disease: Improving Global Outcomes guidelines. The accuracy of the responses generated by LLMs was assessed using F1, precision and recall scores. The quality of the responses was evaluated using a five-point global quality score (GQS). RESULTS ChatGPT and Gemini achieved high F1 scores of 0.89 and 1, respectively, in the diagnosis and lifestyle categories, demonstrating significant success in generating accurate responses. Furthermore, ChatGPT and Gemini were successful in generating accurate responses with high precision values in the diagnosis and lifestyle categories. In terms of recall values, all LLMs exhibited strong performance in the diagnosis, treatment and lifestyle categories. Average GQ scores for the responses generated were 3.46 ± 0.55, 1.93 ± 0.63 and 2.02 ± 0.69 for Gemini, ChatGPT 3.5 and Copilot, respectively. In all categories, Gemini performed better than ChatGPT and Copilot. CONCLUSION Although LLMs provide parents with high-accuracy information about CKD, their use is limited compared with that of a reference source. The limitations in the performance of LLMs can lead to misinformation and potential misinterpretations. Therefore, patients and parents should exercise caution when using these models.
Collapse
Affiliation(s)
- Rüya Naz
- Bursa Yüksek Ihtisas Research and Training Hospital, University of Health Sciences, Bursa, Turkey
| | - Okan Akacı
- Clinic of Pediatric Nephrology, Bursa Yüksek Ihtisas Research and Training Hospital, University of Health Sciences, Bursa, Turkey
| | - Hakan Erdoğan
- Clinic of Pediatric Nephrology, Bursa City Hospital, Bursa, Turkey
| | - Ayfer Açıkgöz
- Department of Pediatric Nursing, Faculty of Health Sciences, Eskişehir Osmangazi University, Eskişehir, Turkey
| |
Collapse
|
38
|
Meyer A, Soleman A, Riese J, Streichert T. Comparison of ChatGPT, Gemini, and Le Chat with physician interpretations of medical laboratory questions from an online health forum. Clin Chem Lab Med 2024; 62:2425-2434. [PMID: 38804035 DOI: 10.1515/cclm-2024-0246] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2024] [Accepted: 05/13/2024] [Indexed: 05/29/2024]
Abstract
OBJECTIVES Laboratory medical reports are often not intuitively comprehensible to non-medical professionals. Given their recent advancements, easier accessibility and remarkable performance on medical licensing exams, patients are therefore likely to turn to artificial intelligence-based chatbots to understand their laboratory results. However, empirical studies assessing the efficacy of these chatbots in responding to real-life patient queries regarding laboratory medicine are scarce. METHODS Thus, this investigation included 100 patient inquiries from an online health forum, specifically addressing Complete Blood Count interpretation. The aim was to evaluate the proficiency of three artificial intelligence-based chatbots (ChatGPT, Gemini and Le Chat) against the online responses from certified physicians. RESULTS The findings revealed that the chatbots' interpretations of laboratory results were inferior to those from online medical professionals. While the chatbots exhibited a higher degree of empathetic communication, they frequently produced erroneous or overly generalized responses to complex patient questions. The appropriateness of chatbot responses ranged from 51 to 64 %, with 22 to 33 % of responses overestimating patient conditions. A notable positive aspect was the chatbots' consistent inclusion of disclaimers regarding its non-medical nature and recommendations to seek professional medical advice. CONCLUSIONS The chatbots' interpretations of laboratory results from real patient queries highlight a dangerous dichotomy - a perceived trustworthiness potentially obscuring factual inaccuracies. Given the growing inclination towards self-diagnosis using AI platforms, further research and improvement of these chatbots is imperative to increase patients' awareness and avoid future burdens on the healthcare system.
Collapse
Affiliation(s)
- Annika Meyer
- Institute of Clinical Chemistry, Faculty of Medicine and University Hospital, 27182 University Hospital Cologne , Cologne, Germany
| | - Ari Soleman
- Faculty of Medicine and University Hospital, 27182 University Hospital Cologne , Cologne, Germany
| | - Janik Riese
- Institute of Pathology, Faculty of Medicine, RWTH Aachen University, Aachen, Germany
| | - Thomas Streichert
- Institute of Clinical Chemistry, Faculty of Medicine and University Hospital, 27182 University Hospital Cologne , Cologne, Germany
| |
Collapse
|
39
|
Graña-Castro O, Izquierdo E, Piñas-Mesa A, Menasalvas E, Chivato-Pérez T. Assessing the Impact of New Technologies on Managing Chronic Respiratory Diseases. J Clin Med 2024; 13:6913. [PMID: 39598056 PMCID: PMC11594345 DOI: 10.3390/jcm13226913] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2024] [Revised: 11/11/2024] [Accepted: 11/14/2024] [Indexed: 11/29/2024] Open
Abstract
Chronic respiratory diseases (CRDs), including asthma and chronic obstructive pulmonary disease (COPD), represent significant global health challenges, contributing to substantial morbidity and mortality. As the prevalence of CRDs continues to rise, particularly in low-income countries, there is a pressing need for more efficient and personalized approaches to diagnosis and treatment. This article explores the impact of emerging technologies, particularly artificial intelligence (AI), on the management of CRDs. AI applications, including machine learning (ML), deep learning (DL), and large language models (LLMs), are transforming the landscape of CRD care, enabling earlier diagnosis, personalized treatment, and enhanced remote patient monitoring. The integration of AI with telehealth and wearable technologies further supports proactive interventions and improved patient outcomes. However, challenges remain, including issues related to data quality, algorithmic bias, and ethical concerns such as patient privacy and AI transparency. This paper evaluates the effectiveness, accessibility, and ethical implications of AI-driven tools in CRD management, offering insights into their potential to shape the future of respiratory healthcare. The integration of AI and advanced technologies in managing CRDs like COPD and asthma holds substantial potential for enhancing early diagnosis, personalized treatment, and remote monitoring, though challenges remain regarding data quality, ethical considerations, and regulatory oversight.
Collapse
Affiliation(s)
- Osvaldo Graña-Castro
- Departamento de Ciencias Médicas Básicas, Instituto de Medicina Molecular Aplicada (IMMA-Nemesio Díez), Facultad de Medicina, Universidad San Pablo-CEU, CEU Universities, 28925 Alcorcón, Spain; (O.G.-C.); (E.I.)
| | - Elena Izquierdo
- Departamento de Ciencias Médicas Básicas, Instituto de Medicina Molecular Aplicada (IMMA-Nemesio Díez), Facultad de Medicina, Universidad San Pablo-CEU, CEU Universities, 28925 Alcorcón, Spain; (O.G.-C.); (E.I.)
| | - Antonio Piñas-Mesa
- Departamento de Humanidades—Sección de Pensamiento Facultad de Humanidades y Ciencias de la Comunicación, Universidad San Pablo-CEU, CEU Universities, 28003 Madrid, Spain;
| | - Ernestina Menasalvas
- ETSI Informáticos, Centro de Tecnología Biomédica, Universidad Politécnica de Madrid, 28223 Pozuelo, Spain;
| | - Tomás Chivato-Pérez
- Departamento de Ciencias Médicas Básicas, Instituto de Medicina Molecular Aplicada (IMMA-Nemesio Díez), Facultad de Medicina, Universidad San Pablo-CEU, CEU Universities, 28925 Alcorcón, Spain; (O.G.-C.); (E.I.)
| |
Collapse
|
40
|
Gill GS, Blair J, Litinsky S. Evaluating the Performance of ChatGPT 3.5 and 4.0 on StatPearls Oculoplastic Surgery Text- and Image-Based Exam Questions. Cureus 2024; 16:e73812. [PMID: 39691123 PMCID: PMC11650114 DOI: 10.7759/cureus.73812] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2024] [Accepted: 10/27/2024] [Indexed: 12/19/2024] Open
Abstract
INTRODUCTION The emergence of large language models (LLMs) has led to significant interest in their potential use as medical assistive tools. Prior investigations have analyzed the overall comparative performance of LLM versions within different ophthalmology subspecialties. However, limited investigations have characterized LLM performance on image-based questions, a recent advance in LLM capabilities. The purpose of this study was to evaluate the performance of Chat Generative Pre-Trained Transformers (ChatGPT) versions 3.5 and 4.0 on image-based and text-only questions using oculoplastic subspecialty questions from StatPearls and OphthoQuestions question banks. METHODS This study utilized 343 non-image questions from StatPearls, 127 images from StatPearls, and 89 OphthoQuestions. All of these questions were specific to Oculoplastics. The information collected included correctness, distribution of answers, and if an additional prompt was necessary. Text-only questions were compared between ChatGPT-3.5 and ChatGPT-4.0. Also, text-only and multimodal (image-based) questions answered by ChatGPT-4.0 were compared. RESULTS ChatGPT-3.5 answered 56.85% (195/343) of text-only questions correctly, while ChatGPT-4.0 achieved 73.46% (252/343), showing a statistically significant difference in accuracy (p<0.05). The biserial correlation between ChatGPT-3.5 and human performance on the StatPearls question bank was 0.198, with a standard deviation of 0.195. When ChatGPT-3.5 was incorrect, the average human correctness was 49.39% (SD 26.27%), and when it was correct, human correctness averaged 57.82% (SD 30.14%) with a t-statistic of 3.57 and a p-value of 0.0004. For ChatGPT-4.0, the biserial correlation was 0.226 (SD 0.213). When ChatGPT-4.0 was incorrect, human correctness averaged 45.49% (SD 24.85%), and when it was correct, human correctness was 57.02% (SD 29.75%) with a t-statistic of 4.28 and a p-value of 0.0006. On image-only questions, ChatGPT-4.0 correctly answered 56.94% (123/216), significantly lower than its performance on text-only questions (p<0.05). DISCUSSION AND CONCLUSION This study shows that ChatGPT-4.0 performs better on the oculoplastic subspecialty than prior versions. However, significant challenges remain regarding accuracy, particularly when integrating image-based prompts. While showing promise within medical education, further progress must be made regarding LLM reliability, and caution should be used until further advancement is achieved.
Collapse
Affiliation(s)
- Gurnoor S Gill
- Medical School, Florida Atlantic University Charles E. Schmidt College of Medicine, Boca Raton, USA
| | - Jacob Blair
- Ophthalmology, Larkin Community Hospital (LCH) Lake Erie College of Osteopathic Medicine (LECOM), Miami, USA
| | - Steven Litinsky
- Ophthalmology, Florida Atlantic University Charles E. Schmidt College of Medicine, Boca Raton, USA
| |
Collapse
|
41
|
Maroncelli R, Rizzo V, Pasculli M, Cicciarelli F, Macera M, Galati F, Catalano C, Pediconi F. Probing clarity: AI-generated simplified breast imaging reports for enhanced patient comprehension powered by ChatGPT-4o. Eur Radiol Exp 2024; 8:124. [PMID: 39477904 PMCID: PMC11525358 DOI: 10.1186/s41747-024-00526-1] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2024] [Accepted: 10/16/2024] [Indexed: 11/02/2024] Open
Abstract
BACKGROUND To assess the reliability and comprehensibility of breast radiology reports simplified by artificial intelligence using the large language model (LLM) ChatGPT-4o. METHODS A radiologist with 20 years' experience selected 21 anonymized breast radiology reports, 7 mammography, 7 breast ultrasound, and 7 breast magnetic resonance imaging (MRI), categorized according to breast imaging reporting and data system (BI-RADS). These reports underwent simplification by prompting ChatGPT-4o with "Explain this medical report to a patient using simple language". Five breast radiologists assessed the quality of these simplified reports for factual accuracy, completeness, and potential harm with a 5-point Likert scale from 1 (strongly agree) to 5 (strongly disagree). Another breast radiologist evaluated the text comprehension of five non-healthcare personnel readers using a 5-point Likert scale from 1 (excellent) to 5 (poor). Descriptive statistics, Cronbach's α, and the Kruskal-Wallis test were used. RESULTS Mammography, ultrasound, and MRI showed high factual accuracy (median 2) and completeness (median 2) across radiologists, with low potential harm scores (median 5); no significant group differences (p ≥ 0.780), and high internal consistency (α > 0.80) were observed. Non-healthcare readers showed high comprehension (median 2 for mammography and MRI and 1 for ultrasound); no significant group differences across modalities (p = 0.368), and high internal consistency (α > 0.85) were observed. BI-RADS 0, 1, and 2 reports were accurately explained, while BI-RADS 3-6 reports were challenging. CONCLUSION The model demonstrated reliability and clarity, offering promise for patients with diverse backgrounds. LLMs like ChatGPT-4o could simplify breast radiology reports, aid in communication, and enhance patient care. RELEVANCE STATEMENT Simplified breast radiology reports generated by ChatGPT-4o show potential in enhancing communication with patients, improving comprehension across varying educational backgrounds, and contributing to patient-centered care in radiology practice. KEY POINTS AI simplifies complex breast imaging reports, enhancing patient understanding. Simplified reports from AI maintain accuracy, improving patient comprehension significantly. Implementing AI reports enhances patient engagement and communication in breast imaging.
Collapse
Affiliation(s)
- Roberto Maroncelli
- Department of Radiological, Oncological and Pathological Sciences, Sapienza-University of Rome, Rome, Roma, Italy.
| | - Veronica Rizzo
- Department of Radiological, Oncological and Pathological Sciences, Sapienza-University of Rome, Rome, Roma, Italy
| | - Marcella Pasculli
- Department of Radiological, Oncological and Pathological Sciences, Sapienza-University of Rome, Rome, Roma, Italy
| | - Federica Cicciarelli
- Department of Radiological, Oncological and Pathological Sciences, Sapienza-University of Rome, Rome, Roma, Italy
| | | | - Francesca Galati
- Department of Radiological, Oncological and Pathological Sciences, Sapienza-University of Rome, Rome, Roma, Italy
| | - Carlo Catalano
- Department of Radiological, Oncological and Pathological Sciences, Sapienza-University of Rome, Rome, Roma, Italy
| | - Federica Pediconi
- Department of Radiological, Oncological and Pathological Sciences, Sapienza-University of Rome, Rome, Roma, Italy
| |
Collapse
|
42
|
Touma NJ, Caterini J, Liblk K. Is ChatGPT ready for primetime? Performance of artificial intelligence on a simulated Canadian urology board exam. Can Urol Assoc J 2024; 18:329-332. [PMID: 38896484 PMCID: PMC11477513 DOI: 10.5489/cuaj.8800] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/21/2024]
Abstract
INTRODUCTION Generative artificial intelligence (AI) has proven to be a powerful tool with increasing applications in clinical care and medical education. ChatGPT has performed adequately on many specialty certification and knowledge assessment exams. The objective of this study was to assess the performance of ChatGPT 4 on a multiple-choice exam meant to simulate the Canadian urology board exam. METHODS Graduating urology residents representing all Canadian training programs gather yearly for a mock exam that simulates their upcoming board-certifying exam. The exam consists of written multiple-choice questions (MCQs) and an oral objective structured clinical examination (OSCE). The 2022 exam was taken by 29 graduating residents and was administered to ChatGPT 4. RESULTS ChatGPT 4 scored 46% on the MCQ exam, whereas the mean and median scores of graduating urology residents were 62.6%, and 62.7%, respectively. This would place ChatGPT's score 1.8 standard deviations from the median. The percentile rank of ChatGPT would be in the sixth percentile. ChatGPT scores on different topics of the exam were as follows: oncology 35%, andrology/benign prostatic hyperplasia 62%, physiology/anatomy 67%, incontinence/female urology 23%, infections 71%, urolithiasis 57%, and trauma/reconstruction 17%, with ChatGPT 4's oncology performance being significantly below that of postgraduate year 5 residents. CONCLUSIONS ChatGPT 4 underperforms on an MCQ exam meant to simulate the Canadian board exam. Ongoing assessments of the capability of generative AI is needed as these models evolve and are trained on additional urology content.
Collapse
|
43
|
Walsh SR. Chatbots Best Left in the Vascular Clinic Waiting Room…for Now. EJVES Vasc Forum 2024; 62:91-92. [PMID: 39524097 PMCID: PMC11549997 DOI: 10.1016/j.ejvsvf.2024.09.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2024] [Revised: 08/15/2024] [Accepted: 09/11/2024] [Indexed: 11/16/2024] Open
Affiliation(s)
- Stewart R. Walsh
- University of Galway, Ireland
- National Surgical Research Support Centre, Royal College of Surgeons in Ireland, Ireland
| |
Collapse
|
44
|
Cascella M, Leoni MLG, Shariff MN, Varrassi G. Artificial Intelligence-Driven Diagnostic Processes and Comprehensive Multimodal Models in Pain Medicine. J Pers Med 2024; 14:983. [PMID: 39338237 PMCID: PMC11432921 DOI: 10.3390/jpm14090983] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2024] [Revised: 09/04/2024] [Accepted: 09/12/2024] [Indexed: 09/30/2024] Open
Abstract
Pain diagnosis remains a challenging task due to its subjective nature, the variability in pain expression among individuals, and the difficult assessment of the underlying biopsychosocial factors. In this complex scenario, artificial intelligence (AI) can offer the potential to enhance diagnostic accuracy, predict treatment outcomes, and personalize pain management strategies. This review aims to dissect the current literature on computer-aided diagnosis methods. It also discusses how AI-driven diagnostic strategies can be integrated into multimodal models that combine various data sources, such as facial expression analysis, neuroimaging, and physiological signals, with advanced AI techniques. Despite the significant advancements in AI technology, its widespread adoption in clinical settings faces crucial challenges. The main issues are ethical considerations related to patient privacy, biases, and the lack of reliability and generalizability. Furthermore, there is a need for high-quality real-world validation and the development of standardized protocols and policies to guide the implementation of these technologies in diverse clinical settings.
Collapse
Affiliation(s)
- Marco Cascella
- Anesthesia and Pain Medicine, Department of Medicine, Surgery and Dentistry “Scuola Medica Salernitana”, University of Salerno, 84081 Baronissi, Italy;
| | - Matteo L. G. Leoni
- Department of Medical and Surgical Sciences and Translational Medicine, Sapienza University of Roma, 00185 Rome, Italy
| | | | | |
Collapse
|
45
|
Lareyre F, Nasr B, Poggi E, Lorenzo GD, Ballaith A, Sliti I, Chaudhuri A, Raffort J. Large language models and artificial intelligence chatbots in vascular surgery. Semin Vasc Surg 2024; 37:314-320. [PMID: 39277347 DOI: 10.1053/j.semvascsurg.2024.06.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2024] [Revised: 06/12/2024] [Accepted: 06/14/2024] [Indexed: 09/17/2024]
Abstract
Natural language processing is a subfield of artificial intelligence that aims to analyze human oral or written language. The development of large language models has brought innovative perspectives in medicine, including the potential use of chatbots and virtual assistants. Nevertheless, the benefits and pitfalls of such technology need to be carefully evaluated before their use in health care. The aim of this narrative review was to provide an overview of potential applications of large language models and artificial intelligence chatbots in the field of vascular surgery, including clinical practice, research, and education. In light of the results, we discuss current limits and future directions.
Collapse
Affiliation(s)
- Fabien Lareyre
- Department of Vascular Surgery, Hospital of Antibes Juan-les-Pins, France; Université Côte d'Azur, Centre National de la Recherche Scientifique (CNRS), UMR7370, Laboratoire de Physiomédecine Moléculaire (LP2M), Nice, France; Fédération Hospitalo-Universitaire FHU Plan & Go, Nice, France
| | - Bahaa Nasr
- University of Brest, Institut National de la Santé et de la Recherche Médicale (INSERM), IMT-Atlantique, UMR 1011 LaTIM, Vascular and Endovascular Surgery Department, CHU Cavale Blanche, Brest, France
| | - Elise Poggi
- Department of Vascular Surgery, Hospital of Antibes Juan-les-Pins, France
| | - Gilles Di Lorenzo
- Department of Vascular Surgery, Hospital of Antibes Juan-les-Pins, France
| | - Ali Ballaith
- Department of Cardiovascular Surgery, Zayed Military Hospital, Abu Dhabi, United Arab Emirates
| | - Imen Sliti
- Department of Vascular Surgery, Hospital of Antibes Juan-les-Pins, France
| | - Arindam Chaudhuri
- Bedfordshire - Milton Keynes Vascular Centre, Bedfordshire Hospitals, National Health Service Foundation Trust, Bedford, UK
| | - Juliette Raffort
- Université Côte d'Azur, Centre National de la Recherche Scientifique (CNRS), UMR7370, Laboratoire de Physiomédecine Moléculaire (LP2M), Nice, France; Fédération Hospitalo-Universitaire FHU Plan & Go, Nice, France; Clinical Chemistry Laboratory, University Hospital of Nice, France; Institute 3IA Côte d'Azur, Université Côte d'Azur, France; Department of Clinical Biochemistry, Hôpital Pasteur, Pavillon J, 30, Avenue de la Voie Romaine, 06001 Nice cedex 1, France.
| |
Collapse
|
46
|
Lonsdale H, O'Reilly-Shah VN, Padiyath A, Simpao AF. Supercharge Your Academic Productivity with Generative Artificial Intelligence. J Med Syst 2024; 48:73. [PMID: 39115560 PMCID: PMC11457929 DOI: 10.1007/s10916-024-02093-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2024] [Accepted: 07/23/2024] [Indexed: 10/09/2024]
Affiliation(s)
- Hannah Lonsdale
- Department of Anesthesiology, Vanderbilt University School of Medicine, Monroe Carell Jr. Children's Hospital at Vanderbilt, Nashville, TN, 37232, USA.
| | - Vikas N O'Reilly-Shah
- Department of Anesthesiology & Pain Medicine, University of Washington School of Medicine, Seattle, WA, USA
| | - Asif Padiyath
- Department of Anesthesiology and Critical Care, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
- Department of Anesthesiology and Critical Care Medicine, Children's Hospital of Philadelphia, Philadelphia, PA, USA
| | - Allan F Simpao
- Department of Anesthesiology and Critical Care, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
- Department of Anesthesiology and Critical Care Medicine, Children's Hospital of Philadelphia, Philadelphia, PA, USA
| |
Collapse
|
47
|
Sridharan K, Sivaramakrishnan G. Enhancing readability of USFDA patient communications through large language models: a proof-of-concept study. Expert Rev Clin Pharmacol 2024; 17:731-741. [PMID: 38823007 DOI: 10.1080/17512433.2024.2363840] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2024] [Accepted: 05/31/2024] [Indexed: 06/03/2024]
Abstract
BACKGROUND The US Food and Drug Administration (USFDA) communicates new drug safety concerns through drug safety communications (DSCs) and medication guides (MGs), which often challenge patients with average reading abilities due to their complexity. This study assesses whether large language models (LLMs) can enhance the readability of these materials. METHODS We analyzed the latest DSCs and MGs, using ChatGPT 4.0© and Gemini© to simplify them to a sixth-grade reading level. Outputs were evaluated for readability, technical accuracy, and content inclusiveness. RESULTS Original materials were difficult to read (DSCs grade level 13, MGs 22). LLMs significantly improved readability, reducing the grade levels to more accessible readings (Single prompt - DSCs: ChatGPT 4.0© 10.1, Gemini© 8; MGs: ChatGPT 4.0© 7.1, Gemini© 6.5. Multiple prompts - DSCs: ChatGPT 4.0© 10.3, Gemini© 7.5; MGs: ChatGPT 4.0© 8, Gemini© 6.8). LLM outputs retained technical accuracy and key messages. CONCLUSION LLMs can significantly simplify complex health-related information, making it more accessible to patients. Future research should extend these findings to other languages and patient groups in real-world settings.
Collapse
Affiliation(s)
- Kannan Sridharan
- Department of Pharmacology & Therapeutics, College of Medicine & Medical Sciences, Arabian Gulf University, Manama, Kingdom of Bahrain
| | - Gowri Sivaramakrishnan
- Speciality Dental Residency Program, Primary Health Care Centers, Manama, Kingdom of Bahrain
| |
Collapse
|
48
|
Singh SP, Jamal A, Qureshi F, Zaidi R, Qureshi F. Leveraging Generative Artificial Intelligence Models in Patient Education on Inferior Vena Cava Filters. Clin Pract 2024; 14:1507-1514. [PMID: 39194925 DOI: 10.3390/clinpract14040121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2024] [Revised: 06/13/2024] [Accepted: 07/23/2024] [Indexed: 08/29/2024] Open
Abstract
Background: Inferior Vena Cava (IVC) filters have become an advantageous treatment modality for patients with venous thromboembolism. As the use of these filters continues to grow, it is imperative for providers to appropriately educate patients in a comprehensive yet understandable manner. Likewise, generative artificial intelligence models are a growing tool in patient education, but there is little understanding of the readability of these tools on IVC filters. Methods: This study aimed to determine the Flesch Reading Ease (FRE), Flesch-Kincaid, and Gunning Fog readability of IVC Filter patient educational materials generated by these artificial intelligence models. Results: The ChatGPT cohort had the highest mean Gunning Fog score at 17.76 ± 1.62 and the lowest at 11.58 ± 1.55 among the Copilot cohort. The difference between groups for Flesch Reading Ease scores (p = 8.70408 × 10-8) was found to be statistically significant albeit with priori power found to be low at 0.392. Conclusions: The results of this study indicate that the answers generated by the Microsoft Copilot cohort offers a greater degree of readability compared to ChatGPT cohort regarding IVC filters. Nevertheless, the mean Flesch-Kincaid readability for both cohorts does not meet the recommended U.S. grade reading levels.
Collapse
Affiliation(s)
- Som P Singh
- Department of Internal Medicine, University of Missouri Kansas City School of Medicine, Kansas City, MO 64108, USA
| | - Aleena Jamal
- Sidney Kimmel Medical College, Thomas Jefferson University, Philadelphia, PA 19107, USA
| | - Farah Qureshi
- Lake Erie College of Osteopathic Medicine, Erie, PA 16509, USA
| | - Rohma Zaidi
- Department of Internal Medicine, University of Missouri Kansas City School of Medicine, Kansas City, MO 64108, USA
| | - Fawad Qureshi
- Department of Nephrology and Hypertension, Mayo Clinic Alix School of Medicine, Rochester, MN 55905, USA
| |
Collapse
|
49
|
Dursun D, Bilici Geçer R. Can artificial intelligence models serve as patient information consultants in orthodontics? BMC Med Inform Decis Mak 2024; 24:211. [PMID: 39075513 PMCID: PMC11285120 DOI: 10.1186/s12911-024-02619-8] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2024] [Accepted: 07/23/2024] [Indexed: 07/31/2024] Open
Abstract
BACKGROUND To evaluate the accuracy, reliability, quality, and readability of responses generated by ChatGPT-3.5, ChatGPT-4, Gemini, and Copilot in relation to orthodontic clear aligners. METHODS Frequently asked questions by patients/laypersons about clear aligners on websites were identified using the Google search tool and these questions were posed to ChatGPT-3.5, ChatGPT-4, Gemini, and Copilot AI models. Responses were assessed using a five-point Likert scale for accuracy, the modified DISCERN scale for reliability, the Global Quality Scale (GQS) for quality, and the Flesch Reading Ease Score (FRES) for readability. RESULTS ChatGPT-4 responses had the highest mean Likert score (4.5 ± 0.61), followed by Copilot (4.35 ± 0.81), ChatGPT-3.5 (4.15 ± 0.75) and Gemini (4.1 ± 0.72). The difference between the Likert scores of the chatbot models was not statistically significant (p > 0.05). Copilot had a significantly higher modified DISCERN and GQS score compared to both Gemini, ChatGPT-4 and ChatGPT-3.5 (p < 0.05). Gemini's modified DISCERN and GQS score was statistically higher than ChatGPT-3.5 (p < 0.05). Gemini also had a significantly higher FRES compared to both ChatGPT-4, Copilot and ChatGPT-3.5 (p < 0.05). The mean FRES was 38.39 ± 11.56 for ChatGPT-3.5, 43.88 ± 10.13 for ChatGPT-4 and 41.72 ± 10.74 for Copilot, indicating that the responses were difficult to read according to the reading level. The mean FRES for Gemini is 54.12 ± 10.27, indicating that Gemini's responses are more readable than other chatbots. CONCLUSIONS All chatbot models provided generally accurate, moderate reliable and moderate to good quality answers to questions about the clear aligners. Furthermore, the readability of the responses was difficult. ChatGPT, Gemini and Copilot have significant potential as patient information tools in orthodontics, however, to be fully effective they need to be supplemented with more evidence-based information and improved readability.
Collapse
Affiliation(s)
- Derya Dursun
- Department of Orthodontics, Hamidiye Faculty of Dentistry, University of Health Sciences, Istanbul, Turkey
| | - Rumeysa Bilici Geçer
- Department of Orthodontics, Faculty of Dentistry, Istanbul Aydin University, Istanbul, Turkey.
| |
Collapse
|
50
|
Parsa S, Somani S, Dudum R, Jain SS, Rodriguez F. Artificial Intelligence in Cardiovascular Disease Prevention: Is it Ready for Prime Time? Curr Atheroscler Rep 2024; 26:263-272. [PMID: 38780665 PMCID: PMC11457745 DOI: 10.1007/s11883-024-01210-w] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/08/2024] [Indexed: 05/25/2024]
Abstract
PURPOSE OF REVIEW This review evaluates how Artificial Intelligence (AI) enhances atherosclerotic cardiovascular disease (ASCVD) risk assessment, allows for opportunistic screening, and improves adherence to guidelines through the analysis of unstructured clinical data and patient-generated data. Additionally, it discusses strategies for integrating AI into clinical practice in preventive cardiology. RECENT FINDINGS AI models have shown superior performance in personalized ASCVD risk evaluations compared to traditional risk scores. These models now support automated detection of ASCVD risk markers, including coronary artery calcium (CAC), across various imaging modalities such as dedicated ECG-gated CT scans, chest X-rays, mammograms, coronary angiography, and non-gated chest CT scans. Moreover, large language model (LLM) pipelines are effective in identifying and addressing gaps and disparities in ASCVD preventive care, and can also enhance patient education. AI applications are proving invaluable in preventing and managing ASCVD and are primed for clinical use, provided they are implemented within well-regulated, iterative clinical pathways.
Collapse
Affiliation(s)
- Shyon Parsa
- Department of Medicine, Stanford University, Stanford, California, USA
| | - Sulaiman Somani
- Department of Medicine, Stanford University, Stanford, California, USA
| | - Ramzi Dudum
- Division of Cardiovascular Medicine and Cardiovascular Institute, Stanford University, Stanford, CA, USA
| | - Sneha S Jain
- Division of Cardiovascular Medicine and Cardiovascular Institute, Stanford University, Stanford, CA, USA
| | - Fatima Rodriguez
- Division of Cardiovascular Medicine and Cardiovascular Institute, Stanford University, Stanford, CA, USA.
- Center for Digital Health, Stanford University, Stanford, California, USA.
| |
Collapse
|