1
|
Meo SA, Al-Khlaiwi T, AbuKhalaf AA, Meo AS, Klonoff DC. The Scientific Knowledge of Bard and ChatGPT in Endocrinology, Diabetes, and Diabetes Technology: Multiple-Choice Questions Examination-Based Performance. J Diabetes Sci Technol 2025; 19:705-710. [PMID: 37798960 PMCID: PMC12035228 DOI: 10.1177/19322968231203987] [Citation(s) in RCA: 14] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 10/07/2023]
Abstract
BACKGROUND The present study aimed to investigate the knowledge level of Bard and ChatGPT in the areas of endocrinology, diabetes, and diabetes technology through a multiple-choice question (MCQ) examination format. METHODS Initially, a 100-MCQ bank was established based on MCQs in endocrinology, diabetes, and diabetes technology. The MCQs were created from physiology, medical textbooks, and academic examination pools in the areas of endocrinology, diabetes, and diabetes technology and academic examination pools. The study team members analyzed the MCQ contents to ensure that they were related to the endocrinology, diabetes, and diabetes technology. The number of MCQs from endocrinology was 50, and that from diabetes and science technology was also 50. The knowledge level of Google's Bard and ChatGPT was assessed with an MCQ-based examination. RESULTS In the endocrinology examination section, ChatGPT obtained 29 marks (correct responses) of 50 (58%), and Bard obtained a similar score of 29 of 50 (58%). However, in the diabetes technology examination section, ChatGPT obtained 23 marks of 50 (46%), and Bard obtained 20 marks of 50 (40%). Overall, in the entire three-part examination, ChatGPT obtained 52 marks of 100 (52%), and Bard obtained 49 marks of 100 (49%). ChatGPT obtained slightly more marks than Bard. However, both ChatGPT and Bard did not achieve satisfactory scores in endocrinology or diabetes/technology of at least 60%. CONCLUSIONS The overall MCQ-based performance of ChatGPT was slightly better than that of Google's Bard. However, both ChatGPT and Bard did not achieve appropriate scores in endocrinology and diabetes/diabetes technology. The study indicates that Bard and ChatGPT have the potential to facilitate medical students and faculty in academic medical education settings, but both artificial intelligence tools need more updated information in the fields of endocrinology, diabetes, and diabetes technology.
Collapse
|
research-article |
1 |
14 |
2
|
Oeding JF, Lu AZ, Mazzucco M, Fu MC, Taylor SA, Dines DM, Warren RF, Gulotta LV, Dines JS, Kunze KN. ChatGPT-4 Performs Clinical Information Retrieval Tasks Using Consistently More Trustworthy Resources Than Does Google Search for Queries Concerning the Latarjet Procedure. Arthroscopy 2025; 41:588-597. [PMID: 38936557 DOI: 10.1016/j.arthro.2024.05.025] [Citation(s) in RCA: 10] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/13/2023] [Revised: 05/15/2024] [Accepted: 05/16/2024] [Indexed: 06/29/2024]
Abstract
PURPOSE To assess the ability of ChatGPT-4, an automated Chatbot powered by artificial intelligence, to answer common patient questions concerning the Latarjet procedure for patients with anterior shoulder instability and compare this performance with Google Search Engine. METHODS Using previously validated methods, a Google search was first performed using the query "Latarjet." Subsequently, the top 10 frequently asked questions (FAQs) and associated sources were extracted. ChatGPT-4 was then prompted to provide the top 10 FAQs and answers concerning the procedure. This process was repeated to identify additional FAQs requiring discrete-numeric answers to allow for a comparison between ChatGPT-4 and Google. Discrete, numeric answers were subsequently assessed for accuracy on the basis of the clinical judgment of 2 fellowship-trained sports medicine surgeons who were blinded to search platform. RESULTS Mean (± standard deviation) accuracy to numeric-based answers was 2.9 ± 0.9 for ChatGPT-4 versus 2.5 ± 1.4 for Google (P = .65). ChatGPT-4 derived information for answers only from academic sources, which was significantly different from Google Search Engine (P = .003), which used only 30% academic sources and websites from individual surgeons (50%) and larger medical practices (20%). For general FAQs, 40% of FAQs were found to be identical when comparing ChatGPT-4 and Google Search Engine. In terms of sources used to answer these questions, ChatGPT-4 again used 100% academic resources, whereas Google Search Engine used 60% academic resources, 20% surgeon personal websites, and 20% medical practices (P = .087). CONCLUSIONS ChatGPT-4 demonstrated the ability to provide accurate and reliable information about the Latarjet procedure in response to patient queries, using multiple academic sources in all cases. This was in contrast to Google Search Engine, which more frequently used single-surgeon and large medical practice websites. Despite differences in the resources accessed to perform information retrieval tasks, the clinical relevance and accuracy of information provided did not significantly differ between ChatGPT-4 and Google Search Engine. CLINICAL RELEVANCE Commercially available large language models (LLMs), such as ChatGPT-4, can perform diverse information retrieval tasks on-demand. An important medical information retrieval application for LLMs consists of the ability to provide comprehensive, relevant, and accurate information for various use cases such as investigation about a recently diagnosed medical condition or procedure. Understanding the performance and abilities of LLMs for use cases has important implications for deployment within health care settings.
Collapse
|
Comparative Study |
1 |
10 |
3
|
Dashti M, Londono J, Ghasemi S, Moghaddasi N. How much can we rely on artificial intelligence chatbots such as the ChatGPT software program to assist with scientific writing? J Prosthet Dent 2025; 133:1082-1088. [PMID: 37438164 DOI: 10.1016/j.prosdent.2023.05.023] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2023] [Revised: 05/22/2023] [Accepted: 05/24/2023] [Indexed: 07/14/2023]
Abstract
STATEMENT OF PROBLEM: Use of the ChatGPT software program by authors raises many questions, primarily regarding egregious issues such as plagiarism. Nevertheless, little is known about the extent to which artificial intelligence (AI) models can produce high-quality research publications and advance and shape the direction of a research topic. PURPOSE The purpose of this study was to determine how well the ChatGPT software program, a writing tool powered by AI, could respond to questions about scientific or research writing and generate accurate references with academic examples. MATERIAL AND METHODS Questions were made for the ChatGPT software program to locate an abstract containing a particular keyword in the Journal of Prosthetic Dentistry (JPD). Then, whether the resulting articles existed or were published was determined. Questions were made for the algorithm 5 times to locate 5 JPD articles containing 2 specific keywords, bringing the total number of articles to 25. The process was repeated twice, each time with a different set of keywords, and the ChatGPT software program provided a total of 75 articles. The search was conducted at various times between April 1 and 4, 2023. Finally, 2 authors independently searched the JPD website and Google Scholar to determine whether the articles provided by the ChatGPT software program existed. RESULTS When the author tested the ChatGPT software program's ability to locate articles in the JPD and Google Scholar using a set of keywords, the results did not match the papers that the ChatGPT software program had generated with the help of the AI tool. Consequently, all 75 articles provided by the ChatGPT software program were not accurately located in the JPD or Google Scholar databases and had to be added manually to ensure the accuracy of the relevant references. CONCLUSIONS Researchers and academic scholars must be cautious when using the ChatGPT software program because AI-generated content cannot provide or analyze the same information as an author or researcher. In addition, the results indicated that writing credit or references to such content or references in prestigious academic journals is not yet appropriate. At this time, scientific writing is only valid when performed manually by researchers.
Collapse
|
|
1 |
7 |
4
|
Coşkun Ö, Kıyak YS, Budakoğlu Iİ. ChatGPT to generate clinical vignettes for teaching and multiple-choice questions for assessment: A randomized controlled experiment. MEDICAL TEACHER 2025; 47:268-274. [PMID: 38478902 DOI: 10.1080/0142159x.2024.2327477] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/06/2023] [Accepted: 03/04/2024] [Indexed: 02/08/2025]
Abstract
AIM This study aimed to evaluate the real-life performance of clinical vignettes and multiple-choice questions generated by using ChatGPT. METHODS This was a randomized controlled study in an evidence-based medicine training program. We randomly assigned seventy-four medical students to two groups. The ChatGPT group received ill-defined cases generated by ChatGPT, while the control group received human-written cases. At the end of the training, they evaluated the cases by rating 10 statements using a Likert scale. They also answered 15 multiple-choice questions (MCQs) generated by ChatGPT. The case evaluations of the two groups were compared. Some psychometric characteristics (item difficulty and point-biserial correlations) of the test were also reported. RESULTS None of the scores in 10 statements regarding the cases showed a significant difference between the ChatGPT group and the control group (p > .05). In the test, only six MCQs had acceptable levels (higher than 0.30) of point-biserial correlation, and five items could be considered acceptable in classroom settings. CONCLUSIONS The results showed that the quality of the vignettes are comparable to those created by human authors, and some multiple-questions have acceptable psychometric characteristics. ChatGPT has potential in generating clinical vignettes for teaching and MCQs for assessment in medical education.
Collapse
|
Randomized Controlled Trial |
1 |
7 |
5
|
Gan W, Ouyang J, Li H, Xue Z, Zhang Y, Dong Q, Huang J, Zheng X, Zhang Y. Integrating ChatGPT in Orthopedic Education for Medical Undergraduates: Randomized Controlled Trial. J Med Internet Res 2024; 26:e57037. [PMID: 39163598 PMCID: PMC11372336 DOI: 10.2196/57037] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2024] [Revised: 06/10/2024] [Accepted: 06/27/2024] [Indexed: 08/22/2024] Open
Abstract
BACKGROUND ChatGPT is a natural language processing model developed by OpenAI, which can be iteratively updated and optimized to accommodate the changing and complex requirements of human verbal communication. OBJECTIVE The study aimed to evaluate ChatGPT's accuracy in answering orthopedics-related multiple-choice questions (MCQs) and assess its short-term effects as a learning aid through a randomized controlled trial. In addition, long-term effects on student performance in other subjects were measured using final examination results. METHODS We first evaluated ChatGPT's accuracy in answering MCQs pertaining to orthopedics across various question formats. Then, 129 undergraduate medical students participated in a randomized controlled study in which the ChatGPT group used ChatGPT as a learning tool, while the control group was prohibited from using artificial intelligence software to support learning. Following a 2-week intervention, the 2 groups' understanding of orthopedics was assessed by an orthopedics test, and variations in the 2 groups' performance in other disciplines were noted through a follow-up at the end of the semester. RESULTS ChatGPT-4.0 answered 1051 orthopedics-related MCQs with a 70.60% (742/1051) accuracy rate, including 71.8% (237/330) accuracy for A1 MCQs, 73.7% (330/448) accuracy for A2 MCQs, 70.2% (92/131) accuracy for A3/4 MCQs, and 58.5% (83/142) accuracy for case analysis MCQs. As of April 7, 2023, a total of 129 individuals participated in the experiment. However, 19 individuals withdrew from the experiment at various phases; thus, as of July 1, 2023, a total of 110 individuals accomplished the trial and completed all follow-up work. After we intervened in the learning style of the students in the short term, the ChatGPT group answered more questions correctly than the control group (ChatGPT group: mean 141.20, SD 26.68; control group: mean 130.80, SD 25.56; P=.04) in the orthopedics test, particularly on A1 (ChatGPT group: mean 46.57, SD 8.52; control group: mean 42.18, SD 9.43; P=.01), A2 (ChatGPT group: mean 60.59, SD 10.58; control group: mean 56.66, SD 9.91; P=.047), and A3/4 MCQs (ChatGPT group: mean 19.57, SD 5.48; control group: mean 16.46, SD 4.58; P=.002). At the end of the semester, we found that the ChatGPT group performed better on final examinations in surgery (ChatGPT group: mean 76.54, SD 9.79; control group: mean 72.54, SD 8.11; P=.02) and obstetrics and gynecology (ChatGPT group: mean 75.98, SD 8.94; control group: mean 72.54, SD 8.66; P=.04) than the control group. CONCLUSIONS ChatGPT answers orthopedics-related MCQs accurately, and students using it excel in both short-term and long-term assessments. Our findings strongly support ChatGPT's integration into medical education, enhancing contemporary instructional methods. TRIAL REGISTRATION Chinese Clinical Trial Registry Chictr2300071774; https://www.chictr.org.cn/hvshowproject.html ?id=225740&v=1.0.
Collapse
|
Randomized Controlled Trial |
1 |
7 |
6
|
Quinn M, Milner JD, Schmitt P, Morrissey P, Lemme N, Marcaccio S, DeFroda S, Tabaddor R, Owens BD. Artificial Intelligence Large Language Models Address Anterior Cruciate Ligament Reconstruction: Superior Clarity and Completeness by Gemini Compared With ChatGPT-4 in Response to American Academy of Orthopaedic Surgeons Clinical Practice Guidelines. Arthroscopy 2025; 41:2002-2008. [PMID: 39313138 DOI: 10.1016/j.arthro.2024.09.020] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/09/2024] [Revised: 08/31/2024] [Accepted: 09/05/2024] [Indexed: 09/25/2024]
Abstract
PURPOSE To assess the ability of ChatGPT-4 and Gemini to generate accurate and relevant responses to the 2022 American Academy of Orthopaedic Surgeons (AAOS) Clinical Practice Guidelines (CPG) for anterior cruciate ligament reconstruction (ACLR). METHODS Responses from ChatGPT-4 and Gemini to prompts derived from all 15 AAOS guidelines were evaluated by 7 fellowship-trained orthopaedic sports medicine surgeons using a structured questionnaire assessing 5 key characteristics on a scale from 1 to 5. The prompts were categorized into 3 areas: diagnosis and preoperative management, surgical timing and technique, and rehabilitation and prevention. Statistical analysis included mean scoring, standard deviation, and 2-sided t tests to compare the performance between the 2 large language models (LLMs). Scores were then evaluated for inter-rater reliability (IRR). RESULTS Overall, both LLMs performed well with mean scores >4 for the 5 key characteristics. Gemini demonstrated superior performance in overall clarity (4.848 ± 0.36 vs 4.743 ± 0.481, P = .034), but all other characteristics demonstrated nonsignificant differences (P > .05). Gemini also demonstrated superior clarity in the surgical timing and technique (P = .038) as well as the prevention and rehabilitation (P = .044) subcategories. Additionally, Gemini had superior performance completeness scores in the rehabilitation and prevention subcategory (P = .044), but no statistically significant differences were found amongst the other subcategories. The overall IRR was found to be 0.71 (moderate). CONCLUSIONS Both Gemini and ChatGPT-4 demonstrate an overall good ability to generate accurate and relevant responses to question prompts based on the 2022 AAOS CPG for ACLR. However, Gemini demonstrated superior clarity in multiple domains in addition to superior completeness for questions pertaining to rehabilitation and prevention. CLINICAL RELEVANCE The current study addresses a current gap in the LLM and ACLR literature by comparing the performance of ChatGPT-4 to Gemini, which is growing in popularity with more than 300 million individual uses in May 2024 alone. Moreover, the results demonstrated superior performance of Gemini in both clarity and completeness, which are critical elements of a tool being used by patients for educational purposes. Additionally, the current study uses question prompts based on the AAOS CPG, which may be used as a method of standardization for future investigations on performance of LLM platforms. Thus, the results of this study may be of interest to both the readership of Arthroscopy and patients.
Collapse
|
Comparative Study |
1 |
6 |
7
|
Topaz M, Peltonen LM, Michalowski M, Stiglic G, Ronquillo C, Pruinelli L, Song J, O'Connor S, Miyagawa S, Fukahori H. The ChatGPT Effect: Nursing Education and Generative Artificial Intelligence. J Nurs Educ 2025; 64:e40-e43. [PMID: 38302101 DOI: 10.3928/01484834-20240126-01] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2024]
Abstract
This article examines the potential of generative artificial intelligence (AI), such as ChatGPT (Chat Generative Pre-trained Transformer), in nursing education and the associated challenges and recommendations for their use. Generative AI offers potential benefits such as aiding students with assignments, providing realistic patient scenarios for practice, and enabling personalized, interactive learning experiences. However, integrating generative AI in nursing education also presents challenges, including academic integrity issues, the potential for plagiarism and copyright infringements, ethical implications, and the risk of producing misinformation. Clear institutional guidelines, comprehensive student education on generative AI, and tools to detect AI-generated content are recommended to navigate these challenges. The article concludes by urging nurse educators to harness generative AI's potential responsibly, highlighting the rewards of enhanced learning and increased efficiency. The careful navigation of these challenges and strategic implementation of AI is key to realizing the promise of AI in nursing education. [J Nurs Educ. 2025;64(6):e40-e43.].
Collapse
|
|
1 |
6 |
8
|
Guven Y, Ozdemir OT, Kavan MY. Performance of Artificial Intelligence Chatbots in Responding to Patient Queries Related to Traumatic Dental Injuries: A Comparative Study. Dent Traumatol 2025; 41:338-347. [PMID: 39578674 DOI: 10.1111/edt.13020] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2024] [Revised: 11/04/2024] [Accepted: 11/06/2024] [Indexed: 11/24/2024]
Abstract
BACKGROUND/AIM Artificial intelligence (AI) chatbots have become increasingly prevalent in recent years as potential sources of online healthcare information for patients when making medical/dental decisions. This study assessed the readability, quality, and accuracy of responses provided by three AI chatbots to questions related to traumatic dental injuries (TDIs), either retrieved from popular question-answer sites or manually created based on the hypothetical case scenarios. MATERIALS AND METHODS A total of 59 traumatic injury queries were directed at ChatGPT 3.5, ChatGPT 4.0, and Google Gemini. Readability was evaluated using the Flesch Reading Ease (FRE) and Flesch-Kincaid Grade Level (FKGL) scores. To assess response quality and accuracy, the DISCERN tool, Global Quality Score (GQS), and misinformation scores were used. The understandability and actionability of the responses were analyzed using the Patient Education Materials Assessment Tool for Printed Materials (PEMAT-P) tool. Statistical analysis included Kruskal-Wallis with Dunn's post hoc test for non-normal variables, and one-way ANOVA with Tukey's post hoc test for normal variables (p < 0.05). RESULTS The mean FKGL and FRE scores for ChatGPT 3.5, ChatGPT 4.0, and Google Gemini were 11.2 and 49.25, 11.8 and 46.42, and 10.1 and 51.91, respectively, indicating that the responses were difficult to read and required a college-level reading ability. ChatGPT 3.5 had the lowest DISCERN and PEMAT-P understandability scores among the chatbots (p < 0.001). ChatGPT 4.0 and Google Gemini were rated higher for quality (GQS score of 5) compared to ChatGPT 3.5 (p < 0.001). CONCLUSIONS In this study, ChatGPT 3.5, although widely used, provided some misleading and inaccurate responses to questions about TDIs. In contrast, ChatGPT 4.0 and Google Gemini generated more accurate and comprehensive answers, making them more reliable as auxiliary information sources. However, for complex issues like TDIs, no chatbot can replace a dentist for diagnosis, treatment, and follow-up care.
Collapse
|
Comparative Study |
1 |
6 |
9
|
Mavrych V, Ganguly P, Bolgova O. Using large language models (ChatGPT, Copilot, PaLM, Bard, and Gemini) in Gross Anatomy course: Comparative analysis. Clin Anat 2025; 38:200-210. [PMID: 39573871 DOI: 10.1002/ca.24244] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2024] [Revised: 10/24/2024] [Accepted: 11/04/2024] [Indexed: 04/27/2025]
Abstract
The increasing application of generative artificial intelligence large language models (LLMs) in various fields, including medical education, raises questions about their accuracy. The primary aim of our study was to undertake a detailed comparative analysis of the proficiencies and accuracies of six different LLMs (ChatGPT-4, ChatGPT-3.5-turbo, ChatGPT-3.5, Copilot, PaLM, Bard, and Gemini) in responding to medical multiple-choice questions (MCQs), and in generating clinical scenarios and MCQs for upper limb topics in a Gross Anatomy course for medical students. Selected chatbots were tested, answering 50 USMLE-style MCQs. The questions were randomly selected from the Gross Anatomy course exam database for medical students and reviewed by three independent experts. The results of five successive attempts to answer each set of questions by the chatbots were evaluated in terms of accuracy, relevance, and comprehensiveness. The best result was provided by ChatGPT-4, which answered 60.5% ± 1.9% of questions accurately, then Copilot (42.0% ± 0.0%) and ChatGPT-3.5 (41.0% ± 5.3%), followed by ChatGPT-3.5-turbo (38.5% ± 5.7%). Google PaLM 2 (34.5% ± 4.4%) and Bard (33.5% ± 3.0%) gave the poorest results. The overall performance of GPT-4 was statistically superior (p < 0.05) to those of Copilot, GPT-3.5, GPT-Turbo, PaLM2, and Bard by 18.6%, 19.5%, 22%, 26%, and 27%, respectively. Each chatbot was then asked to generate a clinical scenario for each of the three randomly selected topics-anatomical snuffbox, supracondylar fracture of the humerus, and the cubital fossa-and three related anatomical MCQs with five options each, and to indicate the correct answers. Two independent experts analyzed and graded 216 records received (0-5 scale). The best results were recorded for ChatGPT-4, then for Gemini, ChatGPT-3.5, and ChatGPT-3.5-turbo, Copilot, followed by Google PaLM 2; Copilot had the lowest grade. Technological progress notwithstanding, LLMs have yet to mature sufficiently to take over the role of teacher or facilitator completely within a Gross Anatomy course; however, they can be valuable tools for medical educators.
Collapse
|
Comparative Study |
1 |
5 |
10
|
Arslan B, Nuhoglu C, Satici MO, Altinbilek E. Evaluating LLM-based generative AI tools in emergency triage: A comparative study of ChatGPT Plus, Copilot Pro, and triage nurses. Am J Emerg Med 2025; 89:174-181. [PMID: 39731895 DOI: 10.1016/j.ajem.2024.12.024] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2024] [Revised: 11/08/2024] [Accepted: 12/09/2024] [Indexed: 12/30/2024] Open
Abstract
BACKGROUND The number of emergency department (ED) visits has been on steady increase globally. Artificial Intelligence (AI) technologies, including Large Language Model (LLMs)-based generative AI models, have shown promise in improving triage accuracy. This study evaluates the performance of ChatGPT and Copilot in triage at a high-volume urban hospital, hypothesizing that these tools can match trained physicians' accuracy and reduce human bias amidst ED crowding challenges. METHODS This single-center, prospective observational study was conducted in an urban ED over one week. Adult patients were enrolled through random 24-h intervals. Exclusions included minors, trauma cases, and incomplete data. Triage nurses assessed patients while an emergency medicine (EM) physician documented clinical vignettes and assigned emergency severity index (ESI) levels. These vignettes were then introduced to ChatGPT and Copilot for comparison with the triage nurse's decision. RESULTS The overall triage accuracy was 65.2 % for nurses, 66.5 % for ChatGPT, and 61.8 % for Copilot, with no significant difference (p = 0.000). Moderate agreement was observed between the EM physician and ChatGPT, triage nurses, and Copilot (Cohen's Kappa = 0.537, 0.477, and 0.472, respectively). In recognizing high-acuity patients, ChatGPT and Copilot outperformed triage nurses (87.8 % and 85.7 % versus 32.7 %, respectively). Compared to ChatGPT and Copilot, nurses significantly under-triaged patients (p < 0.05). The analysis of predictive performance for ChatGPT, Copilot, and triage nurses demonstrated varying discrimination abilities across ESI levels, all of which were statistically significant (p < 0.05). ChatGPT and Copilot exhibited consistent accuracy across age, gender, and admission time, whereas triage nurses were more likely to mistriage patients under 45 years old. CONCLUSION ChatGPT and Copilot outperform traditional nurse triage in identifying high-acuity patients, but real-time ED capacity data is crucial to prevent overcrowding and ensure high-quality of emergency care.
Collapse
|
Observational Study |
1 |
5 |
11
|
Bhuyan SS, Sateesh V, Mukul N, Galvankar A, Mahmood A, Nauman M, Rai A, Bordoloi K, Basu U, Samuel J. Generative Artificial Intelligence Use in Healthcare: Opportunities for Clinical Excellence and Administrative Efficiency. J Med Syst 2025; 49:10. [PMID: 39820845 PMCID: PMC11739231 DOI: 10.1007/s10916-024-02136-1] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2024] [Accepted: 12/19/2024] [Indexed: 01/19/2025]
Abstract
Generative Artificial Intelligence (Gen AI) has transformative potential in healthcare to enhance patient care, personalize treatment options, train healthcare professionals, and advance medical research. This paper examines various clinical and non-clinical applications of Gen AI. In clinical settings, Gen AI supports the creation of customized treatment plans, generation of synthetic data, analysis of medical images, nursing workflow management, risk prediction, pandemic preparedness, and population health management. By automating administrative tasks such as medical documentations, Gen AI has the potential to reduce clinician burnout, freeing more time for direct patient care. Furthermore, application of Gen AI may enhance surgical outcomes by providing real-time feedback and automation of certain tasks in operating rooms. The generation of synthetic data opens new avenues for model training for diseases and simulation, enhancing research capabilities and improving predictive accuracy. In non-clinical contexts, Gen AI improves medical education, public relations, revenue cycle management, healthcare marketing etc. Its capacity for continuous learning and adaptation enables it to drive ongoing improvements in clinical and operational efficiencies, making healthcare delivery more proactive, predictive, and precise.
Collapse
|
research-article |
1 |
5 |
12
|
Villarreal-Espinosa JB, Berreta RS, Allende F, Garcia JR, Ayala S, Familiari F, Chahla J. Accuracy assessment of ChatGPT responses to frequently asked questions regarding anterior cruciate ligament surgery. Knee 2024; 51:84-92. [PMID: 39241674 DOI: 10.1016/j.knee.2024.08.014] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/29/2024] [Revised: 06/21/2024] [Accepted: 08/14/2024] [Indexed: 09/09/2024]
Abstract
BACKGROUND The emergence of artificial intelligence (AI) has allowed users to have access to large sources of information in a chat-like manner. Thereby, we sought to evaluate ChatGPT-4 response's accuracy to the 10 patient most frequently asked questions (FAQs) regarding anterior cruciate ligament (ACL) surgery. METHODS A list of the top 10 FAQs pertaining to ACL surgery was created after conducting a search through all Sports Medicine Fellowship Institutions listed on the Arthroscopy Association of North America (AANA) and American Orthopaedic Society of Sports Medicine (AOSSM) websites. A Likert scale was used to grade response accuracy by two sports medicine fellowship-trained surgeons. Cohen's kappa was used to assess inter-rater agreement. Reproducibility of the responses over time was also assessed. RESULTS Five of the 10 responses received a 'completely accurate' grade by two-fellowship trained surgeons with three additional replies receiving a 'completely accurate' status by at least one. Moreover, inter-rater reliability accuracy assessment revealed a moderate agreement between fellowship-trained attending physicians (weighted kappa = 0.57, 95% confidence interval 0.15-0.99). Additionally, 80% of the responses were reproducible over time. CONCLUSION ChatGPT can be considered an accurate additional tool to answer general patient questions regarding ACL surgery. None the less, patient-surgeon interaction should not be deferred and must continue to be the driving force for information retrieval. Thus, the general recommendation is to address any questions in the presence of a qualified specialist.
Collapse
|
|
1 |
5 |
13
|
Duran GS, Yurdakurban E, Topsakal KG. The Quality of CLP-Related Information for Patients Provided by ChatGPT. Cleft Palate Craniofac J 2025; 62:588-595. [PMID: 38128909 DOI: 10.1177/10556656231222387] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2023] Open
Abstract
ObjectiveTo assess the quality, reliability, readability, and similarity of the data that a recently created NLP-based artificial intelligence model ChatGPT 4 provides to users in Cleft Lip and Palate (CLP)-related information.DesignIn the evaluation of the responses provided by the OpenAI ChatGPT to the CLP-related 50 questions, several tools were utilized, including the Ensuring Quality Information for Patients (EQIP) tool, Reliability Scoring System (Adapted from DISCERN), Flesh Reading Ease Formula (FRES) and Flesch-Kinkaid Reading Grade Level (FKRGL) formulas, Global Quality Scale (GQS), and Similarity Index with plagiarism-detection tool. Jamovi (The Jamovi Project, 2022, version 2.3; Sydney, Australia) software was used for all statistical analyses.ResultsBased on the reliability and GQS values, ChatGPT demonstrated high reliability and good quality attributable to CLP. Furthermore, according to the FRES results, ChatGPT's readability is difficult, and the similarity index values of this software exhibit an acceptable level of similarity ratio. There is no significant difference in EQIP, Reliability Score System, FRES, FKGRL, GQS, and Similarity Index values among the two categories.ConclusionOpenAI ChatGPT provides a highly reliable, high-quality, but challenging to read, and acceptable similarity rate in providing information related to CLP. Ensuring that information obtained through these models is verified and assessed by a qualified medical expert is crucial.
Collapse
|
|
1 |
4 |
14
|
Chen Z, Chambara N, Wu C, Lo X, Liu SYW, Gunda ST, Han X, Qu J, Chen F, Ying MTC. Assessing the feasibility of ChatGPT-4o and Claude 3-Opus in thyroid nodule classification based on ultrasound images. Endocrine 2025; 87:1041-1049. [PMID: 39394537 PMCID: PMC11845565 DOI: 10.1007/s12020-024-04066-x] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/26/2024] [Accepted: 10/02/2024] [Indexed: 10/13/2024]
Abstract
PURPOSE Large language models (LLMs) are pivotal in artificial intelligence, demonstrating advanced capabilities in natural language understanding and multimodal interactions, with significant potential in medical applications. This study explores the feasibility and efficacy of LLMs, specifically ChatGPT-4o and Claude 3-Opus, in classifying thyroid nodules using ultrasound images. METHODS This study included 112 patients with a total of 116 thyroid nodules, comprising 75 benign and 41 malignant cases. Ultrasound images of these nodules were analyzed using ChatGPT-4o and Claude 3-Opus to diagnose the benign or malignant nature of the nodules. An independent evaluation by a junior radiologist was also conducted. Diagnostic performance was assessed using Cohen's Kappa and receiver operating characteristic (ROC) curve analysis, referencing pathological diagnoses. RESULTS ChatGPT-4o demonstrated poor agreement with pathological results (Kappa = 0.116), while Claude 3-Opus showed even lower agreement (Kappa = 0.034). The junior radiologist exhibited moderate agreement (Kappa = 0.450). ChatGPT-4o achieved an area under the ROC curve (AUC) of 57.0% (95% CI: 48.6-65.5%), slightly outperforming Claude 3-Opus (AUC of 52.0%, 95% CI: 43.2-60.9%). In contrast, the junior radiologist achieved a significantly higher AUC of 72.4% (95% CI: 63.7-81.1%). The unnecessary biopsy rates were 41.4% for ChatGPT-4o, 43.1% for Claude 3-Opus, and 12.1% for the junior radiologist. CONCLUSION While LLMs such as ChatGPT-4o and Claude 3-Opus show promise for future applications in medical imaging, their current use in clinical diagnostics should be approached cautiously due to their limited accuracy.
Collapse
|
research-article |
1 |
4 |
15
|
Meyer MKR, Kandathil CK, Davis SJ, Durairaj KK, Patel PN, Pepper JP, Spataro EA, Most SP. Evaluation of Rhinoplasty Information from ChatGPT, Gemini, and Claude for Readability and Accuracy. Aesthetic Plast Surg 2025; 49:1868-1873. [PMID: 39285054 DOI: 10.1007/s00266-024-04343-0] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2024] [Accepted: 08/23/2024] [Indexed: 04/26/2025]
Abstract
OBJECTIVE Assessment of the readability, accuracy, quality, and completeness of ChatGPT (Open AI, San Francisco, CA), Gemini (Google, Mountain View, CA), and Claude (Anthropic, San Francisco, CA) responses to common questions about rhinoplasty. METHODS Ten questions commonly encountered in the senior author's (SPM) rhinoplasty practice were presented to ChatGPT-4, Gemini and Claude. Seven Facial Plastic and Reconstructive Surgeons with experience in rhinoplasty were asked to evaluate these responses for accuracy, quality, completeness, relevance, and use of medical jargon on a Likert scale. The responses were also evaluated using several readability indices. RESULTS ChatGPT achieved significantly higher evaluator scores for accuracy, and overall quality but scored significantly lower on completeness compared to Gemini and Claude. All three chatbot responses to the ten questions were rated as neutral to incomplete. All three chatbots were found to use medical jargon and scored at a college reading level for readability scores. CONCLUSIONS Rhinoplasty surgeons should be aware that the medical information found on chatbot platforms is incomplete and still needs to be scrutinized for accuracy. However, the technology does have potential for use in healthcare education by training it on evidence-based recommendations and improving readability. LEVEL OF EVIDENCE V This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors www.springer.com/00266 .
Collapse
|
|
1 |
3 |
16
|
You M, Chen X, Liu D, Lin Y, Chen G, Li J. ChatGPT-4 and wearable device assisted Intelligent Exercise Therapy for co-existing Sarcopenia and Osteoarthritis (GAISO): a feasibility study and design for a randomized controlled PROBE non-inferiority trial. J Orthop Surg Res 2024; 19:635. [PMID: 39380108 PMCID: PMC11463084 DOI: 10.1186/s13018-024-05134-8] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/04/2024] [Accepted: 09/30/2024] [Indexed: 10/10/2024] Open
Abstract
BACKGROUND Sarcopenia and osteoarthritis are prevalent age-related diseases that mutually exacerbate each other, creating a vicious cycle that worsens both conditions. Exercise is key to breaking this detrimental cycle. Facing increasing demand for rehabilitation services within this patient demographic, ChatGPT-4 and wearable device may increase the availability, efficiency and personalization of such health care. AIM To evaluate the clinical efficacy and cost-effectiveness of a rehabilitation system implemented on mobile platforms, utilizing the integration of ChatGPT-4 and wearable devices. METHODS The study design is a prospective randomized open blinded end-point (PROBE) non-inferiority trial. 278 patients diagnosed with osteoarthritis and sarcopenia will be recruited and randomly assigned to the intervention group and the control group. In the intervention group patients receive mobile phone-based rehabilitation service where ChatGPT-4 generates personalized exercise therapy, and wearable device guides and monitor the patient to implement the exercise therapy. Traditional clinic based face-to-face exercise therapy will be prescribed and implemented in the control group. All patients will receive three-months exercise therapies following the frequency, intensity, type, time, volume and progression (FITT-VP) principle. The patients will be assessed at baseline, one month, three months, and six months after initiation. Outcome measures will include ROM, gait patterns, Visual Analogue Scale (VAS) for pain assessment, Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC), Knee Injury and Osteoarthritis Outcome Score (KOOS) for functional assessment, Short-Form Health Survey 12 (SF-12) for quality of life, Minimal Clinically Important Difference (MCID), Patient Acceptable Symptom State (PASS), and Substantial Clinical Benefit (SCB) for clinically significant measures. DISCUSSION A rehabilitation system combining the capabilities of ChatGPT-4 and wearable devices potentially enhance the availability and efficiency of professional rehabilitation services, thus enhancing the therapeutic outcomes for a substantial population concurrently afflicted with sarcopenia and osteoarthritis.
Collapse
|
Clinical Trial Protocol |
1 |
3 |
17
|
Brodsky V, Ullah E, Bychkov A, Song AH, Walk EE, Louis P, Rasool G, Singh RS, Mahmood F, Bui MM, Parwani AV. Generative Artificial Intelligence in Anatomic Pathology. Arch Pathol Lab Med 2025; 149:298-318. [PMID: 39836377 DOI: 10.5858/arpa.2024-0215-ra] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/20/2024] [Indexed: 01/22/2025]
Abstract
CONTEXT.— Generative artificial intelligence (AI) has emerged as a transformative force in various fields, including anatomic pathology, where it offers the potential to significantly enhance diagnostic accuracy, workflow efficiency, and research capabilities. OBJECTIVE.— To explore the applications, benefits, and challenges of generative AI in anatomic pathology, with a focus on its impact on diagnostic processes, workflow efficiency, education, and research. DATA SOURCES.— A comprehensive review of current literature and recent advancements in the application of generative AI within anatomic pathology, categorized into unimodal and multimodal applications, and evaluated for clinical utility, ethical considerations, and future potential. CONCLUSIONS.— Generative AI demonstrates significant promise in various domains of anatomic pathology, including diagnostic accuracy enhanced through AI-driven image analysis, virtual staining, and synthetic data generation; workflow efficiency, with potential for improvement by automating routine tasks, quality control, and reflex testing; education and research, facilitated by AI-generated educational content, synthetic histology images, and advanced data analysis methods; and clinical integration, with preliminary surveys indicating cautious optimism for nondiagnostic AI tasks and growing engagement in academic settings. Ethical and practical challenges require rigorous validation, prompt engineering, federated learning, and synthetic data generation to help ensure trustworthy, reliable, and unbiased AI applications. Generative AI can potentially revolutionize anatomic pathology, enhancing diagnostic accuracy, improving workflow efficiency, and advancing education and research. Successful integration into clinical practice will require continued interdisciplinary collaboration, careful validation, and adherence to ethical standards to ensure the benefits of AI are realized while maintaining the highest standards of patient care.
Collapse
|
Review |
1 |
3 |
18
|
Özbek EA, Ertan MB, Kından P, Karaca MO, Gürsoy S, Chahla J. ChatGPT Can Offer At Least Satisfactory Responses to Common Patient Questions Regarding Hip Arthroscopy. Arthroscopy 2025; 41:1806-1827. [PMID: 39242057 DOI: 10.1016/j.arthro.2024.08.036] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/25/2024] [Revised: 08/24/2024] [Accepted: 08/24/2024] [Indexed: 09/09/2024]
Abstract
PURPOSE To assess the accuracy of answers provided by ChatGPT 4.0 (an advanced language model developed by OpenAI) regarding 25 common patient questions about hip arthroscopy. METHODS ChatGPT 4.0 was presented with 25 common patient questions regarding hip arthroscopy with no follow-up questions and repetition. Each response was evaluated by 2 board-certified orthopaedic sports medicine surgeons independently. Responses were rated, with scores of 1, 2, 3, and 4 corresponding to "excellent response not requiring clarification," "satisfactory requiring minimal clarification," "satisfactory requiring moderate clarification," and "unsatisfactory requiring substantial clarification," respectively. RESULTS Twenty responses were rated "excellent" and 2 responses were rated "satisfactory requiring minimal clarification" by both of reviewers. Responses to questions "What kind of anesthesia is used for hip arthroscopy?" and "What is the average age for hip arthroscopy?" were rated as "satisfactory requiring minimal clarification" by both reviewers. None of the responses were rated as "satisfactory requiring moderate clarification" or "unsatisfactory" by either of the reviewers. CONCLUSIONS ChatGPT 4.0 provides at least satisfactory responses to patient questions regarding hip arthroscopy. Under the supervision of an orthopaedic sports medicine surgeon, it could be used as a supplementary tool for patient education. CLINICAL RELEVANCE This study compared the answers of ChatGPT to patients' questions regarding hip arthroscopy with the current literature. As ChatGPT has gained popularity among patients, the study aimed to find if the responses that patients get from this chatbot are compatible with the up-to-date literature.
Collapse
|
|
1 |
3 |
19
|
Helvacioglu-Yigit D, Demirturk H, Ali K, Tamimi D, Koenig L, Almashraqi A. Evaluating artificial intelligence chatbots for patient education in oral and maxillofacial radiology. Oral Surg Oral Med Oral Pathol Oral Radiol 2025; 139:750-759. [PMID: 40044548 DOI: 10.1016/j.oooo.2025.01.001] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2024] [Revised: 12/25/2024] [Accepted: 01/03/2025] [Indexed: 04/29/2025]
Abstract
OBJECTIVE This study aimed to compare the quality and readability of the responses generated by 3 publicly available artificial intelligence (AI) chatbots in answering frequently asked questions (FAQs) related to Oral and Maxillofacial Radiology (OMR) to assess their suitability for patient education. STUDY DESIGN Fifteen OMR-related questions were selected from professional patient information websites. These questions were posed to ChatGPT-3.5 by OpenAI, Gemini 1.5 Pro by Google, and Copilot by Microsoft to generate responses. Three board-certified OMR specialists evaluated the responses regarding scientific adequacy, ease of understanding, and overall reader satisfaction. Readability was assessed using the Flesch-Kincaid Grade Level (FKGL) and Flesch Reading Ease (FRE) scores. The Wilcoxon signed-rank test was conducted to compare the scores assigned by the evaluators to the responses from the chatbots and professional websites. Interevaluator agreement was examined by calculating the Fleiss kappa coefficient. RESULTS There were no significant differences between groups in terms of scientific adequacy. In terms of readability, chatbots had overall mean FKGL and FRE scores of 12.97 and 34.11, respectively. Interevaluator agreement level was generally high. CONCLUSIONS Although chatbots are relatively good at responding to FAQs, validating AI-generated information using input from healthcare professionals can enhance patient care and safety. Readability of the text content in the chatbots and websites requires high reading levels.
Collapse
|
|
1 |
3 |
20
|
Ballard DH, Antigua-Made A, Barre E, Edney E, Gordon EB, Kelahan L, Lodhi T, Martin JG, Ozkan M, Serdynski K, Spieler B, Zhu D, Adams SJ. Impact of ChatGPT and Large Language Models on Radiology Education: Association of Academic Radiology-Radiology Research Alliance Task Force White Paper. Acad Radiol 2025; 32:3039-3049. [PMID: 39616097 DOI: 10.1016/j.acra.2024.10.023] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2024] [Revised: 10/06/2024] [Accepted: 10/17/2024] [Indexed: 04/23/2025]
Abstract
Generative artificial intelligence, including large language models (LLMs), holds immense potential to enhance healthcare, medical education, and health research. Recognizing the transformative opportunities and potential risks afforded by LLMs, the Association of Academic Radiology-Radiology Research Alliance convened a task force to explore the promise and pitfalls of using LLMs such as ChatGPT in radiology. This white paper explores the impact of LLMs on radiology education, highlighting their potential to enrich curriculum development, teaching and learning, and learner assessment. Despite these advantages, the implementation of LLMs presents challenges, including limits on accuracy and transparency, the risk of misinformation, data privacy issues, and potential biases, which must be carefully considered. We provide recommendations for the successful integration of LLMs and LLM-based educational tools into radiology education programs, emphasizing assessment of the technological readiness of LLMs for specific use cases, structured planning, regular evaluation, faculty development, increased training opportunities, academic-industry collaboration, and research on best practices for employing LLMs in education.
Collapse
|
|
1 |
3 |
21
|
Kolding S, Lundin RM, Hansen L, Østergaard SD. Use of generative artificial intelligence (AI) in psychiatry and mental health care: a systematic review. Acta Neuropsychiatr 2024; 37:e37. [PMID: 39523628 DOI: 10.1017/neu.2024.50] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/16/2024]
Abstract
OBJECTIVES Tools based on generative artificial intelligence (AI) such as ChatGPT have the potential to transform modern society, including the field of medicine. Due to the prominent role of language in psychiatry, e.g., for diagnostic assessment and psychotherapy, these tools may be particularly useful within this medical field. Therefore, the aim of this study was to systematically review the literature on generative AI applications in psychiatry and mental health. METHODS We conducted a systematic review following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines. The search was conducted across three databases, and the resulting articles were screened independently by two researchers. The content, themes, and findings of the articles were qualitatively assessed. RESULTS The search and screening process resulted in the inclusion of 40 studies. The median year of publication was 2023. The themes covered in the articles were mainly mental health and well-being in general - with less emphasis on specific mental disorders (substance use disorder being the most prevalent). The majority of studies were conducted as prompt experiments, with the remaining studies comprising surveys, pilot studies, and case reports. Most studies focused on models that generate language, ChatGPT in particular. CONCLUSIONS Generative AI in psychiatry and mental health is a nascent but quickly expanding field. The literature mainly focuses on applications of ChatGPT, and finds that generative AI performs well, but notes that it is limited by significant safety and ethical concerns. Future research should strive to enhance transparency of methods, use experimental designs, ensure clinical relevance, and involve users/patients in the design phase.
Collapse
|
Systematic Review |
1 |
2 |
22
|
Yang Z, Yao Z, Tasmin M, Vashisht P, Jang WS, Ouyang F, Wang B, McManus D, Berlowitz D, Yu H. Unveiling GPT-4V's hidden challenges behind high accuracy on USMLE questions: Observational Study. J Med Internet Res 2025; 27:e65146. [PMID: 39919278 PMCID: PMC11845889 DOI: 10.2196/65146] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2024] [Revised: 11/06/2024] [Accepted: 11/26/2024] [Indexed: 02/09/2025] Open
Abstract
BACKGROUND Recent advancements in artificial intelligence, such as GPT-3.5 Turbo (OpenAI) and GPT-4, have demonstrated significant potential by achieving good scores on text-only United States Medical Licensing Examination (USMLE) exams and effectively answering questions from physicians. However, the ability of these models to interpret medical images remains underexplored. OBJECTIVE This study aimed to comprehensively evaluate the performance, interpretability, and limitations of GPT-3.5 Turbo, GPT-4, and its successor, GPT-4 Vision (GPT-4V), specifically focusing on GPT-4V's newly introduced image-understanding feature. By assessing the models on medical licensing examination questions that require image interpretation, we sought to highlight the strengths and weaknesses of GPT-4V in handling complex multimodal clinical information, thereby exposing hidden flaws and providing insights into its readiness for integration into clinical settings. METHODS This cross-sectional study tested GPT-4V, GPT-4, and ChatGPT-3.5 Turbo on a total of 227 multiple-choice questions with images from USMLE Step 1 (n=19), Step 2 clinical knowledge (n=14), Step 3 (n=18), the Diagnostic Radiology Qualifying Core Exam (DRQCE) (n=26), and AMBOSS question banks (n=150). AMBOSS provided expert-written hints and question difficulty levels. GPT-4V's accuracy was compared with 2 state-of-the-art large language models, GPT-3.5 Turbo and GPT-4. The quality of the explanations was evaluated by choosing human preference between an explanation by GPT-4V (without hint), an explanation by an expert, or a tie, using 3 qualitative metrics: comprehensive explanation, question information, and image interpretation. To better understand GPT-4V's explanation ability, we modified a patient case report to resemble a typical "curbside consultation" between physicians. RESULTS For questions with images, GPT-4V achieved an accuracy of 84.2%, 85.7%, 88.9%, and 73.1% in Step 1, Step 2 clinical knowledge, Step 3 of USMLE, and DRQCE, respectively. It outperformed GPT-3.5 Turbo (42.1%, 50%, 50%, 19.2%) and GPT-4 (63.2%, 64.3%, 66.7%, 26.9%). When GPT-4V answered correctly, its explanations were nearly as good as those provided by domain experts from AMBOSS. However, incorrect answers often had poor explanation quality: 18.2% (10/55) contained inaccurate text, 45.5% (25/55) had inference errors, and 76.3% (42/55) demonstrated image misunderstandings. With human expert assistance, GPT-4V reduced errors by an average of 40% (22/55). GPT-4V accuracy improved with hints, maintaining stable performance across difficulty levels, while medical student performance declined as difficulty increased. In a simulated curbside consultation scenario, GPT-4V required multiple specific prompts to interpret complex case data accurately. CONCLUSIONS GPT-4V achieved high accuracy on multiple-choice questions with images, highlighting its potential in medical assessments. However, significant shortcomings were observed in the quality of explanations when questions were answered incorrectly, particularly in the interpretation of images, which could not be efficiently resolved through expert interaction. These findings reveal hidden flaws in the image interpretation capabilities of GPT-4V, underscoring the need for more comprehensive evaluations beyond multiple-choice questions before integrating GPT-4V into clinical settings.
Collapse
|
Observational Study |
1 |
2 |
23
|
Lee KL, Kessler DA, Caglic I, Kuo YH, Shaida N, Barrett T. Assessing the performance of ChatGPT and Bard/Gemini against radiologists for Prostate Imaging-Reporting and Data System classification based on prostate multiparametric MRI text reports. Br J Radiol 2025; 98:368-374. [PMID: 39535870 PMCID: PMC11840166 DOI: 10.1093/bjr/tqae236] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2023] [Revised: 10/10/2024] [Accepted: 11/10/2024] [Indexed: 11/16/2024] Open
Abstract
OBJECTIVES Large language models (LLMs) have shown potential for clinical applications. This study assesses their ability to assign Prostate Imaging-Reporting and Data System (PI-RADS) categories based on clinical text reports. METHODS One hundred consecutive biopsy-naïve patients' multiparametric prostate MRI reports were independently classified by 2 uroradiologists, ChatGPT-3.5 (GPT-3.5), ChatGPT-4o mini (GPT-4), Bard, and Gemini. Original report classifications were considered definitive. RESULTS Out of 100 MRIs, 52 were originally reported as PI-RADS 1-2, 9 PI-RADS 3, 19 PI-RADS 4, and 20 PI-RADS 5. Radiologists demonstrated 95% and 90% accuracy, while GPT-3.5 and Bard both achieved 67%. Accuracy of the updated versions of LLMs increased to 83% (GTP-4) and 79% (Gemini), respectively. In low suspicion studies (PI-RADS 1-2), Bard and Gemini (F1: 0.94, 0.98, respectively) outperformed GPT-3.5 and GTP-4 (F1:0.77, 0.94, respectively), whereas for high probability MRIs (PI-RADS 4-5), GPT-3.5 and GTP-4 (F1: 0.95, 0.98, respectively) outperformed Bard and Gemini (F1: 0.71, 0.87, respectively). Bard assigned a non-existent PI-RADS 6 "hallucination" for 2 patients. Inter-reader agreements (Κ) between the original reports and the senior radiologist, junior radiologist, GPT-3.5, GTP-4, BARD, and Gemini were 0.93, 0.84, 0.65, 0.86, 0.57, and 0.81, respectively. CONCLUSIONS Radiologists demonstrated high accuracy in PI-RADS classification based on text reports, while GPT-3.5 and Bard exhibited poor performance. GTP-4 and Gemini demonstrated improved performance compared to their predecessors. ADVANCES IN KNOWLEDGE This study highlights the limitations of LLMs in accurately classifying PI-RADS categories from clinical text reports. While the performance of LLMs has improved with newer versions, caution is warranted before integrating such technologies into clinical practice.
Collapse
|
research-article |
1 |
2 |
24
|
Ali R, Cui H. Leveraging ChatGPT for Enhanced Aesthetic Evaluations in Minimally Invasive Facial Procedures. Aesthetic Plast Surg 2025; 49:950-961. [PMID: 39578313 DOI: 10.1007/s00266-024-04524-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2024] [Accepted: 11/04/2024] [Indexed: 11/24/2024]
Abstract
BACKGROUND In recent years, the application of AI technologies like ChatGPT has gained traction in the field of plastic surgery. AI models can analyze pre- and post-treatment images to offer insights into the effectiveness of cosmetic procedures. This technological advancement enables rapid, objective evaluations that can complement traditional assessment methods, providing a more comprehensive understanding of treatment outcomes. OBJECTIVE The study aimed to comprehensively assess the effectiveness of custom ChatGPT model, "Face Rating and Review AI," in facial feature evaluation in minimally invasive aesthetic procedures, particularly before and after Botox treatments. METHOD An analysis was conducted on the Web of Science (WoS) database, identifying 79 articles published between 2023 and 2024 on ChatGPT in the field of plastic surgery from various countries. A dataset of 23 patients from Kaggle, including pre- and post-Botox images, was used. The custom ChatGPT model, "Face Rating & Review AI," was used to assess facial features based on objective parameters such as the golden ratio, symmetry, proportion, side angles, skin condition, and overall harmony, as well as subjective parameters like personality, temperament, and social attraction. RESULT The WoS search found 79 articles on ChatGPT in plastic surgery from 27 countries, with most publications originating from the USA, Australia, and Italy. The objective and subjective parameters were analyzed using a paired t-test, and all facial features showed low p-values (<0.05). Higher mean scores on features such as the golden ratio (mean = 5.86, SD = 0.69), skin condition (mean = 3.78, SD = 0.73), and personality (mean = 5.0, SD = 0.79) indicate positive shifts after the treatment. CONCLUSION The custom ChatGPT model "Face Rating and Review AI" is a valuable tool for assessing facial features in Botox treatments. It effectively evaluates objective and subjective attributes, aiding clinical decision-making. However, ethical considerations highlight the need for diverse datasets in future research to improve accuracy and inclusivity. LEVEL OF EVIDENCE V This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors www.springer.com/00266 .
Collapse
|
|
1 |
2 |
25
|
Nasef H, Patel H, Amin Q, Baum S, Ratnasekera A, Ang D, Havron WS, Nakayama D, Elkbuli A. Evaluating the Accuracy, Comprehensiveness, and Validity of ChatGPT Compared to Evidence-Based Sources Regarding Common Surgical Conditions: Surgeons' Perspectives. Am Surg 2025; 91:325-335. [PMID: 38794965 DOI: 10.1177/00031348241256075] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/27/2024]
Abstract
BackgroundThis study aims to assess the accuracy, comprehensiveness, and validity of ChatGPT compared to evidence-based sources regarding the diagnosis and management of common surgical conditions by surveying the perceptions of U.S. board-certified practicing surgeons.MethodsAn anonymous cross-sectional survey was distributed to U.S. practicing surgeons from June 2023 to March 2024. The survey comprised 94 multiple-choice questions evaluating diagnostic and management information for five common surgical conditions from evidence-based sources or generated by ChatGPT. Statistical analysis included descriptive statistics and paired-sample t-tests.ResultsParticipating surgeons were primarily aged 40-50 years (43%), male (86%), White (57%), and had 5-10 years or >15 years of experience (86%). The majority of surgeons had no prior experience with ChatGPT in surgical practice (86%). For material discussing both acute cholecystitis and upper gastrointestinal hemorrhage, evidence-based sources were rated as significantly more comprehensive (3.57 (±.535) vs 2.00 (±1.16), P = .025) (4.14 (±.69) vs 2.43 (±.98), P < .001) and valid (3.71 (±.488) vs 2.86 (±1.07), P = .045) (3.71 (±.76) vs 2.71 (±.95) P = .038) than ChatGPT. However, there was no significant difference in accuracy between the two sources (3.71 vs 3.29, P = .289) (3.57 vs 2.71, P = .111).ConclusionSurveyed U.S. board-certified practicing surgeons rated evidence-based sources as significantly more comprehensive and valid compared to ChatGPT across the majority of surveyed surgical conditions. However, there was no significant difference in accuracy between the sources across the majority of surveyed conditions. While ChatGPT may offer potential benefits in surgical practice, further refinement and validation are necessary to enhance its utility and acceptance among surgeons.
Collapse
|
Comparative Study |
1 |
2 |