1
|
Büker M, Mercan G. Readability, accuracy and appropriateness and quality of AI chatbot responses as a patient information source on root canal retreatment: A comparative assessment. Int J Med Inform 2025; 201:105948. [PMID: 40288015 DOI: 10.1016/j.ijmedinf.2025.105948] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2025] [Revised: 04/20/2025] [Accepted: 04/22/2025] [Indexed: 04/29/2025]
Abstract
AIM This study aimed to assess the readability, accuracy, appropriateness, and overall quality of responses generated by large language models (LLMs), including ChatGPT-3.5, Microsoft Copilot, and Gemini (Version 2.0 Flash), to frequently asked questions (FAQs) related to root canal retreatment. METHODS Three LLM chatbots-ChatGPT-3.5, Microsoft Copilot, and Gemini (Version 2.0 Flash)-were assessed based on their responses to 10 patient FAQs. Readability was analyzed using seven indices, including Flesch reading ease score (FRES), Flesch-Kincaid grade level (FKGL), Simple Measure of Gobbledygook (SMOG), gunning FOG (GFOG), Linsear Write (LW), Coleman-Liau (CL), and automated readability index (ARI), and compared against the recommended sixth-grade reading level. Response quality was evaluated using the Global Quality Scale (GQS), while accuracy and appropriateness were rated on a five-point Likert scale by two independent reviewers. Statistical analyses were conducted using one-way ANOVA, Tukey or Games-Howell post-hoc tests for continuous variables. Spearman's correlation test was used to assess associations between categorical variables. RESULTS All chatbots generated responses exceeding the recommended readability level, making them suitable for readers at or above the 10th-grade level. No significant difference was found between ChatGPT-3.5 and Microsoft Copilot, while Gemini produced significantly more readable responses (p < 0.05). Gemini demonstrated the highest proportion of accurate (80 %) and high-quality responses (80 %) compared to ChatGPT-3.5 and Microsoft Copilot. CONCLUSIONS None of the chatbots met the recommended readability standards for patient education materials. While Gemini demonstrated better readability, accuracy, and quality, all three models require further optimization to enhance accessibility and reliability in patient communication.
Collapse
Affiliation(s)
- Mine Büker
- Department of Endodontics, Faculty of Dentistry, Mersin University, Mersin, Turkey.
| | - Gamze Mercan
- Department of Endodontics, Faculty of Dentistry, Mersin University, Mersin, Turkey.
| |
Collapse
|
2
|
Solomon TPJ, Laye MJ. The sports nutrition knowledge of large language model (LLM) artificial intelligence (AI) chatbots: An assessment of accuracy, completeness, clarity, quality of evidence, and test-retest reliability. PLoS One 2025; 20:e0325982. [PMID: 40512755 PMCID: PMC12165421 DOI: 10.1371/journal.pone.0325982] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2025] [Accepted: 05/21/2025] [Indexed: 06/16/2025] Open
Abstract
BACKGROUND Generative artificial intelligence (AI) chatbots are increasingly utilised in various domains, including sports nutrition. Despite their growing popularity, there is limited evidence on the accuracy, completeness, clarity, evidence quality, and test-retest reliability of AI-generated sports nutrition advice. This study evaluates the performance of ChatGPT, Gemini, and Claude's basic and advanced models across these metrics to determine their utility in providing sports nutrition information. MATERIALS AND METHODS Two experiments were conducted. In Experiment 1, chatbots were tested with simple and detailed prompts in two domains: Sports nutrition for training and Sports nutrition for racing. Intraclass correlation coefficient (ICC) was used to assess interrater agreement and chatbot performance was assessed by measuring accuracy, completeness, clarity, evidence quality, and test-retest reliability. In Experiment 2, chatbot performance was evaluated by measuring the accuracy and test-retest reliability of chatbots' answers to multiple-choice questions based on a sports nutrition certification exam. ANOVAs and logistic mixed models were used to analyse chatbot performance. RESULTS In Experiment 1, interrater agreement was good (ICC = 0.893) and accuracy varied from 74% (Gemini1.5pro) to 31% (ClaudePro). Detailed prompts improved Claude's accuracy but had little impact on ChatGPT or Gemini. Completeness scores were highest for ChatGPT-4o compared to other chatbots, which scored low to moderate. The quality of cited evidence was low for all chatbots when simple prompts were used but improved with detailed prompts. In Experiment 2, accuracy ranged from 89% (Claude3.5Sonnet) to 61% (ClaudePro). Test-retest reliability was acceptable across all metrics in both experiments. CONCLUSIONS While generative AI chatbots demonstrate potential in providing sports nutrition guidance, their accuracy is moderate at best and inconsistent between models. Until significant advancements are made, athletes and coaches should consult registered dietitians for tailored nutrition advice.
Collapse
Affiliation(s)
| | - Matthew J. Laye
- Idaho College of Osteopathic Medicine, Meridian, Idaho, United States of America
| |
Collapse
|
3
|
Lafourcade C, Kérourédan O, Ballester B, Richert R. Accuracy, consistency, and contextual understanding of large language models in restorative dentistry and endodontics. J Dent 2025; 157:105764. [PMID: 40246058 DOI: 10.1016/j.jdent.2025.105764] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2024] [Revised: 03/26/2025] [Accepted: 04/15/2025] [Indexed: 04/19/2025] Open
Abstract
OBJECTIVE This study aimed to evaluate and compare the performance of several large language models (LLMs) in the context of restorative dentistry and endodontics, focusing on their accuracy, consistency, and contextual understanding. METHODS The dataset was extracted from the national educational archives of the Collège National des Enseignants en Odontologie Conservatrice (CNEOC) and includes all chapters from the reference manual for dental residency applicants. Multiple-choice questions (MCQs) were selected following a review by three independent academic experts. Four LLMs were assessed: ChatGPT-3.5, ChatGPT-4 (OpenAI), Claude-3 (Anthropic), and Mistral 7B (Mistral AI). Model accuracy was determined by comparing responses with expert-provided answers. Consistency was measured through robustness (the ability to provide identical responses to paraphrased questions) and repeatability (the ability to provide identical responses to the same question). Contextual understanding was evaluated based on the model's ability to categorise questions correctly and infer terms from definitions. Additionally, accuracy was reassessed after providing the LLMs with the relevant full course chapter. RESULTS A total of 517 MCQs and 539 definitions were included. ChatGPT-4 and Claude-3 demonstrated significantly higher accuracy and repeatability than Mistral 7B, with ChatGPT-4 showing the greater robustness. Advanced LLMs displayed high accuracy in presenting dental content, although performance varied on closely related concepts. Supplying course chapters generally improved response accuracy, though inconsistently across topics. CONCLUSION Even the most advanced LLMs, such as ChatGPT-4 and Claude 3, achieve moderate performance and require cautious use due to inconsistencies in robustness. Future studies should focus on integrating validated content and refining prompt engineering to enhance the educational and clinical utility of LLMs. CLINICAL SIGNIFICANCE The findings underscore the potential of advanced LLMs and context-based prompting in restorative dentistry and endodontics.
Collapse
Affiliation(s)
- Claire Lafourcade
- UFR des Sciences Odontologiques, Université de Bordeaux, Bordeaux, France; CHU de Bordeaux, Pôle de Médecine et Chirurgie bucco-dentaire, Bordeaux, France
| | - Olivia Kérourédan
- UFR des Sciences Odontologiques, Université de Bordeaux, Bordeaux, France; CHU de Bordeaux, Pôle de Médecine et Chirurgie bucco-dentaire, Bordeaux, France; UMR 1026 BioTis INSERM, Université de Bordeaux, Bordeaux, France
| | - Benoit Ballester
- Assistance Publique Des Hôpitaux de Marseille, Marseille, France; Aix Marseille Univ, Inserm, IRD, SESSTIM, Sciences Economiques & Sociales de la Santé & Traitement de l'Information Médicale, ISSPAM, Marseille, France
| | - Raphael Richert
- Faculté d'Odontologie Université Lyon 1, Lyon, France; INSA Lyon, CNRS, LaMCoS, UMR5259, Villeurbanne, France; Hospices Civils de Lyon, PAM Odontologie, Lyon, France.
| |
Collapse
|
4
|
Cantao AB, Levin L. What's Next in Dental Trauma? Innovations, Preventive Strategies, and Future Treatment Paths. Dent Traumatol 2025; 41:241-245. [PMID: 40329468 DOI: 10.1111/edt.13069] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 04/17/2025] [Indexed: 05/08/2025]
Affiliation(s)
| | - Liran Levin
- College of Dentistry, University of Saskatchewan, Saskatoon, Canada
| |
Collapse
|
5
|
Guven Y, Ozdemir OT, Kavan MY. Performance of Artificial Intelligence Chatbots in Responding to Patient Queries Related to Traumatic Dental Injuries: A Comparative Study. Dent Traumatol 2025; 41:338-347. [PMID: 39578674 DOI: 10.1111/edt.13020] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2024] [Revised: 11/04/2024] [Accepted: 11/06/2024] [Indexed: 11/24/2024]
Abstract
BACKGROUND/AIM Artificial intelligence (AI) chatbots have become increasingly prevalent in recent years as potential sources of online healthcare information for patients when making medical/dental decisions. This study assessed the readability, quality, and accuracy of responses provided by three AI chatbots to questions related to traumatic dental injuries (TDIs), either retrieved from popular question-answer sites or manually created based on the hypothetical case scenarios. MATERIALS AND METHODS A total of 59 traumatic injury queries were directed at ChatGPT 3.5, ChatGPT 4.0, and Google Gemini. Readability was evaluated using the Flesch Reading Ease (FRE) and Flesch-Kincaid Grade Level (FKGL) scores. To assess response quality and accuracy, the DISCERN tool, Global Quality Score (GQS), and misinformation scores were used. The understandability and actionability of the responses were analyzed using the Patient Education Materials Assessment Tool for Printed Materials (PEMAT-P) tool. Statistical analysis included Kruskal-Wallis with Dunn's post hoc test for non-normal variables, and one-way ANOVA with Tukey's post hoc test for normal variables (p < 0.05). RESULTS The mean FKGL and FRE scores for ChatGPT 3.5, ChatGPT 4.0, and Google Gemini were 11.2 and 49.25, 11.8 and 46.42, and 10.1 and 51.91, respectively, indicating that the responses were difficult to read and required a college-level reading ability. ChatGPT 3.5 had the lowest DISCERN and PEMAT-P understandability scores among the chatbots (p < 0.001). ChatGPT 4.0 and Google Gemini were rated higher for quality (GQS score of 5) compared to ChatGPT 3.5 (p < 0.001). CONCLUSIONS In this study, ChatGPT 3.5, although widely used, provided some misleading and inaccurate responses to questions about TDIs. In contrast, ChatGPT 4.0 and Google Gemini generated more accurate and comprehensive answers, making them more reliable as auxiliary information sources. However, for complex issues like TDIs, no chatbot can replace a dentist for diagnosis, treatment, and follow-up care.
Collapse
Affiliation(s)
- Yeliz Guven
- Istanbul University, Department of Pedodontics, Faculty of Dentistry, Istanbul, Turkey
| | - Omer Tarik Ozdemir
- Istanbul University, Department of Pedodontics, Faculty of Dentistry, Istanbul, Turkey
| | - Melis Yazir Kavan
- Istanbul University, Department of Pedodontics, Faculty of Dentistry, Istanbul, Turkey
| |
Collapse
|
6
|
Freire Y, Santamaría Laorden A, Orejas Pérez J, Ortiz Collado I, Gómez Sánchez M, Thuissard Vasallo IJ, Díaz-Flores García V, Suárez A. Evaluating the influence of prompt formulation on the reliability and repeatability of ChatGPT in implant-supported prostheses. PLoS One 2025; 20:e0323086. [PMID: 40445924 PMCID: PMC12124515 DOI: 10.1371/journal.pone.0323086] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2025] [Accepted: 04/02/2025] [Indexed: 06/02/2025] Open
Abstract
Language models (LLMs) such as ChatGPT are widely available to any dental professional. However, there is limited evidence to evaluate the reliability and reproducibility of ChatGPT-4 in relation to implant-supported prostheses, as well as the impact of prompt design on its responses. This constrains understanding of its application within this specific area of dentistry. The purpose of this study was to evaluate the performance of ChatGPT-4 in generating answers about implant-supported prostheses using different prompts. Thirty questions on implant-supported and implant-retained prostheses were posed, with 30 answers generated per question using general and specific prompts, totaling 1800 answers. Experts assessed reliability (agreement with expert grading) and repeatability (response consistency) using a 3-point Likert scale. General prompts achieved 70.89% reliability, with repeatability ranging from moderate to almost perfect. Specific prompts showed higher performance, with 78.8% reliability and substantial to almost perfect repeatability. The specific prompt significantly improved reliability compared to the general prompt. Despite these promising results, ChatGPT's ability to generate reliable answers on implant-supported prostheses remains limited, highlighting the need for professional oversight. Using specific prompts can enhance its performance. The use of a specific prompt might improve the answer generation performance of ChatGPT.
Collapse
Affiliation(s)
- Yolanda Freire
- Department of Preclinical Dentistry II, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Villaviciosa de Odón, Madrid, Spain
| | - Andrea Santamaría Laorden
- Department of Preclinical Dentistry II, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Villaviciosa de Odón, Madrid, Spain
| | - Jaime Orejas Pérez
- Department of Preclinical Dentistry II, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Villaviciosa de Odón, Madrid, Spain
| | - Ignacio Ortiz Collado
- Department of Preclinical Dentistry II, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Villaviciosa de Odón, Madrid, Spain
| | - Margarita Gómez Sánchez
- Department of Preclinical Dentistry I. Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Villaviciosa de Odón, Madrid, Spain
| | - Israel J. Thuissard Vasallo
- School for Doctoral Studies and Research. Universidad Europea de Madrid, Villaviciosa de Odón, Madrid, Spain
| | - Víctor Díaz-Flores García
- Department of Preclinical Dentistry I. Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Villaviciosa de Odón, Madrid, Spain
| | - Ana Suárez
- Department of Preclinical Dentistry II, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Villaviciosa de Odón, Madrid, Spain
| |
Collapse
|
7
|
Kim J, Jeong A, Jin J, Lee S, Yoon DK, Kim S. Temporal Association Between ChatGPT-Generated Diarrhea Synonyms in Internet Search Queries and Emergency Department Visits for Diarrhea-Related Symptoms in South Korea: Exploratory Study. J Med Internet Res 2025; 27:e65101. [PMID: 40403303 DOI: 10.2196/65101] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2024] [Revised: 11/27/2024] [Accepted: 04/15/2025] [Indexed: 05/24/2025] Open
Abstract
BACKGROUND Diarrhea, a common symptom of gastrointestinal infections, can lead to severe complications and is a major cause of emergency department (ED) visits. OBJECTIVE This study explored the temporal association between internet search queries for diarrhea and its synonyms and ED visits for diarrhea-related symptoms. METHODS We used data from the National Emergency Department Information System (NEDIS) and NAVER (Naver Corporation), South Korea's leading search engine, from January 2017 to December 2021. After identifying diarrhea synonyms using ChatGPT, we compared weekly trends in relative search volumes (RSVs) for diarrhea, including its synonyms and weekly ED visits. Pearson correlation analysis and Granger causality tests were used to evaluate the relationship between RSVs and ED visits. We developed an Autoregressive Integrated Moving Average with Exogenous Variables (ARIMAX) model to further predict these associations. This study also examined the age-based distribution of search behaviors and ED visits. RESULTS A significant correlation was observed between the weekly RSV for diarrhea and its synonyms and weekly ED visits for diarrhea-related symptoms (ranging from 0.14 to 0.51, P<.05). Weekly RSVs for diarrhea synonyms, such as "upset stomach," "watery diarrhea," and "acute enteritis," showed stronger correlations with weekly ED visits than weekly RSVs for the general term "diarrhea" (ranging from 0.20 to 0.41, P<.05). This may be because these synonyms better reflect layperson terminology. Notably, weekly RSV for "upset stomach" was significantly correlated with weekly ED visits for diarrhea and acute diarrhea at 1 and 2 weeks before the visit (P<.05). An ARIMAX model was developed to predict weekly ED visits based on weekly RSVs for diarrhea synonyms with lagged effects to capture their temporal influence. The age group of <50 years showed the highest activity in both web-based searches and ED visits for diarrhea-related symptoms. CONCLUSIONS This study demonstrates that weekly RSVs for diarrhea synonyms are associated with weekly ED visits for diarrhea-related symptoms. By encompassing a nationwide scope, this study broadens the existing methodology for syndromic surveillance using ED data and provides valuable insights for clinicians.
Collapse
Affiliation(s)
- Jinsoo Kim
- Department of Emergency Medicine, Hanyang University College of Medicine, Seoul, Republic of Korea
| | - Ansun Jeong
- Department of Preventive Medicine, Hanyang University College of Medicine, Seoul, Republic of Korea
| | - Juseong Jin
- Department of Urology, Seoul National University Hospital, Seoul, Republic of Korea
| | - Sangjun Lee
- Department of Preventive Medicine, Seoul National University College of Medicine, Seoul, Republic of Korea
- Integrated Major in Innovative Medical Science, Seoul National University Graduate School, Seoul, Republic of Korea
- Cancer Research Institute, Seoul National University, Seoul, Republic of Korea
| | - Do Kyoon Yoon
- Department of Data Science Research, Innovative Medical Technology Research Institute, Seoul National University Hospital, Seoul, Republic of Korea
| | - Soyeoun Kim
- Biomedical Research Institute, Seoul National University Hospital, Seoul, Republic of Korea
| |
Collapse
|
8
|
Baris SD, Baris K. Assessment of various artificial intelligence applications in responding to technical questions in endodontic surgery. BMC Oral Health 2025; 25:763. [PMID: 40405212 PMCID: PMC12096613 DOI: 10.1186/s12903-025-06149-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2025] [Accepted: 05/09/2025] [Indexed: 05/24/2025] Open
Abstract
BACKGROUND The objective of this study was to evaluate the performance of ScholarGPT, ChatGPT-4o and Google Gemini in responding to queries pertaining to endodontic apical surgery, a subject that demands advanced specialist knowledge in endodontics. METHODS A total of 30 questions, including 12 binary and 18 open-ended queries, were formulated based on information on endodontic apical surgery taken from a well-known endodontic book called Cohen's pathways of the pulp (12th edition). The questions were posed by two different researchers using different accounts on the ScholarGPT, ChatGPT-4o and Gemini platforms. The responses were then coded by the researchers and categorised as 'correct', 'incorrect', or 'insufficient'. The Pearson chi-square test was used to assess the relationships between the platforms. RESULTS A total of 5,400 responses were evaluated. Chi-square analysis revealed statistically significant differences between the accuracy of the responses provided applications (χ² = 22.61; p < 0.05). ScholarGPT demonstrated the highest rate of correct responses (97.7%), followed by ChatGPT-4o with 90.1%. Conversely, Gemini exhibited the lowest correct response rate (59.5%) among the applications examined. CONCLUSIONS ScholarGPT performed better overall on questions about endodontic apical surgery than ChatGPT-4o and Gemini. GPT models based on academic databases, such as ScholarGPT, may provide more accurate information about dentistry. However, additional research should be conducted to develop a GPT model that is specifically tailored to the field of endodontics.
Collapse
|
9
|
Büyüközer Özkan H, Doğan Çankaya T, Kölüş T. The Impact of Language Variability on Artificial Intelligence Performance in Regenerative Endodontics. Healthcare (Basel) 2025; 13:1190. [PMID: 40428026 PMCID: PMC12111750 DOI: 10.3390/healthcare13101190] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2025] [Revised: 05/13/2025] [Accepted: 05/17/2025] [Indexed: 05/29/2025] Open
Abstract
BACKGROUND Regenerative endodontic procedures (REPs) are promising treatments for immature teeth with necrotic pulp. Artificial intelligence (AI) is increasingly used in dentistry; thus, this study evaluates the reliability of AI-generated information on REPs, comparing four AI models against clinical guidelines. METHODS ChatGPT-4o, Claude 3.5 Sonnet, Grok 2, and Gemini 2.0 Advanced were tested with 20 REP-related questions from the ESE/AAE guidelines and expert consensus. Questions were posed in Turkish and English, with or without prompts. Two specialists assessed 640 AI-generated answers via a four-point rubric. Inter-rater reliability and response accuracy were statistically analyzed. RESULTS Inter-rater reliability was high (0.85-0.97). ChatGPT-4o showed higher accuracy with English prompts (p < 0.05). Claude was more accurate than Grok in the Turkish (nonprompted) and English (prompted) conditions (p < 0.05). No model reached ≥80% accuracy. Claude (English, prompted) scored highest; Grok-Turkish (nonprompted) scored lowest. CONCLUSIONS The performance of AI models varies significantly across languages. English queries yield higher accuracy. While AI shows potential for REPs information, current models lack sufficient accuracy for clinical reliance. Cautious interpretation and validation against guidelines are essential. Further research is needed to enhance AI performance in specialized dental fields.
Collapse
Affiliation(s)
- Hatice Büyüközer Özkan
- Department of Endodontics, Faculty of Dentistry, Alanya Alaaddin Keykubat University, 07490 Alanya, Türkiye;
| | - Tülin Doğan Çankaya
- Department of Endodontics, Faculty of Dentistry, Alanya Alaaddin Keykubat University, 07490 Alanya, Türkiye;
| | - Türkay Kölüş
- Department of Restorative Dentistry, Faculty of Dentistry, Karamanoğlu Mehmetbey University, 70200 Karaman, Türkiye;
| |
Collapse
|
10
|
Binaljadm TM, Alqutaibi AY, Halboub E, Zafar MS, Saker S. Artificial Intelligence Chatbots as Sources of Implant Dentistry Information for the Public: Validity and Reliability Assessment. Eur J Dent 2025. [PMID: 40393663 DOI: 10.1055/s-0045-1809155] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/22/2025] Open
Abstract
This study assessed the reliability and validity of responses from three chatbot systems-OpenAI's GPT-3.5, Gemini, and Copilot-concerning frequently asked questions (FAQs) in implant dentistry posed by patients.Twenty FAQs were prompted to three chatbots in three different times utilizing their respective application programming interfaces. The responses were assessed for validity (low and high threshold) and reliability by two prosthodontic consultants using a five-point Likert scale.The test of normality was utilized using the Shapiro-Wilk test. Differences between different chatbots regarding the quantitative variables in a given (fixed) time point and between the same chatbots in different time points were assessed using Friedman's two-way analysis of variance by ranks, followed by pairwise comparisons. All statistical analyses were conducted using the SPSS (Statistical Package for Social Sciences) Version 26.0 software program.GPT-3.5 provided the longest responses, while Gemini was the most concise. All chatbots advised consulting dental professionals more frequently. Validity was high under the low-threshold test but low under the high-threshold test, with Copilot scoring the highest. Reliability was high for all, with Gemini achieving perfect consistency.Chatbots showed consistent and generally valid responses with some variability in accuracy and details. While the chatbots demonstrated a high degree of reliability, their validity-especially under high-threshold criterion-remains limited. Improvements in accuracy and comprehensiveness are necessary for more effective use in providing information about dental implants.
Collapse
Affiliation(s)
- Tahani Mohammed Binaljadm
- Department of Substitutive Dental Sciences (Prosthodontics), College of Dentistry, Taibah University, Al Madinah, Saudi Arabia
| | - Ahmed Yaseen Alqutaibi
- Department of Substitutive Dental Sciences (Prosthodontics), College of Dentistry, Taibah University, Al Madinah, Saudi Arabia
- Department of Prosthodontics, College of Dentistry, Ibb University, Ibb, Yemen
| | - Esam Halboub
- Department of Maxillofacial Surgery and Diagnostic Science, College of Dentistry, Jazan University, Jazan, Saudi Arabia
| | - Muhammad Sohail Zafar
- Department of Clinical Sciences, College of Dentistry, Ajman University, Ajman, United Arab Emirates
- Centre of Medical and Bio-allied Health Sciences Research, Ajman University, Ajman, United Arab Emirates
- School of Dentistry, Jordan University, Amman, Jordan
| | - Samah Saker
- Department of Substitutive Dental Sciences (Prosthodontics), College of Dentistry, Taibah University, Al Madinah, Saudi Arabia
| |
Collapse
|
11
|
Gökcek Taraç M, Nale T. Artificial intelligence in pediatric dental trauma: do artificial intelligence chatbots address parental concerns effectively? BMC Oral Health 2025; 25:736. [PMID: 40382588 PMCID: PMC12085849 DOI: 10.1186/s12903-025-06105-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2025] [Accepted: 05/05/2025] [Indexed: 05/20/2025] Open
Abstract
BACKGROUND This study focused on two Artificial Intelligence chatbots, ChatGPT 3.5 and Google Gemini, as the primary tools for answering questions related to traumatic dental injuries. The aim of this study to evaluate the reliability, understandability, and applicability of the responses provided by these chatbots to commonly asked questions from parents of children with dental trauma. METHODS The case scenarios were developed based on frequently asked questions that parents commonly ask their dentists or Artificial Intelligence chatbots regarding dental trauma in children. The quality and accuracy of the information obtained from the chatbots were assessed using the DISCERN Instrument. The understandability and actionability of the responses obtained from the Artificial Intelligence chatbots were assessed using the Patient Education Materials Assessment Tool for Printed Materials. In statistical analysis; categorical variables were analyzed in terms of frequency and percentage. For numerical variables, skewness and kurtosis values were calculated to assess normal distribution. RESULTS Both Artificial Intelligence chatbots performed similarly, although Google Gemini provided higher quality and more reliable responses. Based on the mean scores, ChatGPT 3.5 had a higher understandability. Both chatbots demonstrated similar levels of performance in terms of actionability. CONCLUSION Artificial Intelligence applications can serve as a helpful starting point for parents seeking information and reassurance after dental trauma. However, they should not replace professional dental consultations, as their reliability is not absolute. Parents should use Artificial Intelligence applications as complementary resources and seek timely professional advice for accurate diagnosis and treatment.
Collapse
Affiliation(s)
- Mihriban Gökcek Taraç
- Department of Pediatric Dentistry, Karabuk University School of Dentistry, Karabük, Turkey.
| | - Tuğba Nale
- Antalya Oral and Dental Health Hospital, Antalya, Turkey
| |
Collapse
|
12
|
Metin U, Goymen M. Information from digital and human sources: A comparison of chatbot and clinician responses to orthodontic questions. Am J Orthod Dentofacial Orthop 2025:S0889-5406(25)00156-8. [PMID: 40327024 DOI: 10.1016/j.ajodo.2025.04.008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2024] [Revised: 04/10/2025] [Accepted: 04/10/2025] [Indexed: 05/07/2025]
Abstract
INTRODUCTION This study aimed to investigate whether artificial intelligence (AI)-based chatbots can be used as reliable adjunct tools in orthodontic practice by evaluating chatbot responses and comparing them to those of clinicians with varying levels of knowledge. METHODS Large language model-based chatbots (ChatGPT-4, ChatGPT-4o, Microsoft Copilot, Google Gemini 1.5 Pro, and Claude 3.5 Sonnet) and clinicians (dental students, general dentists, and orthodontists; n = 30) were included. The groups were asked 40 true and false questions, and the accuracy rate for each question was assessed by comparing it to the predetermined answer key. The total score was converted into a percentage. The Kruskal-Wallis test and Dunn's multiple comparison tests were used to compare accuracy rates. The consistency of the answers given by chatbots at 3 different times was assessed by Cronbach α. RESULTS The accuracy ratio scores for students were significantly lower than Microsoft Copilot (P = 0.029), Claude 3.5 Sonnet (P = 0.023), ChatGPT-4o (P = 0.005), and orthodontists (P = 0.001). For dentists, the accuracy ratio scores were found to be significantly lower than ChatGPT-4o (P = 0.019) and orthodontists (P = 0.001). The accuracy rate of ChatGPT-4o was closest to that of orthodontists, whereas the accuracy rates of ChatGPT-4, Microsoft Copilot, Claude 3.5 Sonnet, and Google Gemini 1.5 Pro were lower than orthodontists but higher than general dentists. Although ChatGPT-4 demonstrated a high degree of consistency in its responses, evidenced by a high Cronbach α value (α = 0.867), ChatGPT-4o (α = 0.256) and Claude 3.5 Sonnet (α = 0.256) were the least consistent chatbots. CONCLUSIONS The study found that orthodontists had the highest accuracy rate, whereas AI-based chatbots had a higher accuracy rate compared with dental students and general dentists. However, ChatGPT-4 gave the most consistent answers, whereas ChatGPT-4o and Claude 3.5 Sonnet showed the least consistency. AI-based chatbots can be useful for patient education and general orthodontic guidance, but a lack of consistency in responses can lead to the risk of misinformation.
Collapse
Affiliation(s)
- Ufuk Metin
- Department of Orthodontics, Dentistry Faculty, Gaziantep University, Gaziantep, Turkey
| | - Merve Goymen
- Department of Orthodontics, Dentistry Faculty, Gaziantep University, Gaziantep, Turkey.
| |
Collapse
|
13
|
Aljamani S, Hassona Y, Fansa HA, M Saadeh H, Dafi Jamani K. Evaluating Large Language Models in Addressing Patient Questions on Endodontic Pain: A Comparative Analysis of Accessible Chatbots. J Endod 2025:S0099-2399(25)00212-2. [PMID: 40334976 DOI: 10.1016/j.joen.2025.04.015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2025] [Revised: 04/25/2025] [Accepted: 04/25/2025] [Indexed: 05/09/2025]
Abstract
INTRODUCTION Patients increasingly use large language models for health-related information, but their reliability and usefulness remain controversial. Continuous assessment is essential to evaluate their role in patient education. This study evaluates the performance of ChatGPT-3.5 and Gemini in answering patient inquiries about endodontic pain. METHODS A total of 62 frequently asked questions on endodontic pain were categorized into etiology, symptoms, management, and incidence. Responses from ChatGPT 3.5 and Gemini were assessed using standardized tools, including the Global Quality Score (GQS), Completeness, Lack of false information, Evidence supported, Appropriateness and Relevance reliability tool, and readability indices (Flesch-Kincaid and Simple Measure of Gobbledygook). RESULTS Compared to Gemini, ChatGPT 3.5 responses scored significantly higher in terms of overall quality (GQS: 4.67-4.9 vs 2.5-4, P < .001) and reliability (Completeness, Lack of false information, Evidence supported, Appropriateness and Relevance: 23.5-23.6 vs 19.35-22.7, P < .05). However, it required a higher reading level (Simple Measure of Gobbledygook: 14-17.6) compared to Gemini (8.7-11.3, P < .001). Gemini's responses were more readable (6th-7th grade level) but lacked depth and completeness. CONCLUSION While ChatGPT 3.5 outperformed Gemini in quality and reliability, its complex language reduced accessibility. In contrast, Gemini's simpler language enhanced readability but sacrificed comprehensiveness. These findings highlight the need for professional oversight in integrating artificial intelligence-driven tools into healthcare communication to ensure accurate, accessible, and empathetic patient education.
Collapse
Affiliation(s)
- Sanaa Aljamani
- Department of Restorative Dentistry, School of Dentistry, The University of Jordan, Amman, Jordan; Jordan University Hospital, Amman, Jordan.
| | - Yazan Hassona
- Jordan University Hospital, Amman, Jordan; Department of Oral and Maxillofacial surgery, School of Dentistry, The University of Jordan, Amman, Jordan
| | - Hoda A Fansa
- Faculty of Dentistry, Al -Ahliyya Amman University, Amman, Jordan; Faculty of Dentistry, Alexandria University, Alexandria, Egypt
| | - Hiba M Saadeh
- Department of Restorative Dentistry, School of Dentistry, The University of Jordan, Amman, Jordan
| | - Kifah Dafi Jamani
- Jordan University Hospital, Amman, Jordan; Department of Prosthetic Dentistry, School of Dentistry, The University of Jordan, Amman, Jordan
| |
Collapse
|
14
|
Abdulrab S, Abada H, Mashyakhy M, Mostafa N, Alhadainy H, Halboub E. Performance of 4 Artificial Intelligence Chatbots in Answering Endodontic Questions. J Endod 2025; 51:602-608. [PMID: 39814135 DOI: 10.1016/j.joen.2025.01.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2024] [Revised: 01/03/2025] [Accepted: 01/05/2025] [Indexed: 01/18/2025]
Abstract
INTRODUCTION Artificial intelligence models have shown potential as educational tools in healthcare, such as answering exam questions. This study aimed to assess the performance of 4 prominent chatbots: ChatGPT-4o, MedGebra GPT-4o, Meta LIama 3, and Gemini Advanced in answering multiple-choice questions (MCQs) in endodontics. METHODS The study utilized 100 MCQs, each with 4 potential answers. These MCQs were obtained from 2 well-known endodontic textbooks. The performance of the above chatbots regarding choosing the correct answers was assessed twice with a 1-week interval. RESULTS The stability of the performance in the 2 rounds was highest for ChatGPT-4o, followed by Gemini Advanced and Meta Llama 3. MedGebra GPT-4o provided the highest percentage of true answers in the first round (93%) followed by ChatGPT-4o in the second round (90%). Meta Llama 3 provided the lowest percentages in the first (73%) and second rounds (75%). Although the performance of MedGebra GPT-4o was the best in the first round, it was less stable upon the second round (McNemar P > .05; Kappa = 0.725, P < .001). CONCLUSIONS ChatGPT-4o and MedGebra GPT-4o answered a high fraction of endodontic MCQs, while Meta LIama 3 and Gemini Advanced showed lower performance. Further training and development are required to improve their accuracy and reliability in endodontics.
Collapse
Affiliation(s)
- Saleem Abdulrab
- Al Khor Health Centre, Primary Health Care Corporation, Doha, Qatar
| | - Hisham Abada
- Department of Endodontics, Faculty of Dentistry, Kafrelsheikh University, Kafrelsheikh, Egypt.
| | - Mohammed Mashyakhy
- Department of Restorative Dental Sciences, College of Dentistry, Jazan University, Jazan, Saudi Arabia
| | - Nawras Mostafa
- Al Saad Health Centre, Primary Health Care Corporation, Doha, Qatar
| | - Hatem Alhadainy
- Department of Endodontics, Faculty of Dentistry, Tanta University, Tanta, Egypt
| | - Esam Halboub
- Department of Maxillofacial Surgery and Diagnostic Sciences, College of Dentistry, Jazan University, Jazan, Saudi Arabia
| |
Collapse
|
15
|
Esmailpour H, Rasaie V, Babaee Hemmati Y, Falahchai M. Performance of artificial intelligence chatbots in responding to the frequently asked questions of patients regarding dental prostheses. BMC Oral Health 2025; 25:574. [PMID: 40234820 PMCID: PMC11998412 DOI: 10.1186/s12903-025-05965-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2025] [Accepted: 04/07/2025] [Indexed: 04/17/2025] Open
Abstract
BACKGROUND Artificial intelligence (AI) chatbots are increasingly used in healthcare to address patient questions by providing personalized responses. Evaluating their performance is essential to ensure their reliability. This study aimed to assess the performance of three AI chatbots in responding to the frequently asked questions (FAQs) of patients regarding dental prostheses. METHODS Thirty-one frequently asked questions (FAQs) were collected from accredited organizations' websites and the "People Also Ask" feature of Google, focusing on removable and fixed prosthodontics. Two board-certified prosthodontists evaluated response quality using the modified Global Quality Score (GQS) on a 5-point Likert scale. Inter-examiner agreement was assessed using weighted kappa. Readability was measured using the Flesch-Kincaid Grade Level (FKGL) and Flesch Reading Ease (FRE) indices. Statistical analyses were performed using repeated measures ANOVA and Friedman test, with Bonferroni correction for pairwise comparisons (α = 0.05). RESULTS The inter-examiner agreement was good. Among the chatbots, Google Gemini had the highest quality score (4.58 ± 0.50), significantly outperforming Microsoft Copilot (3.87 ± 0.89) (P =.004). Readability analysis showed ChatGPT (10.45 ± 1.26) produced significantly more complex responses compared to Gemini (7.82 ± 1.19) and Copilot (8.38 ± 1.59) (P <.001). FRE scores indicated that ChatGPT's responses were categorized as fairly difficult (53.05 ± 7.16), while Gemini's responses were in plain English (64.94 ± 7.29), with a significant difference between them (P <.001). CONCLUSIONS AI chatbots show great potential in answering patient inquiries about dental prostheses. However, improvements are needed to enhance their effectiveness as patient education tools.
Collapse
Affiliation(s)
| | - Vanya Rasaie
- Research Affiliate at Sydney Dental School, Faculty of Medicine and Health, Sydney, Australia
| | - Yasamin Babaee Hemmati
- Department of Orthodontics, Dental Sciences Research Center, School of Dentistry, Guilan University of Medical Sciences, Rasht, Iran
| | - Mehran Falahchai
- Department of Prosthodontics, Dental Sciences Research Center, School of Dentistry, Guilan University of Medical Sciences, Rasht, Iran.
| |
Collapse
|
16
|
Gheisarifar M, Shembesh M, Koseoglu M, Fang Q, Afshari FS, Yuan JCC, Sukotjo C. Evaluating the validity and consistency of artificial intelligence chatbots in responding to patients' frequently asked questions in prosthodontics. J Prosthet Dent 2025:S0022-3913(25)00243-4. [PMID: 40199631 DOI: 10.1016/j.prosdent.2025.03.009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2024] [Revised: 03/06/2025] [Accepted: 03/07/2025] [Indexed: 04/10/2025]
Abstract
STATEMENT OF PROBLEM Healthcare-related information provided by artificial intelligence (AI) chatbots may pose challenges such as inaccuracies, lack of empathy, biases, over-reliance, limited scope, and ethical concerns. PURPOSE The purpose of this study was to evaluate and compare the validity and consistency of responses to prosthodontics-related frequently asked questions (FAQ) generated by 4 different chatbot systems. MATERIAL AND METHODS Four prosthodontics domains were evaluated: implant, fixed prosthodontics, complete denture (CD), and removable partial denture (RPD). Within each domain, 10 questions were prepared by full-time prosthodontic faculty members, and 10 questions were generated by GPT-3.5, representing its top frequently asked questions in each domain. The validity and consistency of responses provided by 4 chatbots: GPT-3.5, GPT-4, Gemini, and Bing were evaluated. The chi-squared test with the Yates correction was used to compare the validity of responses between different chatbots (α=.05). The Cronbach alpha was calculated for 3 sets of responses collected in the morning, afternoon, and evening to evaluate the consistency of the responses. RESULTS According to the low threshold validity test, the chatbots' answers to ChatGPT's implant-related, ChatGPT's RPD-related, and prosthodontists' CD-related FAQs were statistically different (P<.001, P<.001, and P=.004, respectively), with Bing being the lowest. At the high threshold validity test, the chatbots' answers to ChatGPT's implant-related and RPD-related FAQs and ChatGPT's and prosthodontists' fixed prosthetics-related and CD-related FAQs were statistically different (P<.001, P<.001, P=.004, P=.002, and P=.003, respectively), with Bing being the lowest. Overall, all 4 chatbots demonstrated lower validity at the high threshold than the low threshold. Bing, Gemini, and ChatGPT-4 chatbots displayed an acceptable level of consistency, while ChatGPT-3.5 did not. CONCLUSIONS Currently, AI chatbots show limitations in delivering answers to patients' prosthodontic-related FAQs with high validity and consistency.
Collapse
Affiliation(s)
- Maryam Gheisarifar
- Clinical Assistant Professor, Department of Restorative Dentistry, College of Dentistry, University of Illinois Chicago, Chicago, Ill
| | - Marwa Shembesh
- Clinical Assistant Professor, Department of Restorative Dentistry, College of Dentistry, University of Illinois Chicago, Chicago, Ill
| | - Merve Koseoglu
- Associate Professor, Department of Prosthodontics, Faculty of Dentistry, University of Sakarya, Sakarya, Turkey; and PhD student, Department of Prosthodontics, Faculty of Dentistry, University of Ataturk, Erzurum, Turkey
| | - Qiao Fang
- Clinical Assistant Professor, Department of Restorative Dentistry, College of Dentistry, University of Illinois Chicago, Chicago, Ill
| | - Fatemeh Solmaz Afshari
- Associate Professor, Department of Restorative Dentistry, College of Dentistry, University of Illinois Chicago, Chicago, Ill
| | - Judy Chia-Chun Yuan
- Professor and Associate Dean for Clinical Affairs, Department of Restorative Dentistry, College of Dentistry, University of Illinois Chicago, Chicago, Ill
| | - Cortino Sukotjo
- Professor and Chair, Department of Prosthodontics, University of Pittsburgh, School of Dental Medicine, Pittsburgh, PA.
| |
Collapse
|
17
|
Portilla ND, Garcia-Font M, Nagendrababu V, Abbott PV, Sanchez JAG, Abella F. Accuracy and Consistency of Gemini Responses Regarding the Management of Traumatized Permanent Teeth. Dent Traumatol 2025; 41:171-177. [PMID: 39460511 DOI: 10.1111/edt.13004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2024] [Revised: 09/27/2024] [Accepted: 09/29/2024] [Indexed: 10/28/2024]
Abstract
BACKGROUND The aim of this cross-sectional observational analytical study was to assess the accuracy and consistency of responses provided by Google Gemini (GG), a free-access high-performance multimodal large language model, to questions related to the European Society of Endodontology position statement on the management of traumatized permanent teeth (MTPT). MATERIALS AND METHODS Three academic endodontists developed a set of 99 yes/no questions covering all areas of the MTPT. Nine general dentists and 22 endodontic specialists evaluated these questions for clarity and comprehension through an iterative process. Two academic dental trauma experts categorized the knowledge required to answer each question into three levels. The three academic endodontists submitted the 99 questions to the GG, resulting in 297 responses, which were then assessed for accuracy and consistency. Accuracy was evaluated using the Wald binomial method, while the consistency of GG responses was assessed using the kappa-Fleiss coefficient with a confidence interval of 95%. A 5% significance level chi-squared test was used to evaluate the influence of question level of knowledge on accuracy and consistency. RESULTS The responses generated by Gemini showed an overall moderate accuracy of 80.81%, with no significant differences found between the responses of the academic endodontists. Overall, high consistency (95.96%) was demonstrated, with no significant differences between GG responses across the three accounts. The analysis also revealed no correlation between question level of knowledge and accuracy or consistency, with no significant differences. CONCLUSIONS The results of this study could significantly impact the potential use of Gemini as a free-access source of information for clinicians in the MTPT.
Collapse
Affiliation(s)
- Nicolas Dufey Portilla
- Department of Endodontics, School of Dentistry, Universitat International de Catalunya, Barcelona, Spain
- Department of Endodontics, School of Dentistry, Universidad Andres Bello, Viña del Mar, Chile
| | - Marc Garcia-Font
- Department of Endodontics, School of Dentistry, Universitat International de Catalunya, Barcelona, Spain
| | - Venkateshbabu Nagendrababu
- Department of Preventive and Restorative Dentistry, College of Dental Medicine, University of Sharjah, Sharjah, UAE
| | - Paul V Abbott
- UWA Dental School, The University of Western Australia, Perth, Western Australia, Australia
| | | | - Francesc Abella
- Department of Endodontics, School of Dentistry, Universitat International de Catalunya, Barcelona, Spain
| |
Collapse
|
18
|
Johnson AJ, Singh TK, Gupta A, Sankar H, Gill I, Shalini M, Mohan N. Evaluation of validity and reliability of AI Chatbots as public sources of information on dental trauma. Dent Traumatol 2025; 41:187-193. [PMID: 39417352 DOI: 10.1111/edt.13000] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2024] [Revised: 09/24/2024] [Accepted: 09/25/2024] [Indexed: 10/19/2024]
Abstract
AIM This study aimed to assess the validity and reliability of AI chatbots, including Bing, ChatGPT 3.5, Google Gemini, and Claude AI, in addressing frequently asked questions (FAQs) related to dental trauma. METHODOLOGY A set of 30 FAQs was initially formulated by collecting responses from four AI chatbots. A panel comprising expert endodontists and maxillofacial surgeons then refined these to a final selection of 20 questions. Each question was entered into each chatbot three times, generating a total of 240 responses. These responses were evaluated using the Global Quality Score (GQS) on a 5-point Likert scale (5: strongly agree; 4: agree; 3: neutral; 2: disagree; 1: strongly disagree). Any disagreements in scoring were resolved through evidence-based discussions. The validity of the responses was determined by categorizing them as valid or invalid based on two thresholds: a low threshold (scores of ≥ 4 for all three responses) and a high threshold (scores of 5 for all three responses). A chi-squared test was used to compare the validity of the responses between the chatbots. Cronbach's alpha was calculated to assess the reliability by evaluating the consistency of repeated responses from each chatbot. CONCLUSION The results indicate that the Claude AI chatbot demonstrated superior validity and reliability compared to ChatGPT and Google Gemini, whereas Bing was found to be less reliable. These findings underscore the need for authorities to establish strict guidelines to ensure the accuracy of medical information provided by AI chatbots.
Collapse
Affiliation(s)
- Ashish J Johnson
- All India Institute of Medical Sciences (AIIMS), Bathinda, India
| | | | - Aakash Gupta
- All India Institute of Medical Sciences (AIIMS), Bathinda, India
| | - Hariram Sankar
- All India Institute of Medical Sciences (AIIMS), Bathinda, India
| | - Ikroop Gill
- All India Institute of Medical Sciences (AIIMS), Bathinda, India
| | - Madhav Shalini
- All India Institute of Medical Sciences (AIIMS), Bathinda, India
| | - Neeraj Mohan
- Maulana Azad Institute of Dental Science, New Delhi, India
| |
Collapse
|
19
|
Wu W, Guo Y, Li Q, Jia C. Exploring the potential of large language models in identifying metabolic dysfunction-associated steatotic liver disease: A comparative study of non-invasive tests and artificial intelligence-generated responses. Liver Int 2025; 45:e16112. [PMID: 39526465 DOI: 10.1111/liv.16112] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/14/2024] [Revised: 09/05/2024] [Accepted: 09/11/2024] [Indexed: 11/16/2024]
Abstract
BACKGROUND AND AIMS This study sought to assess the capabilities of large language models (LLMs) in identifying clinically significant metabolic dysfunction-associated steatotic liver disease (MASLD). METHODS We included individuals from NHANES 2017-2018. The validity and reliability of MASLD diagnosis by GPT-3.5 and GPT-4 were quantitatively examined and compared with those of the Fatty Liver Index (FLI) and United States FLI (USFLI). A receiver operating characteristic curve was conducted to assess the accuracy of MASLD diagnosis via different scoring systems. Additionally, GPT-4V's potential in clinical diagnosis using ultrasound images from MASLD patients was evaluated to provide assessments of LLM capabilities in both textual and visual data interpretation. RESULTS GPT-4 demonstrated comparable performance in MASLD diagnosis to FLI and USFLI with the AUROC values of .831 (95% CI .796-.867), .817 (95% CI .797-.837) and .827 (95% CI .807-.848), respectively. GPT-4 exhibited a trend of enhanced accuracy, clinical relevance and efficiency compared to GPT-3.5 based on clinician evaluation. Additionally, Pearson's r values between GPT-4 and FLI, as well as USFLI, were .718 and .695, respectively, indicating robust and moderate correlations. Moreover, GPT-4V showed potential in understanding characteristics from hepatic ultrasound imaging but exhibited limited interpretive accuracy in diagnosing MASLD compared to skilled radiologists. CONCLUSIONS GPT-4 achieved performance comparable to traditional risk scores in diagnosing MASLD and exhibited improved convenience, versatility and the capacity to offer user-friendly outputs. The integration of GPT-4V highlights the capacities of LLMs in handling both textual and visual medical data, reinforcing their expansive utility in healthcare practice.
Collapse
Affiliation(s)
- Wanying Wu
- Department of Cardiology, Guangdong Cardiovascular Institute, Guangdong Provincial People's Hospital, Guangdong Academy of Medical Sciences, Guangzhou, China
- Department of Guangdong Provincial Key Laboratory of Coronary Heart Disease Prevention, Guangdong Cardiovascular Institute, Guangdong Provincial People's Hospital, Guangdong Academy of Medical Sciences, Guangzhou, China
| | - Yuhu Guo
- Faculty of Science and Engineering, The University of Manchester, Manchester, UK
| | - Qi Li
- Department of Neurology, The First Affiliated Hospital of Hebei North University, Zhangjiakou, China
| | - Congzhuo Jia
- Department of Cardiology, Guangdong Cardiovascular Institute, Guangdong Provincial People's Hospital, Guangdong Academy of Medical Sciences, Guangzhou, China
- Department of Guangdong Provincial Key Laboratory of Coronary Heart Disease Prevention, Guangdong Cardiovascular Institute, Guangdong Provincial People's Hospital, Guangdong Academy of Medical Sciences, Guangzhou, China
| |
Collapse
|
20
|
Cantao AB, Levin L. Emerging Insights in Dental Trauma: Exploring Potential Risk Factors, Innovations, and Preventive Strategies. Dent Traumatol 2025; 41:129-132. [PMID: 40083261 DOI: 10.1111/edt.13053] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/16/2025]
Affiliation(s)
| | - Liran Levin
- University of Saskatchewan, Saskatoon, Canada
| |
Collapse
|
21
|
Mohammad-Rahimi H, Setzer FC, Aminoshariae A, Dummer PMH, Duncan HF, Nosrat A. Artificial intelligence chatbots in endodontic education-Concepts and potential applications. Int Endod J 2025. [PMID: 40164964 DOI: 10.1111/iej.14231] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2024] [Revised: 01/29/2025] [Accepted: 03/20/2025] [Indexed: 04/02/2025]
Abstract
The integration of artificial intelligence (AI) into education is transforming learning across various domains, including dentistry. Endodontic education can significantly benefit from AI chatbots; however, knowledge gaps regarding their potential and limitations hinder their effective utilization. This narrative review aims to: (A) explain the core functionalities of AI chatbots, including their reliance on natural language processing (NLP), machine learning (ML), and deep learning (DL); (B) explore their applications in endodontic education for personalized learning, interactive training, and clinical decision support; (C) discuss the challenges posed by technical limitations, ethical considerations, and the potential for misinformation. The review highlights that AI chatbots provide learners with immediate access to knowledge, personalized educational experiences, and tools for developing clinical reasoning through case-based learning. Educators benefit from streamlined curriculum development, automated assessment creation, and evidence-based resource integration. Despite these advantages, concerns such as chatbot hallucinations, algorithmic biases, potential for plagiarism, and the spread of misinformation require careful consideration. Analysis of current research reveals limited endodontic-specific studies, emphasizing the need for tailored chatbot solutions validated for accuracy and relevance. Successful integration will require collaborative efforts among educators, developers, and professional organizations to address challenges, ensure ethical use, and establish evaluation frameworks.
Collapse
Affiliation(s)
- Hossein Mohammad-Rahimi
- Department of Dentistry and Oral Health, Aarhus University, Aarhus, Denmark
- Conservative Dentistry and Periodontology, LMU Klinikum, LMU, Munich, Germany
| | - Frank C Setzer
- Department of Endodontics, School of Dental Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Anita Aminoshariae
- Department of Endodontics, School of Dental Medicine, Case Western Reserve University, Cleveland, Ohio, USA
| | | | - Henry F Duncan
- Division of Restorative Dentistry, Dublin Dental University Hospital, Trinity College Dublin, Dublin, Ireland
| | - Ali Nosrat
- Department of Advanced Oral Sciences and Therapeutics, School of Dentistry, University of Maryland Baltimore, Baltimore, Maryland, USA
- Private Practice, Centreville Endodontics, Centreville, Virginia, USA
| |
Collapse
|
22
|
Wu Y, Zhang Y, Xu M, Jinzhi C, Xue Y, Zheng Y. Effectiveness of various general large language models in clinical consensus and case analysis in dental implantology: a comparative study. BMC Med Inform Decis Mak 2025; 25:147. [PMID: 40140812 PMCID: PMC11938642 DOI: 10.1186/s12911-025-02972-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2024] [Accepted: 03/12/2025] [Indexed: 03/28/2025] Open
Abstract
BACKGROUND This study evaluates and compares ChatGPT-4.0, Gemini Pro 1.5(0801), Claude 3 Opus, and Qwen 2.0 72B in answering dental implant questions. The aim is to help doctors in underserved areas choose the best LLMs(Large Language Model) for their procedures, improving dental care accessibility and clinical decision-making. METHODS Two dental implant specialists with over twenty years of clinical experience evaluated the models. Questions were categorized into simple true/false, complex short-answer, and real-life case analyses. Performance was measured using precision, recall, and Bayesian inference-based evaluation metrics. RESULTS ChatGPT-4 exhibited the most stable and consistent performance on both simple and complex questions. Gemini Pro 1.5(0801)performed well on simple questions but was less stable on complex tasks. Qwen 2.0 72B provided high-quality answers for specific cases but showed variability. Claude 3 opus had the lowest performance across various metrics. Statistical analysis indicated significant differences between models in diagnostic performance but not in treatment planning. CONCLUSIONS ChatGPT-4 is the most reliable model for handling medical questions, followed by Gemini Pro 1.5(0801). Qwen 2.0 72B shows potential but lacks consistency, and Claude 3 Opus performs poorly overall. Combining multiple models is recommended for comprehensive medical decision-making.
Collapse
Affiliation(s)
- Yuepeng Wu
- Center for Plastic & Reconstructive Surgery, Department of Stomatology, Zhejiang Provincial People's Hospital, Affiliated People's Hospital, Hangzhou Medical College, Hangzhou, Zhejiang, China
| | - Yukang Zhang
- Xianju Traditional Chinese Medicine Hospital, Taizhou, Zhejiang, China.
| | - Mei Xu
- Hangzhou Dental Hospital, West Branch, Hangzhou, Zhejiang, China
| | - Chen Jinzhi
- College of Oceanography, HoHai University, Nanjng, Jiangsu, China
| | - Yican Xue
- Hangzhou Medical College, Hangzhou, Zhejiang, China
| | - Yuchen Zheng
- Center for Plastic & Reconstructive Surgery, Department of Stomatology, Zhejiang Provincial People's Hospital, Affiliated People's Hospital, Hangzhou Medical College, Hangzhou, Zhejiang, China.
| |
Collapse
|
23
|
Busch F, Kaibel L, Nguyen H, Lemke T, Ziegelmayer S, Graf M, Marka AW, Endrös L, Prucker P, Spitzl D, Mergen M, Makowski MR, Bressem KK, Petzoldt S, Adams LC, Landgraf T. Evaluation of a Retrieval-Augmented Generation-Powered Chatbot for Pre-CT Informed Consent: a Prospective Comparative Study. JOURNAL OF IMAGING INFORMATICS IN MEDICINE 2025:10.1007/s10278-025-01483-w. [PMID: 40119020 DOI: 10.1007/s10278-025-01483-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/06/2025] [Revised: 02/23/2025] [Accepted: 03/11/2025] [Indexed: 03/24/2025]
Abstract
This study aims to investigate the feasibility, usability, and effectiveness of a Retrieval-Augmented Generation (RAG)-powered Patient Information Assistant (PIA) chatbot for pre-CT information counseling compared to the standard physician consultation and informed consent process. This prospective comparative study included 86 patients scheduled for CT imaging between November and December 2024. Patients were randomly assigned to either the PIA group (n = 43), who received pre-CT information via the PIA chat app, or the control group (n = 43), with standard doctor-led consultation. Patient satisfaction, information clarity and comprehension, and concerns were assessed using six ten-point Likert-scale questions after information counseling with PIA or the doctor's consultation. Additionally, consultation duration was measured, and PIA group patients were asked about their preference for pre-CT consultation, while two radiologists rated each PIA chat in five categories. Both groups reported similarly high ratings for information clarity (PIA: 8.64 ± 1.69; control: 8.86 ± 1.28; p = 0.82) and overall comprehension (PIA: 8.81 ± 1.40; control: 8.93 ± 1.61; p = 0.35). However, the doctor consultation group showed greater effectiveness in alleviating patient concerns (8.30 ± 2.63 versus 6.46 ± 3.29; p = 0.003). The PIA group demonstrated significantly shorter subsequent consultation times (median: 120 s [interquartile range (IQR): 100-140] versus 195 s [IQR: 170-220]; p = 0.04). Both radiologists rated overall quality, scientific and clinical evidence, clinical usefulness and relevance, consistency, and up-to-dateness of PIA high. The RAG-powered PIA effectively provided pre-CT information while significantly reducing physician consultation time. While both methods achieved comparable patient satisfaction and comprehension, physicians were more effective at addressing worries or concerns regarding the examination.
Collapse
Affiliation(s)
- Felix Busch
- School of Medicine and Health, Department of Diagnostic and Interventional Radiology, Klinikum Rechts Der Isar, TUM University Hospital, Technical University of Munich, Munich, Germany.
| | - Lukas Kaibel
- Institute for Computer Science, Free University of Berlin, Berlin, Germany
| | - Hai Nguyen
- Institute for Computer Science, Free University of Berlin, Berlin, Germany
| | - Tristan Lemke
- School of Medicine and Health, Department of Diagnostic and Interventional Radiology, Klinikum Rechts Der Isar, TUM University Hospital, Technical University of Munich, Munich, Germany
| | - Sebastian Ziegelmayer
- School of Medicine and Health, Department of Diagnostic and Interventional Radiology, Klinikum Rechts Der Isar, TUM University Hospital, Technical University of Munich, Munich, Germany
| | - Markus Graf
- School of Medicine and Health, Department of Diagnostic and Interventional Radiology, Klinikum Rechts Der Isar, TUM University Hospital, Technical University of Munich, Munich, Germany
| | - Alexander W Marka
- School of Medicine and Health, Department of Diagnostic and Interventional Radiology, Klinikum Rechts Der Isar, TUM University Hospital, Technical University of Munich, Munich, Germany
| | - Lukas Endrös
- School of Medicine and Health, Department of Diagnostic and Interventional Radiology, Klinikum Rechts Der Isar, TUM University Hospital, Technical University of Munich, Munich, Germany
| | - Philipp Prucker
- School of Medicine and Health, Department of Diagnostic and Interventional Radiology, Klinikum Rechts Der Isar, TUM University Hospital, Technical University of Munich, Munich, Germany
| | - Daniel Spitzl
- School of Medicine and Health, Department of Diagnostic and Interventional Radiology, Klinikum Rechts Der Isar, TUM University Hospital, Technical University of Munich, Munich, Germany
| | - Markus Mergen
- School of Medicine and Health, Department of Diagnostic and Interventional Radiology, Klinikum Rechts Der Isar, TUM University Hospital, Technical University of Munich, Munich, Germany
| | - Marcus R Makowski
- School of Medicine and Health, Department of Diagnostic and Interventional Radiology, Klinikum Rechts Der Isar, TUM University Hospital, Technical University of Munich, Munich, Germany
| | - Keno K Bressem
- School of Medicine and Health, Department of Diagnostic and Interventional Radiology, Klinikum Rechts Der Isar, TUM University Hospital, Technical University of Munich, Munich, Germany
- School of Medicine and Health, Institute for Cardiovascular Radiology and Nuclear Medicine, German Heart Center Munich, TUM University Hospital, Technical University of Munich, Munich, Germany
| | - Sebastian Petzoldt
- Clinic for General, Visceral and Minimally Invasive Surgery, DRK Kliniken Berlin Köpenick, Berlin, Germany
| | - Lisa C Adams
- School of Medicine and Health, Department of Diagnostic and Interventional Radiology, Klinikum Rechts Der Isar, TUM University Hospital, Technical University of Munich, Munich, Germany
| | - Tim Landgraf
- Institute for Computer Science, Free University of Berlin, Berlin, Germany.
| |
Collapse
|
24
|
Arılı Öztürk E, Turan Gökduman C, Çanakçi BC. Evaluation of the performance of ChatGPT-4 and ChatGPT-4o as a learning tool in endodontics. Int Endod J 2025. [PMID: 40025853 DOI: 10.1111/iej.14217] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2024] [Revised: 02/06/2025] [Accepted: 02/18/2025] [Indexed: 03/04/2025]
Abstract
AIMS The aim of this study was to evaluate the accuracy and consistency of responses given by two different versions of Chat Generative Pre-trained Transformer (ChatGPT), ChatGPT-4, and ChatGPT-4o, to multiple-choice questions prepared from undergraduate endodontic education topics at different times of the day and on different days. METHODOLOGY In total, 60 multiple-choice, text-based questions from 6 topics of undergraduate endodontic education were prepared. Each question was asked to ChatGPT-4 and ChatGPT-4o 3 times a day (morning, noon, and evening) and for 3 consecutive days. The accuracy and consistency of AIs were compared using SPSS and R programs (p < .05, 95% confidence interval). RESULTS The accuracy rate of ChatGPT-4o (92.8%) was significantly higher than that of ChatGPT-4 (81.7%; p < .001). The question groups affected the accuracy rates of both AIs (p < .001). The times at which the questions were asked did not affect the accuracy of either AI (p > .05). There was no statistically significant difference in the consistency rate between ChatGPT-4 and ChatGPT-4o (p = .123). The question groups did not affect the consistency of either AI, too (p > .05). CONCLUSIONS According to the results of this study, the accuracy of ChatGPT-4o was better than that of ChatGPT-4. These findings demonstrate that AI chatbots can be used in dental education. However, it is also necessary to consider the limitations and potential risks associated with AI.
Collapse
Affiliation(s)
- Esra Arılı Öztürk
- Department of Endodontics, Faculty of Dentistry, Trakya University, Edirne, Turkey
| | - Ceren Turan Gökduman
- Department of Endodontics, Faculty of Dentistry, Trakya University, Edirne, Turkey
| | - Burhan Can Çanakçi
- Department of Endodontics, Faculty of Dentistry, Trakya University, Edirne, Turkey
| |
Collapse
|
25
|
Mahmoud R, Shuster A, Kleinman S, Arbel S, Ianculovici C, Peleg O. Evaluating Artificial Intelligence Chatbots in Oral and Maxillofacial Surgery Board Exams: Performance and Potential. J Oral Maxillofac Surg 2025; 83:382-389. [PMID: 39642920 DOI: 10.1016/j.joms.2024.11.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2024] [Revised: 11/12/2024] [Accepted: 11/12/2024] [Indexed: 12/09/2024]
Abstract
BACKGROUND While artificial intelligence has significantly impacted medicine, the application of large language models (LLMs) in oral and maxillofacial surgery (OMS) remains underexplored. PURPOSE This study aimed to measure and compare the accuracy of 4 leading LLMs on OMS board examination questions and to identify specific areas for improvement. STUDY DESIGN, SETTING, AND SAMPLE An in-silico cross-sectional study was conducted to evaluate 4 artificial intelligence chatbots on 714 OMS board examination questions. PREDICTOR VARIABLE The predictor variable was the LLM used - LLM 1 (Generative Pretrained Transformer 4o [GPT-4o], OpenAI, San Francisco, CA), LLM 2 (Generative Pretrained Transformer 3.5 [GPT-3.5], OpenAI, San Francisco, CA), LLM 3 (Gemini, Google, Mountain View, CA), and LLM 4 (Copilot, Microsoft, Redmond, WA). MAIN OUTCOME VARIABLES The primary outcome variable was accuracy, defined as the percentage of correct answers provided by each LLM. Secondary outcomes included the LLMs' ability to correct errors on subsequent attempts and their performance across 11 specific OMS subject domains: medicine and anesthesia, dentoalveolar and implant surgery, maxillofacial trauma, maxillofacial infections, maxillofacial pathology, salivary glands, oncology, maxillofacial reconstruction, temporomandibular joint anatomy and pathology, craniofacial and clefts, and orthognathic surgery. COVARIATES No additional covariates were considered. ANALYSES Statistical analysis included one-way ANOVA and post hoc Tukey honest significant difference (HSD) to compare performance across chatbots. χ2 tests were used to assess response consistency and error correction, with statistical significance set at P < .05. RESULTS LLM 1 achieved the highest accuracy with an average score of 83.69%, statistically significantly outperforming LLM 3 (66.85%, P = .002), LLM 2 (64.83%, P = .001), and LLM 4 (62.18%, P < .001). Across the 11 OMS subject domains, LLM 1 consistently had the highest accuracy rates. LLM 1 also corrected 98.2% of errors, while LLM 2 corrected 93.44%, both statistically significantly higher than LLM 4 (29.26%) and LLM 3 (70.71%) (P < .001). CONCLUSION AND RELEVANCE LLM 1 (GPT-4o) significantly outperformed other models in both accuracy and error correction, indicating its strong potential as a tool for enhancing OMS education. However, the variability in performance across different domains highlights the need for ongoing refinement and continued evaluation to integrate these LLMs more effectively into the OMS field.
Collapse
Affiliation(s)
- Reema Mahmoud
- Resident, Department of Oral and Maxillofacial Surgery, Tel-Aviv Sourasky Medical Center, Tel Aviv, Israel.
| | - Amir Shuster
- Senior Surgeon, Department of Oral and Maxillofacial Surgery, Tel-Aviv Sourasky Medical Center, Tel Aviv, Israel; Senior Surgeon, Department of Oral and Maxillofacial Surgery, Goldschleger School of Dental Medicine, Tel-Aviv University, Tel-Aviv, Israel
| | - Shlomi Kleinman
- Department Head, Department of Oral and Maxillofacial Surgery, Tel-Aviv Sourasky Medical Center, Tel Aviv, Israel
| | - Shimrit Arbel
- Senior Surgeon, Department of Oral and Maxillofacial Surgery, Tel-Aviv Sourasky Medical Center, Tel Aviv, Israel
| | - Clariel Ianculovici
- Senior Surgeon, Department of Oral and Maxillofacial Surgery, Tel-Aviv Sourasky Medical Center, Tel Aviv, Israel
| | - Oren Peleg
- Senior Surgeon, Department of Oral and Maxillofacial Surgery, Tel-Aviv Sourasky Medical Center, Tel Aviv, Israel; Senior Surgeon, Department of Oral and Maxillofacial Surgery, Goldschleger School of Dental Medicine, Tel-Aviv University, Tel-Aviv, Israel
| |
Collapse
|
26
|
Sohrabniya F, Hassanzadeh-Samani S, Ourang SA, Jafari B, Farzinnia G, Gorjinejad F, Ghalyanchi-Langeroudi A, Mohammad-Rahimi H, Tichy A, Motamedian SR, Schwendicke F. Exploring a decade of deep learning in dentistry: A comprehensive mapping review. Clin Oral Investig 2025; 29:143. [PMID: 39969623 DOI: 10.1007/s00784-025-06216-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2024] [Accepted: 02/08/2025] [Indexed: 02/20/2025]
Abstract
OBJECTIVES Artificial Intelligence (AI), particularly deep learning, has significantly impacted healthcare, including dentistry, by improving diagnostics, treatment planning, and prognosis prediction. This systematic mapping review explores the current applications of deep learning in dentistry, offering a comprehensive overview of trends, models, and their clinical significance. MATERIALS AND METHODS Following a structured methodology, relevant studies published from January 2012 to September 2023 were identified through database searches in PubMed, Scopus, and Embase. Key data, including clinical purpose, deep learning tasks, model architectures, and data modalities, were extracted for qualitative synthesis. RESULTS From 21,242 screened studies, 1,007 were included. Of these, 63.5% targeted diagnostic tasks, primarily with convolutional neural networks (CNNs). Classification (43.7%) and segmentation (22.9%) were the main methods, and imaging data-such as cone-beam computed tomography and orthopantomograms-were used in 84.4% of cases. Most studies (95.2%) applied fully supervised learning, emphasizing the need for annotated data. Pathology (21.5%), radiology (17.5%), and orthodontics (10.2%) were prominent fields, with 24.9% of studies relating to more than one specialty. CONCLUSION This review explores the advancements in deep learning in dentistry, particulary for diagnostics, and identifies areas for further improvement. While CNNs have been used successfully, it is essential to explore emerging model architectures, learning approaches, and ways to obtain diverse and reliable data. Furthermore, fostering trust among all stakeholders by advancing explainable AI and addressing ethical considerations is crucial for transitioning AI from research to clinical practice. CLINICAL RELEVANCE This review offers a comprehensive overview of a decade of deep learning in dentistry, showcasing its significant growth in recent years. By mapping its key applications and identifying research trends, it provides a valuable guide for future studies and highlights emerging opportunities for advancing AI-driven dental care.
Collapse
Affiliation(s)
- Fatemeh Sohrabniya
- ITU/WHO/WIPO Global Initiative on Artificial Intelligence for Health - Dental Diagnostics and Digital Dentistry, Geneva, Switzerland
| | - Sahel Hassanzadeh-Samani
- ITU/WHO/WIPO Global Initiative on Artificial Intelligence for Health - Dental Diagnostics and Digital Dentistry, Geneva, Switzerland
- Dentofacial Deformities Research Center, Research Institute of Dental Sciences, Shahid Beheshti University of Medical Sciences, Tehran, Iran
| | - Seyed AmirHossein Ourang
- Dentofacial Deformities Research Center, Research Institute of Dental Sciences, Shahid Beheshti University of Medical Sciences, Tehran, Iran
| | - Bahare Jafari
- Division of Orthodontics, The Ohio State University, Columbus, OH, 43210, USA
| | | | - Fatemeh Gorjinejad
- ITU/WHO/WIPO Global Initiative on Artificial Intelligence for Health - Dental Diagnostics and Digital Dentistry, Geneva, Switzerland
| | - Azadeh Ghalyanchi-Langeroudi
- Medical Physics & Biomedical Engineering Department, School of Medicine, Tehran University of Medical Sciences, Tehran, Iran
- Research Center for Biomedical Technologies and Robotics (RCBTR),Advanced Medical Technology and Equipment Institute (AMTEI), Tehran University of Medical Science (TUMS), Tehran, Iran
| | - Hossein Mohammad-Rahimi
- Department of Dentistry and Oral Health, Aarhus University, Vennelyst Boulevard 9, Aarhus C, 8000, Aarhus, Denmark
- Department of Conservative Dentistry and Periodontology, LMU University Hospital, LMU Munich, Munich, Germany
| | - Antonin Tichy
- Department of Conservative Dentistry and Periodontology, LMU University Hospital, LMU Munich, Munich, Germany
- Institute of Dental Medicine, First Faculty of Medicine of the Charles University and General University Hospital, Prague, Czech Republic
| | - Saeed Reza Motamedian
- Dentofacial Deformities Research Center, Research Institute of Dental Sciences, Shahid Beheshti University of Medical Sciences, Tehran, Iran.
- Department of Orthodontics, School of Dentistry, Shahid Beheshti University of Medical Sciences, Tehran, Iran.
| | - Falk Schwendicke
- Department of Conservative Dentistry and Periodontology, LMU University Hospital, LMU Munich, Munich, Germany
| |
Collapse
|
27
|
Mustuloğlu Ş, Deniz BP. Evaluation of Chatbots in the Emergency Management of Avulsion Injuries. Dent Traumatol 2025. [PMID: 39865377 DOI: 10.1111/edt.13041] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2024] [Revised: 01/06/2025] [Accepted: 01/14/2025] [Indexed: 01/28/2025]
Abstract
BACKGROUND This study assessed the accuracy and consistency of responses provided by six Artificial Intelligence (AI) applications, ChatGPT version 3.5 (OpenAI), ChatGPT version 4 (OpenAI), ChatGPT version 4.0 (OpenAI), Perplexity (Perplexity.AI), Gemini (Google), and Copilot (Bing), to questions related to emergency management of avulsed teeth. MATERIALS AND METHODS Two pediatric dentists developed 18 true or false questions regarding dental avulsion and asked public chatbots for 3 days. The responses were recorded and compared with the correct answers. The SPSS program was used to calculate the obtained accuracies and their consistency. RESULTS ChatGPT 4.0 achieved the highest accuracy rate of 95.6% over the entire time frame, while Perplexity (Perplexity.AI) had the lowest accuracy rate of 67.2%. ChatGPT version 4.0 (OpenAI) was the only AI that achieved perfect agreement with real answers, except at noon on day 1. ChatGPT version 3.5 (OpenAI) was the AI that showed the weakest agreement (6 times). CONCLUSIONS With the exception of ChatGPT's paid version, 4.0, AI chatbots do not seem ready for use as the main resource in managing avulsed teeth during emergencies. It might prove beneficial to incorporate the International Association of Dental Traumatology (IADT) guidelines in chatbot databases, enhancing their accuracy and consistency.
Collapse
Affiliation(s)
- Şeyma Mustuloğlu
- Department of Paediatric Dentistry, Faculty of Dentistry, Mersin University, Mersin, Turkey
| | - Büşra Pınar Deniz
- Department of Paediatric Dentistry, Faculty of Dentistry, Mersin University, Mersin, Turkey
| |
Collapse
|
28
|
Busch F, Hoffmann L, Rueger C, van Dijk EH, Kader R, Ortiz-Prado E, Makowski MR, Saba L, Hadamitzky M, Kather JN, Truhn D, Cuocolo R, Adams LC, Bressem KK. Current applications and challenges in large language models for patient care: a systematic review. COMMUNICATIONS MEDICINE 2025; 5:26. [PMID: 39838160 PMCID: PMC11751060 DOI: 10.1038/s43856-024-00717-2] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2024] [Accepted: 12/17/2024] [Indexed: 01/23/2025] Open
Abstract
BACKGROUND The introduction of large language models (LLMs) into clinical practice promises to improve patient education and empowerment, thereby personalizing medical care and broadening access to medical knowledge. Despite the popularity of LLMs, there is a significant gap in systematized information on their use in patient care. Therefore, this systematic review aims to synthesize current applications and limitations of LLMs in patient care. METHODS We systematically searched 5 databases for qualitative, quantitative, and mixed methods articles on LLMs in patient care published between 2022 and 2023. From 4349 initial records, 89 studies across 29 medical specialties were included. Quality assessment was performed using the Mixed Methods Appraisal Tool 2018. A data-driven convergent synthesis approach was applied for thematic syntheses of LLM applications and limitations using free line-by-line coding in Dedoose. RESULTS We show that most studies investigate Generative Pre-trained Transformers (GPT)-3.5 (53.2%, n = 66 of 124 different LLMs examined) and GPT-4 (26.6%, n = 33/124) in answering medical questions, followed by patient information generation, including medical text summarization or translation, and clinical documentation. Our analysis delineates two primary domains of LLM limitations: design and output. Design limitations include 6 second-order and 12 third-order codes, such as lack of medical domain optimization, data transparency, and accessibility issues, while output limitations include 9 second-order and 32 third-order codes, for example, non-reproducibility, non-comprehensiveness, incorrectness, unsafety, and bias. CONCLUSIONS This review systematically maps LLM applications and limitations in patient care, providing a foundational framework and taxonomy for their implementation and evaluation in healthcare settings.
Collapse
Affiliation(s)
- Felix Busch
- School of Medicine and Health, Department of Diagnostic and Interventional Radiology, Klinikum rechts der Isar, TUM University Hospital, Technical University of Munich, Munich, Germany.
| | - Lena Hoffmann
- Department of Neuroradiology, Charité - Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt Universität zu Berlin, Berlin, Germany
| | - Christopher Rueger
- Department of Neuroradiology, Charité - Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt Universität zu Berlin, Berlin, Germany
| | - Elon Hc van Dijk
- Department of Ophthalmology, Leiden University Medical Center, Leiden, The Netherlands
- Department of Ophthalmology, Sir Charles Gairdner Hospital, Perth, Australia
| | - Rawen Kader
- Division of Surgery and Interventional Sciences, University College London, London, United Kingdom
| | - Esteban Ortiz-Prado
- One Health Research Group, Faculty of Health Science, Universidad de Las Américas, Quito, Ecuador
| | - Marcus R Makowski
- School of Medicine and Health, Department of Diagnostic and Interventional Radiology, Klinikum rechts der Isar, TUM University Hospital, Technical University of Munich, Munich, Germany
| | - Luca Saba
- Department of Radiology, Azienda Ospedaliero Universitaria (A.O.U.), Cagliari, Italy
| | - Martin Hadamitzky
- School of Medicine and Health, Institute for Cardiovascular Radiology and Nuclear Medicine, German Heart Center Munich, TUM University Hospital, Technical University of Munich, Munich, Germany
| | - Jakob Nikolas Kather
- Department of Medical Oncology, National Center for Tumor Diseases (NCT), Heidelberg University Hospital, Heidelberg, Germany
- Else Kroener Fresenius Center for Digital Health, Medical Faculty Carl Gustav Carus, Technical University Dresden, Dresden, Germany
| | - Daniel Truhn
- Department of Diagnostic and Interventional Radiology, University Hospital Aachen, Aachen, Germany
| | - Renato Cuocolo
- Department of Medicine, Surgery and Dentistry, University of Salerno, Baronissi, Italy
| | - Lisa C Adams
- School of Medicine and Health, Department of Diagnostic and Interventional Radiology, Klinikum rechts der Isar, TUM University Hospital, Technical University of Munich, Munich, Germany
| | - Keno K Bressem
- School of Medicine and Health, Department of Diagnostic and Interventional Radiology, Klinikum rechts der Isar, TUM University Hospital, Technical University of Munich, Munich, Germany
- School of Medicine and Health, Institute for Cardiovascular Radiology and Nuclear Medicine, German Heart Center Munich, TUM University Hospital, Technical University of Munich, Munich, Germany
| |
Collapse
|
29
|
Edalati S, Sharma S, Guda R, Vasan V, Mohamed S, Gidumal S, Govindaraj S, Iloreta AM. Assessing adult sinusitis guidelines: A comparative analysis of AAO-HNS and AI Chatbots. Am J Otolaryngol 2025; 46:104563. [PMID: 39884919 DOI: 10.1016/j.amjoto.2024.104563] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2024] [Accepted: 11/28/2024] [Indexed: 02/01/2025]
Abstract
OBJECTIVE To compare the guidelines offered by the American Academy of Otolaryngology-Head and Neck Surgery Foundation (AAO-HNS) on adult sinusitis to chatbots. METHODS ChatGPT-3.5, ChatGPT-4.0, Bard, and Llama 2 represent openly accessible large language model-based chatbots. Accuracy, over-conclusiveness, supplemental, and incompleteness of chatbot responses were compared to the AAO-HNS Adult sinusitis clinical guidelines. RESULTS 12 guidelines consisting of 30 questions from the AAO-HNS were compared to 4 different chatbots. Adherence to AAO-HNS guidelines varied, with Llama 2 providing 80 % accurate responses, BARD 83.3 %, ChatGPT-4.0 80 %, and ChatGPT-3.5 73.3 %. Over-conclusive responses were minimal, with only one instance each from Llama 2 and ChatGPT-4.0. However, rates of incomplete responses varied, with Llama 2 exhibiting the highest at 40 %, followed by ChatGPT-4.0 at 33.3 %, BARD at 23.3 %, and ChatGPT-3.5 at 36.7 %. Fisher's Exact Test analysis revealed significant deviations from the guideline standard, with less accuracy (p = 0.012 for Llama 2, p = 0.026 for BARD, p = 0.012 for ChatGPT-4.0, p = 0.002 for ChatGPT-3.5), inclusion of supplemental data (p < 0.001 for all), and less completeness (p < 0.01 for all) across all chatbots, indicating potential areas for enhancement in their performance. CONCLUSION Although AI chatbots like Llama 2, Bard, and ChatGPT exhibit potential in sharing health-related information, their present performance in responding to clinical concerns concerning adult rhinosinusitis is not up to par with recognized clinical criteria. Future revisions should focus on addressing these shortcomings and placing an emphasis on accuracy, completeness, and conformity with evidence-based practices.
Collapse
Affiliation(s)
- Shaun Edalati
- Department of Otolaryngology-Head and Neck Surgery, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
| | - Shiven Sharma
- Department of Otolaryngology-Head and Neck Surgery, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Rahul Guda
- Department of Otolaryngology-Head and Neck Surgery, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Vikram Vasan
- Department of Otolaryngology-Head and Neck Surgery, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Shahed Mohamed
- Department of Otolaryngology-Head and Neck Surgery, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Sunder Gidumal
- Department of Otolaryngology-Head and Neck Surgery, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Satish Govindaraj
- Department of Otolaryngology-Head and Neck Surgery, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Alfred Marc Iloreta
- Department of Otolaryngology-Head and Neck Surgery, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| |
Collapse
|
30
|
Umer F, Batool I, Naved N. Innovation and application of Large Language Models (LLMs) in dentistry - a scoping review. BDJ Open 2024; 10:90. [PMID: 39617779 PMCID: PMC11609263 DOI: 10.1038/s41405-024-00277-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2024] [Revised: 11/03/2024] [Accepted: 11/04/2024] [Indexed: 01/31/2025] Open
Abstract
OBJECTIVE Large Language Models (LLMs) have revolutionized healthcare, yet their integration in dentistry remains underexplored. Therefore, this scoping review aims to systematically evaluate current literature on LLMs in dentistry. DATA SOURCES The search covered PubMed, Scopus, IEEE Xplore, and Google Scholar, with studies selected based on predefined criteria. Data were extracted to identify applications, evaluation metrics, prompting strategies, and deployment levels of LLMs in dental practice. RESULTS From 4079 records, 17 studies met the inclusion criteria. ChatGPT was the predominant model, mainly used for post-operative patient queries. Likert scale was the most reported evaluation metric, and only two studies employed advanced prompting strategies. Most studies were at level 3 of deployment, indicating practical application but requiring refinement. CONCLUSION LLMs showed extensive applicability in dental specialties; however, reliance on ChatGPT necessitates diversified assessments across multiple LLMs. Standardizing reporting practices and employing advanced prompting techniques are crucial for transparency and reproducibility, necessitating continuous efforts to optimize LLM utility and address existing challenges.
Collapse
Affiliation(s)
- Fahad Umer
- Associate Professor, Operative Dentistry & Endodontics, Aga Khan University Hospital, Karachi, Pakistan
| | - Itrat Batool
- Resident, Operative Dentistry & Endodontics, Aga Khan University Hospital, Karachi, Pakistan
| | - Nighat Naved
- Resident, Operative Dentistry & Endodontics, Aga Khan University Hospital, Karachi, Pakistan.
| |
Collapse
|
31
|
de Araujo BMDM, de Jesus Freitas PF, Deliga Schroder AG, Küchler EC, Baratto-Filho F, Ditzel Westphalen VP, Carneiro E, Xavier da Silva-Neto U, de Araujo CM. PAINe: An Artificial Intelligence-based Virtual Assistant to Aid in the Differentiation of Pain of Odontogenic versus Temporomandibular Origin. J Endod 2024; 50:1761-1765.e2. [PMID: 39342988 DOI: 10.1016/j.joen.2024.09.008] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2024] [Revised: 09/23/2024] [Accepted: 09/23/2024] [Indexed: 10/01/2024]
Abstract
INTRODUCTION Pain associated with temporomandibular dysfunction (TMD) is often confused with odontogenic pain, which is a challenge in endodontic diagnosis. Validated screening questionnaires can aid in the identification and differentiation of the source of pain. Therefore, this study aimed to develop a virtual assistant based on artificial intelligence using natural language processing techniques to automate the initial screening of patients with tooth pain. METHODS The PAINe chatbot was developed in Python (Python Software Foundation, Beaverton, OR) language using the PyCharm (JetBrains, Prague, Czech Republic) environment and the openai library to integrate the ChatGPT 4 API (OpenAI, San Francisco, CA) and the Streamlit library (Snowflake Inc, San Francisco, CA) for interface construction. The validated TMD Pain Screener questionnaire and 1 question regarding the current pain intensity were integrated into the chatbot to perform the differential diagnosis of TMD in patients with tooth pain. The accuracy of the responses was evaluated in 50 random scenarios to compare the chatbot with the validated questionnaire. The kappa coefficient was calculated to assess the agreement level between the chatbot responses and the validated questionnaire. RESULTS The chatbot achieved an accuracy rate of 86% and a substantial level of agreement (κ = 0.70). Most responses were clear and provided adequate information about the diagnosis. CONCLUSIONS The implementation of a virtual assistant using natural language processing based on large language models for initial differential diagnosis screening of patients with tooth pain demonstrated substantial agreement between validated questionnaires and the chatbot. This approach emerges as a practical and efficient option for screening these patients.
Collapse
Affiliation(s)
| | | | | | - Erika Calvano Küchler
- Department of Orthodontics, University Hospital Bonn, Medical Faculty, Bonn, Germany
| | - Flares Baratto-Filho
- School of Dentistry, Department of Endodontics, Tuiuti University of Paraná, Curitiba, Paraná, Brazil; University of the Region of Joinville (Univille), Joinville, Santa Catarina, Brazil
| | | | - Everdan Carneiro
- School of Dentistry, Department of Endodontics, Pontifícia Universidade Católica do Paraná, Curitiba, Paraná, Brazil
| | - Ulisses Xavier da Silva-Neto
- School of Dentistry, Department of Endodontics, Pontifícia Universidade Católica do Paraná, Curitiba, Paraná, Brazil
| | | |
Collapse
|
32
|
Chatzopoulos GS, Koidou VP, Tsalikis L, Kaklamanos EG. Large language models in periodontology: Assessing their performance in clinically relevant questions. J Prosthet Dent 2024:S0022-3913(24)00714-5. [PMID: 39562221 DOI: 10.1016/j.prosdent.2024.10.020] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2024] [Revised: 10/16/2024] [Accepted: 10/18/2024] [Indexed: 11/21/2024]
Abstract
STATEMENT OF PROBLEM Although the use of artificial intelligence (AI) seems promising and may assist dentists in clinical practice, the consequences of inaccurate or even harmful responses are paramount. Research is required to examine whether large language models (LLMs) can be used in accessing periodontal content reliably. PURPOSE The purpose of this study was to evaluate and compare the evidence-based potential of answers provided by 4 LLMs to common clinical questions in the field of periodontology. MATERIAL AND METHODS A total of 10 open-ended questions pertinent to periodontology were posed to 4 distinct LLMs: ChatGPT model GPT 4.0, Google Gemini, Google Gemini Advanced, and Microsoft Copilot. The answers to each question were evaluated independently by 2 periodontists against robust scientific evidence based on a predefined rubric assessing the comprehensiveness, scientific accuracy, clarity, and relevance. Each response received a score ranging from 0 (minimum) to 10 (maximum). After a period of 2 weeks from initial evaluation, the answers were re-graded independently to gauge intra-evaluator reliability. Inter-evaluator reliability was assessed using correlation tests, while Cronbach alpha and interclass correlation coefficient were used to measure overall reliability. The Kruskal-Wallis test was employed to compare the scores given by different LLMs. RESULTS The scores provided by the 2 evaluators for both evaluations were statistically similar (P values ranging from .083 to >;.999), therefore an average score was calculated for each LLM. Both evaluators gave the highest scores to the answers generated by ChatGPT 4.0, while Google Gemini had the lowest scores. ChatGPT 4.0 received the highest average score, while significant differences were detected between ChatGPT 4.0 and Google Gemini (P=.042). ChatGPT 4.0 answers were found to be highly comprehensive, with scientific accuracy, clarity, and relevance. CONCLUSIONS Professionals need to be aware of the limitations of LLMs when utilizing them. These models must not replace dental professionals as improper use may negatively impact patient care. Chat GPT 4.0, Google Gemini, Google Gemini Advanced, and Microsoft CoPilot performed relatively well with Chat GPT 4.0 demonstrating the highest performance.
Collapse
Affiliation(s)
- Georgios S Chatzopoulos
- PhD candidate, Department of Preventive Dentistry, Periodontology and Implant Biology, School of Dentistry, Aristotle University of Thessaloniki, Thessaloniki, Greece; and Visiting Research Assistant Professor, Division of Periodontology, Department of Developmental and Surgical Sciences, School of Dentistry, University of Minnesota, Minneapolis, Minn.
| | - Vasiliki P Koidou
- Research Assistant, Centre for Oral Immunobiology and Regenerative Medicine and Centre for Oral Clinical Research, Institute of Dentistry, Queen Mary University London (QMUL), London, England, UK
| | - Lazaros Tsalikis
- Professor, Department of Preventive Dentistry, Periodontology and Implant Biology, School of Dentistry, Aristotle University of Thessaloniki, Thessaloniki, Greece
| | - Eleftherios G Kaklamanos
- Associate Professor, Department of Preventive Dentistry, Periodontology and Implant Biology, School of Dentistry, Aristotle University of Thessaloniki, Thessaloniki, Greece; Associate Professor, School of Dentistry, European University Cyprus, Nicosia, Cyprus; and Adjunct Associate Professor, Hamdan bin Mohammed College of Dental Medicine, Mohammed bin Rashid University of Medicine and Health Sciences (MBRU), Dubai, United Arab Emirates
| |
Collapse
|
33
|
Ourang SA, Sohrabniya F, Mohammad-Rahimi H, Dianat O, Aminoshariae A, Nagendrababu V, Dummer PMH, Duncan HF, Nosrat A. Artificial intelligence in endodontics: Fundamental principles, workflow, and tasks. Int Endod J 2024; 57:1546-1565. [PMID: 39056554 DOI: 10.1111/iej.14127] [Citation(s) in RCA: 11] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2024] [Revised: 06/25/2024] [Accepted: 07/13/2024] [Indexed: 07/28/2024]
Abstract
The integration of artificial intelligence (AI) in healthcare has seen significant advancements, particularly in areas requiring image interpretation. Endodontics, a specialty within dentistry, stands to benefit immensely from AI applications, especially in interpreting radiographic images. However, there is a knowledge gap among endodontists regarding the fundamentals of machine learning and deep learning, hindering the full utilization of AI in this field. This narrative review aims to: (A) elaborate on the basic principles of machine learning and deep learning and present the basics of neural network architectures; (B) explain the workflow for developing AI solutions, from data collection through clinical integration; (C) discuss specific AI tasks and applications relevant to endodontic diagnosis and treatment. The article shows that AI offers diverse practical applications in endodontics. Computer vision methods help analyse images while natural language processing extracts insights from text. With robust validation, these techniques can enhance diagnosis, treatment planning, education, and patient care. In conclusion, AI holds significant potential to benefit endodontic research, practice, and education. Successful integration requires an evolving partnership between clinicians, computer scientists, and industry.
Collapse
Affiliation(s)
- Seyed AmirHossein Ourang
- Dentofacial Deformities Research Center, Research Institute of Dental Sciences, Shahid Beheshti University of Medical Sciences, Tehran, Iran
| | - Fatemeh Sohrabniya
- Topic Group Dental Diagnostics and Digital Dentistry, ITU/WHO Focus Group AI on Health, Berlin, Germany
| | - Hossein Mohammad-Rahimi
- Topic Group Dental Diagnostics and Digital Dentistry, ITU/WHO Focus Group AI on Health, Berlin, Germany
| | - Omid Dianat
- Division of Endodontics, Department of Advanced Oral Sciences and Therapeutics, University of Maryland School of Dentistry, Baltimore, Maryland, USA
- Private Practice, Irvine Endodontics, Irvine, California, USA
| | - Anita Aminoshariae
- Department of Endodontics, School of Dental Medicine, Case Western Reserve University, Cleveland, Ohio, USA
| | | | | | - Henry F Duncan
- Division of Restorative Dentistry, Dublin Dental University Hospital, Trinity College Dublin, Dublin, Ireland
| | - Ali Nosrat
- Division of Endodontics, Department of Advanced Oral Sciences and Therapeutics, University of Maryland School of Dentistry, Baltimore, Maryland, USA
- Private Practice, Centreville Endodontics, Centreville, Virginia, USA
| |
Collapse
|
34
|
Mohammad-Rahimi H, Sohrabniya F, Ourang SA, Dianat O, Aminoshariae A, Nagendrababu V, Dummer PMH, Duncan HF, Nosrat A. Artificial intelligence in endodontics: Data preparation, clinical applications, ethical considerations, limitations, and future directions. Int Endod J 2024; 57:1566-1595. [PMID: 39075670 DOI: 10.1111/iej.14128] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2024] [Revised: 07/03/2024] [Accepted: 07/16/2024] [Indexed: 07/31/2024]
Abstract
Artificial intelligence (AI) is emerging as a transformative technology in healthcare, including endodontics. A gap in knowledge exists in understanding AI's applications and limitations among endodontic experts. This comprehensive review aims to (A) elaborate on technical and ethical aspects of using data to implement AI models in endodontics; (B) elaborate on evaluation metrics; (C) review the current applications of AI in endodontics; and (D) review the limitations and barriers to real-world implementation of AI in the field of endodontics and its future potentials/directions. The article shows that AI techniques have been applied in endodontics for critical tasks such as detection of radiolucent lesions, analysis of root canal morphology, prediction of treatment outcome and post-operative pain and more. Deep learning models like convolutional neural networks demonstrate high accuracy in these applications. However, challenges remain regarding model interpretability, generalizability, and adoption into clinical practice. When thoughtfully implemented, AI has great potential to aid with diagnostics, treatment planning, clinical interventions, and education in the field of endodontics. However, concerted efforts are still needed to address limitations and to facilitate integration into clinical workflows.
Collapse
Affiliation(s)
- Hossein Mohammad-Rahimi
- Topic Group Dental Diagnostics and Digital Dentistry, ITU/WHO Focus Group AI on Health, Berlin, Germany
| | - Fatemeh Sohrabniya
- Topic Group Dental Diagnostics and Digital Dentistry, ITU/WHO Focus Group AI on Health, Berlin, Germany
| | - Seyed AmirHossein Ourang
- Dentofacial Deformities Research Center, Research Institute of Dental Sciences, Shahid Beheshti University of Medical Sciences, Tehran, Iran
| | - Omid Dianat
- Division of Endodontics, Department of Advanced Oral Sciences and Therapeutics, School of Dentistry, University of Maryland, Baltimore, Maryland, USA
- Private Practice, Irvine Endodontics, Irvine, California, USA
| | - Anita Aminoshariae
- Department of Endodontics, School of Dental Medicine, Case Western Reserve University, Cleveland, Ohio, USA
| | | | | | - Henry F Duncan
- Division of Restorative Dentistry, Dublin Dental University Hospital, Trinity College Dublin, Dublin, Ireland
| | - Ali Nosrat
- Division of Endodontics, Department of Advanced Oral Sciences and Therapeutics, School of Dentistry, University of Maryland, Baltimore, Maryland, USA
- Private Practice, Centreville Endodontics, Centreville, Virginia, USA
| |
Collapse
|
35
|
Quah B, Yong CW, Lai CWM, Islam I. Performance of large language models in oral and maxillofacial surgery examinations. Int J Oral Maxillofac Surg 2024; 53:881-886. [PMID: 38926015 DOI: 10.1016/j.ijom.2024.06.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2024] [Revised: 05/12/2024] [Accepted: 06/11/2024] [Indexed: 06/28/2024]
Abstract
This study aimed to determine the accuracy of large language models (LLMs) in answering oral and maxillofacial surgery (OMS) multiple choice questions. A total of 259 questions from the university's question bank were answered by the LLMs (GPT-3.5, GPT-4, Llama 2, Gemini, and Copilot). The scores per category as well as the total score out of 259 were recorded and evaluated, with the passing score set at 50%. The mean overall score amongst all LLMs was 62.5%. GPT-4 performed the best (76.8%, 95% confidence interval (CI) 71.4-82.2%), followed by Copilot (72.6%, 95% CI 67.2-78.0%), GPT-3.5 (62.2%, 95% CI 56.4-68.0%), Gemini (58.7%, 95% CI 52.9-64.5%), and Llama 2 (42.5%, 95% CI 37.1-48.6%). There was a statistically significant difference between the scores of the five LLMs overall (χ2 = 79.9, df = 4, P < 0.001) and within all categories except 'basic sciences' (P = 0.129), 'dentoalveolar and implant surgery' (P = 0.052), and 'oral medicine/pathology/radiology' (P = 0.801). The LLMs performed best in 'basic sciences' (68.9%) and poorest in 'pharmacology' (45.9%). The LLMs can be used as adjuncts in teaching, but should not be used for clinical decision-making until the models are further developed and validated.
Collapse
Affiliation(s)
- B Quah
- Faculty of Dentistry, National University of Singapore, Singapore; Discipline of Oral and Maxillofacial Surgery, National University Centre for Oral Health, Singapore
| | - C W Yong
- Faculty of Dentistry, National University of Singapore, Singapore; Discipline of Oral and Maxillofacial Surgery, National University Centre for Oral Health, Singapore
| | - C W M Lai
- Faculty of Dentistry, National University of Singapore, Singapore
| | - I Islam
- Faculty of Dentistry, National University of Singapore, Singapore; Discipline of Oral and Maxillofacial Surgery, National University Centre for Oral Health, Singapore.
| |
Collapse
|
36
|
Quah B, Zheng L, Sng TJH, Yong CW, Islam I. Reliability of ChatGPT in automated essay scoring for dental undergraduate examinations. BMC MEDICAL EDUCATION 2024; 24:962. [PMID: 39227811 PMCID: PMC11373238 DOI: 10.1186/s12909-024-05881-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/04/2024] [Accepted: 08/09/2024] [Indexed: 09/05/2024]
Abstract
BACKGROUND This study aimed to answer the research question: How reliable is ChatGPT in automated essay scoring (AES) for oral and maxillofacial surgery (OMS) examinations for dental undergraduate students compared to human assessors? METHODS Sixty-nine undergraduate dental students participated in a closed-book examination comprising two essays at the National University of Singapore. Using pre-created assessment rubrics, three assessors independently performed manual essay scoring, while one separate assessor performed AES using ChatGPT (GPT-4). Data analyses were performed using the intraclass correlation coefficient and Cronbach's α to evaluate the reliability and inter-rater agreement of the test scores among all assessors. The mean scores of manual versus automated scoring were evaluated for similarity and correlations. RESULTS A strong correlation was observed for Question 1 (r = 0.752-0.848, p < 0.001) and a moderate correlation was observed between AES and all manual scorers for Question 2 (r = 0.527-0.571, p < 0.001). Intraclass correlation coefficients of 0.794-0.858 indicated excellent inter-rater agreement, and Cronbach's α of 0.881-0.932 indicated high reliability. For Question 1, the mean AES scores were similar to those for manual scoring (p > 0.05), and there was a strong correlation between AES and manual scores (r = 0.829, p < 0.001). For Question 2, AES scores were significantly lower than manual scores (p < 0.001), and there was a moderate correlation between AES and manual scores (r = 0.599, p < 0.001). CONCLUSION This study shows the potential of ChatGPT for essay marking. However, an appropriate rubric design is essential for optimal reliability. With further validation, the ChatGPT has the potential to aid students in self-assessment or large-scale marking automated processes.
Collapse
Affiliation(s)
- Bernadette Quah
- Faculty of Dentistry, National University of Singapore, Singapore, Singapore
- Discipline of Oral and Maxillofacial Surgery, National University Centre for Oral Health, 9 Lower Kent Ridge Road, Singapore, Singapore
| | - Lei Zheng
- Faculty of Dentistry, National University of Singapore, Singapore, Singapore
- Discipline of Oral and Maxillofacial Surgery, National University Centre for Oral Health, 9 Lower Kent Ridge Road, Singapore, Singapore
| | - Timothy Jie Han Sng
- Faculty of Dentistry, National University of Singapore, Singapore, Singapore
- Discipline of Oral and Maxillofacial Surgery, National University Centre for Oral Health, 9 Lower Kent Ridge Road, Singapore, Singapore
| | - Chee Weng Yong
- Faculty of Dentistry, National University of Singapore, Singapore, Singapore
- Discipline of Oral and Maxillofacial Surgery, National University Centre for Oral Health, 9 Lower Kent Ridge Road, Singapore, Singapore
| | - Intekhab Islam
- Faculty of Dentistry, National University of Singapore, Singapore, Singapore.
- Discipline of Oral and Maxillofacial Surgery, National University Centre for Oral Health, 9 Lower Kent Ridge Road, Singapore, Singapore.
| |
Collapse
|
37
|
Dursun D, Bilici Geçer R. Can artificial intelligence models serve as patient information consultants in orthodontics? BMC Med Inform Decis Mak 2024; 24:211. [PMID: 39075513 PMCID: PMC11285120 DOI: 10.1186/s12911-024-02619-8] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2024] [Accepted: 07/23/2024] [Indexed: 07/31/2024] Open
Abstract
BACKGROUND To evaluate the accuracy, reliability, quality, and readability of responses generated by ChatGPT-3.5, ChatGPT-4, Gemini, and Copilot in relation to orthodontic clear aligners. METHODS Frequently asked questions by patients/laypersons about clear aligners on websites were identified using the Google search tool and these questions were posed to ChatGPT-3.5, ChatGPT-4, Gemini, and Copilot AI models. Responses were assessed using a five-point Likert scale for accuracy, the modified DISCERN scale for reliability, the Global Quality Scale (GQS) for quality, and the Flesch Reading Ease Score (FRES) for readability. RESULTS ChatGPT-4 responses had the highest mean Likert score (4.5 ± 0.61), followed by Copilot (4.35 ± 0.81), ChatGPT-3.5 (4.15 ± 0.75) and Gemini (4.1 ± 0.72). The difference between the Likert scores of the chatbot models was not statistically significant (p > 0.05). Copilot had a significantly higher modified DISCERN and GQS score compared to both Gemini, ChatGPT-4 and ChatGPT-3.5 (p < 0.05). Gemini's modified DISCERN and GQS score was statistically higher than ChatGPT-3.5 (p < 0.05). Gemini also had a significantly higher FRES compared to both ChatGPT-4, Copilot and ChatGPT-3.5 (p < 0.05). The mean FRES was 38.39 ± 11.56 for ChatGPT-3.5, 43.88 ± 10.13 for ChatGPT-4 and 41.72 ± 10.74 for Copilot, indicating that the responses were difficult to read according to the reading level. The mean FRES for Gemini is 54.12 ± 10.27, indicating that Gemini's responses are more readable than other chatbots. CONCLUSIONS All chatbot models provided generally accurate, moderate reliable and moderate to good quality answers to questions about the clear aligners. Furthermore, the readability of the responses was difficult. ChatGPT, Gemini and Copilot have significant potential as patient information tools in orthodontics, however, to be fully effective they need to be supplemented with more evidence-based information and improved readability.
Collapse
Affiliation(s)
- Derya Dursun
- Department of Orthodontics, Hamidiye Faculty of Dentistry, University of Health Sciences, Istanbul, Turkey
| | - Rumeysa Bilici Geçer
- Department of Orthodontics, Faculty of Dentistry, Istanbul Aydin University, Istanbul, Turkey.
| |
Collapse
|
38
|
Gumilar KE, Indraprasta BR, Hsu YC, Yu ZY, Chen H, Irawan B, Tambunan Z, Wibowo BM, Nugroho H, Tjokroprawiro BA, Dachlan EG, Mulawardhana P, Rahestyningtyas E, Pramuditya H, Putra VGE, Waluyo ST, Tan NR, Folarin R, Ibrahim IH, Lin CH, Hung TY, Lu TF, Chen YF, Shih YH, Wang SJ, Huang J, Yates CC, Lu CH, Liao LN, Tan M. Disparities in medical recommendations from AI-based chatbots across different countries/regions. Sci Rep 2024; 14:17052. [PMID: 39048640 PMCID: PMC11269683 DOI: 10.1038/s41598-024-67689-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2024] [Accepted: 07/15/2024] [Indexed: 07/27/2024] Open
Abstract
This study explores disparities and opportunities in healthcare information provided by AI chatbots. We focused on recommendations for adjuvant therapy in endometrial cancer, analyzing responses across four regions (Indonesia, Nigeria, Taiwan, USA) and three platforms (Bard, Bing, ChatGPT-3.5). Utilizing previously published cases, we asked identical questions to chatbots from each location within a 24-h window. Responses were evaluated in a double-blinded manner on relevance, clarity, depth, focus, and coherence by ten experts in endometrial cancer. Our analysis revealed significant variations across different countries/regions (p < 0.001). Interestingly, Bing's responses in Nigeria consistently outperformed others (p < 0.05), excelling in all evaluation criteria (p < 0.001). Bard also performed better in Nigeria compared to other regions (p < 0.05), consistently surpassing them across all categories (p < 0.001, with relevance reaching p < 0.01). Notably, Bard's overall scores were significantly higher than those of ChatGPT-3.5 and Bing in all locations (p < 0.001). These findings highlight disparities and opportunities in the quality of AI-powered healthcare information based on user location and platform. This emphasizes the necessity for more research and development to guarantee equal access to trustworthy medical information through AI technologies.
Collapse
Affiliation(s)
- Khanisyah E Gumilar
- Graduate Institute of Biomedical Science, China Medical University, Taichung, Taiwan.
- Department of Obstetrics and Gynecology, Hospital of Universitas Airlangga-Faculty of Medicine, Universitas Airlangga, Jl. Dharmahusada Permai, Mulyorejo, Kec. Mulyorejo, Surabaya, Jawa Timur, 60115, Indonesia.
| | - Birama R Indraprasta
- Department of Obstetrics and Gynecology, Dr. Soetomo General Hospital-Faculty of Medicine, Universitas Airlangga, Surabaya, Indonesia
| | - Yu-Cheng Hsu
- Department of Public Health, China Medical University, No. 100, Sec. 1, Jingmao Rd, Beitun Dist, Taichung, 406040, Taiwan, ROC
- School of Chinese Medicine, China Medical University, Taichung, Taiwan
| | - Zih-Ying Yu
- Department of Public Health, China Medical University, No. 100, Sec. 1, Jingmao Rd, Beitun Dist, Taichung, 406040, Taiwan, ROC
| | - Hong Chen
- Graduate Institute of Biomedical Science, China Medical University, Taichung, Taiwan
| | - Budi Irawan
- Department of Obstetrics and Gynecology, Dr. Soetomo General Hospital-Faculty of Medicine, Universitas Airlangga, Surabaya, Indonesia
| | - Zulkarnain Tambunan
- Department of Obstetrics and Gynecology, Dr. Soetomo General Hospital-Faculty of Medicine, Universitas Airlangga, Surabaya, Indonesia
| | - Bagus M Wibowo
- Department of Obstetrics and Gynecology, Dr. Soetomo General Hospital-Faculty of Medicine, Universitas Airlangga, Surabaya, Indonesia
| | - Hari Nugroho
- Department of Obstetrics and Gynecology, Dr. Soetomo General Hospital-Faculty of Medicine, Universitas Airlangga, Surabaya, Indonesia
| | - Brahmana A Tjokroprawiro
- Department of Obstetrics and Gynecology, Dr. Soetomo General Hospital-Faculty of Medicine, Universitas Airlangga, Surabaya, Indonesia
| | - Erry G Dachlan
- Department of Obstetrics and Gynecology, Dr. Soetomo General Hospital-Faculty of Medicine, Universitas Airlangga, Surabaya, Indonesia
| | - Pungky Mulawardhana
- Department of Obstetrics and Gynecology, Hospital of Universitas Airlangga-Faculty of Medicine, Universitas Airlangga, Jl. Dharmahusada Permai, Mulyorejo, Kec. Mulyorejo, Surabaya, Jawa Timur, 60115, Indonesia
| | - Eccita Rahestyningtyas
- Department of Obstetrics and Gynecology, Hospital of Universitas Airlangga-Faculty of Medicine, Universitas Airlangga, Jl. Dharmahusada Permai, Mulyorejo, Kec. Mulyorejo, Surabaya, Jawa Timur, 60115, Indonesia
| | - Herlangga Pramuditya
- Department of Obstetrics and Gynecology, Dr. Ramelan Naval Hospital, Surabaya, Indonesia
| | - Very Great E Putra
- Department of Obstetrics and Gynecology, Dr. Kariadi Central General Hospital, Semarang, Indonesia
| | - Setyo T Waluyo
- Department of Obstetrics and Gynecology, Ulin General Hospital, Banjarmasin, Indonesia
| | - Nathan R Tan
- Department of Modern and Classical Languages and Literature, University of South Alabama, Mobile, AL, USA
| | - Royhaan Folarin
- Department of Anatomy, Faculty of Basic Medical Sciences, Olabisi Onabanjo University, Sagamu, Nigeria
| | - Ibrahim H Ibrahim
- Graduate Institute of Biomedical Science, China Medical University, Taichung, Taiwan
| | - Cheng-Han Lin
- Graduate Institute of Biomedical Science, China Medical University, Taichung, Taiwan
| | - Tai-Yu Hung
- Graduate Institute of Biomedical Science, China Medical University, Taichung, Taiwan
| | - Ting-Fang Lu
- Department of Obstetrics and Gynecology, Taichung Veteran General Hospital, 1650 Taiwan Boulevard Sector. 4, Taichung, 40705, Taiwan, ROC
| | - Yen-Fu Chen
- Department of Obstetrics and Gynecology, Taichung Veteran General Hospital, 1650 Taiwan Boulevard Sector. 4, Taichung, 40705, Taiwan, ROC
| | - Yu-Hsiang Shih
- Department of Obstetrics and Gynecology, Taichung Veteran General Hospital, 1650 Taiwan Boulevard Sector. 4, Taichung, 40705, Taiwan, ROC
| | - Shao-Jing Wang
- Department of Obstetrics and Gynecology, Taichung Veteran General Hospital, 1650 Taiwan Boulevard Sector. 4, Taichung, 40705, Taiwan, ROC
| | - Jingshan Huang
- School of Computing and College of Medicine, University of South Alabama, Mobile, AL, USA
| | - Clayton C Yates
- Department of Pathology, Johns Hopkins University School of Medicine, Baltimore, MD, 21287, USA
| | - Chien-Hsing Lu
- Department of Obstetrics and Gynecology, Taichung Veteran General Hospital, 1650 Taiwan Boulevard Sector. 4, Taichung, 40705, Taiwan, ROC.
| | - Li-Na Liao
- Department of Public Health, China Medical University, No. 100, Sec. 1, Jingmao Rd, Beitun Dist, Taichung, 406040, Taiwan, ROC.
| | - Ming Tan
- Graduate Institute of Biomedical Science, China Medical University, Taichung, Taiwan.
- Institute of Biochemistry and Molecular Biology, Graduate Institute of Biomedical Sciences, China Medical University (Taiwan), No. 100, Sec. 1, Jingmao Rd, Beitun Dist, Taichung, 406040, Taiwan, ROC.
| |
Collapse
|
39
|
Batool I, Naved N, Kazmi SMR, Umer F. Leveraging Large Language Models in the delivery of post-operative dental care: a comparison between an embedded GPT model and ChatGPT. BDJ Open 2024; 10:48. [PMID: 38866751 PMCID: PMC11169374 DOI: 10.1038/s41405-024-00226-3] [Citation(s) in RCA: 9] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2024] [Revised: 05/01/2024] [Accepted: 05/07/2024] [Indexed: 06/14/2024] Open
Abstract
OBJECTIVE This study underscores the transformative role of Artificial Intelligence (AI) in healthcare, particularly the promising applications of Large Language Models (LLMs) in the delivery of post-operative dental care. The aim is to evaluate the performance of an embedded GPT model and its comparison with ChatGPT-3.5 turbo. The assessment focuses on aspects like response accuracy, clarity, relevance, and up-to-date knowledge in addressing patient concerns and facilitating informed decision-making. MATERIAL AND METHODS An embedded GPT model, employing GPT-3.5-16k, was crafted via GPT-trainer to answer postoperative questions in four dental specialties including Operative Dentistry & Endodontics, Periodontics, Oral & Maxillofacial Surgery, and Prosthodontics. The generated responses were validated by thirty-six dental experts, nine from each specialty, employing a Likert scale, providing comprehensive insights into the embedded GPT model's performance and its comparison with GPT3.5 turbo. For content validation, a quantitative Content Validity Index (CVI) was used. The CVI was calculated both at the item level (I-CVI) and scale level (S-CVI/Ave). To adjust I-CVI for chance agreement, a modified kappa statistic (K*) was computed. RESULTS The overall content validity of responses generated via embedded GPT model and ChatGPT was 65.62% and 61.87% respectively. Moreover, the embedded GPT model revealed a superior performance surpassing ChatGPT with an accuracy of 62.5% and clarity of 72.5%. In contrast, the responses generated via ChatGPT achieved slightly lower scores, with an accuracy of 52.5% and clarity of 67.5%. However, both models performed equally well in terms of relevance and up-to-date knowledge. CONCLUSION In conclusion, embedded GPT model showed better results as compared to ChatGPT in providing post-operative dental care emphasizing the benefits of embedding and prompt engineering, paving the way for future advancements in healthcare applications.
Collapse
Affiliation(s)
- Itrat Batool
- Section of Dentistry, Department of Surgery, Aga Khan University Hospital, Karachi, Pakistan
| | - Nighat Naved
- Section of Dentistry, Department of Surgery, Aga Khan University Hospital, Karachi, Pakistan
| | - Syed Murtaza Raza Kazmi
- Section of Dentistry, Department of Surgery, Aga Khan University Hospital, Karachi, Pakistan
| | - Fahad Umer
- Section of Dentistry, Department of Surgery, Aga Khan University Hospital, Karachi, Pakistan.
| |
Collapse
|