1
|
Finch L, Broach V, Feinberg J, Al-Niaimi A, Abu-Rustum NR, Zhou Q, Iasonos A, Chi DS. ChatGPT compared to national guidelines for management of ovarian cancer: Did ChatGPT get it right? - A Memorial Sloan Kettering Cancer Center Team Ovary study. Gynecol Oncol 2024; 189:75-79. [PMID: 39042956 PMCID: PMC11402584 DOI: 10.1016/j.ygyno.2024.07.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2024] [Revised: 07/08/2024] [Accepted: 07/15/2024] [Indexed: 07/25/2024]
Abstract
OBJECTIVES We evaluated the performance of a chatbot compared to the National Comprehensive Cancer Network (NCCN) Guidelines for the management of ovarian cancer. METHODS Using NCCN Guidelines, we generated 10 questions and answers regarding management of ovarian cancer at a single point in time. Questions were thematically divided into risk factors, surgical management, medical management, and surveillance. We asked ChatGPT (GPT-4) to provide responses without prompting (unprompted GPT) and with prompt engineering (prompted GPT). Responses were blinded and evaluated for accuracy and completeness by 5 gynecologic oncologists. A score of 0 was defined as inaccurate, 1 as accurate and incomplete, and 2 as accurate and complete. Evaluations were compared among NCCN, unprompted GPT, and prompted GPT answers. RESULTS Overall, 48% of responses from NCCN, 64% from unprompted GPT, and 66% from prompted GPT were accurate and complete. The percentage of accurate but incomplete responses was higher for NCCN vs GPT-4. The percentage of accurate and complete scores for questions regarding risk factors, surgical management, and surveillance was higher for GPT-4 vs NCCN; however, for questions regarding medical management, the percentage was lower for GPT-4 vs NCCN. Overall, 14% of responses from unprompted GPT, 12% from prompted GPT, and 10% from NCCN were inaccurate. CONCLUSIONS GPT-4 provided accurate and complete responses at a single point in time to a limited set of questions regarding ovarian cancer, with best performance in areas of risk factors, surgical management, and surveillance. Occasional inaccuracies, however, should limit unsupervised use of chatbots at this time.
Collapse
Affiliation(s)
- Lindsey Finch
- Gynecology Service, Department of Surgery, Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | - Vance Broach
- Gynecology Service, Department of Surgery, Memorial Sloan Kettering Cancer Center, New York, NY, USA; Department of Obstetrics and Gynecology, Weill Cornell Medical College, New York, NY, USA
| | - Jacqueline Feinberg
- Gynecology Service, Department of Surgery, Memorial Sloan Kettering Cancer Center, New York, NY, USA; Department of Obstetrics and Gynecology, Weill Cornell Medical College, New York, NY, USA
| | - Ahmed Al-Niaimi
- Gynecology Service, Department of Surgery, Memorial Sloan Kettering Cancer Center, New York, NY, USA; Department of Obstetrics and Gynecology, Weill Cornell Medical College, New York, NY, USA
| | - Nadeem R Abu-Rustum
- Gynecology Service, Department of Surgery, Memorial Sloan Kettering Cancer Center, New York, NY, USA; Department of Obstetrics and Gynecology, Weill Cornell Medical College, New York, NY, USA
| | - Qin Zhou
- Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | - Alexia Iasonos
- Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | - Dennis S Chi
- Gynecology Service, Department of Surgery, Memorial Sloan Kettering Cancer Center, New York, NY, USA; Department of Obstetrics and Gynecology, Weill Cornell Medical College, New York, NY, USA.
| |
Collapse
|
2
|
Washington CJ, Abouyared M, Karanth S, Braithwaite D, Birkeland A, Silverman DA, Chen S. The Use of Chatbots in Head and Neck Mucosal Malignancy Treatment Recommendations. Otolaryngol Head Neck Surg 2024; 171:1062-1068. [PMID: 38769872 DOI: 10.1002/ohn.818] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2023] [Revised: 04/02/2024] [Accepted: 04/27/2024] [Indexed: 05/22/2024]
Abstract
OBJECTIVE As cancer patients increasingly use chatbots, it is crucial to recognize ChatGPT's potential in enhancing health literacy while ensuring validation to prevent misinformation. This study aims to assess ChatGPT-3.5's capability to provide appropriate staging and treatment recommendations for head and neck mucosal malignancies for vulnerable populations. STUDY DESIGN AND SETTING Forty distinct clinical vignettes were introduced into ChatGPT to inquire about staging and treatment recommendations for head and neck mucosal malignancies. METHODS Prompts were created based on head and neck cancer (HNC) disease descriptions (cancer location, tumor size, lymph node involvement, and symptoms). Staging and treatment recommendations according to the 2021 National Comprehensive Cancer Network (NCCN) guidelines were scored by three fellowship-trained HNC surgeons from two separate tertiary care institutions. HNC surgeons assessed the accuracy of staging and treatment recommendations, such as the completeness of surgery and the appropriateness of treatment modality. RESULTS Whereas ChatGPT's responses were 95% accurate at recommending the correct first-line treatment based on the 2021 NCCN guidelines, 55% of the responses contained inaccurate staging. Neck dissection was incorrectly omitted from treatment recommendations in 50% of the cases. Moreover, 40% of ChatGPT's treatment recommendations were deemed unnecessary. CONCLUSION This study emphasizes ChatGPT's potential in HNC patient education, aligning with NCCN guidelines for mucosal malignancies, but highlights the importance of ongoing refinement and scrutiny due to observed inaccuracies in tumor, nodal, metastasis staging, incomplete surgery options, and inappropriate treatment recommendations. Otolaryngologists can use this information to caution patients, families, and trainees regarding the use of ChatGPT for HNC education without expert guidance.
Collapse
Affiliation(s)
- Caretia J Washington
- Department of Epidemiology, University of Florida College of Public Health and Health Professions and College of Medicine, Gainesville, Florida, USA
| | - Marianne Abouyared
- Department of Otolaryngology-Head and Neck Surgery, University of California Davis Medical Center, Sacramento, California, USA
| | - Shama Karanth
- Division of Cancer Control and Population Sciences, Gainesville, Florida, USA
- Department of Surgery, College of Medicine, University of Florida, Gainesville, Florida, USA
| | - Dejana Braithwaite
- Department of Epidemiology, University of Florida College of Public Health and Health Professions and College of Medicine, Gainesville, Florida, USA
- Division of Cancer Control and Population Sciences, Gainesville, Florida, USA
- Department of Surgery, College of Medicine, University of Florida, Gainesville, Florida, USA
| | - Andrew Birkeland
- Department of Otolaryngology-Head and Neck Surgery, University of California Davis Medical Center, Sacramento, California, USA
| | - Dustin A Silverman
- Department of Otolaryngology-Head and Neck Surgery, University of Cincinnati, Cincinnati, Ohio, USA
| | - Si Chen
- Department of Otolaryngology-Head and Neck Surgery, University of Florida College of Medicine, Gainesville, Florida, USA
| |
Collapse
|
3
|
Marques de Mattos de Araujo B, Jesus Freitas PF, Deliga Schroder AG, Küchler EC, Baratto-Filho F, Ditzel Westphalen VP, Carneiro E, Xavier da Silva-Neto U, Miranda de Araujo C. PAINe - An Artificial Intelligence Based Virtual Assistant to Aid in the Differentiation of Pain of Odontogenic versus Temporomandibular Origin. J Endod 2024:S0099-2399(24)00524-7. [PMID: 39342988 DOI: 10.1016/j.joen.2024.09.008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2024] [Revised: 09/23/2024] [Accepted: 09/23/2024] [Indexed: 10/01/2024]
Abstract
INTRODUCTION Pain associated with temporomandibular dysfunction (TMD) is often confused with odontogenic pain, which is a challenge in endodontic diagnosis. Validated screening questionnaires can aid the identification and differentiation of the source of pain. Therefore, this study aimed to develop a virtual assistant based on artificial intelligence, using natural language processing techniques to automate the initial screening of patients with tooth pain. METHODS The PAINe chatbot was developed in Python language, using the PyCharm environment and the 'openai' library to integrate the ChatGPT 4 API, and the 'streamlit' library for interface construction. The validated TMD Pain Screener questionnaire and one question about the current pain intensity was integrated into the chatbot to perform the differential diagnosis of TMD in patients with tooth pain. The responses' accuracy was evaluated in 50 random scenarios to compare the chatbot with the validated questionnaire. The Kappa coefficient was calculated to assess the agreement level between the chatbot responses and the validated questionnaire. RESULTS The chatbot achieved an accuracy rate of 86% and a substantial level of agreement (Kappa = 0.70). Most responses were clear and provided adequate information about the diagnosis. CONCLUSIONS The implementation of a virtual assistant using natural language processing, based on large language models, for initial differential diagnosis screening of patients with tooth pain, demonstrated substantial agreement between validated questionnaires and the chatbot. This approach emerges as a practical and efficient option for screening these patients.
Collapse
Affiliation(s)
| | | | | | - Erika Calvano Küchler
- Department of Orthodontics, University Hospital Bonn, Medical Faculty, Welschnonnenstr. 17, 53111, Bonn, Germany
| | - Flares Baratto-Filho
- School of Dentistry, Department of Endodontics, Tuiuti University of Paraná, Curitiba, PR, Brazil
| | | | - Everdan Carneiro
- School of Dentistry, Department of Endodontics, Pontifícia Universidade Católica do Paraná, Curitiba, Paraná, Brazil
| | - Ulisses Xavier da Silva-Neto
- School of Dentistry, Department of Endodontics, Pontifícia Universidade Católica do Paraná, Curitiba, Paraná, Brazil
| | | |
Collapse
|
4
|
Tong L, Zhang C, Liu R, Yang J, Sun Z. Comparative performance analysis of large language models: ChatGPT-3.5, ChatGPT-4 and Google Gemini in glucocorticoid-induced osteoporosis. J Orthop Surg Res 2024; 19:574. [PMID: 39289734 PMCID: PMC11409482 DOI: 10.1186/s13018-024-04996-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/25/2024] [Accepted: 08/12/2024] [Indexed: 09/19/2024] Open
Abstract
BACKGROUNDS The use of large language models (LLMs) in medicine can help physicians improve the quality and effectiveness of health care by increasing the efficiency of medical information management, patient care, medical research, and clinical decision-making. METHODS We collected 34 frequently asked questions about glucocorticoid-induced osteoporosis (GIOP), covering topics related to the disease's clinical manifestations, pathogenesis, diagnosis, treatment, prevention, and risk factors. We also generated 25 questions based on the 2022 American College of Rheumatology Guideline for the Prevention and Treatment of Glucocorticoid-Induced Osteoporosis (2022 ACR-GIOP Guideline). Each question was posed to the LLM (ChatGPT-3.5, ChatGPT-4, and Google Gemini), and three senior orthopedic surgeons independently rated the responses generated by the LLMs. Three senior orthopedic surgeons independently rated the answers based on responses ranging between 1 and 4 points. A total score (TS) > 9 indicated 'good' responses, 6 ≤ TS ≤ 9 indicated 'moderate' responses, and TS < 6 indicated 'poor' responses. RESULTS In response to the general questions related to GIOP and the 2022 ACR-GIOP Guidelines, Google Gemini provided more concise answers than the other LLMs. In terms of pathogenesis, ChatGPT-4 had significantly higher total scores (TSs) than ChatGPT-3.5. The TSs for answering questions related to the 2022 ACR-GIOP Guideline by ChatGPT-4 were significantly higher than those for Google Gemini. ChatGPT-3.5 and ChatGPT-4 had significantly higher self-corrected TSs than pre-corrected TSs, while Google Gemini self-corrected for responses that were not significantly different than before. CONCLUSIONS Our study showed that Google Gemini provides more concise and intuitive responses than ChatGPT-3.5 and ChatGPT-4. ChatGPT-4 performed significantly better than ChatGPT3.5 and Google Gemini in terms of answering general questions about GIOP and the 2022 ACR-GIOP Guidelines. ChatGPT3.5 and ChatGPT-4 self-corrected better than Google Gemini.
Collapse
Affiliation(s)
- Linjian Tong
- Clinical College of Neurology, Neurosurgery and Neurorehabilitation, Tianjin Medical University, Tianjin, 300070, China
| | - Chaoyang Zhang
- Department of Orthopedics, Tianjin Medical University Baodi Hospital, Tianjin, 301800, China
| | - Rui Liu
- Clinical College of Neurology, Neurosurgery and Neurorehabilitation, Tianjin Medical University, Tianjin, 300070, China
| | - Jia Yang
- Clinical College of Neurology, Neurosurgery and Neurorehabilitation, Tianjin Medical University, Tianjin, 300070, China
| | - Zhiming Sun
- Clinical College of Neurology, Neurosurgery and Neurorehabilitation, Tianjin Medical University, Tianjin, 300070, China.
| |
Collapse
|
5
|
Ruiz Sarrias O, Martínez del Prado MP, Sala Gonzalez MÁ, Azcuna Sagarduy J, Casado Cuesta P, Figaredo Berjano C, Galve-Calvo E, López de San Vicente Hernández B, López-Santillán M, Nuño Escolástico M, Sánchez Togneri L, Sande Sardina L, Pérez Hoyos MT, Abad Villar MT, Zabalza Zudaire M, Sayar Beristain O. Leveraging Large Language Models for Precision Monitoring of Chemotherapy-Induced Toxicities: A Pilot Study with Expert Comparisons and Future Directions. Cancers (Basel) 2024; 16:2830. [PMID: 39199603 PMCID: PMC11352281 DOI: 10.3390/cancers16162830] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2024] [Revised: 08/04/2024] [Accepted: 08/09/2024] [Indexed: 09/01/2024] Open
Abstract
INTRODUCTION Large Language Models (LLMs), such as the GPT model family from OpenAI, have demonstrated transformative potential across various fields, especially in medicine. These models can understand and generate contextual text, adapting to new tasks without specific training. This versatility can revolutionize clinical practices by enhancing documentation, patient interaction, and decision-making processes. In oncology, LLMs offer the potential to significantly improve patient care through the continuous monitoring of chemotherapy-induced toxicities, which is a task that is often unmanageable for human resources alone. However, existing research has not sufficiently explored the accuracy of LLMs in identifying and assessing subjective toxicities based on patient descriptions. This study aims to fill this gap by evaluating the ability of LLMs to accurately classify these toxicities, facilitating personalized and continuous patient care. METHODS This comparative pilot study assessed the ability of an LLM to classify subjective toxicities from chemotherapy. Thirteen oncologists evaluated 30 fictitious cases created using expert knowledge and OpenAI's GPT-4. These evaluations, based on the CTCAE v.5 criteria, were compared to those of a contextualized LLM model. Metrics such as mode and mean of responses were used to gauge consensus. The accuracy of the LLM was analyzed in both general and specific toxicity categories, considering types of errors and false alarms. The study's results are intended to justify further research involving real patients. RESULTS The study revealed significant variability in oncologists' evaluations due to the lack of interaction with fictitious patients. The LLM model achieved an accuracy of 85.7% in general categories and 64.6% in specific categories using mean evaluations with mild errors at 96.4% and severe errors at 3.6%. False alarms occurred in 3% of cases. When comparing the LLM's performance to that of expert oncologists, individual accuracy ranged from 66.7% to 89.2% for general categories and 57.0% to 76.0% for specific categories. The 95% confidence intervals for the median accuracy of oncologists were 81.9% to 86.9% for general categories and 67.6% to 75.6% for specific categories. These benchmarks highlight the LLM's potential to achieve expert-level performance in classifying chemotherapy-induced toxicities. DISCUSSION The findings demonstrate that LLMs can classify subjective toxicities from chemotherapy with accuracy comparable to expert oncologists. The LLM achieved 85.7% accuracy in general categories and 64.6% in specific categories. While the model's general category performance falls within expert ranges, specific category accuracy requires improvement. The study's limitations include the use of fictitious cases, lack of patient interaction, and reliance on audio transcriptions. Nevertheless, LLMs show significant potential for enhancing patient monitoring and reducing oncologists' workload. Future research should focus on the specific training of LLMs for medical tasks, conducting studies with real patients, implementing interactive evaluations, expanding sample sizes, and ensuring robustness and generalization in diverse clinical settings. CONCLUSIONS This study concludes that LLMs can classify subjective toxicities from chemotherapy with accuracy comparable to expert oncologists. The LLM's performance in general toxicity categories is within the expert range, but there is room for improvement in specific categories. LLMs have the potential to enhance patient monitoring, enable early interventions, and reduce severe complications, improving care quality and efficiency. Future research should involve specific training of LLMs, validation with real patients, and the incorporation of interactive capabilities for real-time patient interactions. Ethical considerations, including data accuracy, transparency, and privacy, are crucial for the safe integration of LLMs into clinical practice.
Collapse
Affiliation(s)
- Oskitz Ruiz Sarrias
- Department of Mathematics and Statistic, NNBi 2020 SL, 31110 Noain, Navarra, Spain;
| | - María Purificación Martínez del Prado
- Medical Oncology Service, Basurto University Hospital, OSI Bilbao-Basurto, Osakidetza, 48013 Bilbao, Biscay, Spain; (M.P.M.d.P.); (M.Á.S.G.); (J.A.S.); (P.C.C.); (C.F.B.); (E.G.-C.); (B.L.d.S.V.H.); (M.L.-S.); (M.N.E.); (L.S.T.); (L.S.S.); (M.T.P.H.); (M.T.A.V.)
| | - María Ángeles Sala Gonzalez
- Medical Oncology Service, Basurto University Hospital, OSI Bilbao-Basurto, Osakidetza, 48013 Bilbao, Biscay, Spain; (M.P.M.d.P.); (M.Á.S.G.); (J.A.S.); (P.C.C.); (C.F.B.); (E.G.-C.); (B.L.d.S.V.H.); (M.L.-S.); (M.N.E.); (L.S.T.); (L.S.S.); (M.T.P.H.); (M.T.A.V.)
| | - Josune Azcuna Sagarduy
- Medical Oncology Service, Basurto University Hospital, OSI Bilbao-Basurto, Osakidetza, 48013 Bilbao, Biscay, Spain; (M.P.M.d.P.); (M.Á.S.G.); (J.A.S.); (P.C.C.); (C.F.B.); (E.G.-C.); (B.L.d.S.V.H.); (M.L.-S.); (M.N.E.); (L.S.T.); (L.S.S.); (M.T.P.H.); (M.T.A.V.)
| | - Pablo Casado Cuesta
- Medical Oncology Service, Basurto University Hospital, OSI Bilbao-Basurto, Osakidetza, 48013 Bilbao, Biscay, Spain; (M.P.M.d.P.); (M.Á.S.G.); (J.A.S.); (P.C.C.); (C.F.B.); (E.G.-C.); (B.L.d.S.V.H.); (M.L.-S.); (M.N.E.); (L.S.T.); (L.S.S.); (M.T.P.H.); (M.T.A.V.)
| | - Covadonga Figaredo Berjano
- Medical Oncology Service, Basurto University Hospital, OSI Bilbao-Basurto, Osakidetza, 48013 Bilbao, Biscay, Spain; (M.P.M.d.P.); (M.Á.S.G.); (J.A.S.); (P.C.C.); (C.F.B.); (E.G.-C.); (B.L.d.S.V.H.); (M.L.-S.); (M.N.E.); (L.S.T.); (L.S.S.); (M.T.P.H.); (M.T.A.V.)
| | - Elena Galve-Calvo
- Medical Oncology Service, Basurto University Hospital, OSI Bilbao-Basurto, Osakidetza, 48013 Bilbao, Biscay, Spain; (M.P.M.d.P.); (M.Á.S.G.); (J.A.S.); (P.C.C.); (C.F.B.); (E.G.-C.); (B.L.d.S.V.H.); (M.L.-S.); (M.N.E.); (L.S.T.); (L.S.S.); (M.T.P.H.); (M.T.A.V.)
| | - Borja López de San Vicente Hernández
- Medical Oncology Service, Basurto University Hospital, OSI Bilbao-Basurto, Osakidetza, 48013 Bilbao, Biscay, Spain; (M.P.M.d.P.); (M.Á.S.G.); (J.A.S.); (P.C.C.); (C.F.B.); (E.G.-C.); (B.L.d.S.V.H.); (M.L.-S.); (M.N.E.); (L.S.T.); (L.S.S.); (M.T.P.H.); (M.T.A.V.)
| | - María López-Santillán
- Medical Oncology Service, Basurto University Hospital, OSI Bilbao-Basurto, Osakidetza, 48013 Bilbao, Biscay, Spain; (M.P.M.d.P.); (M.Á.S.G.); (J.A.S.); (P.C.C.); (C.F.B.); (E.G.-C.); (B.L.d.S.V.H.); (M.L.-S.); (M.N.E.); (L.S.T.); (L.S.S.); (M.T.P.H.); (M.T.A.V.)
| | - Maitane Nuño Escolástico
- Medical Oncology Service, Basurto University Hospital, OSI Bilbao-Basurto, Osakidetza, 48013 Bilbao, Biscay, Spain; (M.P.M.d.P.); (M.Á.S.G.); (J.A.S.); (P.C.C.); (C.F.B.); (E.G.-C.); (B.L.d.S.V.H.); (M.L.-S.); (M.N.E.); (L.S.T.); (L.S.S.); (M.T.P.H.); (M.T.A.V.)
| | - Laura Sánchez Togneri
- Medical Oncology Service, Basurto University Hospital, OSI Bilbao-Basurto, Osakidetza, 48013 Bilbao, Biscay, Spain; (M.P.M.d.P.); (M.Á.S.G.); (J.A.S.); (P.C.C.); (C.F.B.); (E.G.-C.); (B.L.d.S.V.H.); (M.L.-S.); (M.N.E.); (L.S.T.); (L.S.S.); (M.T.P.H.); (M.T.A.V.)
| | - Laura Sande Sardina
- Medical Oncology Service, Basurto University Hospital, OSI Bilbao-Basurto, Osakidetza, 48013 Bilbao, Biscay, Spain; (M.P.M.d.P.); (M.Á.S.G.); (J.A.S.); (P.C.C.); (C.F.B.); (E.G.-C.); (B.L.d.S.V.H.); (M.L.-S.); (M.N.E.); (L.S.T.); (L.S.S.); (M.T.P.H.); (M.T.A.V.)
| | - María Teresa Pérez Hoyos
- Medical Oncology Service, Basurto University Hospital, OSI Bilbao-Basurto, Osakidetza, 48013 Bilbao, Biscay, Spain; (M.P.M.d.P.); (M.Á.S.G.); (J.A.S.); (P.C.C.); (C.F.B.); (E.G.-C.); (B.L.d.S.V.H.); (M.L.-S.); (M.N.E.); (L.S.T.); (L.S.S.); (M.T.P.H.); (M.T.A.V.)
| | - María Teresa Abad Villar
- Medical Oncology Service, Basurto University Hospital, OSI Bilbao-Basurto, Osakidetza, 48013 Bilbao, Biscay, Spain; (M.P.M.d.P.); (M.Á.S.G.); (J.A.S.); (P.C.C.); (C.F.B.); (E.G.-C.); (B.L.d.S.V.H.); (M.L.-S.); (M.N.E.); (L.S.T.); (L.S.S.); (M.T.P.H.); (M.T.A.V.)
| | | | | |
Collapse
|
6
|
Geantă M, Bădescu D, Chirca N, Nechita OC, Radu CG, Rascu S, Rădăvoi D, Sima C, Toma C, Jinga V. The Potential Impact of Large Language Models on Doctor-Patient Communication: A Case Study in Prostate Cancer. Healthcare (Basel) 2024; 12:1548. [PMID: 39120251 PMCID: PMC11311818 DOI: 10.3390/healthcare12151548] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2024] [Revised: 07/16/2024] [Accepted: 08/03/2024] [Indexed: 08/10/2024] Open
Abstract
BACKGROUND In recent years, the integration of large language models (LLMs) into healthcare has emerged as a revolutionary approach to enhancing doctor-patient communication, particularly in the management of diseases such as prostate cancer. METHODS Our paper evaluated the effectiveness of three prominent LLMs-ChatGPT (3.5), Gemini (Pro), and Co-Pilot (the free version)-against the official Romanian Patient's Guide on prostate cancer. Employing a randomized and blinded method, our study engaged eight medical professionals to assess the responses of these models based on accuracy, timeliness, comprehensiveness, and user-friendliness. RESULTS The primary objective was to explore whether LLMs, when operating in Romanian, offer comparable or superior performance to the Patient's Guide, considering their potential to personalize communication and enhance the informational accessibility for patients. Results indicated that LLMs, particularly ChatGPT, generally provided more accurate and user-friendly information compared to the Guide. CONCLUSIONS The findings suggest a significant potential for LLMs to enhance healthcare communication by providing accurate and accessible information. However, variability in performance across different models underscores the need for tailored implementation strategies. We highlight the importance of integrating LLMs with a nuanced understanding of their capabilities and limitations to optimize their use in clinical settings.
Collapse
Affiliation(s)
- Marius Geantă
- Department of Urology, “Carol Davila” University of Medicine and Pharmacy, 8 Eroii Sanitari Blvd., 050474 Bucharest, Romania
- Center for Innovation in Medicine, 42J Theodor Pallady Bvd., 032266 Bucharest, Romania
- United Nations University—Maastricht Economic and Social Research Institute on Innovation and Technology, Boschstraat 24, 6211 AX Maastricht, The Netherlands
| | - Daniel Bădescu
- Department of Urology, “Carol Davila” University of Medicine and Pharmacy, 8 Eroii Sanitari Blvd., 050474 Bucharest, Romania
- Department of Urology, “Prof. Dr. Th. Burghele” Clinical Hospital, 20 Panduri Str., 050659 Bucharest, Romania
| | - Narcis Chirca
- Department of Urology, “Carol Davila” University of Medicine and Pharmacy, 8 Eroii Sanitari Blvd., 050474 Bucharest, Romania
- Department of Urology, “Prof. Dr. Th. Burghele” Clinical Hospital, 20 Panduri Str., 050659 Bucharest, Romania
| | - Ovidiu Cătălin Nechita
- Department of Urology, “Carol Davila” University of Medicine and Pharmacy, 8 Eroii Sanitari Blvd., 050474 Bucharest, Romania
- Department of Urology, “Prof. Dr. Th. Burghele” Clinical Hospital, 20 Panduri Str., 050659 Bucharest, Romania
| | - Cosmin George Radu
- Department of Urology, “Prof. Dr. Th. Burghele” Clinical Hospital, 20 Panduri Str., 050659 Bucharest, Romania
| | - Stefan Rascu
- Department of Urology, “Carol Davila” University of Medicine and Pharmacy, 8 Eroii Sanitari Blvd., 050474 Bucharest, Romania
- Department of Urology, “Prof. Dr. Th. Burghele” Clinical Hospital, 20 Panduri Str., 050659 Bucharest, Romania
| | - Daniel Rădăvoi
- Department of Urology, “Carol Davila” University of Medicine and Pharmacy, 8 Eroii Sanitari Blvd., 050474 Bucharest, Romania
- Department of Urology, “Prof. Dr. Th. Burghele” Clinical Hospital, 20 Panduri Str., 050659 Bucharest, Romania
| | - Cristian Sima
- Department of Urology, “Carol Davila” University of Medicine and Pharmacy, 8 Eroii Sanitari Blvd., 050474 Bucharest, Romania
- Department of Urology, “Prof. Dr. Th. Burghele” Clinical Hospital, 20 Panduri Str., 050659 Bucharest, Romania
| | - Cristian Toma
- Department of Urology, “Carol Davila” University of Medicine and Pharmacy, 8 Eroii Sanitari Blvd., 050474 Bucharest, Romania
- Department of Urology, “Prof. Dr. Th. Burghele” Clinical Hospital, 20 Panduri Str., 050659 Bucharest, Romania
| | - Viorel Jinga
- Department of Urology, “Carol Davila” University of Medicine and Pharmacy, 8 Eroii Sanitari Blvd., 050474 Bucharest, Romania
- Department of Urology, “Prof. Dr. Th. Burghele” Clinical Hospital, 20 Panduri Str., 050659 Bucharest, Romania
- Academy of Romanian Scientists, 3 Ilfov, 050085 Bucharest, Romania
| |
Collapse
|
7
|
Pavlovic ZJ, Jiang VS, Hariton E. Current applications of artificial intelligence in assisted reproductive technologies through the perspective of a patient's journey. Curr Opin Obstet Gynecol 2024; 36:211-217. [PMID: 38597425 DOI: 10.1097/gco.0000000000000951] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/11/2024]
Abstract
PURPOSE OF REVIEW This review highlights the timely relevance of artificial intelligence in enhancing assisted reproductive technologies (ARTs), particularly in-vitro fertilization (IVF). It underscores artificial intelligence's potential in revolutionizing patient outcomes and operational efficiency by addressing challenges in fertility diagnoses and procedures. RECENT FINDINGS Recent advancements in artificial intelligence, including machine learning and predictive modeling, are making significant strides in optimizing IVF processes such as medication dosing, scheduling, and embryological assessments. Innovations include artificial intelligence augmented diagnostic testing, predictive modeling for treatment outcomes, scheduling optimization, dosing and protocol selection, follicular and hormone monitoring, trigger timing, and improved embryo selection. These developments promise to refine treatment approaches, enhance patient engagement, and increase the accuracy and scalability of fertility treatments. SUMMARY The integration of artificial intelligence into reproductive medicine offers profound implications for clinical practice and research. By facilitating personalized treatment plans, standardizing procedures, and improving the efficiency of fertility clinics, artificial intelligence technologies pave the way for value-based, accessible, and efficient fertility services. Despite the promise, the full potential of artificial intelligence in ART will require ongoing validation and ethical considerations to ensure equitable and effective implementation.
Collapse
Affiliation(s)
- Zoran J Pavlovic
- Department of Obstetrics and Gynecology/Reproductive Endocrinology and Infertility, University of South Florida, Morsani College of Medicine, Tampa, Florida
| | - Victoria S Jiang
- Division of Reproductive Endocrinology & Infertility, Vincent Department of Obstetrics and Gynecology, Massachusetts General Hospital/Harvard Medical School, Boston, Massachusetts
| | - Eduardo Hariton
- Reproductive Science Center of the San Francisco Bay Area, San Ramon, California, USA
| |
Collapse
|
8
|
Kneifel F, Becker F, Knipping A, Katou S, Andreou A, Juratli M, Houben P, Morgul H, Pascher A, Strücker B. ChatGPT as a Source of Information on Pancreatic Cancer. DEUTSCHES ARZTEBLATT INTERNATIONAL 2024; 121:505-506. [PMID: 39356560 DOI: 10.3238/arztebl.m2024.0081] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/09/2023] [Revised: 04/17/2024] [Accepted: 04/17/2024] [Indexed: 10/04/2024]
|
9
|
Ray PP. Letter to the editor regarding "Application of the convolution neural network in determining the depth of invasion of gastrointestinal cancer: a systematic review and meta-analysis". J Gastrointest Surg 2024; 28:1218-1219. [PMID: 38703989 DOI: 10.1016/j.gassur.2024.04.029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/19/2024] [Accepted: 04/27/2024] [Indexed: 05/06/2024]
Affiliation(s)
- Partha Pratim Ray
- Department of Computer Applications, Sikkim University, Gangtok, India.
| |
Collapse
|
10
|
Geantă M, Bădescu D, Chirca N, Nechita OC, Radu CG, Rascu Ș, Rădăvoi D, Sima C, Toma C, Jinga V. The Emerging Role of Large Language Models in Improving Prostate Cancer Literacy. Bioengineering (Basel) 2024; 11:654. [PMID: 39061736 PMCID: PMC11274300 DOI: 10.3390/bioengineering11070654] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2024] [Revised: 06/18/2024] [Accepted: 06/24/2024] [Indexed: 07/28/2024] Open
Abstract
This study assesses the effectiveness of chatbots powered by Large Language Models (LLMs)-ChatGPT 3.5, CoPilot, and Gemini-in delivering prostate cancer information, compared to the official Patient's Guide. Using 25 expert-validated questions, we conducted a comparative analysis to evaluate accuracy, timeliness, completeness, and understandability through a Likert scale. Statistical analyses were used to quantify the performance of each model. Results indicate that ChatGPT 3.5 consistently outperformed the other models, establishing itself as a robust and reliable source of information. CoPilot also performed effectively, albeit slightly less so than ChatGPT 3.5. Despite the strengths of the Patient's Guide, the advanced capabilities of LLMs like ChatGPT significantly enhance educational tools in healthcare. The findings underscore the need for ongoing innovation and improvement in AI applications within health sectors, especially considering the ethical implications underscored by the forthcoming EU AI Act. Future research should focus on investigating potential biases in AI-generated responses and their impact on patient outcomes.
Collapse
Affiliation(s)
- Marius Geantă
- Department of Urology, “Carol Davila” University of Medicine and Pharmacy, 8 Eroii Sanitari Blvd., 050474 Bucharest, Romania (V.J.)
- Center for Innovation in Medicine, 42J Theodor Pallady Blvd., 032266 Bucharest, Romania
| | - Daniel Bădescu
- Department of Urology, “Carol Davila” University of Medicine and Pharmacy, 8 Eroii Sanitari Blvd., 050474 Bucharest, Romania (V.J.)
- Department of Urology, “Prof. Dr. Th. Burghele” Clinical Hospital, 20 Panduri Str., 050659 Bucharest, Romania
| | - Narcis Chirca
- Department of Urology, “Carol Davila” University of Medicine and Pharmacy, 8 Eroii Sanitari Blvd., 050474 Bucharest, Romania (V.J.)
- Department of Urology, “Prof. Dr. Th. Burghele” Clinical Hospital, 20 Panduri Str., 050659 Bucharest, Romania
| | - Ovidiu Cătălin Nechita
- Department of Urology, “Carol Davila” University of Medicine and Pharmacy, 8 Eroii Sanitari Blvd., 050474 Bucharest, Romania (V.J.)
- Department of Urology, “Prof. Dr. Th. Burghele” Clinical Hospital, 20 Panduri Str., 050659 Bucharest, Romania
| | - Cosmin George Radu
- Department of Urology, “Prof. Dr. Th. Burghele” Clinical Hospital, 20 Panduri Str., 050659 Bucharest, Romania
| | - Ștefan Rascu
- Department of Urology, “Carol Davila” University of Medicine and Pharmacy, 8 Eroii Sanitari Blvd., 050474 Bucharest, Romania (V.J.)
- Department of Urology, “Prof. Dr. Th. Burghele” Clinical Hospital, 20 Panduri Str., 050659 Bucharest, Romania
| | - Daniel Rădăvoi
- Department of Urology, “Carol Davila” University of Medicine and Pharmacy, 8 Eroii Sanitari Blvd., 050474 Bucharest, Romania (V.J.)
- Department of Urology, “Prof. Dr. Th. Burghele” Clinical Hospital, 20 Panduri Str., 050659 Bucharest, Romania
| | - Cristian Sima
- Department of Urology, “Carol Davila” University of Medicine and Pharmacy, 8 Eroii Sanitari Blvd., 050474 Bucharest, Romania (V.J.)
- Department of Urology, “Prof. Dr. Th. Burghele” Clinical Hospital, 20 Panduri Str., 050659 Bucharest, Romania
| | - Cristian Toma
- Department of Urology, “Carol Davila” University of Medicine and Pharmacy, 8 Eroii Sanitari Blvd., 050474 Bucharest, Romania (V.J.)
- Department of Urology, “Prof. Dr. Th. Burghele” Clinical Hospital, 20 Panduri Str., 050659 Bucharest, Romania
| | - Viorel Jinga
- Department of Urology, “Carol Davila” University of Medicine and Pharmacy, 8 Eroii Sanitari Blvd., 050474 Bucharest, Romania (V.J.)
- Academy of Romanian Scientists, 3 Ilfov, 050085 Bucharest, Romania
| |
Collapse
|
11
|
Borna S, Gomez-Cabello CA, Pressman SM, Haider SA, Forte AJ. Comparative Analysis of Large Language Models in Emergency Plastic Surgery Decision-Making: The Role of Physical Exam Data. J Pers Med 2024; 14:612. [PMID: 38929832 PMCID: PMC11204584 DOI: 10.3390/jpm14060612] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2024] [Revised: 06/04/2024] [Accepted: 06/06/2024] [Indexed: 06/28/2024] Open
Abstract
In the U.S., diagnostic errors are common across various healthcare settings due to factors like complex procedures and multiple healthcare providers, often exacerbated by inadequate initial evaluations. This study explores the role of Large Language Models (LLMs), specifically OpenAI's ChatGPT-4 and Google Gemini, in improving emergency decision-making in plastic and reconstructive surgery by evaluating their effectiveness both with and without physical examination data. Thirty medical vignettes covering emergency conditions such as fractures and nerve injuries were used to assess the diagnostic and management responses of the models. These responses were evaluated by medical professionals against established clinical guidelines, using statistical analyses including the Wilcoxon rank-sum test. Results showed that ChatGPT-4 consistently outperformed Gemini in both diagnosis and management, irrespective of the presence of physical examination data, though no significant differences were noted within each model's performance across different data scenarios. Conclusively, while ChatGPT-4 demonstrates superior accuracy and management capabilities, the addition of physical examination data, though enhancing response detail, did not significantly surpass traditional medical resources. This underscores the utility of AI in supporting clinical decision-making, particularly in scenarios with limited data, suggesting its role as a complement to, rather than a replacement for, comprehensive clinical evaluation and expertise.
Collapse
Affiliation(s)
- Sahar Borna
- Division of Plastic Surgery, Mayo Clinic, Jacksonville, FL 32224, USA
| | | | | | - Syed Ali Haider
- Division of Plastic Surgery, Mayo Clinic, Jacksonville, FL 32224, USA
| | - Antonio Jorge Forte
- Division of Plastic Surgery, Mayo Clinic, Jacksonville, FL 32224, USA
- Center for Digital Health, Mayo Clinic, Rochester, MN 55905, USA
| |
Collapse
|
12
|
Longwell JB, Hirsch I, Binder F, Gonzalez Conchas GA, Mau D, Jang R, Krishnan RG, Grant RC. Performance of Large Language Models on Medical Oncology Examination Questions. JAMA Netw Open 2024; 7:e2417641. [PMID: 38888919 PMCID: PMC11185976 DOI: 10.1001/jamanetworkopen.2024.17641] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/01/2024] [Accepted: 04/18/2024] [Indexed: 06/20/2024] Open
Abstract
Importance Large language models (LLMs) recently developed an unprecedented ability to answer questions. Studies of LLMs from other fields may not generalize to medical oncology, a high-stakes clinical setting requiring rapid integration of new information. Objective To evaluate the accuracy and safety of LLM answers on medical oncology examination questions. Design, Setting, and Participants This cross-sectional study was conducted between May 28 and October 11, 2023. The American Society of Clinical Oncology (ASCO) Oncology Self-Assessment Series on ASCO Connection, the European Society of Medical Oncology (ESMO) Examination Trial questions, and an original set of board-style medical oncology multiple-choice questions were presented to 8 LLMs. Main Outcomes and Measures The primary outcome was the percentage of correct answers. Medical oncologists evaluated the explanations provided by the best LLM for accuracy, classified the types of errors, and estimated the likelihood and extent of potential clinical harm. Results Proprietary LLM 2 correctly answered 125 of 147 questions (85.0%; 95% CI, 78.2%-90.4%; P < .001 vs random answering). Proprietary LLM 2 outperformed an earlier version, proprietary LLM 1, which correctly answered 89 of 147 questions (60.5%; 95% CI, 52.2%-68.5%; P < .001), and the best open-source LLM, Mixtral-8x7B-v0.1, which correctly answered 87 of 147 questions (59.2%; 95% CI, 50.0%-66.4%; P < .001). The explanations provided by proprietary LLM 2 contained no or minor errors for 138 of 147 questions (93.9%; 95% CI, 88.7%-97.2%). Incorrect responses were most commonly associated with errors in information retrieval, particularly with recent publications, followed by erroneous reasoning and reading comprehension. If acted upon in clinical practice, 18 of 22 incorrect answers (81.8%; 95% CI, 59.7%-94.8%) would have a medium or high likelihood of moderate to severe harm. Conclusions and Relevance In this cross-sectional study of the performance of LLMs on medical oncology examination questions, the best LLM answered questions with remarkable performance, although errors raised safety concerns. These results demonstrated an opportunity to develop and evaluate LLMs to improve health care clinician experiences and patient care, considering the potential impact on capabilities and safety.
Collapse
Affiliation(s)
- Jack B. Longwell
- Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, Canada
| | - Ian Hirsch
- Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, Canada
- Department of Medicine, University of Toronto, Toronto, Ontario, Canada
| | - Fernando Binder
- Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, Canada
- Department of Medicine, University of Toronto, Toronto, Ontario, Canada
| | | | - Daniel Mau
- Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, Canada
- Institute of Medical Science, University of Toronto, Toronto, Ontario, Canada
| | - Raymond Jang
- Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, Canada
- Department of Medicine, University of Toronto, Toronto, Ontario, Canada
| | - Rahul G. Krishnan
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
- Department of Laboratory Medicine and Pathobiology, University of Toronto, Toronto, Ontario, Canada
- Vector Institute, Toronto, Ontario, Canada
| | - Robert C. Grant
- Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, Canada
- Department of Medicine, University of Toronto, Toronto, Ontario, Canada
- Institute of Medical Science, University of Toronto, Toronto, Ontario, Canada
- ICES, Toronto, Ontario, Canada
| |
Collapse
|
13
|
Bajčetić M, Mirčić A, Rakočević J, Đoković D, Milutinović K, Zaletel I. Comparing the performance of artificial intelligence learning models to medical students in solving histology and embryology multiple choice questions. Ann Anat 2024; 254:152261. [PMID: 38521363 DOI: 10.1016/j.aanat.2024.152261] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2023] [Revised: 02/06/2024] [Accepted: 03/19/2024] [Indexed: 03/25/2024]
Abstract
INTRODUCTION The appearance of artificial intelligence language models (AI LMs) in the form of chatbots has gained a lot of popularity worldwide, potentially interfering with different aspects of education, including medical education as well. The present study aims to assess the accuracy and consistency of different AI LMs regarding the histology and embryology knowledge obtained during the 1st year of medical studies. METHODS Five different chatbots (ChatGPT, Bing AI, Bard AI, Perplexity AI, and ChatSonic) were given two sets of multiple-choice questions (MCQs). AI LMs test results were compared to the same test results obtained from 1st year medical students. Chatbots were instructed to use revised Bloom's taxonomy when classifying questions depending on hierarchical cognitive domains. Simultaneously, two histology teachers independently rated the questions applying the same criteria, followed by the comparison between chatbots' and teachers' question classification. The consistency of chatbots' answers was explored by giving the chatbots the same tests two months apart. RESULTS AI LMs successfully and correctly solved MCQs regarding histology and embryology material. All five chatbots showed better results than the 1st year medical students on both histology and embryology tests. Chatbots showed poor results when asked to classify the questions according to revised Bloom's cognitive taxonomy compared to teachers. There was an inverse correlation between the difficulty of questions and their correct classification by the chatbots. Retesting the chatbots after two months showed a lack of consistency concerning both MCQs answers and question classification according to revised Bloom's taxonomy learning stage. CONCLUSION Despite the ability of certain chatbots to provide correct answers to the majority of diverse and heterogeneous questions, a lack of consistency in answers over time warrants their careful use as a medical education tool.
Collapse
Affiliation(s)
- Miloš Bajčetić
- Institute of Histology and Embryology "Aleksandar Đ. Kostić", Faculty of Medicine, University of Belgrade, Belgrade, Serbia
| | - Aleksandar Mirčić
- Institute of Histology and Embryology "Aleksandar Đ. Kostić", Faculty of Medicine, University of Belgrade, Belgrade, Serbia
| | - Jelena Rakočević
- Institute of Histology and Embryology "Aleksandar Đ. Kostić", Faculty of Medicine, University of Belgrade, Belgrade, Serbia
| | - Danilo Đoković
- Institute of Histology and Embryology "Aleksandar Đ. Kostić", Faculty of Medicine, University of Belgrade, Belgrade, Serbia
| | - Katarina Milutinović
- Institute of Histology and Embryology "Aleksandar Đ. Kostić", Faculty of Medicine, University of Belgrade, Belgrade, Serbia
| | - Ivan Zaletel
- Institute of Histology and Embryology "Aleksandar Đ. Kostić", Faculty of Medicine, University of Belgrade, Belgrade, Serbia.
| |
Collapse
|
14
|
Xue E, Bracken-Clarke D, Iannantuono GM, Choo-Wosoba H, Gulley JL, Floudas CS. Utility of Large Language Models for Health Care Professionals and Patients in Navigating Hematopoietic Stem Cell Transplantation: Comparison of the Performance of ChatGPT-3.5, ChatGPT-4, and Bard. J Med Internet Res 2024; 26:e54758. [PMID: 38758582 PMCID: PMC11143389 DOI: 10.2196/54758] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2023] [Revised: 03/22/2024] [Accepted: 03/22/2024] [Indexed: 05/18/2024] Open
Abstract
BACKGROUND Artificial intelligence is increasingly being applied to many workflows. Large language models (LLMs) are publicly accessible platforms trained to understand, interact with, and produce human-readable text; their ability to deliver relevant and reliable information is also of particular interest for the health care providers and the patients. Hematopoietic stem cell transplantation (HSCT) is a complex medical field requiring extensive knowledge, background, and training to practice successfully and can be challenging for the nonspecialist audience to comprehend. OBJECTIVE We aimed to test the applicability of 3 prominent LLMs, namely ChatGPT-3.5 (OpenAI), ChatGPT-4 (OpenAI), and Bard (Google AI), in guiding nonspecialist health care professionals and advising patients seeking information regarding HSCT. METHODS We submitted 72 open-ended HSCT-related questions of variable difficulty to the LLMs and rated their responses based on consistency-defined as replicability of the response-response veracity, language comprehensibility, specificity to the topic, and the presence of hallucinations. We then rechallenged the 2 best performing chatbots by resubmitting the most difficult questions and prompting to respond as if communicating with either a health care professional or a patient and to provide verifiable sources of information. Responses were then rerated with the additional criterion of language appropriateness, defined as language adaptation for the intended audience. RESULTS ChatGPT-4 outperformed both ChatGPT-3.5 and Bard in terms of response consistency (66/72, 92%; 54/72, 75%; and 63/69, 91%, respectively; P=.007), response veracity (58/66, 88%; 40/54, 74%; and 16/63, 25%, respectively; P<.001), and specificity to the topic (60/66, 91%; 43/54, 80%; and 27/63, 43%, respectively; P<.001). Both ChatGPT-4 and ChatGPT-3.5 outperformed Bard in terms of language comprehensibility (64/66, 97%; 53/54, 98%; and 52/63, 83%, respectively; P=.002). All displayed episodes of hallucinations. ChatGPT-3.5 and ChatGPT-4 were then rechallenged with a prompt to adapt their language to the audience and to provide source of information, and responses were rated. ChatGPT-3.5 showed better ability to adapt its language to nonmedical audience than ChatGPT-4 (17/21, 81% and 10/22, 46%, respectively; P=.03); however, both failed to consistently provide correct and up-to-date information resources, reporting either out-of-date materials, incorrect URLs, or unfocused references, making their output not verifiable by the reader. CONCLUSIONS In conclusion, despite LLMs' potential capability in confronting challenging medical topics such as HSCT, the presence of mistakes and lack of clear references make them not yet appropriate for routine, unsupervised clinical use, or patient counseling. Implementation of LLMs' ability to access and to reference current and updated websites and research papers, as well as development of LLMs trained in specialized domain knowledge data sets, may offer potential solutions for their future clinical application.
Collapse
Affiliation(s)
- Elisabetta Xue
- Center for Immuno-Oncology, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, United States
| | - Dara Bracken-Clarke
- Center for Immuno-Oncology, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, United States
| | - Giovanni Maria Iannantuono
- Genitourinary Malignancies Branch, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, United States
| | - Hyoyoung Choo-Wosoba
- Biostatistics and Data Management Section, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, United States
| | - James L Gulley
- Center for Immuno-Oncology, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, United States
| | - Charalampos S Floudas
- Center for Immuno-Oncology, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, United States
| |
Collapse
|
15
|
Borna S, Gomez-Cabello CA, Pressman SM, Haider SA, Sehgal A, Leibovich BC, Cole D, Forte AJ. Comparative Analysis of Artificial Intelligence Virtual Assistant and Large Language Models in Post-Operative Care. Eur J Investig Health Psychol Educ 2024; 14:1413-1424. [PMID: 38785591 PMCID: PMC11119735 DOI: 10.3390/ejihpe14050093] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2024] [Revised: 05/11/2024] [Accepted: 05/14/2024] [Indexed: 05/25/2024] Open
Abstract
In postoperative care, patient education and follow-up are pivotal for enhancing the quality of care and satisfaction. Artificial intelligence virtual assistants (AIVA) and large language models (LLMs) like Google BARD and ChatGPT-4 offer avenues for addressing patient queries using natural language processing (NLP) techniques. However, the accuracy and appropriateness of the information vary across these platforms, necessitating a comparative study to evaluate their efficacy in this domain. We conducted a study comparing AIVA (using Google Dialogflow) with ChatGPT-4 and Google BARD, assessing the accuracy, knowledge gap, and response appropriateness. AIVA demonstrated superior performance, with significantly higher accuracy (mean: 0.9) and lower knowledge gap (mean: 0.1) compared to BARD and ChatGPT-4. Additionally, AIVA's responses received higher Likert scores for appropriateness. Our findings suggest that specialized AI tools like AIVA are more effective in delivering precise and contextually relevant information for postoperative care compared to general-purpose LLMs. While ChatGPT-4 shows promise, its performance varies, particularly in verbal interactions. This underscores the importance of tailored AI solutions in healthcare, where accuracy and clarity are paramount. Our study highlights the necessity for further research and the development of customized AI solutions to address specific medical contexts and improve patient outcomes.
Collapse
Affiliation(s)
- Sahar Borna
- Division of Plastic Surgery, Mayo Clinic, Jacksonville, FL 32224, USA
| | | | | | - Syed Ali Haider
- Division of Plastic Surgery, Mayo Clinic, Jacksonville, FL 32224, USA
| | - Ajai Sehgal
- Center for Digital Health, Mayo Clinic, Rochester, MN 55905, USA
| | - Bradley C. Leibovich
- Center for Digital Health, Mayo Clinic, Rochester, MN 55905, USA
- Department of Urology, Mayo Clinic, Rochester, MN 55905, USA
| | - Dave Cole
- Center for Digital Health, Mayo Clinic, Rochester, MN 55905, USA
| | - Antonio Jorge Forte
- Division of Plastic Surgery, Mayo Clinic, Jacksonville, FL 32224, USA
- Center for Digital Health, Mayo Clinic, Rochester, MN 55905, USA
| |
Collapse
|
16
|
Saner FH, Saner YM, Abufarhaneh E, Broering DC, Raptis DA. Comparative Analysis of Artificial Intelligence (AI) Languages in Predicting Sequential Organ Failure Assessment (SOFA) Scores. Cureus 2024; 16:e59662. [PMID: 38836141 PMCID: PMC11148682 DOI: 10.7759/cureus.59662] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/04/2024] [Indexed: 06/06/2024] Open
Abstract
PURPOSE The Sequential Organ Failure Assessment (SOFA) score plays a crucial role in intensive care units (ICUs) by providing a reliable measure of a patient's organ function or extent of failure. However, the precise assessment is time-consuming, and daily assessment in clinical practice in the ICU can be challenging. METHODS Realistic scenarios in an ICU setting were created, and the data mining precision of ChatGPT 4.0 Plus, Bard, and Perplexity AI were assessed using Spearman's as well as the intraclass correlation coefficients regarding the accuracy in determining the SOFA score. RESULTS The strongest correlation was observed between the actual SOFA score and the score calculated by ChatGPT 4.0 Plus (r correlation coefficient 0.92) (p<0.001). In contrast, the correlation between the actual SOFA and that calculated by Bard was moderate (r=0.59, p=0.070), while the correlation with Perplexity AI was substantial, at 0.89, with a p<0.001. The interclass correlation coefficient analysis of SOFA with those of ChatGPT 4.0 Plus, Bard, and Perplexity AI was ICC=0.94. CONCLUSION Artificial intelligence (AI) tools, particularly ChatGPT 4.0 Plus, show significant promise in assisting with automated SOFA score calculations via AI data mining in ICU settings. They offer a pathway to reduce the manual workload and increase the efficiency of continuous patient monitoring and assessment. However, further development and validation are necessary to ensure accuracy and reliability in a critical care environment.
Collapse
Affiliation(s)
- Fuat H Saner
- Organ Transplant Center of Excellence, King Faisal Specialist Hospital and Research Centre, Riyadh, SAU
| | - Yasemin M Saner
- Department of Urology, Medical Center University Duisburg-Essen, Essen, DEU
| | - Ehab Abufarhaneh
- Organ Transplant Center of Excellence, King Faisal Specialist Hospital and Research Centre, Riyadh, SAU
| | - Dieter C Broering
- Organ Transplant Center of Excellence, King Faisal Specialist Hospital and Research Centre, Riyadh, SAU
| | - Dimitri A Raptis
- Organ Transplant Center of Excellence, King Faisal Specialist Hospital and Research Centre, Riyadh, SAU
| |
Collapse
|
17
|
He W, Zhang W, Jin Y, Zhou Q, Zhang H, Xia Q. Physician Versus Large Language Model Chatbot Responses to Web-Based Questions From Autistic Patients in Chinese: Cross-Sectional Comparative Analysis. J Med Internet Res 2024; 26:e54706. [PMID: 38687566 PMCID: PMC11094593 DOI: 10.2196/54706] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2023] [Revised: 03/20/2024] [Accepted: 04/02/2024] [Indexed: 05/02/2024] Open
Abstract
BACKGROUND There is a dearth of feasibility assessments regarding using large language models (LLMs) for responding to inquiries from autistic patients within a Chinese-language context. Despite Chinese being one of the most widely spoken languages globally, the predominant research focus on applying these models in the medical field has been on English-speaking populations. OBJECTIVE This study aims to assess the effectiveness of LLM chatbots, specifically ChatGPT-4 (OpenAI) and ERNIE Bot (version 2.2.3; Baidu, Inc), one of the most advanced LLMs in China, in addressing inquiries from autistic individuals in a Chinese setting. METHODS For this study, we gathered data from DXY-a widely acknowledged, web-based, medical consultation platform in China with a user base of over 100 million individuals. A total of 100 patient consultation samples were rigorously selected from January 2018 to August 2023, amounting to 239 questions extracted from publicly available autism-related documents on the platform. To maintain objectivity, both the original questions and responses were anonymized and randomized. An evaluation team of 3 chief physicians assessed the responses across 4 dimensions: relevance, accuracy, usefulness, and empathy. The team completed 717 evaluations. The team initially identified the best response and then used a Likert scale with 5 response categories to gauge the responses, each representing a distinct level of quality. Finally, we compared the responses collected from different sources. RESULTS Among the 717 evaluations conducted, 46.86% (95% CI 43.21%-50.51%) of assessors displayed varying preferences for responses from physicians, with 34.87% (95% CI 31.38%-38.36%) of assessors favoring ChatGPT and 18.27% (95% CI 15.44%-21.10%) of assessors favoring ERNIE Bot. The average relevance scores for physicians, ChatGPT, and ERNIE Bot were 3.75 (95% CI 3.69-3.82), 3.69 (95% CI 3.63-3.74), and 3.41 (95% CI 3.35-3.46), respectively. Physicians (3.66, 95% CI 3.60-3.73) and ChatGPT (3.73, 95% CI 3.69-3.77) demonstrated higher accuracy ratings compared to ERNIE Bot (3.52, 95% CI 3.47-3.57). In terms of usefulness scores, physicians (3.54, 95% CI 3.47-3.62) received higher ratings than ChatGPT (3.40, 95% CI 3.34-3.47) and ERNIE Bot (3.05, 95% CI 2.99-3.12). Finally, concerning the empathy dimension, ChatGPT (3.64, 95% CI 3.57-3.71) outperformed physicians (3.13, 95% CI 3.04-3.21) and ERNIE Bot (3.11, 95% CI 3.04-3.18). CONCLUSIONS In this cross-sectional study, physicians' responses exhibited superiority in the present Chinese-language context. Nonetheless, LLMs can provide valuable medical guidance to autistic patients and may even surpass physicians in demonstrating empathy. However, it is crucial to acknowledge that further optimization and research are imperative prerequisites before the effective integration of LLMs in clinical settings across diverse linguistic environments can be realized. TRIAL REGISTRATION Chinese Clinical Trial Registry ChiCTR2300074655; https://www.chictr.org.cn/bin/project/edit?pid=199432.
Collapse
Affiliation(s)
- Wenjie He
- Tianjin University of Traditional Chinese Medicine, Tianjin, China
- Dongguan Rehabilitation Experimental School, Dongguan, China
| | - Wenyan Zhang
- Lanzhou University Second Hospital, Lanzhou University, Lanzhou, China
| | - Ya Jin
- Dongguan Songshan Lake Central Hospital, Guangdong Medical University, Dongguan, China
| | - Qiang Zhou
- Dongguan Rehabilitation Experimental School, Dongguan, China
| | - Huadan Zhang
- Dongguan Rehabilitation Experimental School, Dongguan, China
| | - Qing Xia
- Tianjin University of Traditional Chinese Medicine, Tianjin, China
| |
Collapse
|
18
|
Pressman SM, Borna S, Gomez-Cabello CA, Haider SA, Haider C, Forte AJ. AI and Ethics: A Systematic Review of the Ethical Considerations of Large Language Model Use in Surgery Research. Healthcare (Basel) 2024; 12:825. [PMID: 38667587 PMCID: PMC11050155 DOI: 10.3390/healthcare12080825] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2024] [Revised: 04/02/2024] [Accepted: 04/09/2024] [Indexed: 04/28/2024] Open
Abstract
INTRODUCTION As large language models receive greater attention in medical research, the investigation of ethical considerations is warranted. This review aims to explore surgery literature to identify ethical concerns surrounding these artificial intelligence models and evaluate how autonomy, beneficence, nonmaleficence, and justice are represented within these ethical discussions to provide insights in order to guide further research and practice. METHODS A systematic review was conducted in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines. Five electronic databases were searched in October 2023. Eligible studies included surgery-related articles that focused on large language models and contained adequate ethical discussion. Study details, including specialty and ethical concerns, were collected. RESULTS The literature search yielded 1179 articles, with 53 meeting the inclusion criteria. Plastic surgery, orthopedic surgery, and neurosurgery were the most represented surgical specialties. Autonomy was the most explicitly cited ethical principle. The most frequently discussed ethical concern was accuracy (n = 45, 84.9%), followed by bias, patient confidentiality, and responsibility. CONCLUSION The ethical implications of using large language models in surgery are complex and evolving. The integration of these models into surgery necessitates continuous ethical discourse to ensure responsible and ethical use, balancing technological advancement with human dignity and safety.
Collapse
Affiliation(s)
| | - Sahar Borna
- Division of Plastic Surgery, Mayo Clinic, Jacksonville, FL 32224, USA
| | | | - Syed A. Haider
- Division of Plastic Surgery, Mayo Clinic, Jacksonville, FL 32224, USA
| | - Clifton Haider
- Department of Physiology and Biomedical Engineering, Mayo Clinic, Rochester, MN 55905, USA
| | - Antonio J. Forte
- Division of Plastic Surgery, Mayo Clinic, Jacksonville, FL 32224, USA
- Center for Digital Health, Mayo Clinic, Rochester, MN 55905, USA
| |
Collapse
|
19
|
Alanezi F. Examining the role of ChatGPT in promoting health behaviors and lifestyle changes among cancer patients. Nutr Health 2024:2601060241244563. [PMID: 38567408 DOI: 10.1177/02601060241244563] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/04/2024]
Abstract
Purpose: This study aims to investigate the role of ChatGPT in promoting health behavioral changes among cancer patients. Methods: A quasi-experiment design with qualitative approach was adopted in this study, as the ChatGPT technology is novel, and many people are unaware of it. The participants included outpatients at a public hospital. An experiment was carried out, where the participants used ChatGPT for seeking cancer related information for two weeks, which is then followed by focus group (FG) discussions. A total of 72 outpatients participated in ten focus groups. Results: Three main themes with 14 sub-themes were identified reflecting the role of ChatGPT in promoting health behavior changes. Its prominent role was observed in developing health literacy, promoting self-management of conditions through emotional, informational, motivational support. Three challenges including privacy, lack of personalization, and reliability issues were identified. Conclusion: Although ChatGPT has a huge potential in promoting health behavior changes among cancer patients, its ability is minimized by several factors such as regulatory, reliability, and privacy issues. There is a need for further evidence to generalize the results across the regions.
Collapse
Affiliation(s)
- Fahad Alanezi
- College of Business Administration, Department Management Information Systems, Imam Abdulrahman Bin Faisal University, Dammam, Saudi Arabia
| |
Collapse
|
20
|
Freire Y, Santamaría Laorden A, Orejas Pérez J, Gómez Sánchez M, Díaz-Flores García V, Suárez A. ChatGPT performance in prosthodontics: Assessment of accuracy and repeatability in answer generation. J Prosthet Dent 2024; 131:659.e1-659.e6. [PMID: 38310063 DOI: 10.1016/j.prosdent.2024.01.018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Revised: 01/17/2024] [Accepted: 01/18/2024] [Indexed: 02/05/2024]
Abstract
STATEMENT OF PROBLEM The artificial intelligence (AI) software program ChatGPT is based on large language models (LLMs) and is widely accessible. However, in prosthodontics, little is known about its performance in generating answers. PURPOSE The purpose of this study was to determine the performance of ChatGPT in generating answers about removable dental prostheses (RDPs) and tooth-supported fixed dental prostheses (FDPs). MATERIAL AND METHODS Thirty short questions were designed about RDPs and tooth-supported FDP, and 30 answers were generated for each of the questions using ChatGPT-4 in October 2023. The 900 generated answers were independently graded by experts using a 3-point Likert scale. The relative frequency and absolute percentage of answers were described. Accuracy was assessed using the Wald binomial method, while repeatability was evaluated using percentage agreement, Brennan and Prediger coefficient, Conger generalized Cohen kappa, Fleiss kappa, Gwet AC, and Krippendorff alpha methods. Confidence intervals were set at 95%. Statistical analysis was performed using the STATA software program. RESULTS The performance of ChatGPT in generating answers related to RDP and tooth-supported FDP was limited. The answers showed a reliability of 25.6%, with a confidence range between 22.9% and 28.6%. The repeatability ranged from substantial to moderate. CONCLUSIONS The results show that currently ChatGPT has limited ability to generate answers related to RDPs and tooth-supported FDPs. Therefore, ChatGPT cannot replace a dentist, and, if professionals were to use it, they should be aware of its limitations.
Collapse
Affiliation(s)
- Yolanda Freire
- Assistant Professor, Department of Pre-Clinic Dentistry, Faculty of Biomedical and Health Sciences, European University of Madrid (UEM), Madrid, Spain
| | - Andrea Santamaría Laorden
- Assistant Professor, Department of Pre-Clinic Dentistry, Faculty of Biomedical and Health Sciences, European University of Madrid (UEM), Madrid, Spain
| | - Jaime Orejas Pérez
- Assistant Professor, Department of Pre-Clinic Dentistry, Faculty of Biomedical and Health Sciences, European University of Madrid (UEM), Madrid, Spain
| | - Margarita Gómez Sánchez
- Assistant Professor, Vice Dean of Dentistry, Department of Pre-Clinic Dentistry and Clinical Dentistry, Faculty of Biomedical and Health Sciences, European University of Madrid (UEM), Madrid, Spain
| | - Víctor Díaz-Flores García
- Assistant Professor, Department of Pre-Clinic Dentistry, Faculty of Biomedical and Health Sciences, European University of Madrid (UEM), Madrid, Spain.
| | - Ana Suárez
- Associate Professor, Department of Pre-Clinic Dentistry, Faculty of Biomedical and Health Sciences, European University of Madrid (UEM), Madrid, Spain
| |
Collapse
|
21
|
Caglayan A, Slusarczyk W, Rabbani RD, Ghose A, Papadopoulos V, Boussios S. Large Language Models in Oncology: Revolution or Cause for Concern? Curr Oncol 2024; 31:1817-1830. [PMID: 38668040 PMCID: PMC11049602 DOI: 10.3390/curroncol31040137] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2024] [Revised: 03/13/2024] [Accepted: 03/29/2024] [Indexed: 04/28/2024] Open
Abstract
The technological capability of artificial intelligence (AI) continues to advance with great strength. Recently, the release of large language models has taken the world by storm with concurrent excitement and concern. As a consequence of their impressive ability and versatility, their provide a potential opportunity for implementation in oncology. Areas of possible application include supporting clinical decision making, education, and contributing to cancer research. Despite the promises that these novel systems can offer, several limitations and barriers challenge their implementation. It is imperative that concerns, such as accountability, data inaccuracy, and data protection, are addressed prior to their integration in oncology. As the progression of artificial intelligence systems continues, new ethical and practical dilemmas will also be approached; thus, the evaluation of these limitations and concerns will be dynamic in nature. This review offers a comprehensive overview of the potential application of large language models in oncology, as well as concerns surrounding their implementation in cancer care.
Collapse
Affiliation(s)
- Aydin Caglayan
- Department of Medical Oncology, Medway NHS Foundation Trust, Gillingham ME7 5NY, UK; (A.C.); (R.D.R.); (A.G.)
| | | | - Rukhshana Dina Rabbani
- Department of Medical Oncology, Medway NHS Foundation Trust, Gillingham ME7 5NY, UK; (A.C.); (R.D.R.); (A.G.)
| | - Aruni Ghose
- Department of Medical Oncology, Medway NHS Foundation Trust, Gillingham ME7 5NY, UK; (A.C.); (R.D.R.); (A.G.)
- Department of Medical Oncology, Barts Cancer Centre, St Bartholomew’s Hospital, Barts Heath NHS Trust, London EC1A 7BE, UK
- Department of Medical Oncology, Mount Vernon Cancer Centre, East and North Hertfordshire Trust, London HA6 2RN, UK
- Health Systems and Treatment Optimisation Network, European Cancer Organisation, 1040 Brussels, Belgium
- Oncology Council, Royal Society of Medicine, London W1G 0AE, UK
| | | | - Stergios Boussios
- Department of Medical Oncology, Medway NHS Foundation Trust, Gillingham ME7 5NY, UK; (A.C.); (R.D.R.); (A.G.)
- Kent Medway Medical School, University of Kent, Canterbury CT2 7LX, UK;
- Faculty of Life Sciences & Medicine, School of Cancer & Pharmaceutical Sciences, King’s College London, Strand Campus, London WC2R 2LS, UK
- Faculty of Medicine, Health, and Social Care, Canterbury Christ Church University, Canterbury CT2 7PB, UK
- AELIA Organization, 9th Km Thessaloniki—Thermi, 57001 Thessaloniki, Greece
| |
Collapse
|
22
|
Ocakoglu SR, Coskun B. The Emerging Role of AI in Patient Education: A Comparative Analysis of LLM Accuracy for Pelvic Organ Prolapse. Med Princ Pract 2024; 33:000538538. [PMID: 38527444 PMCID: PMC11324208 DOI: 10.1159/000538538] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/06/2023] [Accepted: 03/21/2024] [Indexed: 03/27/2024] Open
Abstract
OBJECTIVE This study aimed to evaluate the accuracy, completeness, precision, and readability of outputs generated by three Large Language Models (LLMs): GPT by OpenAI, BARD by Google, and Bing by Microsoft, in comparison to patient education material on Pelvic Organ Prolapse (POP) provided by the Royal College of Obstetricians and Gynecologists (RCOG). METHODS A total of 15 questions were retrieved from the RCOG website and input into the three LLMs. Two independent reviewers evaluated the outputs for accuracy, completeness, and precision. Readability was assessed using the Simplified Measure of Gobbledygook (SMOG) score and the Flesch-Kincaid Grade Level (FKGL) score. RESULTS Significant differences were observed in completeness and precision metrics. ChatGPT ranked highest in completeness (66.7%), while Bing led in precision (100%). No significant differences were observed in accuracy across all models. In terms of readability, ChatGPT exhibited higher difficulty than BARD, Bing, and the original RCOG answers. CONCLUSION While all models displayed a variable degree of correctness, ChatGPT excelled in completeness, significantly surpassing BARD and Bing. However, Bing led in precision, providing the most relevant and concise answers. Regarding readability, ChatGPT exhibited higher difficulty. The study found that while all LLMs showed varying degrees of correctness in answering RCOG questions on patient information for Pelvic Organ Prolapse (POP), ChatGPT was the most comprehensive, but its answers were harder to read. Bing, on the other hand, was the most precise. The findings highlight the potential of LLMs in health information dissemination and the need for careful interpretation of their outputs.
Collapse
Affiliation(s)
| | - Burhan Coskun
- Department of Urology, Bursa Uludag University, Bursa, Turkey
| |
Collapse
|
23
|
Weidener L, Fischer M. Artificial Intelligence in Medicine: Cross-Sectional Study Among Medical Students on Application, Education, and Ethical Aspects. JMIR MEDICAL EDUCATION 2024; 10:e51247. [PMID: 38180787 PMCID: PMC10799276 DOI: 10.2196/51247] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/26/2023] [Revised: 10/26/2023] [Accepted: 12/02/2023] [Indexed: 01/06/2024]
Abstract
BACKGROUND The use of artificial intelligence (AI) in medicine not only directly impacts the medical profession but is also increasingly associated with various potential ethical aspects. In addition, the expanding use of AI and AI-based applications such as ChatGPT demands a corresponding shift in medical education to adequately prepare future practitioners for the effective use of these tools and address the associated ethical challenges they present. OBJECTIVE This study aims to explore how medical students from Germany, Austria, and Switzerland perceive the use of AI in medicine and the teaching of AI and AI ethics in medical education in accordance with their use of AI-based chat applications, such as ChatGPT. METHODS This cross-sectional study, conducted from June 15 to July 15, 2023, surveyed medical students across Germany, Austria, and Switzerland using a web-based survey. This study aimed to assess students' perceptions of AI in medicine and the integration of AI and AI ethics into medical education. The survey, which included 53 items across 6 sections, was developed and pretested. Data analysis used descriptive statistics (median, mode, IQR, total number, and percentages) and either the chi-square or Mann-Whitney U tests, as appropriate. RESULTS Surveying 487 medical students across Germany, Austria, and Switzerland revealed limited formal education on AI or AI ethics within medical curricula, although 38.8% (189/487) had prior experience with AI-based chat applications, such as ChatGPT. Despite varied prior exposures, 71.7% (349/487) anticipated a positive impact of AI on medicine. There was widespread consensus (385/487, 74.9%) on the need for AI and AI ethics instruction in medical education, although the current offerings were deemed inadequate. Regarding the AI ethics education content, all proposed topics were rated as highly relevant. CONCLUSIONS This study revealed a pronounced discrepancy between the use of AI-based (chat) applications, such as ChatGPT, among medical students in Germany, Austria, and Switzerland and the teaching of AI in medical education. To adequately prepare future medical professionals, there is an urgent need to integrate the teaching of AI and AI ethics into the medical curricula.
Collapse
Affiliation(s)
- Lukas Weidener
- Research Unit for Quality and Ethics in Health Care, UMIT TIROL - Private University for Health Sciences and Health Technology, Hall in Tirol, Austria
| | - Michael Fischer
- Research Unit for Quality and Ethics in Health Care, UMIT TIROL - Private University for Health Sciences and Health Technology, Hall in Tirol, Austria
| |
Collapse
|
24
|
Piao Y, Chen H, Wu S, Li X, Li Z, Yang D. Assessing the performance of large language models (LLMs) in answering medical questions regarding breast cancer in the Chinese context. Digit Health 2024; 10:20552076241284771. [PMID: 39386109 PMCID: PMC11462564 DOI: 10.1177/20552076241284771] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2024] [Accepted: 09/03/2024] [Indexed: 10/12/2024] Open
Abstract
Purpose Large language models (LLMs) are deep learning models designed to comprehend and generate meaningful responses, which have gained public attention in recent years. The purpose of this study is to evaluate and compare the performance of LLMs in answering questions regarding breast cancer in the Chinese context. Material and Methods ChatGPT, ERNIE Bot, and ChatGLM were chosen to answer 60 questions related to breast cancer posed by two oncologists. Responses were scored as comprehensive, correct but inadequate, mixed with correct and incorrect data, completely incorrect, or unanswered. The accuracy, length, and readability among answers from different models were evaluated using statistical software. Results ChatGPT answered 60 questions, with 40 (66.7%) comprehensive answers and six (10.0%) correct but inadequate answers. ERNIE Bot answered 60 questions, with 34 (56.7%) comprehensive answers and seven (11.7%) correct but inadequate answers. ChatGLM generated 60 answers, with 35 (58.3%) comprehensive answers and six (10.0%) correct but inadequate answers. The differences for chosen accuracy metrics among the three LLMs did not reach statistical significance, but only ChatGPT demonstrated a sense of human compassion. The accuracy of the three models in answering questions regarding breast cancer treatment was the lowest, with an average of 44.4%. ERNIE Bot's responses were significantly shorter compared to ChatGPT and ChatGLM (p < .001 for both). The readability scores of the three models showed no statistical significance. Conclusions In the Chinese context, the capabilities of ChatGPT, ERNIE Bot, and ChatGLM are similar in answering breast cancer-related questions at present. These three LLMs may serve as adjunct informational tools for breast cancer patients in the Chinese context, offering guidance for general inquiries. However, for highly specialized issues, particularly in the realm of breast cancer treatment, LLMs cannot deliver reliable performance. It is necessary to utilize them under the supervision of healthcare professionals.
Collapse
Affiliation(s)
- Ying Piao
- Department of Radiation Oncology, Shenzhen People’s Hospital (The Second Clinical Medical College, Jinan University;
The First Affiliated Hospital, Southern University of Science and Technology), Shenzhen, Guangdong,
People’s Republic of China
| | - Hongtao Chen
- Department of Radiation Oncology, Shenzhen People’s Hospital (The Second Clinical Medical College, Jinan University;
The First Affiliated Hospital, Southern University of Science and Technology), Shenzhen, Guangdong,
People’s Republic of China
| | - Shihai Wu
- Department of Radiation Oncology, Shenzhen People’s Hospital (The Second Clinical Medical College, Jinan University;
The First Affiliated Hospital, Southern University of Science and Technology), Shenzhen, Guangdong,
People’s Republic of China
| | - Xianming Li
- Department of Radiation Oncology, Shenzhen People’s Hospital (The Second Clinical Medical College, Jinan University;
The First Affiliated Hospital, Southern University of Science and Technology), Shenzhen, Guangdong,
People’s Republic of China
| | - Zihuang Li
- Department of Radiation Oncology, Shenzhen People’s Hospital (The Second Clinical Medical College, Jinan University;
The First Affiliated Hospital, Southern University of Science and Technology), Shenzhen, Guangdong,
People’s Republic of China
| | - Dong Yang
- Department of Radiation Oncology, Shenzhen People’s Hospital (The Second Clinical Medical College, Jinan University;
The First Affiliated Hospital, Southern University of Science and Technology), Shenzhen, Guangdong,
People’s Republic of China
| |
Collapse
|
25
|
Iannantuono GM, Bracken-Clarke D, Karzai F, Choo-Wosoba H, Gulley JL, Floudas CS. Comparison of Large Language Models in Answering Immuno-Oncology Questions: A Cross-Sectional Study. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2023:2023.10.31.23297825. [PMID: 38076813 PMCID: PMC10705618 DOI: 10.1101/2023.10.31.23297825] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/19/2023]
Abstract
Background The capability of large language models (LLMs) to understand and generate human-readable text has prompted the investigation of their potential as educational and management tools for cancer patients and healthcare providers. Materials and Methods We conducted a cross-sectional study aimed at evaluating the ability of ChatGPT-4, ChatGPT-3.5, and Google Bard to answer questions related to four domains of immuno-oncology (Mechanisms, Indications, Toxicities, and Prognosis). We generated 60 open-ended questions (15 for each section). Questions were manually submitted to LLMs, and responses were collected on June 30th, 2023. Two reviewers evaluated the answers independently. Results ChatGPT-4 and ChatGPT-3.5 answered all questions, whereas Google Bard answered only 53.3% (p <0.0001). The number of questions with reproducible answers was higher for ChatGPT-4 (95%) and ChatGPT3.5 (88.3%) than for Google Bard (50%) (p <0.0001). In terms of accuracy, the number of answers deemed fully correct were 75.4%, 58.5%, and 43.8% for ChatGPT-4, ChatGPT-3.5, and Google Bard, respectively (p = 0.03). Furthermore, the number of responses deemed highly relevant was 71.9%, 77.4%, and 43.8% for ChatGPT-4, ChatGPT-3.5, and Google Bard, respectively (p = 0.04). Regarding readability, the number of highly readable was higher for ChatGPT-4 and ChatGPT-3.5 (98.1%) and (100%) compared to Google Bard (87.5%) (p = 0.02). Conclusion ChatGPT-4 and ChatGPT-3.5 are potentially powerful tools in immuno-oncology, whereas Google Bard demonstrated relatively poorer performance. However, the risk of inaccuracy or incompleteness in the responses was evident in all three LLMs, highlighting the importance of expert-driven verification of the outputs returned by these technologies.
Collapse
Affiliation(s)
- Giovanni Maria Iannantuono
- Genitourinary Malignancies Branch, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, United States
| | - Dara Bracken-Clarke
- Center for Immuno-Oncology, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, United States
| | - Fatima Karzai
- Genitourinary Malignancies Branch, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, United States
| | - Hyoyoung Choo-Wosoba
- Biostatistics and Data Management Section, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, United States
| | - James L. Gulley
- Center for Immuno-Oncology, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, United States
| | - Charalampos S. Floudas
- Center for Immuno-Oncology, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, United States
| |
Collapse
|