1
|
Shabani F, Jodeiri A, Mohammad-Alizadeh-Charandabi S, Abbasalizadeh F, Tanha J, Mirghafourvand M. Developing and validating an artificial intelligence-based application for predicting some pregnancy outcomes: a multi-phase study protocol. Reprod Health 2025; 22:99. [PMID: 40481447 PMCID: PMC12144753 DOI: 10.1186/s12978-025-02048-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2025] [Accepted: 05/25/2025] [Indexed: 06/11/2025] Open
Abstract
Background Pregnancy complications such as preterm birth, low birth weight, gestational diabetes mellitus, preeclampsia, and intrauterine growth restriction significantly affect both maternal and neonatal health outcomes. Early identification of high-risk pregnancies is essential for timely interventions; however, traditional predictive models often lack accuracy. This study aims to develop and validate an AI-based application to improve risk assessment and clinical decision-making regarding pregnancy outcomes through a multi-phase approach. Methods This study comprises three phases. In Phase 1, retrospective case-control data will be collected from medical records, including Mother and Infant System (IMaN), Hospital Information System (HIS), and archived records of women who gave birth at Al-Zahra and Taleghani Educational and Medical Centers in Tabriz between 2022 and 2024. In Phase 2, an artificial intelligence model will be developed using machine learning algorithms such as Random Forest, XGBoost, Support Vector Machines (SVM), and neural networks, followed by model training, validation, and integration into a user-friendly application. Phase 3 will focus on a prospective cohort study of pregnant women attending clinics after 22 weeks of gestation, evaluating the AI model’s predictive performance through metrics like AUROC (area under the receiver operating characteristic curve), sensitivity, specificity, and predictive values, along with real-time data collection. Content validity will be determined through expert reviews. Discussion This study protocol presents a multi-phase approach to developing and validating an AI-based application for predicting pregnancy outcomes. By integrating retrospective data analysis, machine learning, and prospective validation, the study aims to improve early risk detection and maternal care. If successful, this application could support personalized obstetric decision-making. This study aims to develop and validate an artificial intelligence (AI)-based tool to predict pregnancy complications, including preterm birth, low birth weight, gestational diabetes, intrauterine growth restriction, and preeclampsia. The research will be conducted in three phases. First, past medical records from two hospitals will be analysed to identify key risk factors. Next, a machine learning model will be developed and integrated into a user-friendly application. Finally, the tool will be tested on a group of pregnant women to assess its accuracy in predicting adverse pregnancy outcomes. By leveraging AI, this study seeks to enhance early risk detection, enabling healthcare providers to implement timely preventive measures and improve maternal and neonatal health outcomes. If successful, this AI-based application could serve as a valuable resource in maternity care, assisting midwives and doctors in delivering personalized care and reducing complications. The findings could also advance the use of AI technology in obstetric practice, improving decision-making and optimizing healthcare resources.
Collapse
Affiliation(s)
- Fatemeh Shabani
- Midwifery Department, Faculty of Nursing and Midwifery, Tabriz University of Medical Sciences, Tabriz, Iran
| | - Ata Jodeiri
- Faculty of Advanced Medical Sciences, Tabriz University of Medical Sciences, Tabriz, Iran
| | | | - Fatemeh Abbasalizadeh
- Department of Obstetrics and Gynecology, Tabriz University of Medical Sciences, Tabriz, Iran
| | - Jafar Tanha
- Faculty of Electrical and Computer Engineering, University of Tabriz, Tabriz, Iran
| | - Mojgan Mirghafourvand
- Social Determinants of Health Research Center, Faculty of Nursing and Midwifery, Tabriz University of Medical Sciences, Tabriz, Iran.
| |
Collapse
|
2
|
Birol NY, Çiftci HB, Yılmaz A, Çağlayan A, Alkan F. Is there any room for ChatGPT AI bot in speech-language pathology? Eur Arch Otorhinolaryngol 2025; 282:3267-3280. [PMID: 40025183 PMCID: PMC12122639 DOI: 10.1007/s00405-025-09295-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2024] [Accepted: 02/21/2025] [Indexed: 03/04/2025]
Abstract
PURPOSE This study investigates the potential of the ChatGPT-4.0 artificial intelligence bot to assist speech-language pathologists (SLPs) by assessing its accuracy, comprehensiveness, and relevance in various tasks related to speech, language, and swallowing disorders. METHOD In this cross-sectional descriptive study, 15 practicing SLPs evaluated ChatGPT-4.0's responses to task-specific queries across six core areas: report writing, assessment material generation, clinical decision support, therapy stimulus generation, therapy planning, and client/family training material generation. English prompts were created in seven areas: speech sound disorders, motor speech disorders, aphasia, stuttering, childhood language disorders, voice disorders, and swallowing disorders. These prompts were entered into ChatGPT-4.0, and its responses were evaluated. Using a three-point Likert-type scale, participants rated each response for accuracy, relevance, and comprehensiveness based on clinical expectations and their professional judgment. RESULTS The study revealed that ChatGPT-4.0 performed with predominantly high accuracy, comprehensiveness, and relevance in tasks related to speech and language disorders. High accuracy, comprehensiveness, and relevance levels were observed in report writing, clinical decision support, and creating education material. However, tasks such as creating therapy stimuli and therapy planning showed more variation with medium and high accuracy levels. CONCLUSIONS ChatGPT-4.0 shows promise in assisting SLPs with various professional tasks, particularly report writing, clinical decision support, and education material creation. However, further research is needed to address its limitations in therapy stimulus generation and therapy planning to improve its usability in clinical practice. Integrating AI technologies such as ChatGPT could improve the efficiency and effectiveness of therapeutic processes in speech-language pathology.
Collapse
Affiliation(s)
- Namık Yücel Birol
- Department of Speech and Language Therapy, Faculty of Health Sciences, Tarsus University, Mersin, Türkiye.
| | - Hilal Berber Çiftci
- Department of Speech and Language Therapy, Faculty of Health Sciences, Tarsus University, Mersin, Türkiye
| | - Ayşegül Yılmaz
- Department of Speech and Language Therapy, Graduate School of Health Sciences, İstanbul Medipol University, İstanbul, Türkiye
| | - Ayhan Çağlayan
- Çağlayan Speech and Language Therapy Center, İzmir, Türkiye
| | - Ferhat Alkan
- Department of Speech and Language Therapy, Institute of Graduate Education, İstinye University, İstanbul, Türkiye
| |
Collapse
|
3
|
Du Y, Ji C, Xu J, Wei M, Ren Y, Xia S, Zhou J. Performance of ChatGPT and Microsoft Copilot in Bing in answering obstetric ultrasound questions and analyzing obstetric ultrasound reports. Sci Rep 2025; 15:14627. [PMID: 40287483 PMCID: PMC12033324 DOI: 10.1038/s41598-025-99268-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2024] [Accepted: 04/18/2025] [Indexed: 04/29/2025] Open
Abstract
To evaluate and compare the performance of publicly available ChatGPT-3.5, ChatGPT-4.0 and Microsoft Copilot in Bing (Copilot) in answering obstetric ultrasound questions and analyzing obstetric ultrasound reports. Twenty questions related to obstetric ultrasound were answered and 110 obstetric ultrasound reports were analyzed by ChatGPT-3.5, ChatGPT-4.0 and Copilot, with each question and report being posed three times to them at different times. The accuracy and consistency of each response to twenty questions and each analysis result in the report were evaluated and compared. In answering twenty questions, both ChatGPT-3.5 and ChatGPT-4.0 outperformed Copilot in accuracy (95.0% vs. 80.0%) and consistency (90.0% and 85.0% vs. 75.0%). However, no statistical difference was found among them. When analyzing obstetric ultrasound reports, ChatGPT-3.5 and ChatGPT-4.0 demonstrated superior accuracy compared to Copilot (P < 0.05), and all three showed high consistency and the ability to provide recommendations. The overall accuracy and consistency of ChatGPT-3.5, ChatGPT-4.0, and Copilot were 83.86%, 84.13% vs. 77.51% in accuracy, and 87.30%, 93.65% vs. 90.48% in consistency, respectively. These large language models (ChatGPT-3.5, ChatGPT-4.0 and Copilot) have the potential to assist clinical workflows by enhancing patient education and patient clinical communication around common obstetric ultrasound issues. With inconsistent and sometimes inaccurate responses, along with cybersecurity concerns, physician supervision is crucial in the use of these models.
Collapse
Affiliation(s)
- Yanran Du
- Department of Ultrasound, Ruijin Hospital, Shanghai Jiaotong University School of Medicine, No. 197, Rui Jin 2nd Road, Shanghai, 200025, China
| | - Chao Ji
- Department of Pediatrics, Ruijin Hospital, Shanghai Jiaotong University School of Medicine, No. 197, Rui Jin 2nd Road, Shanghai, 200025, China
| | - Jiale Xu
- Department of Ultrasound, Ruijin Hospital, Shanghai Jiaotong University School of Medicine, No. 197, Rui Jin 2nd Road, Shanghai, 200025, China
| | - Minyan Wei
- Department of Ultrasound, Ruijin Hospital, Shanghai Jiaotong University School of Medicine, No. 197, Rui Jin 2nd Road, Shanghai, 200025, China
| | - Yunyun Ren
- Obstetrics and Gynecology Hospital of Fudan University, No.128, Shenyang Road, Shanghai, 200090, China.
| | - Shujun Xia
- Department of Ultrasound, Ruijin Hospital, Shanghai Jiaotong University School of Medicine, No. 197, Rui Jin 2nd Road, Shanghai, 200025, China.
| | - JianQiao Zhou
- Department of Ultrasound, Ruijin Hospital, Shanghai Jiaotong University School of Medicine, No. 197, Rui Jin 2nd Road, Shanghai, 200025, China.
| |
Collapse
|
4
|
Liu R, Liu J, Yang J, Sun Z, Yan H. Comparative analysis of ChatGPT-4o mini, ChatGPT-4o and Gemini Advanced in the treatment of postmenopausal osteoporosis. BMC Musculoskelet Disord 2025; 26:369. [PMID: 40241048 PMCID: PMC12001388 DOI: 10.1186/s12891-025-08601-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/23/2025] [Accepted: 03/31/2025] [Indexed: 04/18/2025] Open
Abstract
BACKGROUND Osteoporosis is a sex-specific disease. Postmenopausal osteoporosis (PMOP) has been the focus of public health research worldwide. The purpose of this study is to evaluate the quality and readability of artificial intelligence large-scale language models (AI-LLMs): ChatGPT-4o mini, ChatGPT-4o and Gemini Advanced for responses generated in response to questions related to PMOP. METHODS We collected 48 PMOP frequently asked questions (FAQs) through offline counseling and online medical community forums. We also prepared 24 specific questions about PMOP based on the Management of Postmenopausal Osteoporosis: 2022 ACOG Clinical Practice Guideline No. 2 (2022 ACOG-PMOP Guideline). In this project, the FAQs were imported into the AI-LLMs (ChatGPT-4o mini, ChatGPT-4o, Gemini Advanced) and randomly assigned to four professional orthopedic surgeons, who independently rated the satisfaction of each response via a 5-point Likert scale. Furthermore, a Flesch Reading Ease (FRE) score was calculated for each of the LLMs' responses to assess the readability of the text generated by each LLM. RESULTS When it comes to addressing questions related to PMOP and the 2022 ACOG-PMOP guidelines, ChatGPT-4o and Gemini Advanced provide more concise answers than ChatGPT-4o mini. In terms of the overall FAQs of PMOP, ChatGPT-4o has a significantly higher accuracy rate than ChatGPT-4o mini and Gemini Advanced. When answering questions related to the 2022 ACOG-PMOP guidelines, ChatGPT-4o mini vs. ChatGPT-4o have significantly higher response accuracy than Gemini Advanced. ChatGPT-4o mini, ChatGPT-4o, and Gemini Advanced all have good levels of self-correction. CONCLUSIONS Our research shows that Gemini Advanced and ChatGPT-4o provide more concise and intuitive answers. ChatGPT-4o responds better in answering frequently asked questions related to PMOP. When answering questions related to the 2022 ACOG-PMOP guidelines, ChatGPT-4o mini and ChatGPT-4o responded significantly better than Gemini Advanced. ChatGPT-4o mini, ChatGPT-4o, and Gemini Advanced have demonstrated a strong ability to self-correct. CLINICAL TRIAL NUMBER Not applicable.
Collapse
Affiliation(s)
- Rui Liu
- Clinical College of Neurology, Neurosurgery and Neurorehabilitation, Tianjin Medical University, Tianjin, China
| | - Jian Liu
- College of Computer Science, Nankai University, Tianjin, 300350, China
| | - Jia Yang
- Clinical College of Neurology, Neurosurgery and Neurorehabilitation, Tianjin Medical University, Tianjin, China
| | - Zhiming Sun
- Clinical College of Neurology, Neurosurgery and Neurorehabilitation, Tianjin Medical University, Tianjin, China.
| | - Hua Yan
- Clinical College of Neurology, Neurosurgery and Neurorehabilitation, Tianjin Medical University, Tianjin, China.
| |
Collapse
|
5
|
Zhou X, Chen Y, Abdulghani EA, Zhang X, Zheng W, Li Y. Performance in answering orthodontic patients' frequently asked questions: Conversational artificial intelligence versus orthodontists. J World Fed Orthod 2025:S2212-4438(25)00012-8. [PMID: 40140287 DOI: 10.1016/j.ejwf.2025.02.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2024] [Revised: 02/11/2025] [Accepted: 02/11/2025] [Indexed: 03/28/2025]
Abstract
OBJECTIVES Can conversational artificial intelligence (AI) help alleviate orthodontic patients' general doubts? This study aimed to investigate the performance of conversational AI in answering frequently asked questions (FAQs) from orthodontic patients, with comparison to orthodontists. MATERIALS AND METHODS Thirty FAQs were selected covering the pre-, during-, and postorthodontic treatment stages. Each question was respectively answered by AI (Chat Generative Pretrained Transformer [ChatGPT]-4) and two orthodontists (Ortho. A and Ortho. B), randomly drawn out of a panel. Their responses to the 30 FAQs were ranked by four raters, randomly selected from another panel of orthodontists, resulting in 120 rankings. All the participants were Chinese, and all the questions and answers were conducted in Chinese. RESULTS Among the 120 rankings, ChatGPT was ranked first in 61 instances (50.8%), second in 35 instances (29.2%), and third in 24 instances (20.0%). Furthermore, the mean rank of ChatGPT was 1.69 ± 0.79, significantly better than that of Ortho. A (2.23 ± 0.79, P < 0.001) and Ortho. B (2.08 ± 0.79, P < 0.05). No significant difference was found between the two orthodontist groups. Additionally, the Spearman correlation coefficient between the average ranking of ChatGPT and the inter-rater agreement was 0.69 (P < 0.001), indicating a strong positive correlation between the two variables. CONCLUSIONS Overall, the conversational AI ChatGPT-4 may outperform orthodontists in addressing orthodontic patients' FAQs, even in a non-English language. In addition, ChatGPT tends to perform better when responding to questions with answers widely accepted among orthodontic professionals, and vice versa.
Collapse
Affiliation(s)
- Xinlianyi Zhou
- State Key Laboratory of Oral Diseases, National Center for Stomatology, National Clinical Research Center for Oral Diseases, Department of Orthodontics, West China Hospital of Stomatology, Sichuan University, Chengdu, China
| | - Yao Chen
- State Key Laboratory of Oral Diseases, National Center for Stomatology, National Clinical Research Center for Oral Diseases, Department of Orthodontics, West China Hospital of Stomatology, Sichuan University, Chengdu, China
| | - Ehab A Abdulghani
- State Key Laboratory of Oral Diseases, National Center for Stomatology, National Clinical Research Center for Oral Diseases, Department of Orthodontics, West China Hospital of Stomatology, Sichuan University, Chengdu, China; Department of Orthodontics and Dentofacial Orthopedics, College of Dentistry, Thamar University, Dhamar, Yemen
| | - Xu Zhang
- State Key Laboratory of Oral Diseases, National Center for Stomatology, National Clinical Research Center for Oral Diseases, Department of Orthodontics, West China Hospital of Stomatology, Sichuan University, Chengdu, China
| | - Wei Zheng
- State Key Laboratory of Oral Diseases, National Center for Stomatology, National Clinical Research Center for Oral Diseases, Department of Oral and Maxillofacial Surgery, West China Hospital of Stomatology, Sichuan University, Chengdu, China.
| | - Yu Li
- State Key Laboratory of Oral Diseases, National Center for Stomatology, National Clinical Research Center for Oral Diseases, Department of Orthodontics, West China Hospital of Stomatology, Sichuan University, Chengdu, China.
| |
Collapse
|
6
|
Chen R, Zeng D, Li Y, Huang R, Sun D, Li T. Evaluating the performance and clinical decision-making impact of ChatGPT-4 in reproductive medicine. Int J Gynaecol Obstet 2025; 168:1285-1291. [PMID: 39526823 DOI: 10.1002/ijgo.15959] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2024] [Revised: 10/22/2024] [Accepted: 09/30/2024] [Indexed: 11/16/2024]
Abstract
BACKGROUND ChatGPT, a sophisticated language model developed by OpenAI, has the potential to offer professional and patient-friendly support. We aimed to assess the accuracy and reproducibility of ChatGPT-4 in answering questions related to knowledge, management, and support within the field of reproductive medicine. METHODS ChatGPT-4 was used to respond to queries sourced from a domestic attending physician examination database, as well as to address both local and international treatment guidelines within the field of reproductive medicine. Each response generated by ChatGPT-4 was independently evaluated by a trio of experts specializing in reproductive medicine. The experts used four qualitative measures-relevance, accuracy, completeness, and understandability-to assess each response. RESULTS We found that ChatGPT-4 demonstrated extensive knowledge in reproductive medicine, with median scores for relevance, accuracy, completeness, and comprehensibility of objective questions being 4, 3.5, 3, and 3, respectively. However, the composite accuracy rate for multiple-choice questions was 63.38%. Significant discrepancies were observed among the three experts' scores across all four measures. Expert 1 generally provided higher and more consistent scores, while Expert 3 awarded lower scores for accuracy. ChatGPT-4's responses to both domestic and international guidelines showed varying levels of understanding, with a lack of knowledge on regional guideline variations. However, it offered practical and multifaceted advice regarding next steps and adjusting to new guidelines. CONCLUSIONS We analyzed the strengths and limitations of ChatGPT-4's responses on the management of reproductive medicine and relevant support. ChatGPT-4 might serve as a supplementary informational tool for patients and physicians to improve outcomes in the field of reproductive medicine.
Collapse
Affiliation(s)
- Rouzhu Chen
- The Sixth Affiliated Hospital, Sun Yat-sen University, Guangzhou, China
| | - Danling Zeng
- The Sixth Affiliated Hospital, Sun Yat-sen University, Guangzhou, China
| | - Yi Li
- Sun Yat-sen Memorial Hospital, Sun Yat-sen University, Guangzhou, China
| | - Rui Huang
- The Sixth Affiliated Hospital, Sun Yat-sen University, Guangzhou, China
| | - Dejuan Sun
- The Sixth Affiliated Hospital, Sun Yat-sen University, Guangzhou, China
| | - Tingting Li
- The Sixth Affiliated Hospital, Sun Yat-sen University, Guangzhou, China
| |
Collapse
|
7
|
Nieves-Lopez B, Bechtle AR, Traverse J, Klifto C, Schoch BS, Aziz KT. Evaluating the Evolution of ChatGPT as an Information Resource in Shoulder and Elbow Surgery. Orthopedics 2025; 48:e69-e74. [PMID: 39879624 DOI: 10.3928/01477447-20250123-03] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 01/31/2025]
Abstract
BACKGROUND The purpose of this study was to evaluate the performance and evolution of Chat Generative Pre-Trained Transformer (ChatGPT; OpenAI) as a resource for shoulder and elbow surgery information by assessing its accuracy on the American Academy of Orthopaedic Surgeons shoulder-elbow self-assessment questions. We hypothesized that both ChatGPT models would demonstrate proficiency and that there would be significant improvement with progressive iterations. MATERIALS AND METHODS A total of 200 questions were selected from the 2019 and 2021 American Academy of Orthopaedic Surgeons shoulder-elbow self-assessment questions. ChatGPT 3.5 and 4 were used to evaluate all questions. Questions with non-text data were excluded (114 questions). Remaining questions were input into ChatGPT and categorized as follows: anatomy, arthroplasty, basic science, instability, miscellaneous, nonoperative, and trauma. ChatGPT's performances were quantified and compared across categories with chi-square tests. The continuing medical education credit threshold of 50% was used to determine proficiency. Statistical significance was set at P<.05. RESULTS ChatGPT 3.5 and 4 answered 52.3% and 73.3% of the questions correctly, respectively (P=.003). ChatGPT 3.5 performed significantly better in the instability category (P=.037). ChatGPT 4's performance did not significantly differ across categories (P=.841). ChatGPT 4 performed significantly better than ChatGPT 3.5 in all categories except instability and miscellaneous. CONCLUSION ChatGPT 3.5 and 4 exceeded the proficiency threshold. ChatGPT 4 performed better than ChatGPT 3.5, showing an increased capability to correctly answer shoulder and elbow-focused questions. Further refinement of ChatGPT's training may improve its performance and utility as a resource. Currently, ChatGPT remains unable to answer questions at a high enough accuracy to replace clinical decision-making. [Orthopedics. 2025;48(2):e69-e74.].
Collapse
|
8
|
Guo S, Li R, Li G, Chen W, Huang J, He L, Ma Y, Wang L, Zheng H, Tian C, Zhao Y, Pan X, Wan H, Liu D, Li Z, Lei J. Comparing ChatGPT's and Surgeon's Responses to Thyroid-related Questions From Patients. J Clin Endocrinol Metab 2025; 110:e841-e850. [PMID: 38597169 DOI: 10.1210/clinem/dgae235] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/12/2024] [Revised: 04/03/2024] [Accepted: 04/08/2024] [Indexed: 04/11/2024]
Abstract
CONTEXT For some common thyroid-related conditions with high prevalence and long follow-up times, ChatGPT can be used to respond to common thyroid-related questions. OBJECTIVE In this cross-sectional study, we assessed the ability of ChatGPT (version GPT-4.0) to provide accurate, comprehensive, compassionate, and satisfactory responses to common thyroid-related questions. METHODS First, we obtained 28 thyroid-related questions from the Huayitong app, which together with the 2 interfering questions eventually formed 30 questions. Then, these questions were responded to by ChatGPT (on July 19, 2023), a junior specialist, and a senior specialist (on July 20, 2023) separately. Finally, 26 patients and 11 thyroid surgeons evaluated those responses on 4 dimensions: accuracy, comprehensiveness, compassion, and satisfaction. RESULTS Among the 30 questions and responses, ChatGPT's speed of response was faster than that of the junior specialist (8.69 [7.53-9.48] vs 4.33 [4.05-4.60]; P < .001) and the senior specialist (8.69 [7.53-9.48] vs 4.22 [3.36-4.76]; P < .001). The word count of the ChatGPT's responses was greater than that of both the junior specialist (341.50 [301.00-384.25] vs 74.50 [51.75-84.75]; P < .001) and senior specialist (341.50 [301.00-384.25] vs 104.00 [63.75-177.75]; P < .001). ChatGPT received higher scores than the junior specialist and senior specialist in terms of accuracy, comprehensiveness, compassion, and satisfaction in responding to common thyroid-related questions. CONCLUSION ChatGPT performed better than a junior specialist and senior specialist in answering common thyroid-related questions, but further research is needed to validate the logical ability of the ChatGPT for complex thyroid questions.
Collapse
Affiliation(s)
- Siyin Guo
- Division of Thyroid Surgery, Department of General Surgery, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China
| | - Ruicen Li
- Health Management Center, General Practice Medical Center, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China
| | - Genpeng Li
- Division of Thyroid Surgery, Department of General Surgery, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China
| | - Wenjie Chen
- Division of Thyroid Surgery, Department of General Surgery, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China
| | - Jing Huang
- Division of Thyroid Surgery, Department of General Surgery, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China
| | - Linye He
- Division of Thyroid Surgery, Department of General Surgery, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China
| | - Yu Ma
- Division of Thyroid Surgery, Department of General Surgery, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China
| | - Liying Wang
- Division of Thyroid Surgery, Department of General Surgery, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China
| | - Hongping Zheng
- Department of Thyroid Surgery, General Surgery Ward 7, The First Hospital of Lanzhou University, Lanzhou, Gansu 730000, China
| | - Chunxiang Tian
- Chengdu Women's and Children's Central Hospital, School of Medicine, University of Electronic Science and Technology of China, Chengdu, Sichuan 610031, China
| | - Yatong Zhao
- Thyroid Surgery, Zhengzhou Central Hospital Affiliated of Zhengzhou University, Zhengzhou, Henan 450007, China
| | - Xinmin Pan
- Department of Thyroid Surgery, General Surgery III, Gansu Provincial Hospital, Lanzhou, Gansu 730000, China
| | - Hongxing Wan
- Department of Oncology, Sanya People's Hospital, Sanya, Hainan 572000, China
| | - Dasheng Liu
- Department of Vascular Thyroid Surgery, The Second Affiliated Hospital of Guangzhou University of Chinese Medicine, Guangzhou, Guangdong 510120, China
| | - Zhihui Li
- Division of Thyroid Surgery, Department of General Surgery, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China
| | - Jianyong Lei
- Division of Thyroid Surgery, Department of General Surgery, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China
| |
Collapse
|
9
|
Wang X, Ye H, Zhang S, Yang M, Wang X. Evaluation of the Performance of Three Large Language Models in Clinical Decision Support: A Comparative Study Based on Actual Cases. J Med Syst 2025; 49:23. [PMID: 39948214 DOI: 10.1007/s10916-025-02152-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2024] [Accepted: 01/23/2025] [Indexed: 05/09/2025]
Abstract
BACKGROUND Generative large language models (LLMs) are increasingly integrated into the medical field. However, their actual efficacy in clinical decision-making remains partially unexplored. This study aimed to assess the performance of the three LLMs, ChatGPT-4, Gemini, and Med-Go, in the domain of professional medicine when confronted with actual clinical cases. METHODS This study involved 134 clinical cases spanning nine medical disciplines. Each LLM was required to provide suggestions for diagnosis, diagnostic criteria, differential diagnosis, examination and treatment for every case. Responses were scored by two experts using a predefined rubric. RESULTS In overall performance among the models, Med-Go achieved the highest median score (37.5, IQR 31.9-41.5), while Gemini recorded the lowest (33.0, IQR 25.5-36.6), showing significant statistical difference among the three LLMs (p < 0.001). Analysis revealed that responses related to differential diagnosis were the weakest, while those pertaining to treatment recommendations were the strongest. Med-Go displayed notable performance advantages in gastroenterology, nephrology, and neurology. CONCLUSIONS The findings show that all three LLMs achieved over 60% of the maximum possible score, indicating their potential applicability in clinical practice. However, inaccuracies that could lead to adverse decisions underscore the need for caution in their application. Med-Go's superior performance highlights the benefits of incorporating specialized medical knowledge into LLMs training. It is anticipated that further development and refinement of medical LLMs will enhance their precision and safety in clinical use.
Collapse
Affiliation(s)
- Xueqi Wang
- Department of Critical Care Medicine, Shanghai East Hospital, Tongji University School of Medicine, No.150, Jimo Road, Pudong New Area, Shanghai, China
| | - Haiyan Ye
- Department of Critical Care Medicine, Shanghai East Hospital, Tongji University School of Medicine, No.150, Jimo Road, Pudong New Area, Shanghai, China
| | - Sumian Zhang
- Department of Critical Care Medicine, Shanghai East Hospital, Tongji University School of Medicine, No.150, Jimo Road, Pudong New Area, Shanghai, China
| | - Mei Yang
- Department of Critical Care Medicine, Shanghai East Hospital, Tongji University School of Medicine, No.150, Jimo Road, Pudong New Area, Shanghai, China
| | - Xuebin Wang
- Department of Critical Care Medicine, Shanghai East Hospital, Tongji University School of Medicine, No.150, Jimo Road, Pudong New Area, Shanghai, China.
| |
Collapse
|
10
|
Zapata-Caballero CA, Galindo-Rodriguez NA, Rodriguez-Lane R, Cueto-Cámara JF, Gorbea-Chávez V, Granados-Martínez V. Evaluating language processing artificial intelligence answers to patient-generated queries on chronic pelvic pain. PAIN MEDICINE (MALDEN, MASS.) 2025; 26:114-116. [PMID: 39404826 DOI: 10.1093/pm/pnae104] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/16/2024] [Revised: 10/02/2024] [Accepted: 10/10/2024] [Indexed: 02/04/2025]
Affiliation(s)
| | | | - Rebeca Rodriguez-Lane
- Department of Urogynecology, National Institute of Perinatology, 11000 Mexico City, Mexico
| | - Jonathan Fidel Cueto-Cámara
- Department of Minimally Invasive Gynecologic Surgery, National Institute of Perinatology, 11000 Mexico City, Mexico
| | - Viridiana Gorbea-Chávez
- Department of Medical Education, National Institute of Perinatology, 11000 Mexico City, Mexico
| | | |
Collapse
|
11
|
He N, Yan Y, Wu Z, Cheng Y, Liu F, Li X, Zhai S. Chat GPT-4 significantly surpasses GPT-3.5 in drug information queries. J Telemed Telecare 2025; 31:306-308. [PMID: 37350055 DOI: 10.1177/1357633x231181922] [Citation(s) in RCA: 20] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/24/2023]
Affiliation(s)
- Na He
- Department of Pharmacy, Peking University Third Hospital, Beijing, China
- Institute for Drug Evaluation, Peking University Health Science Center, Beijing, China
| | - Yingying Yan
- Department of Pharmacy, Peking University Third Hospital, Beijing, China
- Institute for Drug Evaluation, Peking University Health Science Center, Beijing, China
| | - Ziyang Wu
- Department of Pharmacy, Peking University Third Hospital, Beijing, China
- Institute for Drug Evaluation, Peking University Health Science Center, Beijing, China
| | - Yinchu Cheng
- Department of Pharmacy, Peking University Third Hospital, Beijing, China
- Institute for Drug Evaluation, Peking University Health Science Center, Beijing, China
| | - Fang Liu
- Department of Pharmacy, Peking University Third Hospital, Beijing, China
- Institute for Drug Evaluation, Peking University Health Science Center, Beijing, China
| | - Xiaotong Li
- School of Pharmacy, University of Pittsburgh, Pittsburgh, PA, USA
| | - Suodi Zhai
- Department of Pharmacy, Peking University Third Hospital, Beijing, China
- Institute for Drug Evaluation, Peking University Health Science Center, Beijing, China
| |
Collapse
|
12
|
Cohen A, Burns J, Gabra M, Gordon A, Deebel N, Terlecki R, Woodburn KL. Performance of Chat Generative Pre-Trained Transformer on Personal Review of Learning in Obstetrics and Gynecology. South Med J 2025; 118:102-105. [PMID: 39883147 DOI: 10.14423/smj.0000000000001783] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2025]
Abstract
OBJECTIVES Chat Generative Pre-Trained Transformer (ChatGPT) is a popular natural-language processor that is able to analyze and respond to a variety of prompts, providing eloquent answers based on a collection of Internet data. ChatGPT has been considered an avenue for the education of resident physicians in the form of board preparation in the contemporary literature, where it has been applied against board study material across multiple medical specialties. The purpose of our study was to evaluate the performance of ChatGPT on the Personal Review of Learning in Obstetrics and Gynecology (PROLOG) assessments and gauge its specialty specific knowledge for educational applications. METHODS PROLOG assessments were administered to ChatGPT version 3.5, and the percentage of correct responses was recorded. Questions were categorized by question stem order and used to measure ChatGPT performance. Performance was compared using descriptive statistics. RESULTS There were 848 questions without visual components; ChatGPT answered 57.8% correct (N = 490). ChatGPT performed worse on higher-order questions compared with first-order questions, 56.8% vs 60.5%, respectively. There were 65 questions containing visual data, and ChatGPT answered 16.9% correctly. CONCLUSIONS The passing score for the PROLOG assessments is 80%; therefore ChatGPT 3.5 did not perform satisfactorily. Given this, it is unlikely that the tested version of ChatGPT has sufficient specialty-specific knowledge or logical capability to serve as a reliable tool for trainee education.
Collapse
Affiliation(s)
| | - Jersey Burns
- Obstetrics and Gynecology, Atrium Health Wake Forest Baptist, Winston-Salem, North Carolina
| | | | - Alex Gordon
- the Edward Via College of Osteopathic Medicine, Blacksburg, Virginia
| | | | | | | |
Collapse
|
13
|
Cohen ND, Ho M, McIntire D, Smith K, Kho KA. A comparative analysis of generative artificial intelligence responses from leading chatbots to questions about endometriosis. AJOG GLOBAL REPORTS 2025; 5:100405. [PMID: 39810943 PMCID: PMC11730533 DOI: 10.1016/j.xagr.2024.100405] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2025] Open
Abstract
Introduction The use of generative artificial intelligence (AI) has begun to permeate most industries, including medicine, and patients will inevitably start using these large language model (LLM) chatbots as a modality for education. As healthcare information technology evolves, it is imperative to evaluate chatbots and the accuracy of the information they provide to patients and to determine if there is variability between them. Objective This study aimed to evaluate the accuracy and comprehensiveness of three chatbots in addressing questions related to endometriosis and determine the level of variability between them. Study Design Three LLMs, including Chat GPT-4 (Open AI), Claude (Anthropic), and Bard (Google) were asked to generate answers to 10 commonly asked questions about endometriosis. The responses were qualitatively compared to current guidelines and expert opinion on endometriosis and rated on a scale by nine gynecologists. The grading scale included the following: (1) Completely incorrect, (2) mostly incorrect and some correct, (3) mostly correct and some incorrect, (4) correct but inadequate, (5) correct and comprehensive. Final scores were averaged between the nine reviewers. Kendall's W and the related chi-square test were used to evaluate the reviewers' strength of agreement in ranking the LLMs' responses for each item. Results Average scores for the 10 answers amongst Bard, Chat GPT, and Claude were 3.69, 4.24, and 3.7, respectively. Two questions showed significant disagreement between the nine reviewers. There were no questions the models could answer comprehensively or correctly across the reviewers. The model most associated with comprehensive and correct responses was ChatGPT. Chatbots showed an improved ability to accurately answer questions about symptoms and pathophysiology over treatment and risk of recurrence. Conclusion The analysis of LLMs revealed that, on average, they mainly provided correct but inadequate responses to commonly asked patient questions about endometriosis. While chatbot responses can serve as valuable supplements to information provided by licensed medical professionals, it is crucial to maintain a thorough ongoing evaluation process for outputs to provide the most comprehensive and accurate information to patients. Further research into this technology and its role in patient education and treatment is crucial as generative AI becomes more embedded in the medical field.
Collapse
Affiliation(s)
- Natalie D. Cohen
- University of Texas Southwestern, Dallas, TX (Cohen, Ho, McIntire, Smith, and Kho)
| | - Milan Ho
- University of Texas Southwestern, Dallas, TX (Cohen, Ho, McIntire, Smith, and Kho)
| | - Donald McIntire
- University of Texas Southwestern, Dallas, TX (Cohen, Ho, McIntire, Smith, and Kho)
| | - Katherine Smith
- University of Texas Southwestern, Dallas, TX (Cohen, Ho, McIntire, Smith, and Kho)
| | - Kimberly A. Kho
- University of Texas Southwestern, Dallas, TX (Cohen, Ho, McIntire, Smith, and Kho)
| |
Collapse
|
14
|
Abstract
Preeclampsia is a multisystem hypertensive disorder that manifests itself after 20 weeks of pregnancy, along with proteinuria. The pathophysiology of preeclampsia is incompletely understood. Artificial intelligence, especially machine learning with its capability to identify patterns in complex data, has the potential to revolutionize preeclampsia research. These data-driven techniques can improve early diagnosis, personalize risk assessment, uncover the disease's molecular basis, optimize treatments, and enable remote monitoring. This brief review discusses the recent applications of artificial intelligence and machine learning in preeclampsia management and research, including the improvements these approaches have brought, along with their challenges and limitations.
Collapse
Affiliation(s)
- Anita T Layton
- Department of Applied Mathematics, Department of Biology, Cheriton School of Computer Science, and School of Pharmacology, University of Waterloo, ON, Canada
| |
Collapse
|
15
|
Yuan XT, Shao CY, Zhang ZZ, Qian D. Comparing the performance of ChatGPT and ERNIE Bot in answering questions regarding liver cancer interventional radiology in Chinese and English contexts: A comparative study. Digit Health 2025; 11:20552076251315511. [PMID: 39850627 PMCID: PMC11755525 DOI: 10.1177/20552076251315511] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2024] [Accepted: 01/08/2025] [Indexed: 01/25/2025] Open
Abstract
Introduction This study aims to critically assess the appropriateness and limitations of two prominent large language models (LLMs), enhanced representation through knowledge integration (ERNIE Bot) and chat generative pre-trained transformer (ChatGPT), in answering questions about liver cancer interventional radiology. Through a comparative analysis, the performance of these models will be evaluated based on their responses to questions about transarterial chemoembolization and hepatic arterial infusion chemotherapy in both English and Chinese contexts. Methods A total of 38 questions were developed to cover a range of topics related to transarterial chemoembolization (TACE) and hepatic arterial infusion chemotherapy (HAIC), including foundational knowledge, patient education, and treatment and care. The responses generated by ERNIE Bot and ChatGPT were rigorously evaluated by 10 professionals in liver cancer interventional radiology. The final score was determined by one seasoned clinical expert. Each response was rated on a five-point Likert scale, facilitating a quantitative analysis of the accuracy and comprehensiveness of the information provided by each language model. Results ERNIE Bot is superior to ChatGPT in the Chinese context (ERNIE Bot: 5, 89.47%; 4, 10.53%; 3, 0%; 2, 0%; 1, 0% vs ChatGPT: 5, 57.89%; 4, 5.27%; 3, 34.21%; 2, 2.63%; 1, 0%; P = 0.001). However, ChatGPT outperformed ERNIE Bot in the English context (ERNIE Bot: 5, 73.68%; 4, 2.63%; 3, 13.16; 2, 10.53%;1, 0% vs ChatGPT: 5, 92.11%; 4, 2.63%; 3, 5.26%; 2, 0%; 1, 0%; P = 0.026). Conclusions This study preliminarily demonstrated that ERNIE Bot and ChatGPT effectively address questions related to liver cancer interventional radiology. However, their performance varied by language: ChatGPT excelled in English contexts, while ERNIE Bot performed better in Chinese. We found that choosing the appropriate LLMs is beneficial for patients in obtaining more accurate treatment information. Both models require manual review to ensure accuracy and reliability in practical use.
Collapse
Affiliation(s)
- Xue-ting Yuan
- Department of Interventional Radiology, The First Affiliated Hospital of Soochow University, Suzhou, China
| | - Chen-ye Shao
- School of Nursing, Department of Thoracic Surgery, The First Affiliated Hospital of Soochow University, Suzhou, China
| | - Zhen-zhen Zhang
- Department of Interventional Radiology, The First Affiliated Hospital of Soochow University, Suzhou, China
| | - Duo Qian
- Department of Interventional Radiology, The First Affiliated Hospital of Soochow University, Suzhou, China
| |
Collapse
|
16
|
Barbosa-Silva J, Driusso P, Ferreira EA, de Abreu RM. Exploring the Efficacy of Artificial Intelligence: A Comprehensive Analysis of CHAT-GPT's Accuracy and Completeness in Addressing Urinary Incontinence Queries. Neurourol Urodyn 2025; 44:153-164. [PMID: 39390731 DOI: 10.1002/nau.25603] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2024] [Revised: 09/05/2024] [Accepted: 09/25/2024] [Indexed: 10/12/2024]
Abstract
BACKGROUND Artificial intelligence models are increasingly gaining popularity among patients and healthcare professionals. While it is impossible to restrict patient's access to different sources of information on the Internet, healthcare professional needs to be aware of the content-quality available across different platforms. OBJECTIVE To investigate the accuracy and completeness of Chat Generative Pretrained Transformer (ChatGPT) in addressing frequently asked questions related to the management and treatment of female urinary incontinence (UI), compared to recommendations from guidelines. METHODS This is a cross-sectional study. Two researchers developed 14 frequently asked questions related to UI. Then, they were inserted into the ChatGPT platform on September 16, 2023. The accuracy (scores from 1 to 5) and completeness (score from 1 to 3) of ChatGPT's answers were assessed individually by two experienced researchers in the Women's Health field, following the recommendations proposed by the guidelines for UI. RESULTS Most of the answers were classified as "more correct than incorrect" (n = 6), followed by "incorrect information than correct" (n = 3), "approximately equal correct and incorrect" (n = 2), "near all correct" (n = 2, and "correct" (n = 1). Regarding the appropriateness, most of the answers were classified as adequate, as they provided the minimum information expected to be classified as correct. CONCLUSION These results showed an inconsistency when evaluating the accuracy of answers generated by ChatGPT compared by scientific guidelines. Almost all the answers did not bring the complete content expected or reported in previous guidelines, which highlights to healthcare professionals and scientific community a concern about using artificial intelligence in patient counseling.
Collapse
Affiliation(s)
- Jordana Barbosa-Silva
- Women's Health Research Laboratory, Physical Therapy Department, Federal University of São Carlos, São Carlos, Brazil
| | - Patricia Driusso
- Women's Health Research Laboratory, Physical Therapy Department, Federal University of São Carlos, São Carlos, Brazil
| | - Elizabeth A Ferreira
- Department of Obstetrics and Gynecology, FMUSP School of Medicine, University of São Paulo, São Paulo, Brazil
- Department of Physiotherapy, Speech Therapy and Occupational Therapy, School of Medicine, University of São Paulo, São Paulo, Brazil
| | - Raphael M de Abreu
- Department of Physiotherapy, LUNEX University, International University of Health, Exercise & Sports S.A., Differdange, Luxembourg
- LUNEX ASBL Luxembourg Health & Sport Sciences Research Institute, Differdange, Luxembourg
| |
Collapse
|
17
|
Zitu MM, Le TD, Duong T, Haddadan S, Garcia M, Amorrortu R, Zhao Y, Rollison DE, Thieu T. Large language models in cancer: potentials, risks, and safeguards. BJR ARTIFICIAL INTELLIGENCE 2025; 2:ubae019. [PMID: 39777117 PMCID: PMC11703354 DOI: 10.1093/bjrai/ubae019] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/27/2023] [Revised: 10/26/2024] [Accepted: 12/09/2024] [Indexed: 01/11/2025]
Abstract
This review examines the use of large language models (LLMs) in cancer, analysing articles sourced from PubMed, Embase, and Ovid Medline, published between 2017 and 2024. Our search strategy included terms related to LLMs, cancer research, risks, safeguards, and ethical issues, focusing on studies that utilized text-based data. 59 articles were included in the review, categorized into 3 segments: quantitative studies on LLMs, chatbot-focused studies, and qualitative discussions on LLMs on cancer. Quantitative studies highlight LLMs' advanced capabilities in natural language processing (NLP), while chatbot-focused articles demonstrate their potential in clinical support and data management. Qualitative research underscores the broader implications of LLMs, including the risks and ethical considerations. Our findings suggest that LLMs, notably ChatGPT, have potential in data analysis, patient interaction, and personalized treatment in cancer care. However, the review identifies critical risks, including data biases and ethical challenges. We emphasize the need for regulatory oversight, targeted model development, and continuous evaluation. In conclusion, integrating LLMs in cancer research offers promising prospects but necessitates a balanced approach focusing on accuracy, ethical integrity, and data privacy. This review underscores the need for further study, encouraging responsible exploration and application of artificial intelligence in oncology.
Collapse
Affiliation(s)
- Md Muntasir Zitu
- Department of Cancer Epidemiology, Moffitt Cancer Center, Tampa, FL, Moffitt Cancer Center and Research Institute, Tampa, FL 33612, United States
- Department of Machine Learning, Moffitt Cancer Center, Tampa, FL, Moffitt Cancer Center and Research Institute, Tampa, FL 33612, United States
| | - Tuan Dung Le
- Department of Machine Learning, Moffitt Cancer Center, Tampa, FL, Moffitt Cancer Center and Research Institute, Tampa, FL 33612, United States
| | - Thanh Duong
- Department of Machine Learning, Moffitt Cancer Center, Tampa, FL, Moffitt Cancer Center and Research Institute, Tampa, FL 33612, United States
| | - Shohreh Haddadan
- Department of Machine Learning, Moffitt Cancer Center, Tampa, FL, Moffitt Cancer Center and Research Institute, Tampa, FL 33612, United States
| | - Melany Garcia
- Department of Cancer Epidemiology, Moffitt Cancer Center, Tampa, FL, Moffitt Cancer Center and Research Institute, Tampa, FL 33612, United States
| | - Rossybelle Amorrortu
- Department of Cancer Epidemiology, Moffitt Cancer Center, Tampa, FL, Moffitt Cancer Center and Research Institute, Tampa, FL 33612, United States
| | - Yayi Zhao
- Department of Cancer Epidemiology, Moffitt Cancer Center, Tampa, FL, Moffitt Cancer Center and Research Institute, Tampa, FL 33612, United States
| | - Dana E Rollison
- Department of Cancer Epidemiology, Moffitt Cancer Center, Tampa, FL, Moffitt Cancer Center and Research Institute, Tampa, FL 33612, United States
| | - Thanh Thieu
- Department of Machine Learning, Moffitt Cancer Center, Tampa, FL, Moffitt Cancer Center and Research Institute, Tampa, FL 33612, United States
| |
Collapse
|
18
|
Huang AE, Chang MT, Khanwalkar A, Yan CH, Phillips KM, Yong MJ, Nayak JV, Hwang PH, Patel ZM. Utilization of ChatGPT for Rhinology Patient Education: Limitations in a Surgical Sub-Specialty. OTO Open 2025; 9:e70065. [PMID: 39776758 PMCID: PMC11705442 DOI: 10.1002/oto2.70065] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2024] [Revised: 10/28/2024] [Accepted: 12/08/2024] [Indexed: 01/11/2025] Open
Abstract
Objective To analyze the accuracy of ChatGPT-generated responses to common rhinologic patient questions. Methods Ten common questions from rhinology patients were compiled by a panel of 4 rhinology fellowship-trained surgeons based on clinical patient experience. This panel (Panel 1) developed consensus "expert" responses to each question. Questions were individually posed to ChatGPT (version 3.5) and its responses recorded. ChatGPT-generated responses were individually graded by Panel 1 on a scale of 0 (incorrect) to 3 (correct and exceeding the quality of expert responses). A 2nd panel was given the consensus and ChatGPT responses to each question and asked to guess which response corresponded to which source. They then graded ChatGPT responses using the same criteria as Panel 1. Question-specific and overall mean grades for ChatGPT responses, as well as interclass correlation coefficient (ICC) as a measure of interrater reliability, were calculated. Results The overall mean grade for ChatGPT responses was 1.65/3. For 2 out of 10 questions, ChatGPT responses were equal to or better than expert responses. However, for 8 out of 10 questions, ChatGPT provided responses that were incorrect, false, or incomplete based on mean rater grades. Overall ICC was 0.526, indicating moderate reliability among raters of ChatGPT responses. Reviewers were able to discern ChatGPT from human responses with 97.5% accuracy. Conclusion This preliminary study demonstrates overall near-complete and variably accurate responses provided by ChatGPT to common rhinologic questions, demonstrating important limitations in nuanced subspecialty fields.
Collapse
Affiliation(s)
- Alice E. Huang
- Department of Otolaryngology–Head and Neck SurgeryStanford University School of MedicineStanfordCaliforniaUSA
| | - Michael T. Chang
- Department of Otolaryngology–Head and Neck SurgeryStanford University School of MedicineStanfordCaliforniaUSA
| | - Ashoke Khanwalkar
- Department of Otolaryngology–Head and Neck SurgeryUniversity of Colorado Anschultz School of MedicineAuroraColoradoUSA
| | - Carol H. Yan
- Department of Otolaryngology–Head and Neck SurgeryUniversity of California‐San Diego School of MedicineSan DiegoCaliforniaUSA
| | - Katie M. Phillips
- Department of Otolaryngology–Head and Neck SurgeryUniversity of Cincinnati College of MedicineCincinattiOhioUSA
| | - Michael J. Yong
- Department of Otolaryngology–Head and Neck SurgeryStanford University School of MedicineStanfordCaliforniaUSA
| | - Jayakar V. Nayak
- Department of Otolaryngology–Head and Neck SurgeryStanford University School of MedicineStanfordCaliforniaUSA
| | - Peter H. Hwang
- Department of Otolaryngology–Head and Neck SurgeryStanford University School of MedicineStanfordCaliforniaUSA
| | - Zara M. Patel
- Department of Otolaryngology–Head and Neck SurgeryStanford University School of MedicineStanfordCaliforniaUSA
| |
Collapse
|
19
|
Gungor ND, Esen FS, Tasci T, Gungor K, Cil K. Navigating Gynecological Oncology with Different Versions of ChatGPT: A Transformative Breakthrough or the Next Black Box Challenge? Oncol Res Treat 2024; 48:102-111. [PMID: 39689699 DOI: 10.1159/000543173] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2024] [Accepted: 12/10/2024] [Indexed: 12/19/2024]
Abstract
INTRODUCTION The study evaluates the performance of large language model versions of ChatGPT - ChatGPT-3.5, ChatGPT-4, and ChatGPT-Omni - in addressing inquiries related to the diagnosis and treatment of gynecological cancers, including ovarian, endometrial, and cervical cancers. METHODS A total of 804 questions were equally distributed across four categories: true/false, multiple-choice, open-ended, and case-scenario, with each question type representing varying levels of complexity. Performance was assessed using a six-point Likert scale, focusing on accuracy, completeness, and alignment with established clinical guidelines. RESULTS For true/false queries, ChatGPT-Omni achieved accuracy rates of 100% for easy, 98% for medium, and 97% for complicated questions, higher than ChatGPT-4 (94%, 90%, 85%) and ChatGPT-3.5 (90%, 85%, 80%) (p = 0.041, 0.023, 0.014, respectively). In multiple-choice, ChatGPT-Omni maintained superior accuracy with 100% for easy, 98% for medium, and 93% for complicated queries, compared to ChatGPT-4 (92%, 88%, 80%) and ChatGPT-3.5 (85%, 80%, 70%) (p = 0.035, 0.028, 0.011). For open-ended questions, ChatGPT-Omni had mean Likert scores of 5.8 for easy, 5.5 for medium, and 5.2 for complex levels, outperforming ChatGPT-4 (5.4, 5.0, 4.5) and ChatGPT-3.5 (5.0, 4.5, 4.0) (p = 0.037, 0.026, 0.015). Similar trends were observed in case-scenario questions, where ChatGPT-Omni achieved scores of 5.6, 5.3, and 4.9 for easy, medium, and hard levels, respectively (p = 0.017, 0.008, 0.012). CONCLUSIONS ChatGPT-Omni exhibited superior performance in responding to clinical queries related to gynecological cancers, underscoring its potential utility as a decision support tool and an educational resource in clinical practice.
Collapse
Affiliation(s)
- Nur Dokuzeylul Gungor
- Department of Reproductive Endocrinology and IVF Center BAU, Goztepe Medical Park Hospital, Istanbul, Turkey
| | - Fatih Sinan Esen
- Department of Computer Engineering, Ankara University, Ankara, Turkey
| | - Tolga Tasci
- Medicalpark Göztepe Hospital, Department of Obstetrics and Gynecology, Bahçeşehir University, Istanbul, Turkey
| | - Kagan Gungor
- Süleyman Yalçın City Hospital, Department of Endocrinology and Metabolic Diseases, Medeniyet University, Istanbul, Turkey
| | - Kaan Cil
- Otto-von-Guericke-University Magdeburg Class 6 Student, Magdeburg, Germany
| |
Collapse
|
20
|
Zhang S, Chu Q, Li Y, Liu J, Wang J, Yan C, Liu W, Wang Y, Zhao C, Zhang X, Chen Y. Evaluation of large language models under different training background in Chinese medical examination: a comparative study. Front Artif Intell 2024; 7:1442975. [PMID: 39697797 PMCID: PMC11652508 DOI: 10.3389/frai.2024.1442975] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2024] [Accepted: 11/06/2024] [Indexed: 12/20/2024] Open
Abstract
BackgroundRecently, Large Language Models have shown impressive potential in medical services. However, the aforementioned research primarily discusses the performance of LLMs developed in English within English-speaking medical contexts, ignoring the LLMs developed under different linguistic environments with respect to their performance in the Chinese clinical medicine field.ObjectiveThrough a comparative analysis of three LLMs developed under different training background, we firstly evaluate their potential as medical service tools in a Chinese language environment. Furthermore, we also point out the limitations in the application of Chinese medical practice.MethodUtilizing the APIs provided by three LLMs, we conducted an automated assessment of their performance in the 2023 CMLE. We not only examined the accuracy of three LLMs across various question, but also categorized the reasons for their errors. Furthermore, we performed repetitive experiments on selected questions to evaluate the stability of the outputs generated by the LLMs.ResultThe accuracy of GPT-4, ERNIE Bot, and DISC-MedLLM in CMLE are 65.2, 61.7, and 25.3%. In error types, the knowledge errors of GPT-4 and ERNIE Bot account for 52.2 and 51.7%, while hallucinatory errors account for 36.4 and 52.6%. In the Chinese text generation experiment, the general LLMs demonstrated high natural language understanding ability and was able to generate clear and standardized Chinese texts. In repetitive experiments, the LLMs showed a certain output stability of 70%, but there were still cases of inconsistent output results.ConclusionGeneral LLMs, represented by GPT-4 and ERNIE Bot, demonstrate the capability to meet the standards of the CMLE. Despite being developed and trained in different linguistic contexts, they exhibit excellence in understanding Chinese natural language and Chinese clinical knowledge, highlighting their broad potential application in Chinese medical practice. However, these models still show deficiencies in mastering specialized knowledge, addressing ethical issues, and maintaining the outputs stability. Additionally, there is a tendency to avoid risk when providing medical advice.
Collapse
Affiliation(s)
- Siwen Zhang
- School of Medical Device, Shenyang Pharmaceutical University, Shenyang, China
| | - Qi Chu
- Department of Clinical Laboratory, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences, Beijing, China
| | - Yujun Li
- School of Pharmacy, Shenyang Pharmaceutical University, Shenyang, China
| | - Jialu Liu
- School of Pharmacy, Shenyang Pharmaceutical University, Shenyang, China
| | - Jiayi Wang
- School of Pharmacy, Shenyang Pharmaceutical University, Shenyang, China
| | - Chi Yan
- School of Pharmacy, Shenyang Pharmaceutical University, Shenyang, China
| | - Wenxi Liu
- School of Pharmaceutical Engineering, Shenyang Pharmaceutical University, Shenyang, China
| | - Yizhen Wang
- School of Pharmaceutical Engineering, Shenyang Pharmaceutical University, Shenyang, China
| | - Chengcheng Zhao
- School of Pharmacy, Shenyang Pharmaceutical University, Shenyang, China
| | - Xinyue Zhang
- School of Pharmacy, Shenyang Pharmaceutical University, Shenyang, China
| | - Yuwen Chen
- School of Business Administration, Shenyang Pharmaceutical University, Shenyang, China
| |
Collapse
|
21
|
Rotem R, Zamstein O, Rottenstreich M, O'Sullivan OE, O'reilly BA, Weintraub AY. The future of patient education: A study on AI-driven responses to urinary incontinence inquiries. Int J Gynaecol Obstet 2024; 167:1004-1009. [PMID: 38944693 DOI: 10.1002/ijgo.15751] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2024] [Revised: 05/30/2024] [Accepted: 06/14/2024] [Indexed: 07/01/2024]
Abstract
OBJECTIVE To evaluate the effectiveness of ChatGPT in providing insights into common urinary incontinence concerns within urogynecology. By analyzing the model's responses against established benchmarks of accuracy, completeness, and safety, the study aimed to quantify its usefulness for informing patients and aiding healthcare providers. METHODS An expert-driven questionnaire was developed, inviting urogynecologists worldwide to assess ChatGPT's answers to 10 carefully selected questions on urinary incontinence (UI). These assessments focused on the accuracy of the responses, their comprehensiveness, and whether they raised any safety issues. Subsequent statistical analyses determined the average consensus among experts and identified the proportion of responses receiving favorable evaluations (a score of 4 or higher). RESULTS Of 50 urogynecologists that were approached worldwide, 37 responded, offering insights into ChatGPT's responses on UI. The overall feedback averaged a score of 4.0, indicating a positive acceptance. Accuracy scores averaged 3.9 with 71% rated favorably, whereas comprehensiveness scored slightly higher at 4 with 74% favorable ratings. Safety assessments also averaged 4 with 74% favorable responses. CONCLUSION This investigation underlines ChatGPT's favorable performance across the evaluated domains of accuracy, comprehensiveness, and safety within the context of UI queries. However, despite this broadly positive reception, the study also signals a clear avenue for improvement, particularly in the precision of the provided information. Refining ChatGPT's accuracy and ensuring the delivery of more pinpointed responses are essential steps forward, aiming to bolster its utility as a comprehensive educational resource for patients and a supportive tool for healthcare practitioners.
Collapse
Affiliation(s)
- Reut Rotem
- Department of Urogynaecology, Cork University Maternity Hospital, Cork, Ireland
- Department of Obstetrics and Gynecology, Shaare Zedek Medical Center, Affiliated with the Hebrew University School of Medicine, Jerusalem, Israel
| | - Omri Zamstein
- Department of Obstetrics and Gynecology, Soroka University Medical Center, Faculty of Health Sciences, Ben-Gurion University of the Negev, Beer-Sheva, Israel
| | - Misgav Rottenstreich
- Department of Obstetrics and Gynecology, Shaare Zedek Medical Center, Affiliated with the Hebrew University School of Medicine, Jerusalem, Israel
| | | | - Barry A O'reilly
- Department of Urogynaecology, Cork University Maternity Hospital, Cork, Ireland
| | - Adi Y Weintraub
- Department of Obstetrics and Gynecology, Soroka University Medical Center, Faculty of Health Sciences, Ben-Gurion University of the Negev, Beer-Sheva, Israel
| |
Collapse
|
22
|
Gurbuz T, Gokmen O, Devranoglu B, Yurci A, Madenli AA. Artificial intelligence in reproductive endocrinology: an in-depth longitudinal analysis of ChatGPTv4's month-by-month interpretation and adherence to clinical guidelines for diminished ovarian reserve. Endocrine 2024; 86:1171-1177. [PMID: 39341951 DOI: 10.1007/s12020-024-04031-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/30/2024] [Accepted: 09/03/2024] [Indexed: 10/01/2024]
Abstract
OBJECTIVE To quantitatively assess the performance of ChatGPTv4, an Artificial Intelligence Language Model, in adhering to clinical guidelines for Diminished Ovarian Reserve (DOR) over two months, evaluating the model's consistency in providing guideline-based responses. DESIGN A longitudinal study design was employed to evaluate ChatGPTv4's response accuracy and completeness using a structured questionnaire at baseline and at a two-month follow-up. SETTING ChatGPTv4 was tasked with interpreting DOR questionnaires based on standardized clinical guidelines. PARTICIPANTS The study did not involve human participants; the questionnaire was exclusively administered to the ChatGPT model to generate responses about DOR. METHODS A guideline-based questionnaire with 176 open-ended, 166 multiple-choice, and 153 true/false questions were deployed to rigorously assess ChatGPTv4's ability to provide accurate medical advice aligned with current DOR clinical guidelines. AI-generated responses were rated on a 6-point Likert scale for accuracy and a 3-point scale for completeness. The two-phase design assessed the stability and consistency of AI-generated answers over two months. RESULTS ChatGPTv4 achieved near-perfect scores across all question types, with true/false questions consistently answered with 100% accuracy. In multiple-choice queries, accuracy improved from 98.2 to 100% at the two-month follow-up. Open-ended question responses exhibited significant positive enhancements, with accuracy scores increasing from an average of 5.38 ± 0.71 to 5.74 ± 0.51 (max: 6.0) and completeness scores from 2.57 ± 0.52 to 2.85 ± 0.36 (max: 3.0). It underscored the improvements as significant (p < 0.001), with positive correlations between initial and follow-up accuracy (r = 0.597) and completeness (r = 0.381) scores. LIMITATIONS The study was limited by the reliance on a controlled, albeit simulated, setting that may not perfectly mirror real-world clinical interactions. CONCLUSION ChatGPTv4 demonstrated exceptional and improving accuracy and completeness in handling DOR-related guideline queries over the studied period. These findings highlight ChatGPTv4's potential as a reliable, adaptable AI tool in reproductive endocrinology, capable of augmenting clinical decision-making and guideline development.
Collapse
Affiliation(s)
- Tugba Gurbuz
- Department of Gynecology and Obstetrics Clinic, Vocational School of Health Services, Operating Room Services (Turkish-English) Medical Imaging Techniques (Turkish-English), Medistate Hospital, Istanbul Nişantaşı University, Istanbul, Turkey.
| | - Oya Gokmen
- Department of Gynecology, Obstetrics and In Vitro Fertilization Clinic, Medistate Hospital, Istanbul, Turkey
| | - Belgin Devranoglu
- Department of Obstetrics and Gynecology, Zeynep Kamil Maternity/Children, Education and Training Hospital, Istanbul, Turkey
| | - Arzu Yurci
- IVF Department, Department of Gynecology and Obstetrics, Memorial Bahçelievler Hospital, Istanbul Arel University, Istanbul, Turkey
| | - Asena Ayar Madenli
- Department of Obstetrics and Gynecology, Liv Hospital Vadistanbul, Istanbul, Turkey
- Department of Obstetrics and Gynecology, Faculty of Medicine, Istinye University, Istanbul, Turkey
| |
Collapse
|
23
|
Cheng T, Li Y, Gu J, He Y, He G, Zhou P, Li S, Xu H, Bao Y, Wang X. The performance of ChatGPT in day surgery and pre-anesthesia risk assessment: a case-control study of 150 simulated patient presentations. Perioper Med (Lond) 2024; 13:111. [PMID: 39574189 PMCID: PMC11580513 DOI: 10.1186/s13741-024-00469-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2024] [Accepted: 11/09/2024] [Indexed: 11/25/2024] Open
Abstract
BACKGROUND Day surgery has developed rapidly in China in recent years, although it still faces a shortage of anesthesiologists to handle pre-anesthesia routine before surgery. We hypothesized that ChatGPT may assist anesthesia practitioners in preoperative assessment and answer questions on the concerns of patients. The aims of this study were to examine the ability of ChatGPT to assess preoperative risk and determine its accuracy in answering questions regarding knowledge and management of day surgery anesthesia. METHODS One-hundred fifty patient profiles were generated to simulate day surgery patient presentations that involved complications of varying acuity and severity. The ChatGPT group and the expert group were both required to evaluate the profiles of 150 simulated patients to determine their ASA-PS classification and whether day surgery was recommended. ChatGPT was then asked to answer 131 questions about day surgery anesthesia that represented the most common issues encountered in clinical practice. The performance of ChatGPT was assessed and graded independently by two experienced anesthesiologists. RESULTS A total of 150 patient profiles were included in the study (75 males [50.0%] and 75 females [50.0%]). There was no difference between the ChatGPT group and the expert group for the ASA-PS classification and assessment of anesthesia risk in the patient profiles (P > 0.05). Regarding recommendation for day surgery in patients with certain comorbidities (ASA ≥ II), the expert group was inclined to require further examination or treatment. In addition, the proportion of conclusions made by ChatGPT was smaller than that of the experts (i.e., ChatGPT n (%) vs. expert n (%): day surgery can be performed, 67 (47.9) vs. 31 (25.4); needs further treatment and evaluation, 56 (37.3) vs. 66 (44.0); and day surgery is not recommended, 18 (12.9) vs. 29 (9.3), P < 0.05). We showed that ChatGPT had extensive knowledge related to day surgery anesthesia (94.0% correct), with most of the points (70%) considered comprehensive. The performance of ChatGPT was also better in the domains of peri-anesthesia concerns, lifestyle, and emotional support. CONCLUSIONS ChatGPT can assist anesthesia practitioners and surgeons by alerting them to the ASA-PS classification and assessing perioperative risk in day surgery patients. ChatGPT can also be trusted to answer questions and concerns related to pre-anesthesia and therefore has the potential to provide important assistance in clinical work.
Collapse
Affiliation(s)
- Tingting Cheng
- Department of Anesthesiology, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, 200025, China
| | - Yu Li
- School of Clinical Medicine, Qinghai University, Xining, Qinghai, 810000, China
| | - Jiaqiu Gu
- Department of Anesthesiology, Jiading District Central Hospital, Shanghai University of Medicine & Health Sciences, Shanghai, 201800, China
| | - Yibo He
- Department of Anesthesiology, Jiading District Central Hospital, Shanghai University of Medicine & Health Sciences, Shanghai, 201800, China
| | - Guangbao He
- Department of Anesthesiology, Jiading District Central Hospital, Shanghai University of Medicine & Health Sciences, Shanghai, 201800, China
| | - Peipei Zhou
- Department of Anesthesiology, Children's Hospital of Shanghai, Shanghai Jiao Tong University School of Medicine, Shanghai, 200127, China
| | - Shuyun Li
- Department of Anesthesiology, Jiading District Central Hospital, Shanghai University of Medicine & Health Sciences, Shanghai, 201800, China
| | - Hang Xu
- Department of Anesthesiology, Jiading District Central Hospital, Shanghai University of Medicine & Health Sciences, Shanghai, 201800, China
| | - Yang Bao
- Department of Anesthesiology, Jiading District Central Hospital, Shanghai University of Medicine & Health Sciences, Shanghai, 201800, China.
| | - Xuejun Wang
- Department of Anesthesiology, Qinghai Red Cross Hospital, Xining, Qinghai, 810000, China.
| |
Collapse
|
24
|
Desseauve D, Lescar R, de la Fourniere B, Ceccaldi PF, Dziadzko M. AI in obstetrics: Evaluating residents' capabilities and interaction strategies with ChatGPT. Eur J Obstet Gynecol Reprod Biol 2024; 302:238-241. [PMID: 39326228 DOI: 10.1016/j.ejogrb.2024.09.008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2023] [Revised: 06/01/2024] [Accepted: 09/06/2024] [Indexed: 09/28/2024]
Abstract
In line with the digital transformation trend in medical training, students may resort to artificial intelligence (AI) for learning. This study assessed the interaction between obstetrics residents and ChatGPT during clinically oriented summative evaluations related to acute hepatic steatosis of pregnancy, and their self-reported competencies in information technology (IT) and AI. The participants in this semi-qualitative observational study were 14 obstetrics residents from two university hospitals. Students' queries were categorized into three distinct types: third-party enquiries; search-engine-style queries; and GPT-centric prompts. Responses were compared against a standardized answer produced by ChatGPT with a Delphi-developed expert prompt. Data analysis employed descriptive statistics and correlation analysis to explore the relationship between AI/IT skills and response accuracy. The study participants showed moderate IT proficiency but low AI proficiency. Interaction with ChatGPT regarding clinical signs of acute hepatic steatosis gravidarum revealed a preference for third-party questioning, resulting in only 21% accurate responses due to misinterpretation of medical acronyms. No correlation was found between AI response accuracy and the residents' self-assessed IT or AI skills, with most expressing dissatisfaction with their AI training. This study underlines the discrepancy between perceived and actual AI proficiency, highlighted by clinically inaccurate yet plausible AI responses - a manifestation of the 'stochastic parrot' phenomenon. These findings advocate for the inclusion of structured AI literacy programmes in medical education, focusing on prompt engineering. These academic skills are essential to exploit AI's potential in obstetrics and gynaecology. The ultimate aim is to optimize patient care in AI-augmented health care, and prevent misleading and unsafe knowledge acquisition.
Collapse
Affiliation(s)
- David Desseauve
- Department of Women-Mother-Child, Gynaecology and Obstetrics Unit, Lausanne University Hospital, Lausanne, Switzerland; Department of Women-Mother-Child, Gynaecology and Obstetrics Unit, Grenoble Alpes, University Hospital, Grenoble, France.
| | - Raphael Lescar
- Department of Obstetrics and Gynaecology, Hôpital de la Croix-Rousse, Hospices civils de Lyon, Lyon, France
| | - Benoit de la Fourniere
- Department of Obstetrics and Gynaecology, Hôpital de la Croix-Rousse, Hospices civils de Lyon, Lyon, France
| | - Pierre-François Ceccaldi
- Department of Obstetrics, Gynaecology and Reproductive Medicine, Foch Hospital, Suresnes, France; Innovative Dental Materials and Interfaces Research Unit (UR 4462), Faculty of Health, University of Paris, Paris, France
| | - Mikhail Dziadzko
- Department of Anaesthesiology, Hôpital de la Croix-Rousse, Hospices civils de Lyon, Lyon, France; RESHAPE UMR 1290 INSERM, Université Lyon 1, Lyon, France
| |
Collapse
|
25
|
Graf EM, McKinney JA, Dye AB, Lin L, Sanchez-Ramos L. Exploring the Limits of Artificial Intelligence for Referencing Scientific Articles. Am J Perinatol 2024; 41:2072-2081. [PMID: 38653452 DOI: 10.1055/s-0044-1786033] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 04/25/2024]
Abstract
OBJECTIVE To evaluate the reliability of three artificial intelligence (AI) chatbots (ChatGPT, Google Bard, and Chatsonic) in generating accurate references from existing obstetric literature. STUDY DESIGN Between mid-March and late April 2023, ChatGPT, Google Bard, and Chatsonic were prompted to provide references for specific obstetrical randomized controlled trials (RCTs) published in 2020. RCTs were considered for inclusion if they were mentioned in a previous article that primarily evaluated RCTs published by the top medical and obstetrics and gynecology journals with the highest impact factors in 2020 as well as RCTs published in a new journal focused on publishing obstetric RCTs. The selection of the three AI models was based on their popularity, performance in natural language processing, and public availability. Data collection involved prompting the AI chatbots to provide references according to a standardized protocol. The primary evaluation metric was the accuracy of each AI model in correctly citing references, including authors, publication title, journal name, and digital object identifier (DOI). Statistical analysis was performed using a permutation test to compare the performance of the AI models. RESULTS Among the 44 RCTs analyzed, Google Bard demonstrated the highest accuracy, correctly citing 13.6% of the requested RCTs, whereas ChatGPT and Chatsonic exhibited lower accuracy rates of 2.4 and 0%, respectively. Google Bard often substantially outperformed Chatsonic and ChatGPT in correctly citing the studied reference components. The majority of references from all AI models studied were noted to provide DOIs for unrelated studies or DOIs that do not exist. CONCLUSION To ensure the reliability of scientific information being disseminated, authors must exercise caution when utilizing AI for scientific writing and literature search. However, despite their limitations, collaborative partnerships between AI systems and researchers have the potential to drive synergistic advancements, leading to improved patient care and outcomes. KEY POINTS · AI chatbots often cite scientific articles incorrectly.. · AI chatbots can create false references.. · Responsible AI use in research is vital..
Collapse
Affiliation(s)
- Emily M Graf
- Department of Obstetrics and Gynecology, University of Florida College of Medicine, Jacksonville, Florida
| | - Jordan A McKinney
- Department of Obstetrics and Gynecology, University of Florida College of Medicine, Jacksonville, Florida
| | - Alexander B Dye
- Department of Obstetrics and Gynecology, University of Florida College of Medicine, Jacksonville, Florida
| | - Lifeng Lin
- Department of Epidemiology and Biostatistics, University of Arizona, Tucson, Arizona
| | - Luis Sanchez-Ramos
- Department of Obstetrics and Gynecology, University of Florida College of Medicine, Jacksonville, Florida
| |
Collapse
|
26
|
Grünebaum A, Dudenhausen J, Chervenak FA. Enhancing patient understanding in obstetrics: The role of generative AI in simplifying informed consent for labor induction with oxytocin. J Perinat Med 2024:jpm-2024-0428. [PMID: 39470098 DOI: 10.1515/jpm-2024-0428] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/16/2024] [Accepted: 10/12/2024] [Indexed: 10/30/2024]
Abstract
Informed consent is a cornerstone of ethical medical practice, particularly in obstetrics where procedures like labor induction carry significant risks and require clear patient understanding. Despite legal mandates for patient materials to be accessible, many consent forms remain too complex, resulting in patient confusion and dissatisfaction. This study explores the use of Generative Artificial Intelligence (GAI) to simplify informed consent for labor induction with oxytocin, ensuring content is both medically accurate and comprehensible at an 8th-grade readability level. GAI-generated consent forms streamline the process, automatically tailoring content to meet readability standards while retaining essential details such as the procedure's nature, risks, benefits, and alternatives. Through iterative prompts and expert refinement, the AI produces clear, patient-friendly language that bridges the gap between medical jargon and patient comprehension. Flesch Reading Ease scores show improved readability, meeting recommended levels for health literacy. GAI has the potential to revolutionize healthcare communication by enhancing patient understanding, promoting shared decision-making, and improving satisfaction with the consent process. However, human oversight remains critical to ensure that AI-generated content adheres to legal and ethical standards. This case study demonstrates that GAI can be an effective tool in creating accessible, standardized, yet personalized consent documents, contributing to better-informed patients and potentially reducing malpractice claims.
Collapse
Affiliation(s)
- Amos Grünebaum
- Donald and Barbara Zucker School of Medicine at Hofstra/Northwell, New Hyde Park, NY, USA
| | | | - Frank A Chervenak
- Donald and Barbara Zucker School of Medicine at Hofstra/Northwell, New Hyde Park, NY, USA
| |
Collapse
|
27
|
Anastasio MK, Peters P, Foote J, Melamed A, Modesitt SC, Musa F, Rossi E, Albright BB, Havrilesky LJ, Moss HA. The doc versus the bot: A pilot study to assess the quality and accuracy of physician and chatbot responses to clinical questions in gynecologic oncology. Gynecol Oncol Rep 2024; 55:101477. [PMID: 39224817 PMCID: PMC11367046 DOI: 10.1016/j.gore.2024.101477] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/03/2024] [Revised: 08/03/2024] [Accepted: 08/06/2024] [Indexed: 09/04/2024] Open
Abstract
Artificial intelligence (AI) applications to medical care are currently under investigation. We aimed to evaluate and compare the quality and accuracy of physician and chatbot responses to common clinical questions in gynecologic oncology. In this cross-sectional pilot study, ten questions about the knowledge and management of gynecologic cancers were selected. Each question was answered by a recruited gynecologic oncologist, ChatGPT (Generative Pretreated Transformer) AI platform, and Bard by Google AI platform. Five recruited gynecologic oncologists who were blinded to the study design were allowed 15 min to respond to each of two questions. Chatbot responses were generated by inserting the question into a fresh session in September 2023. Qualifiers and language identifying the response source were removed. Three gynecologic oncology providers who were blinded to the response source independently reviewed and rated response quality using a 5-point Likert scale, evaluated each response for accuracy, and selected the best response for each question. Overall, physician responses were judged to be best in 76.7 % of evaluations versus ChatGPT (10.0 %) and Bard (13.3 %; p < 0.001). The average quality of responses was 4.2/5.0 for physicians, 3.0/5.0 for ChatGPT and 2.8/5.0 for Bard (t-test for both and ANOVA p < 0.001). Physicians provided a higher proportion of accurate responses (86.7 %) compared to ChatGPT (60 %) and Bard (43 %; p < 0.001 for both). Physicians provided higher quality responses to gynecologic oncology clinical questions compared to chatbots. Patients should be cautioned against non-validated AI platforms for medical advice; larger studies on the use of AI for medical advice are needed.
Collapse
Affiliation(s)
- Mary Katherine Anastasio
- Division of Gynecologic Oncology, Department of Obstetrics and Gynecology, Duke University Medical Center, Durham, NC, USA
| | - Pamela Peters
- Division of Gynecologic Oncology, Department of Obstetrics and Gynecology, Duke University Medical Center, Durham, NC, USA
| | - Jonathan Foote
- Commonwealth Gynecologic Oncology, Bon Secours Health, Richmond, VA, USA
| | - Alexander Melamed
- Division of Gynecologic Oncology, Vincent Department of Obstetrics & Gynecology, Massachusetts General Hospital, Boston, MA, USA
| | - Susan C. Modesitt
- Division of Gynecologic Oncology, Department of Gynecology and Obstetrics, Emory University School of Medicine, Atlanta, GA, USA
| | | | - Emma Rossi
- Division of Gynecologic Oncology, Department of Obstetrics and Gynecology, Duke University Medical Center, Durham, NC, USA
| | - Benjamin B. Albright
- Division of Gynecologic Oncology, Department of Obstetrics and Gynecology, University of North Carolina Chapel Hill, Chapel Hill, NC, USA
| | - Laura J. Havrilesky
- Division of Gynecologic Oncology, Department of Obstetrics and Gynecology, Duke University Medical Center, Durham, NC, USA
| | - Haley A. Moss
- Division of Gynecologic Oncology, Department of Obstetrics and Gynecology, Duke University Medical Center, Durham, NC, USA
| |
Collapse
|
28
|
Grossman S, Zerilli T, Nathan JP. Appropriateness of ChatGPT as a resource for medication-related questions. Br J Clin Pharmacol 2024; 90:2691-2695. [PMID: 39096130 DOI: 10.1111/bcp.16212] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2024] [Revised: 07/04/2024] [Accepted: 07/22/2024] [Indexed: 08/04/2024] Open
Abstract
With its increasing popularity, healthcare professionals and patients may use ChatGPT to obtain medication-related information. This study was conducted to assess ChatGPT's ability to provide satisfactory responses (i.e., directly answers the question, accurate, complete and relevant) to medication-related questions posed to an academic drug information service. ChatGPT responses were compared to responses generated by the investigators through the use of traditional resources, and references were evaluated. Thirty-nine questions were entered into ChatGPT; the three most common categories were therapeutics (8; 21%), compounding/formulation (6; 15%) and dosage (5; 13%). Ten (26%) questions were answered satisfactorily by ChatGPT. Of the 29 (74%) questions that were not answered satisfactorily, deficiencies included lack of a direct response (11; 38%), lack of accuracy (11; 38%) and/or lack of completeness (12; 41%). References were included with eight (29%) responses; each included fabricated references. Presently, healthcare professionals and consumers should be cautioned against using ChatGPT for medication-related information.
Collapse
Affiliation(s)
- Sara Grossman
- LIU Pharmacy, Arnold & Marie Schwartz College of Pharmacy and Health Sciences, Brooklyn, New York, USA
| | - Tina Zerilli
- LIU Pharmacy, Arnold & Marie Schwartz College of Pharmacy and Health Sciences, Brooklyn, New York, USA
| | - Joseph P Nathan
- LIU Pharmacy, Arnold & Marie Schwartz College of Pharmacy and Health Sciences, Brooklyn, New York, USA
| |
Collapse
|
29
|
Peng L, Liang R, Zhao A, Sun R, Yi F, Zhong J, Li R, Zhu S, Zhang S, Wu S. Amplifying Chinese physicians' emphasis on patients' psychological states beyond urologic diagnoses with ChatGPT - a multicenter cross-sectional study. Int J Surg 2024; 110:6501-6508. [PMID: 38954666 PMCID: PMC11487044 DOI: 10.1097/js9.0000000000001775] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2024] [Accepted: 05/29/2024] [Indexed: 07/04/2024]
Abstract
BACKGROUND Artificial intelligence (AI) technologies, particularly large language models (LLMs), have been widely employed by the medical community. In addressing the intricacies of urology, ChatGPT offers a novel possibility to aid in clinical decision-making. This study aimed to investigate the decision-making ability of LLMs in solving complex urology-related problems and assess their effectiveness in providing psychological support to patients with urological disorders. MATERIALS AND METHODS This study evaluated the clinical and psychological support capabilities of ChatGPT 3.5 and 4.0 in the field of urology. A total of 69 clinical and 30 psychological questions were posed to the AI models, and both urologists and psychologists evaluated their response. As a control, clinicians from Chinese medical institutions responded to closed-book conditions. Statistical analyses were conducted separately for each subgroup. RESULTS In multiple-choice tests covering diverse urological topics, ChatGPT 4.0 was performed comparably to the physician group, with no significant overall score difference. Subgroup analyses revealed variable performance based on disease type and physician experience, with ChatGPT 4.0 generally outperforming ChatGPT 3.5 and exhibiting competitive results against physicians. When assessing the psychological support capabilities of AI, it is evident that ChatGPT 4.0 outperforms ChatGPT 3.5 across all urology-related psychological problems. CONCLUSIONS The performance of LLMs in dealing with standardized clinical problems and providing psychological support has certain advantages over clinicians. AI stands out as a promising tool for potential clinical aid.
Collapse
Affiliation(s)
- Lei Peng
- Department of Urology, Lanzhou University Second Hospital, Lanzhou, Gansu
- Department of Urology, South China Hospital, Shenzhen University, Shenzhen, Guangdong
| | - Rui Liang
- Department of Urology, South China Hospital, Shenzhen University, Shenzhen, Guangdong
- Department of Urology, The First Affiliated Hospital of Soochow University
| | - Anguo Zhao
- Department of Urology, South China Hospital, Shenzhen University, Shenzhen, Guangdong
- Department of Urology, Dushu Lake Hospital Affiliated to Soochow University, Medical Center of Soochow University, Suzhou Dushu Lake Hospital, Suzhou, Jiangsu
| | - Ruonan Sun
- West China School of Medicine, Sichuan University, Chengdu
| | - Fulin Yi
- North Sichuan Medical College (University), Nanchong, Sichuan, People’s Republic of China
| | - Jianye Zhong
- Department of Urology, South China Hospital, Shenzhen University, Shenzhen, Guangdong
| | - Rongkang Li
- Department of Urology, Lanzhou University Second Hospital, Lanzhou, Gansu
- Department of Urology, South China Hospital, Shenzhen University, Shenzhen, Guangdong
| | - Shimao Zhu
- Department of Urology, South China Hospital, Shenzhen University, Shenzhen, Guangdong
| | - Shaohua Zhang
- Department of Urology, South China Hospital, Shenzhen University, Shenzhen, Guangdong
| | - Song Wu
- Department of Urology, Lanzhou University Second Hospital, Lanzhou, Gansu
- Department of Urology, South China Hospital, Shenzhen University, Shenzhen, Guangdong
| |
Collapse
|
30
|
Weichert J, Scharf JL. Advancements in Artificial Intelligence for Fetal Neurosonography: A Comprehensive Review. J Clin Med 2024; 13:5626. [PMID: 39337113 PMCID: PMC11432922 DOI: 10.3390/jcm13185626] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2024] [Revised: 09/04/2024] [Accepted: 09/16/2024] [Indexed: 09/30/2024] Open
Abstract
The detailed sonographic assessment of the fetal neuroanatomy plays a crucial role in prenatal diagnosis, providing valuable insights into timely, well-coordinated fetal brain development and detecting even subtle anomalies that may impact neurodevelopmental outcomes. With recent advancements in artificial intelligence (AI) in general and medical imaging in particular, there has been growing interest in leveraging AI techniques to enhance the accuracy, efficiency, and clinical utility of fetal neurosonography. The paramount objective of this focusing review is to discuss the latest developments in AI applications in this field, focusing on image analysis, the automation of measurements, prediction models of neurodevelopmental outcomes, visualization techniques, and their integration into clinical routine.
Collapse
Affiliation(s)
- Jan Weichert
- Division of Prenatal Medicine, Department of Gynecology and Obstetrics, University Hospital of Schleswig-Holstein, Ratzeburger Allee 160, 23538 Luebeck, Germany;
- Elbe Center of Prenatal Medicine and Human Genetics, Willy-Brandt-Str. 1, 20457 Hamburg, Germany
| | - Jann Lennard Scharf
- Division of Prenatal Medicine, Department of Gynecology and Obstetrics, University Hospital of Schleswig-Holstein, Ratzeburger Allee 160, 23538 Luebeck, Germany;
| |
Collapse
|
31
|
Zheng C, Ye H, Guo J, Yang J, Fei P, Yuan Y, Huang D, Huang Y, Peng J, Xie X, Xie M, Zhao P, Chen L, Zhang M. Development and evaluation of a large language model of ophthalmology in Chinese. Br J Ophthalmol 2024; 108:1390-1397. [PMID: 39019566 PMCID: PMC11503135 DOI: 10.1136/bjo-2023-324526] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2023] [Accepted: 06/03/2024] [Indexed: 07/19/2024]
Abstract
BACKGROUND Large language models (LLMs), such as ChatGPT, have considerable implications for various medical applications. However, ChatGPT's training primarily draws from English-centric internet data and is not tailored explicitly to the medical domain. Thus, an ophthalmic LLM in Chinese is clinically essential for both healthcare providers and patients in mainland China. METHODS We developed an LLM of ophthalmology (MOPH) using Chinese corpora and evaluated its performance in three clinical scenarios: ophthalmic board exams in Chinese, answering evidence-based medicine-oriented ophthalmic questions and diagnostic accuracy for clinical vignettes. Additionally, we compared MOPH's performance to that of human doctors. RESULTS In the ophthalmic exam, MOPH's average score closely aligned with the mean score of trainees (64.7 (range 62-68) vs 66.2 (range 50-92), p=0.817), but achieving a score above 60 in all seven mock exams. In answering ophthalmic questions, MOPH demonstrated an adherence of 83.3% (25/30) of responses following Chinese guidelines (Likert scale 4-5). Only 6.7% (2/30, Likert scale 1-2) and 10% (3/30, Likert scale 3) of responses were rated as 'poor or very poor' or 'potentially misinterpretable inaccuracies' by reviewers. In diagnostic accuracy, although the rate of correct diagnosis by ophthalmologists was superior to that by MOPH (96.1% vs 81.1%, p>0.05), the difference was not statistically significant. CONCLUSION This study demonstrated the promising performance of MOPH, a Chinese-specific ophthalmic LLM, in diverse clinical scenarios. MOPH has potential real-world applications in Chinese-language ophthalmology settings.
Collapse
Affiliation(s)
- Ce Zheng
- Ophthalmology, Xinhua Hospital Affiliated to Shanghai Jiaotong University School of Medicine, Shanghai, China
- Institute of Hospital Development Strategy, China Hospital Development Institute, Shanghai Jiao Tong University, Shanghai, China
| | - Hongfei Ye
- Ophthalmology, Xinhua Hospital Affiliated to Shanghai Jiaotong University School of Medicine, Shanghai, China
- Institute of Hospital Development Strategy, China Hospital Development Institute, Shanghai Jiao Tong University, Shanghai, China
| | - Jinming Guo
- Joint Shantou International Eye Center of Shantou University and The Chinese University of Hong Kong, Shantou, Guangdong, China
| | - Junrui Yang
- Ophthalmology, The 74th Army Group Hospital, Guangzhou, Guangdong, China
| | - Ping Fei
- Ophthalmology, Xinhua Hospital Affiliated to Shanghai Jiaotong University School of Medicine, Shanghai, China
| | - Yuanzhi Yuan
- Ophthalmology, Zhongshan Hospital Fudan University, Shanghai, China
| | - Danqing Huang
- Discipline Inspection & Supervision Office, Xinhua Hospital Affiliated to Shanghai Jiaotong University School of Medicine, Shanghai, China
| | - Yuqiang Huang
- Joint Shantou International Eye Center of Shantou University and The Chinese University of Hong Kong, Shantou, Guangdong, China
| | - Jie Peng
- Opthalmology, Xinhua Hospital Affiliated to Shanghai Jiaotong University School of Medicine, Shanghai, China
| | - Xiaoling Xie
- Joint Shantou International Eye Center of Shantou University and The Chinese University of Hong Kong, Shantou, Guangdong, China
| | - Meng Xie
- Ophthalmology, Xinhua Hospital Affiliated to Shanghai Jiaotong University School of Medicine, Shanghai, China
| | - Peiquan Zhao
- Ophthalmology, Xinhua Hospital Affiliated to Shanghai Jiaotong University School of Medicine, Shanghai, China
| | - Li Chen
- Ophthalmology, Xinhua Hospital Affiliated to Shanghai Jiaotong University School of Medicine, Shanghai, China
| | - Mingzhi Zhang
- Joint Shantou International Eye Center of Shantou University and The Chinese University of Hong Kong, Shantou, Guangdong, China
| |
Collapse
|
32
|
Ye Z, Zhang B, Zhang K, Méndez MJG, Yan H, Wu T, Qu Y, Jiang Y, Xue P, Qiao Y. An assessment of ChatGPT's responses to frequently asked questions about cervical and breast cancer. BMC Womens Health 2024; 24:482. [PMID: 39223612 PMCID: PMC11367894 DOI: 10.1186/s12905-024-03320-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2024] [Accepted: 08/16/2024] [Indexed: 09/04/2024] Open
Abstract
BACKGROUND Cervical cancer (CC) and breast cancer (BC) threaten women's well-being, influenced by health-related stigma and a lack of reliable information, which can cause late diagnosis and early death. ChatGPT is likely to become a key source of health information, although quality concerns could also influence health-seeking behaviours. METHODS This cross-sectional online survey compared ChatGPT's responses to five physicians specializing in mammography and five specializing in gynaecology. Twenty frequently asked questions about CC and BC were asked on 26th and 29th of April, 2023. A panel of seven experts assessed the accuracy, consistency, and relevance of ChatGPT's responses using a 7-point Likert scale. Responses were analyzed for readability, reliability, and efficiency. ChatGPT's responses were synthesized, and findings are presented as a radar chart. RESULTS ChatGPT had an accuracy score of 7.0 (range: 6.6-7.0) for CC and BC questions, surpassing the highest-scoring physicians (P < 0.05). ChatGPT took an average of 13.6 s (range: 7.6-24.0) to answer each of the 20 questions presented. Readability was comparable to that of experts and physicians involved, but ChatGPT generated more extended responses compared to physicians. The consistency of repeated answers was 5.2 (range: 3.4-6.7). With different contexts combined, the overall ChatGPT relevance score was 6.5 (range: 4.8-7.0). Radar plot analysis indicated comparably good accuracy, efficiency, and to a certain extent, relevance. However, there were apparent inconsistencies, and the reliability and readability be considered inadequate. CONCLUSIONS ChatGPT shows promise as an initial source of information for CC and BC. ChatGPT is also highly functional and appears to be superior to physicians, and aligns with expert consensus, although there is room for improvement in readability, reliability, and consistency. Future efforts should focus on developing advanced ChatGPT models explicitly designed to improve medical practice and for those with concerns about symptoms.
Collapse
Affiliation(s)
- Zichen Ye
- School of Population Medicine and Public Health, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| | - Bo Zhang
- School of Population Medicine and Public Health, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| | - Kun Zhang
- School of Population Medicine and Public Health, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| | - María José González Méndez
- Department of Primary Healthcare and Family Medicine, Faculty of Medicine, Universidad de Chile, Santiago, Chile
| | - Huijiao Yan
- School of Population Medicine and Public Health, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| | - Tong Wu
- School of Population Medicine and Public Health, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| | - Yimin Qu
- School of Population Medicine and Public Health, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| | - Yu Jiang
- School of Population Medicine and Public Health, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China.
- School of Health Policy and Management, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China.
| | - Peng Xue
- School of Population Medicine and Public Health, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China.
| | - Youlin Qiao
- School of Population Medicine and Public Health, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| |
Collapse
|
33
|
Matsubara S. ChatGPT use should be prohibited in writing letters. Am J Obstet Gynecol 2024; 231:e110. [PMID: 38710270 DOI: 10.1016/j.ajog.2024.04.046] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2024] [Accepted: 04/30/2024] [Indexed: 05/08/2024]
Affiliation(s)
- Shigeki Matsubara
- Department of Obstetrics and Gynecology, Jichi Medical University, Tochigi, Japan; Department of Obstetrics and Gynecology, Koga Red Cross Hospital, 1150 Shimoyama, Koga, Ibaraki 306-0014, Japan; Medical Examination Center, Ibaraki Western Medical Center, Chikusei, Japan.
| |
Collapse
|
34
|
Anuk AT, Tanacan A, Kara Ö, Sahin D. Assessing adverse pregnancy outcomes in women with uncontrolled asthma vs. mild asthma: a retrospective comparative analysis. Arch Gynecol Obstet 2024; 310:1433-1440. [PMID: 38276984 DOI: 10.1007/s00404-023-07347-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2023] [Accepted: 12/11/2023] [Indexed: 01/27/2024]
Abstract
PURPOSE The aim of this study was to evaluate perinatal outcomes between the uncontrolled asthma group and the mild asthma group and to reveal the relationship between disease severity and adverse maternal-fetal outcomes in this study. METHODS This retrospective cohort study analyzed 180 pregnant women diagnosed with asthma, hospitalized, and delivered at our center between September 1, 2019, and December 1, 2021. We compared two groups: 160 with mild asthma and 20 with uncontrolled asthma. Data encompassed maternal characteristics, obstetrical complications, medication use, emergency department admissions for exacerbations, smoking status, and neonatal outcomes. RESULTS In the uncontrolled asthma group, hospitalization rates, use of inhaled short-acting β-agonist (SABA), and systemic corticosteroids were significantly higher compared to the mild asthma group (p < 0.01). Maternal and fetal complications were more prevalent in the uncontrolled group, including asthma exacerbations (45% vs. 1.2%), anemia (10% vs. 4.4%), prematurity (25% vs. 9.6%), and intrauterine fetal demise (IUFD) (10% vs. 0.6%). Neonatal outcomes in the uncontrolled group showed higher rates of admission to the neonatal intensive care unit (NICU) (50% vs. 25%), respiratory distress syndrome (RDS) (30% vs. 14%), and intraventricular hemorrhage (IVH) (5% vs. 0%) compared to the mild asthma group. CONCLUSION Uncontrolled asthma during pregnancy is associated with higher adverse maternal-fetal and neonatal outcomes compared to mild asthma.
Collapse
Affiliation(s)
- Ali Taner Anuk
- Division of Perinatology, Department of Obstetrics and Gynecology, Ministry of Health, Ankara City Hospital, Ankara, Türkiye.
| | - Atakan Tanacan
- Division of Perinatology, Department of Obstetrics and Gynecology, Ministry of Health, Ankara City Hospital, Ankara, Türkiye
| | - Özgür Kara
- Division of Perinatology, Department of Obstetrics and Gynecology, Ministry of Health, Ankara City Hospital, Ankara, Türkiye
| | - Dilek Sahin
- Division of Perinatology, Department of Obstetrics and Gynecology, University of Health Sciences, Ministry of Health, Ankara City Hospital, Ankara, Türkiye
| |
Collapse
|
35
|
Grünebaum A, Chervenak FA. The dichotomy between the scientific and artistic aspects of medical writing. Am J Obstet Gynecol 2024; 231:e111. [PMID: 38710266 DOI: 10.1016/j.ajog.2024.04.047] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2024] [Accepted: 04/30/2024] [Indexed: 05/08/2024]
Affiliation(s)
- Amos Grünebaum
- Department of Obstetrics and Gynecology, Donald and Barbara Zucker School of Medicine at Hofstra/Northwell, Lenox Hill Hospital, New York, NY.
| | - Frank A Chervenak
- Department of Obstetrics and Gynecology, Donald and Barbara Zucker School of Medicine at Hofstra/Northwell, Lenox Hill Hospital, New York, NY
| |
Collapse
|
36
|
Peled T, Sela HY, Weiss A, Grisaru-Granovsky S, Agrawal S, Rottenstreich M. Evaluating the validity of ChatGPT responses on common obstetric issues: Potential clinical applications and implications. Int J Gynaecol Obstet 2024; 166:1127-1133. [PMID: 38523565 DOI: 10.1002/ijgo.15501] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2023] [Revised: 02/29/2024] [Accepted: 03/10/2024] [Indexed: 03/26/2024]
Abstract
OBJECTIVE To evaluate the quality of ChatGPT responses to common issues in obstetrics and assess its ability to provide reliable responses to pregnant individuals. The study aimed to examine the responses based on expert opinions using predetermined criteria, including "accuracy," "completeness," and "safety." METHODS We curated 15 common and potentially clinically significant questions that pregnant women are asking. Two native English-speaking women were asked to reframe the questions in their own words, and we employed the ChatGPT language model to generate responses to the questions. To evaluate the accuracy, completeness, and safety of the ChatGPT's generated responses, we developed a questionnaire with a scale of 1 to 5 that obstetrics and gynecology experts from different countries were invited to rate accordingly. The ratings were analyzed to evaluate the average level of agreement and percentage of positive ratings (≥4) for each criterion. RESULTS Of the 42 experts invited, 20 responded to the questionnaire. The combined score for all responses yielded a mean rating of 4, with 75% of responses receiving a positive rating (≥4). While examining specific criteria, the ChatGPT responses were better for the accuracy criterion, with a mean rating of 4.2 and 80% of the questions received a positive rating. The responses scored less for the completeness criterion, with a mean rating of 3.8 and 46.7% of questions received a positive rating. For safety, the mean rating was 3.9 and 53.3% of questions received a positive rating. There was no response with an average negative rating below three. CONCLUSION This study demonstrates promising results regarding potential use of ChatGPT's in providing accurate responses to obstetric clinical questions posed by pregnant women. However, it is crucial to exercise caution when addressing inquiries concerning the safety of the fetus or the mother.
Collapse
Affiliation(s)
- Tzuria Peled
- Department of Obstetrics and Gynecology, Shaare Zedek Medical Center, Affiliated with the Hebrew University School of Medicine, Jerusalem, Israel
| | - Hen Y Sela
- Department of Obstetrics and Gynecology, Shaare Zedek Medical Center, Affiliated with the Hebrew University School of Medicine, Jerusalem, Israel
| | - Ari Weiss
- Department of Obstetrics and Gynecology, Shaare Zedek Medical Center, Affiliated with the Hebrew University School of Medicine, Jerusalem, Israel
| | - Sorina Grisaru-Granovsky
- Department of Obstetrics and Gynecology, Shaare Zedek Medical Center, Affiliated with the Hebrew University School of Medicine, Jerusalem, Israel
| | - Swati Agrawal
- Division of Maternal-Fetal Medicine, Department of Obstetrics and Gynecology, Hamilton Health Sciences, McMaster University, Hamilton, Ontario, Canada
| | - Misgav Rottenstreich
- Department of Obstetrics and Gynecology, Shaare Zedek Medical Center, Affiliated with the Hebrew University School of Medicine, Jerusalem, Israel
- Division of Maternal-Fetal Medicine, Department of Obstetrics and Gynecology, Hamilton Health Sciences, McMaster University, Hamilton, Ontario, Canada
- Department of Nursing, Jerusalem College of Technology, Jerusalem, Israel
| |
Collapse
|
37
|
Hua R, Dong X, Wei Y, Shu Z, Yang P, Hu Y, Zhou S, Sun H, Yan K, Yan X, Chang K, Li X, Bai Y, Zhang R, Wang W, Zhou X. Lingdan: enhancing encoding of traditional Chinese medicine knowledge for clinical reasoning tasks with large language models. J Am Med Inform Assoc 2024; 31:2019-2029. [PMID: 39038795 PMCID: PMC11339528 DOI: 10.1093/jamia/ocae087] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2023] [Revised: 02/22/2024] [Accepted: 04/06/2024] [Indexed: 07/24/2024] Open
Abstract
OBJECTIVE The recent surge in large language models (LLMs) across various fields has yet to be fully realized in traditional Chinese medicine (TCM). This study aims to bridge this gap by developing a large language model tailored to TCM knowledge, enhancing its performance and accuracy in clinical reasoning tasks such as diagnosis, treatment, and prescription recommendations. MATERIALS AND METHODS This study harnessed a wide array of TCM data resources, including TCM ancient books, textbooks, and clinical data, to create 3 key datasets: the TCM Pre-trained Dataset, the Traditional Chinese Patent Medicine (TCPM) Question Answering Dataset, and the Spleen and Stomach Herbal Prescription Recommendation Dataset. These datasets underpinned the development of the Lingdan Pre-trained LLM and 2 specialized models: the Lingdan-TCPM-Chat Model, which uses a Chain-of-Thought process for symptom analysis and TCPM recommendation, and a Lingdan Prescription Recommendation model (Lingdan-PR) that proposes herbal prescriptions based on electronic medical records. RESULTS The Lingdan-TCPM-Chat and the Lingdan-PR Model, fine-tuned on the Lingdan Pre-trained LLM, demonstrated state-of-the art performances for the tasks of TCM clinical knowledge answering and herbal prescription recommendation. Notably, Lingdan-PR outperformed all state-of-the-art baseline models, achieving an improvement of 18.39% in the Top@20 F1-score compared with the best baseline. CONCLUSION This study marks a pivotal step in merging advanced LLMs with TCM, showcasing the potential of artificial intelligence to help improve clinical decision-making of medical diagnostics and treatment strategies. The success of the Lingdan Pre-trained LLM and its derivative models, Lingdan-TCPM-Chat and Lingdan-PR, not only revolutionizes TCM practices but also opens new avenues for the application of artificial intelligence in other specialized medical fields. Our project is available at https://github.com/TCMAI-BJTU/LingdanLLM.
Collapse
Affiliation(s)
- Rui Hua
- Institute of Medical Intelligence, Beijing Key Lab of Traffic Data Analysis and Mining, School of Computer Science & Technology, Beijing Jiaotong University, Beijing 100044, China
| | - Xin Dong
- Institute of Medical Intelligence, Beijing Key Lab of Traffic Data Analysis and Mining, School of Computer Science & Technology, Beijing Jiaotong University, Beijing 100044, China
| | - Yu Wei
- Innovation Center of Digital & Intelligent Chinese Medicine, Tasly Pharmaceutical Group Co., Ltd., Tianjin 300410, China
| | - Zixin Shu
- Institute of Medical Intelligence, Beijing Key Lab of Traffic Data Analysis and Mining, School of Computer Science & Technology, Beijing Jiaotong University, Beijing 100044, China
| | - Pengcheng Yang
- Innovation Center of Digital & Intelligent Chinese Medicine, Tasly Pharmaceutical Group Co., Ltd., Tianjin 300410, China
| | - Yunhui Hu
- Innovation Center of Digital & Intelligent Chinese Medicine, Tasly Pharmaceutical Group Co., Ltd., Tianjin 300410, China
| | - Shuiping Zhou
- Innovation Center of Digital & Intelligent Chinese Medicine, Tasly Pharmaceutical Group Co., Ltd., Tianjin 300410, China
| | - He Sun
- Innovation Center of Digital & Intelligent Chinese Medicine, Tasly Pharmaceutical Group Co., Ltd., Tianjin 300410, China
| | - Kaijing Yan
- Innovation Center of Digital & Intelligent Chinese Medicine, Tasly Pharmaceutical Group Co., Ltd., Tianjin 300410, China
| | - Xijun Yan
- Innovation Center of Digital & Intelligent Chinese Medicine, Tasly Pharmaceutical Group Co., Ltd., Tianjin 300410, China
| | - Kai Chang
- Institute of Medical Intelligence, Beijing Key Lab of Traffic Data Analysis and Mining, School of Computer Science & Technology, Beijing Jiaotong University, Beijing 100044, China
| | - Xiaodong Li
- Affiliated Hospital of Hubei University of Chinese Medicine, Wuhan 430065, China
- Hubei Academy of Chinese Medicine, Wuhan 430061, China
- Institute of Liver Diseases, Hubei Key Laboratory of Theoretical and Applied Research of Liver and Kidney in Traditional Chinese Medicine, Hubei Provincial Hospital of Traditional Chinese Medicine, Wuhan 430061, China
| | - Yuning Bai
- Department of Gastroenterology, Guang’anmen Hospital, China Academy of Chinese Medical Sciences, Beijing 100053, China
| | - Runshun Zhang
- Department of Gastroenterology, Guang’anmen Hospital, China Academy of Chinese Medical Sciences, Beijing 100053, China
| | - Wenjia Wang
- Innovation Center of Digital & Intelligent Chinese Medicine, Tasly Pharmaceutical Group Co., Ltd., Tianjin 300410, China
| | - Xuezhong Zhou
- Institute of Medical Intelligence, Beijing Key Lab of Traffic Data Analysis and Mining, School of Computer Science & Technology, Beijing Jiaotong University, Beijing 100044, China
| |
Collapse
|
38
|
Liu CH, Wang PH. Winners of the 2023 honor awards for excellence at the annual meeting of the Chinese Medical Association-Taipei: Part IV. J Chin Med Assoc 2024; 87:817-818. [PMID: 38965650 DOI: 10.1097/jcma.0000000000001130] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 07/06/2024] Open
Affiliation(s)
- Chia-Hao Liu
- Department of Obstetrics and Gynecology, Taipei Veterans General Hospital, Taipei, Taiwan, ROC
- Institute of Clinical Medicine, National Yang Ming Chiao Tung University, Taipei, Taiwan, ROC
| | - Peng-Hui Wang
- Department of Obstetrics and Gynecology, Taipei Veterans General Hospital, Taipei, Taiwan, ROC
- Institute of Clinical Medicine, National Yang Ming Chiao Tung University, Taipei, Taiwan, ROC
- Female Cancer Foundation, Taipei, Taiwan, ROC
| |
Collapse
|
39
|
Wang Y, Chen Y, Sheng J. Assessing ChatGPT as a Medical Consultation Assistant for Chronic Hepatitis B: Cross-Language Study of English and Chinese. JMIR Med Inform 2024; 12:e56426. [PMID: 39115930 PMCID: PMC11342014 DOI: 10.2196/56426] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2024] [Revised: 05/24/2024] [Accepted: 07/21/2024] [Indexed: 08/10/2024] Open
Abstract
BACKGROUND Chronic hepatitis B (CHB) imposes substantial economic and social burdens globally. The management of CHB involves intricate monitoring and adherence challenges, particularly in regions like China, where a high prevalence of CHB intersects with health care resource limitations. This study explores the potential of ChatGPT-3.5, an emerging artificial intelligence (AI) assistant, to address these complexities. With notable capabilities in medical education and practice, ChatGPT-3.5's role is examined in managing CHB, particularly in regions with distinct health care landscapes. OBJECTIVE This study aimed to uncover insights into ChatGPT-3.5's potential and limitations in delivering personalized medical consultation assistance for CHB patients across diverse linguistic contexts. METHODS Questions sourced from published guidelines, online CHB communities, and search engines in English and Chinese were refined, translated, and compiled into 96 inquiries. Subsequently, these questions were presented to both ChatGPT-3.5 and ChatGPT-4.0 in independent dialogues. The responses were then evaluated by senior physicians, focusing on informativeness, emotional management, consistency across repeated inquiries, and cautionary statements regarding medical advice. Additionally, a true-or-false questionnaire was employed to further discern the variance in information accuracy for closed questions between ChatGPT-3.5 and ChatGPT-4.0. RESULTS Over half of the responses (228/370, 61.6%) from ChatGPT-3.5 were considered comprehensive. In contrast, ChatGPT-4.0 exhibited a higher percentage at 74.5% (172/222; P<.001). Notably, superior performance was evident in English, particularly in terms of informativeness and consistency across repeated queries. However, deficiencies were identified in emotional management guidance, with only 3.2% (6/186) in ChatGPT-3.5 and 8.1% (15/154) in ChatGPT-4.0 (P=.04). ChatGPT-3.5 included a disclaimer in 10.8% (24/222) of responses, while ChatGPT-4.0 included a disclaimer in 13.1% (29/222) of responses (P=.46). When responding to true-or-false questions, ChatGPT-4.0 achieved an accuracy rate of 93.3% (168/180), significantly surpassing ChatGPT-3.5's accuracy rate of 65.0% (117/180) (P<.001). CONCLUSIONS In this study, ChatGPT demonstrated basic capabilities as a medical consultation assistant for CHB management. The choice of working language for ChatGPT-3.5 was considered a potential factor influencing its performance, particularly in the use of terminology and colloquial language, and this potentially affects its applicability within specific target populations. However, as an updated model, ChatGPT-4.0 exhibits improved information processing capabilities, overcoming the language impact on information accuracy. This suggests that the implications of model advancement on applications need to be considered when selecting large language models as medical consultation assistants. Given that both models performed inadequately in emotional guidance management, this study highlights the importance of providing specific language training and emotional management strategies when deploying ChatGPT for medical purposes. Furthermore, the tendency of these models to use disclaimers in conversations should be further investigated to understand the impact on patients' experiences in practical applications.
Collapse
Affiliation(s)
- Yijie Wang
- State Key Laboratory for Diagnosis and Treatment of Infectious Diseases, Collaborative Innovation Center for Diagnosis and Treatment of Infectious Disease, The First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Yining Chen
- Department of Urology, Sir Run Run Shaw Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Jifang Sheng
- State Key Laboratory for Diagnosis and Treatment of Infectious Diseases, Collaborative Innovation Center for Diagnosis and Treatment of Infectious Disease, The First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| |
Collapse
|
40
|
Burns C, Bakaj A, Berishaj A, Hristidis V, Deak P, Equils O. Use of Generative AI for Improving Health Literacy in Reproductive Health: Case Study. JMIR Form Res 2024; 8:e59434. [PMID: 38986153 PMCID: PMC11336497 DOI: 10.2196/59434] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2024] [Revised: 06/18/2024] [Accepted: 07/10/2024] [Indexed: 07/12/2024] Open
Abstract
BACKGROUND Patients find technology tools to be more approachable for seeking sensitive health-related information, such as reproductive health information. The inventive conversational ability of artificial intelligence (AI) chatbots, such as ChatGPT (OpenAI Inc), offers a potential means for patients to effectively locate answers to their health-related questions digitally. OBJECTIVE A pilot study was conducted to compare the novel ChatGPT with the existing Google Search technology for their ability to offer accurate, effective, and current information regarding proceeding action after missing a dose of oral contraceptive pill. METHODS A sequence of 11 questions, mimicking a patient inquiring about the action to take after missing a dose of an oral contraceptive pill, were input into ChatGPT as a cascade, given the conversational ability of ChatGPT. The questions were input into 4 different ChatGPT accounts, with the account holders being of various demographics, to evaluate potential differences and biases in the responses given to different account holders. The leading question, "what should I do if I missed a day of my oral contraception birth control?" alone was then input into Google Search, given its nonconversational nature. The results from the ChatGPT questions and the Google Search results for the leading question were evaluated on their readability, accuracy, and effective delivery of information. RESULTS The ChatGPT results were determined to be at an overall higher-grade reading level, with a longer reading duration, less accurate, less current, and with a less effective delivery of information. In contrast, the Google Search resulting answer box and snippets were at a lower-grade reading level, shorter reading duration, more current, able to reference the origin of the information (transparent), and provided the information in various formats in addition to text. CONCLUSIONS ChatGPT has room for improvement in accuracy, transparency, recency, and reliability before it can equitably be implemented into health care information delivery and provide the potential benefits it poses. However, AI may be used as a tool for providers to educate their patients in preferred, creative, and efficient ways, such as using AI to generate accessible short educational videos from health care provider-vetted information. Larger studies representing a diverse group of users are needed.
Collapse
Affiliation(s)
- Christina Burns
- MiOra, Encino, CA, United States
- University of California San Diego, San Diego, CA, United States
| | - Angela Bakaj
- MiOra, Encino, CA, United States
- Institute for Management & Innovation, University of Toronto, Toronto, ON, Canada
| | - Amonda Berishaj
- MiOra, Encino, CA, United States
- College of Professional Studies, Northeastern University, Boston, MA, United States
| | - Vagelis Hristidis
- Computer Science and Engineering, University of California Riverside, Riverside, CA, United States
| | - Pamela Deak
- Department of Obstetrics, Gynecology and Reproductive Sciences, University of California San Diego, San Diego, CA, United States
| | | |
Collapse
|
41
|
De Vito A, Geremia N, Marino A, Bavaro DF, Caruana G, Meschiari M, Colpani A, Mazzitelli M, Scaglione V, Venanzi Rullo E, Fiore V, Fois M, Campanella E, Pistarà E, Faltoni M, Nunnari G, Cattelan A, Mussini C, Bartoletti M, Vaira LA, Madeddu G. Assessing ChatGPT's theoretical knowledge and prescriptive accuracy in bacterial infections: a comparative study with infectious diseases residents and specialists. Infection 2024:10.1007/s15010-024-02350-6. [PMID: 38995551 DOI: 10.1007/s15010-024-02350-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2024] [Accepted: 07/06/2024] [Indexed: 07/13/2024]
Abstract
OBJECTIVES Advancements in Artificial Intelligence(AI) have made platforms like ChatGPT increasingly relevant in medicine. This study assesses ChatGPT's utility in addressing bacterial infection-related questions and antibiogram-based clinical cases. METHODS This study involved a collaborative effort involving infectious disease (ID) specialists and residents. A group of experts formulated six true/false, six open-ended questions, and six clinical cases with antibiograms for four types of infections (endocarditis, pneumonia, intra-abdominal infections, and bloodstream infection) for a total of 96 questions. The questions were submitted to four senior residents and four specialists in ID and inputted into ChatGPT-4 and a trained version of ChatGPT-4. A total of 720 responses were obtained and reviewed by a blinded panel of experts in antibiotic treatments. They evaluated the responses for accuracy and completeness, the ability to identify correct resistance mechanisms from antibiograms, and the appropriateness of antibiotics prescriptions. RESULTS No significant difference was noted among the four groups for true/false questions, with approximately 70% correct answers. The trained ChatGPT-4 and ChatGPT-4 offered more accurate and complete answers to the open-ended questions than both the residents and specialists. Regarding the clinical case, we observed a lower accuracy from ChatGPT-4 to recognize the correct resistance mechanism. ChatGPT-4 tended not to prescribe newer antibiotics like cefiderocol or imipenem/cilastatin/relebactam, favoring less recommended options like colistin. Both trained- ChatGPT-4 and ChatGPT-4 recommended longer than necessary treatment periods (p-value = 0.022). CONCLUSIONS This study highlights ChatGPT's capabilities and limitations in medical decision-making, specifically regarding bacterial infections and antibiogram analysis. While ChatGPT demonstrated proficiency in answering theoretical questions, it did not consistently align with expert decisions in clinical case management. Despite these limitations, the potential of ChatGPT as a supportive tool in ID education and preliminary analysis is evident. However, it should not replace expert consultation, especially in complex clinical decision-making.
Collapse
Affiliation(s)
- Andrea De Vito
- Unit of Infectious Diseases, Department of Medicine, Surgery, and Pharmacy, University of Sassari, Sassari, Italy.
- PhD School in Biomedical Science, Biomedical Science Department, University of Sassari, Sassari, Italy.
| | - Nicholas Geremia
- Unit of Infectious Diseases, Department of Clinical Medicine, Ospedale dell'Angelo, Venice, Italy
- Unit of Infectious Diseases, Department of Clinical Medicine, Ospedale Civile S.S. Giovanni e Paolo, Venice, Italy
| | - Andrea Marino
- Unit of Infectious Diseases, Department of Clinical and Experimental Medicine, ARNAS Garibaldi Hospital, University of Catania, Catania, Italy
| | - Davide Fiore Bavaro
- Infectious Diseases Unit - IRCCS Humanitas Research Hospital, Via Manzoni 56, Rozzano, Milan, 20089, Italy
- Department of Biomedical Sciences, Humanitas University, Via Rita Levi Montalcini 4, Pieve Emanuele, Milan, 20090, Italy
| | - Giorgia Caruana
- Infectious Diseases Service, Cantonal Hospital of Sion and Institut Central des Hôpitaux (ICH), Sion, Switzerland
- Institute of Microbiology, Department of Laboratory Medicine and Pathology, Lausanne University Hospital, Lausanne, Switzerland
| | | | - Agnese Colpani
- Unit of Infectious Diseases, Department of Medicine, Surgery, and Pharmacy, University of Sassari, Sassari, Italy
| | - Maria Mazzitelli
- Infectious and Tropical Diseases Unit, Padua University Hospital, Padua, Italy
| | - Vincenzo Scaglione
- Infectious and Tropical Diseases Unit, Padua University Hospital, Padua, Italy
| | - Emmanuele Venanzi Rullo
- Unit of Infectious Diseases, Department of Clinical and Experimental Medicine, University of Messina, Messina, Italy
| | - Vito Fiore
- Unit of Infectious Diseases, Department of Medicine, Surgery, and Pharmacy, University of Sassari, Sassari, Italy
| | - Marco Fois
- Unit of Infectious Diseases, Department of Medicine, Surgery, and Pharmacy, University of Sassari, Sassari, Italy
| | - Edoardo Campanella
- Unit of Infectious Diseases, Department of Clinical and Experimental Medicine, ARNAS Garibaldi Hospital, University of Catania, Catania, Italy
- Unit of Infectious Diseases, Department of Clinical and Experimental Medicine, University of Messina, Messina, Italy
| | - Eugenia Pistarà
- Unit of Infectious Diseases, Department of Clinical and Experimental Medicine, ARNAS Garibaldi Hospital, University of Catania, Catania, Italy
- Unit of Infectious Diseases, Department of Clinical and Experimental Medicine, University of Messina, Messina, Italy
| | | | - Giuseppe Nunnari
- Unit of Infectious Diseases, Department of Clinical and Experimental Medicine, ARNAS Garibaldi Hospital, University of Catania, Catania, Italy
| | - Annamaria Cattelan
- Infectious and Tropical Diseases Unit, Padua University Hospital, Padua, Italy
| | | | - Michele Bartoletti
- Infectious Diseases Unit - IRCCS Humanitas Research Hospital, Via Manzoni 56, Rozzano, Milan, 20089, Italy
- Department of Biomedical Sciences, Humanitas University, Via Rita Levi Montalcini 4, Pieve Emanuele, Milan, 20090, Italy
| | - Luigi Angelo Vaira
- Maxillofacial Surgery Unit, Department of Medicine, Surgery, and Pharmacy, University of Sassari, Sassari, Italy
| | - Giordano Madeddu
- Unit of Infectious Diseases, Department of Medicine, Surgery, and Pharmacy, University of Sassari, Sassari, Italy
| |
Collapse
|
42
|
Yilmaz Muluk S, Olcucu N. Comparative Analysis of Artificial Intelligence Platforms: ChatGPT-3.5 and GoogleBard in Identifying Red Flags of Low Back Pain. Cureus 2024; 16:e63580. [PMID: 39087174 PMCID: PMC11290316 DOI: 10.7759/cureus.63580] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 06/25/2024] [Indexed: 08/02/2024] Open
Abstract
BACKGROUND Low back pain (LBP) is a prevalent healthcare concern that is frequently responsive to conservative treatment. However, it can also stem from severe conditions, marked by 'red flags' (RF) such as malignancy, cauda equina syndrome, fractures, infections, spondyloarthropathies, and aneurysm rupture, which physicians should be vigilant about. Given the increasing reliance on online health information, this study assessed ChatGPT-3.5's (OpenAI, San Francisco, CA, USA) and GoogleBard's (Google, Mountain View, CA, USA) accuracy in responding to RF-related LBP questions and their capacity to discriminate the severity of the condition. METHODS We created 70 questions on RF-related symptoms and diseases following the LBP guidelines. Among them, 58 had a single symptom (SS), and 12 had multiple symptoms (MS) of LBP. Questions were posed to ChatGPT and GoogleBard, and responses were assessed by two authors for accuracy, completeness, and relevance (ACR) using a 5-point rubric criteria. RESULTS Cohen's kappa values (0.60-0.81) indicated significant agreement among the authors. The average scores for responses ranged from 3.47 to 3.85 for ChatGPT-3.5 and from 3.36 to 3.76 for GoogleBard for 58 SS questions, and from 4.04 to 4.29 for ChatGPT-3.5 and from 3.50 to 3.71 for GoogleBard for 12 MS questions. The ratings for these responses ranged from 'good' to 'excellent'. Most SS responses effectively conveyed the severity of the situation (93.1% for ChatGPT-3.5, 94.8% for GoogleBard), and all MS responses did so. No statistically significant differences were found between ChatGPT-3.5 and GoogleBard scores (p>0.05). CONCLUSIONS In an era characterized by widespread online health information seeking, artificial intelligence (AI) systems play a vital role in delivering precise medical information. These technologies may hold promise in the field of health information if they continue to improve.
Collapse
Affiliation(s)
| | - Nazli Olcucu
- Physical Medicine and Rehabilitation, Antalya Ataturk State Hospital, Antalya, TUR
| |
Collapse
|
43
|
Safrai M, Orwig KE. Utilizing artificial intelligence in academic writing: an in-depth evaluation of a scientific review on fertility preservation written by ChatGPT-4. J Assist Reprod Genet 2024; 41:1871-1880. [PMID: 38619763 PMCID: PMC11263262 DOI: 10.1007/s10815-024-03089-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2024] [Accepted: 03/07/2024] [Indexed: 04/16/2024] Open
Abstract
PURPOSE To evaluate the ability of ChatGPT-4 to generate a biomedical review article on fertility preservation. METHODS ChatGPT-4 was prompted to create an outline for a review on fertility preservation in men and prepubertal boys. The outline provided by ChatGPT-4 was subsequently used to prompt ChatGPT-4 to write the different parts of the review and provide five references for each section. The different parts of the article and the references provided were combined to create a single scientific review that was evaluated by the authors, who are experts in fertility preservation. The experts assessed the article and the references for accuracy and checked for plagiarism using online tools. In addition, both experts independently scored the relevance, depth, and currentness of the ChatGPT-4's article using a scoring matrix ranging from 0 to 5 where higher scores indicate higher quality. RESULTS ChatGPT-4 successfully generated a relevant scientific article with references. Among 27 statements needing citations, four were inaccurate. Of 25 references, 36% were accurate, 48% had correct titles but other errors, and 16% were completely fabricated. Plagiarism was minimal (mean = 3%). Experts rated the article's relevance highly (5/5) but gave lower scores for depth (2-3/5) and currentness (3/5). CONCLUSION ChatGPT-4 can produce a scientific review on fertility preservation with minimal plagiarism. While precise in content, it showed factual and contextual inaccuracies and inconsistent reference reliability. These issues limit ChatGPT-4 as a sole tool for scientific writing but suggest its potential as an aid in the writing process.
Collapse
Affiliation(s)
- Myriam Safrai
- Department of Obstetrics, Gynecology and Reproductive Sciences, Magee-Womens Research Institute, University of Pittsburgh School of Medicine, Pittsburgh, PA, 15213, USA.
- Department of Obstetrics and Gynecology, Chaim Sheba Medical Center (Tel Hashomer), Sackler Faculty of Medicine, Tel Aviv University, 52621, Tel Aviv, Israel.
| | - Kyle E Orwig
- Department of Obstetrics, Gynecology and Reproductive Sciences, Magee-Womens Research Institute, University of Pittsburgh School of Medicine, Pittsburgh, PA, 15213, USA
| |
Collapse
|
44
|
Aghamaliyev U, Karimbayli J, Giessen-Jung C, Matthias I, Unger K, Andrade D, Hofmann FO, Weniger M, Angele MK, Benedikt Westphalen C, Werner J, Renz BW. ChatGPT's Gastrointestinal Tumor Board Tango: A limping dance partner? Eur J Cancer 2024; 205:114100. [PMID: 38729055 DOI: 10.1016/j.ejca.2024.114100] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2024] [Accepted: 04/23/2024] [Indexed: 05/12/2024]
Abstract
OBJECTIVES This study aimed to assess the consistency and replicability of treatment recommendations provided by ChatGPT 3.5 compared to gastrointestinal tumor cases presented at multidisciplinary tumor boards (MTBs). It also aimed to distinguish between general and case-specific responses and investigated the precision of ChatGPT's recommendations in replicating exact treatment plans, particularly regarding chemotherapy regimens and follow-up protocols. MATERIAL AND METHODS A retrospective study was carried out on 115 cases of gastrointestinal malignancies, selected from 448 patients reviewed in MTB meetings. A senior resident fed patient data into ChatGPT 3.5 to produce treatment recommendations, which were then evaluated against the tumor board's decisions by senior oncology fellows. RESULTS Among the examined cases, ChatGPT 3.5 provided general information about the malignancy without considering individual patient characteristics in 19% of cases. However, only in 81% of cases, ChatGPT generated responses that were specific to the individual clinical scenarios. In the subset of case-specific responses, 83% of recommendations exhibited overall treatment strategy concordance between ChatGPT and MTB. However, the exact treatment concordance dropped to 65%, notably lower in recommending specific chemotherapy regimens. Cases recommended for surgery showed the highest concordance rates, while those involving chemotherapy recommendations faced challenges in precision. CONCLUSIONS ChatGPT 3.5 demonstrates potential in aligning conceptual approaches to treatment strategies with MTB guidelines. However, it falls short in accurately duplicating specific treatment plans, especially concerning chemotherapy regimens and follow-up procedures. Ethical concerns and challenges in achieving exact replication necessitate prudence when considering ChatGPT 3.5 for direct clinical decision-making in MTBs.
Collapse
Affiliation(s)
- Ughur Aghamaliyev
- Department of General, Visceral and Transplantation Surgery, LMU University Hospital, LMU Munich, Germany
| | - Javad Karimbayli
- Division of Molecular Oncology, Centro di Riferimento Oncologico di Aviano (CRO), IRCCS, National Cancer Institute, Aviano, Italy
| | - Clemens Giessen-Jung
- Comprehensive Cancer Center Munich & Department of Medicine III, LMU University Hospital, LMU Munich, Germany
| | - Ilmer Matthias
- Department of General, Visceral and Transplantation Surgery, LMU University Hospital, LMU Munich, Germany; German Cancer Consortium (DKTK), Partner Site Munich, Munich, Germany
| | - Kristian Unger
- German Cancer Consortium (DKTK), Partner Site Munich, Munich, Germany; Department of Radiation Oncology, University Hospital, LMU Munich, 81377; Bavarian Cancer Research Center (BZKF), Munich, Germany
| | - Dorian Andrade
- Department of General, Visceral and Transplantation Surgery, LMU University Hospital, LMU Munich, Germany
| | - Felix O Hofmann
- Department of General, Visceral and Transplantation Surgery, LMU University Hospital, LMU Munich, Germany; German Cancer Consortium (DKTK), Partner Site Munich, Munich, Germany
| | - Maximilian Weniger
- Department of General, Visceral and Transplantation Surgery, LMU University Hospital, LMU Munich, Germany
| | - Martin K Angele
- Department of General, Visceral and Transplantation Surgery, LMU University Hospital, LMU Munich, Germany
| | - C Benedikt Westphalen
- Comprehensive Cancer Center Munich & Department of Medicine III, LMU University Hospital, LMU Munich, Germany; German Cancer Consortium (DKTK), Partner Site Munich, Munich, Germany
| | - Jens Werner
- Department of General, Visceral and Transplantation Surgery, LMU University Hospital, LMU Munich, Germany
| | - Bernhard W Renz
- Department of General, Visceral and Transplantation Surgery, LMU University Hospital, LMU Munich, Germany; German Cancer Consortium (DKTK), Partner Site Munich, Munich, Germany.
| |
Collapse
|
45
|
Khromchenko K, Shaikh S, Singh M, Vurture G, Rana RA, Baum JD. ChatGPT-3.5 Versus Google Bard: Which Large Language Model Responds Best to Commonly Asked Pregnancy Questions? Cureus 2024; 16:e65543. [PMID: 39188430 PMCID: PMC11346960 DOI: 10.7759/cureus.65543] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/27/2024] [Indexed: 08/28/2024] Open
Abstract
Large language models (LLM) have been widely used to provide information in many fields, including obstetrics and gynecology. Which model performs best in providing answers to commonly asked pregnancy questions is unknown. A qualitative analysis of Chat Generative Pre-Training Transformer Version 3.5 (ChatGPT-3.5) (OpenAI, Inc., San Francisco, California, United States) and Bard, recently renamed Google Gemini (Google LLC, Mountain View, California, United States), was performed in August of 2023. Each LLM was queried on 12 commonly asked pregnancy questions and asked for their references. Review and grading of the responses and references for both LLMs were performed by the co-authors individually and then as a group to formulate a consensus. Query responses were graded as "acceptable" or "not acceptable" based on correctness and completeness in comparison to American College of Obstetricians and Gynecologists (ACOG) publications, PubMed-indexed evidence, and clinical experience. References were classified as "verified," "broken," "irrelevant," "non-existent," and "no references." Grades of "acceptable" were given to 58% of ChatGPT-3.5 responses (seven out of 12) and 83% of Bard responses (10 out of 12). In regard to references, ChatGPT-3.5 had reference issues in 100% of its references, and Bard had discrepancies in 8% of its references (one out of 12). When comparing ChatGPT-3.5 responses between May 2023 and August 2023, a change in "acceptable" responses was noted: 50% versus 58%, respectively. Bard answered more questions correctly than ChatGPT-3.5 when queried on a small sample of commonly asked pregnancy questions. ChatGPT-3.5 performed poorly in terms of reference verification. The overall performance of ChatGPT-3.5 remained stable over time, with approximately one-half of responses being "acceptable" in both May and August of 2023. Both LLMs need further evaluation and vetting before being accepted as accurate and reliable sources of information for pregnant women.
Collapse
Affiliation(s)
- Keren Khromchenko
- Obstetrics and Gynecology, Hackensack Meridian Jersey Shore University Medical Center, Neptune, USA
| | - Sameeha Shaikh
- Obstetrics and Gynecology, Hackensack Meridian School of Medicine, Nutley, USA
| | - Meghana Singh
- Obstetrics and Gynecology, Hackensack Meridian School of Medicine, Nutley, USA
| | - Gregory Vurture
- Obstetrics and Gynecology, Hackensack Meridian Jersey Shore University Medical Center, Neptune, USA
| | - Rima A Rana
- Obstetrics and Gynecology, Hackensack Meridian Jersey Shore University Medical Center, Neptune, USA
| | - Jonathan D Baum
- Obstetrics and Gynecology, Hackensack Meridian Jersey Shore University Medical Center, Neptune, USA
| |
Collapse
|
46
|
Rodrigues Alessi M, Gomes HA, Lopes de Castro M, Terumy Okamoto C. Performance of ChatGPT in Solving Questions From the Progress Test (Brazilian National Medical Exam): A Potential Artificial Intelligence Tool in Medical Practice. Cureus 2024; 16:e64924. [PMID: 39156244 PMCID: PMC11330648 DOI: 10.7759/cureus.64924] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/19/2024] [Indexed: 08/20/2024] Open
Abstract
Background The use of artificial intelligence (AI) is not a recent phenomenon, but the latest advancements in this technology are making a significant impact across various fields of human knowledge. In medicine, this trend is no different, although it has developed at a slower pace. ChatGPT is an example of an AI-based algorithm capable of answering questions, interpreting phrases, and synthesizing complex information, potentially aiding and even replacing humans in various areas of social interest. Some studies have compared its performance in solving medical knowledge exams with medical students and professionals to verify AI accuracy. This study aimed to measure the performance of ChatGPT in answering questions from the Progress Test from 2021 to 2023. Methodology An observational study was conducted in which questions from the 2021 Progress Test and the regional tests (Southern Institutional Pedagogical Support Center II) of 2022 and 2023 were presented to ChatGPT 3.5. The results obtained were compared with the scores of first- to sixth-year medical students from over 120 Brazilian universities. All questions were presented sequentially, without any modification to their structure. After each question was presented, the platform's history was cleared, and the site was restarted. Results The platform achieved an average accuracy rate in 2021, 2022, and 2023 of 69.7%, 68.3%, and 67.2%, respectively, surpassing students from all medical years in the three tests evaluated, reinforcing findings in the current literature. The subject with the best score for the AI was Public Health, with a mean grade of 77.8%. Conclusions ChatGPT demonstrated the ability to answer medical questions with higher accuracy than humans, including students from the last year of medical school.
Collapse
Affiliation(s)
| | - Heitor A Gomes
- School of Medicine, Universidade Positivo, Curitiba, BRA
| | | | | |
Collapse
|
47
|
Meyer R, Hamilton KM, Truong MD, Wright KN, Siedhoff MT, Brezinov Y, Levin G. ChatGPT compared with Google Search and healthcare institution as sources of postoperative patient instructions after gynecological surgery. BJOG 2024; 131:1154-1156. [PMID: 38177090 DOI: 10.1111/1471-0528.17746] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/13/2023] [Indexed: 01/06/2024]
Affiliation(s)
- Raanan Meyer
- Division of Minimally Invasive Gynecologic Surgery, Department of Obstetrics and Gynecology, Cedars Sinai Medical Center, Los Angeles, California, USA
- The Dr. Pinchas Bornstein Talpiot Medical Leadership Programme, Sheba Medical Center, Ramat-Gan, Israel
| | - Kacey M Hamilton
- Division of Minimally Invasive Gynecologic Surgery, Department of Obstetrics and Gynecology, Cedars Sinai Medical Center, Los Angeles, California, USA
| | - Mireille D Truong
- Division of Minimally Invasive Gynecologic Surgery, Department of Obstetrics and Gynecology, Cedars Sinai Medical Center, Los Angeles, California, USA
| | - Kelly N Wright
- Division of Minimally Invasive Gynecologic Surgery, Department of Obstetrics and Gynecology, Cedars Sinai Medical Center, Los Angeles, California, USA
| | - Matthew T Siedhoff
- Division of Minimally Invasive Gynecologic Surgery, Department of Obstetrics and Gynecology, Cedars Sinai Medical Center, Los Angeles, California, USA
| | - Yoav Brezinov
- Lady Davis Institute for Cancer Research, Jewish General Hospital, McGill University, Montreal, Quebec, Canada
| | - Gabriel Levin
- Lady Davis Institute for Cancer Research, Jewish General Hospital, McGill University, Montreal, Quebec, Canada
| |
Collapse
|
48
|
Kotzur T, Singh A, Parker J, Peterson B, Sager B, Rose R, Corley F, Brady C. Evaluation of a Large Language Model's Ability to Assist in an Orthopedic Hand Clinic. Hand (N Y) 2024:15589447241257643. [PMID: 38907651 PMCID: PMC11571334 DOI: 10.1177/15589447241257643] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 06/24/2024]
Abstract
BACKGROUND Advancements in artificial intelligence technology, such as OpenAI's large language model, ChatGPT, could transform medicine through applications in a clinical setting. This study aimed to assess the utility of ChatGPT as a clinical assistant in an orthopedic hand clinic. METHODS Nine clinical vignettes, describing various common and uncommon hand pathologies, were constructed and reviewed by 4 fellowship-trained orthopedic hand surgeons and an orthopedic resident. ChatGPT was given these vignettes and asked to generate a differential diagnosis, potential workup plan, and provide treatment options for its top differential. Responses were graded for accuracy and the overall utility scored on a 5-point Likert scale. RESULTS The diagnostic accuracy of ChatGPT was 7 out of 9 cases, indicating an overall accuracy rate of 78%. ChatGPT was less reliable with more complex pathologies and failed to identify an intentionally incorrect presentation. ChatGPT received a score of 3.8 ± 1.4 for correct diagnosis, 3.4 ± 1.4 for helpfulness in guiding patient management, 4.1 ± 1.0 for appropriate workup for the actual diagnosis, 4.3 ± 0.8 for an appropriate recommended treatment plan for the diagnosis, and 4.4 ± 0.8 for the helpfulness of treatment options in managing patients. CONCLUSION ChatGPT was successful in diagnosing most of the conditions; however, the overall utility of its advice was variable. While it performed well in recommending treatments, it faced difficulties in providing appropriate diagnoses for uncommon pathologies. In addition, it failed to identify an obvious error in presenting pathology.
Collapse
|
49
|
Moll M, Heilemann G, Georg D, Kauer-Dorner D, Kuess P. The role of artificial intelligence in informed patient consent for radiotherapy treatments-a case report. Strahlenther Onkol 2024; 200:544-548. [PMID: 38180493 DOI: 10.1007/s00066-023-02190-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2023] [Accepted: 12/03/2023] [Indexed: 01/06/2024]
Abstract
Recent advancements in large language models (LMM; e.g., ChatGPT (OpenAI, San Francisco, California, USA)) have seen widespread use in various fields, including healthcare. This case study reports on the first use of LMM in a pretreatment discussion and in obtaining informed consent for a radiation oncology treatment. Further, the reproducibility of the replies by ChatGPT 3.5 was analyzed. A breast cancer patient, following legal consultation, engaged in a conversation with ChatGPT 3.5 regarding her radiotherapy treatment. The patient posed questions about side effects, prevention, activities, medications, and late effects. While some answers contained inaccuracies, responses closely resembled doctors' replies. In a final evaluation discussion, the patient, however, stated that she preferred the presence of a physician and expressed concerns about the source of the provided information. The reproducibility was tested in ten iterations. Future guidelines for using such models in radiation oncology should be driven by medical professionals. While artificial intelligence (AI) supports essential tasks, human interaction remains crucial.
Collapse
Affiliation(s)
- M Moll
- Department of Radiation Oncology, Comprehensive Cancer Center Vienna, Medical University Vienna, Vienna, Austria.
| | - G Heilemann
- Department of Radiation Oncology, Comprehensive Cancer Center Vienna, Medical University Vienna, Vienna, Austria
| | - Dietmar Georg
- Department of Radiation Oncology, Comprehensive Cancer Center Vienna, Medical University Vienna, Vienna, Austria
| | - D Kauer-Dorner
- Department of Radiation Oncology, Comprehensive Cancer Center Vienna, Medical University Vienna, Vienna, Austria
| | - P Kuess
- Department of Radiation Oncology, Comprehensive Cancer Center Vienna, Medical University Vienna, Vienna, Austria
| |
Collapse
|
50
|
Padmanabhan P, Dasarathan T, Surapaneni KM. Exploring the Potential of ChatGPT in Obstetrics and Gynecology of Undergraduate Medical Curriculum. J Obstet Gynaecol India 2024; 74:281-283. [PMID: 38974749 PMCID: PMC11224185 DOI: 10.1007/s13224-023-01909-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2023] [Accepted: 11/06/2023] [Indexed: 07/09/2024] Open
Abstract
ChatGPT, the new bustle in the field of technology, is attracting millions of users worldwide with its impressive skills to perform multiple tasks in a way that mimics human conversation. We conducted this study at two levels with direct and case-based questions from Obstetrics and gynecology to assess the performance of ChatGPT in the medical field. Our results suggest that ChatGPT has a good comprehension of the subject. However, ChatGPT should be trained to include recent updates and improvements in terms of generating error-free and upgraded responses.
Collapse
Affiliation(s)
- Padmavathy Padmanabhan
- Department of Obstetrics & Gynaecology, Panimalar Medical College Hospital and Research Institute, Varadharajapuram, Poonamallee, Chennai, 600123 India
| | - Tamilselvi Dasarathan
- Department of Obstetrics & Gynaecology, Panimalar Medical College Hospital and Research Institute, Varadharajapuram, Poonamallee, Chennai, 600123 India
| | - Krishna Mohan Surapaneni
- Department of Biochemistry, Panimalar Medical College Hospital & Research Institute, Varadharajapuram, Poonamallee, Chennai, Tamil Nadu 600 123 India
- Departments of Medical Education, Clinical Skills & Simulation, Panimalar Medical College Hospital & Research Institute, Varadharajapuram, Poonamallee, Chennai, Tamil Nadu 600 123 India
| |
Collapse
|