1
|
Gül Ş, Erdemir İ, Hanci V, Aydoğmuş E, Erkoç YS. How artificial intelligence can provide information about subdural hematoma: Assessment of readability, reliability, and quality of ChatGPT, BARD, and perplexity responses. Medicine (Baltimore) 2024; 103:e38009. [PMID: 38701313 PMCID: PMC11062651 DOI: 10.1097/md.0000000000038009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/21/2024] [Accepted: 04/04/2024] [Indexed: 05/05/2024] Open
Abstract
Subdural hematoma is defined as blood collection in the subdural space between the dura mater and arachnoid. Subdural hematoma is a condition that neurosurgeons frequently encounter and has acute, subacute and chronic forms. The incidence in adults is reported to be 1.72-20.60/100.000 people annually. Our study aimed to evaluate the quality, reliability and readability of the answers to questions asked to ChatGPT, Bard, and perplexity about "Subdural Hematoma." In this observational and cross-sectional study, we asked ChatGPT, Bard, and perplexity to provide the 100 most frequently asked questions about "Subdural Hematoma" separately. Responses from both chatbots were analyzed separately for readability, quality, reliability and adequacy. When the median readability scores of ChatGPT, Bard, and perplexity answers were compared with the sixth-grade reading level, a statistically significant difference was observed in all formulas (P < .001). All 3 chatbot responses were found to be difficult to read. Bard responses were more readable than ChatGPT's (P < .001) and perplexity's (P < .001) responses for all scores evaluated. Although there were differences between the results of the evaluated calculators, perplexity's answers were determined to be more readable than ChatGPT's answers (P < .05). Bard answers were determined to have the best GQS scores (P < .001). Perplexity responses had the best Journal of American Medical Association and modified DISCERN scores (P < .001). ChatGPT, Bard, and perplexity's current capabilities are inadequate in terms of quality and readability of "Subdural Hematoma" related text content. The readability standard for patient education materials as determined by the American Medical Association, National Institutes of Health, and the United States Department of Health and Human Services is at or below grade 6. The readability levels of the responses of artificial intelligence applications such as ChatGPT, Bard, and perplexity are significantly higher than the recommended 6th grade level.
Collapse
Affiliation(s)
- Şanser Gül
- Department of Neurosurgery, Ankara Ataturk Sanatory Education and Research Hospital, Ankara, Turkey
| | - İsmail Erdemir
- Department of Anesthesiology and Critical Care, Faculty of Medicine, Dokuz Eylül University, Izmir, Turkey
| | - Volkan Hanci
- Department of Anesthesiology and Reanimation, Ankara Sincan Education and Research Hospital, Ankara, Turkey
| | - Evren Aydoğmuş
- Department of Neurosurgery, Istanbul Kartal Dr Lütfi Kırdar City Hospital, Istanbul, Turkey
| | - Yavuz Selim Erkoç
- Department of Neurosurgery, Ankara Ataturk Sanatory Education and Research Hospital, Ankara, Turkey
| |
Collapse
|
2
|
Deng L, Wang T, Yangzhang, Zhai Z, Tao W, Li J, Zhao Y, Luo S, Xu J. Evaluation of large language models in breast cancer clinical scenarios: a comparative analysis based on ChatGPT-3.5, ChatGPT-4.0, and Claude2. Int J Surg 2024; 110:1941-1950. [PMID: 38668655 PMCID: PMC11019981 DOI: 10.1097/js9.0000000000001066] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2023] [Accepted: 12/23/2023] [Indexed: 04/29/2024]
Abstract
BACKGROUND Large language models (LLMs) have garnered significant attention in the AI domain owing to their exemplary context recognition and response capabilities. However, the potential of LLMs in specific clinical scenarios, particularly in breast cancer diagnosis, treatment, and care, has not been fully explored. This study aimed to compare the performances of three major LLMs in the clinical context of breast cancer. METHODS In this study, clinical scenarios designed specifically for breast cancer were segmented into five pivotal domains (nine cases): assessment and diagnosis, treatment decision-making, postoperative care, psychosocial support, and prognosis and rehabilitation. The LLMs were used to generate feedback for various queries related to these domains. For each scenario, a panel of five breast cancer specialists, each with over a decade of experience, evaluated the feedback from LLMs. They assessed feedback concerning LLMs in terms of their quality, relevance, and applicability. RESULTS There was a moderate level of agreement among the raters (Fleiss' kappa=0.345, P<0.05). Comparing the performance of different models regarding response length, GPT-4.0 and GPT-3.5 provided relatively longer feedback than Claude2. Furthermore, across the nine case analyses, GPT-4.0 significantly outperformed the other two models in average quality, relevance, and applicability. Within the five clinical areas, GPT-4.0 markedly surpassed GPT-3.5 in the quality of the other four areas and scored higher than Claude2 in tasks related to psychosocial support and treatment decision-making. CONCLUSION This study revealed that in the realm of clinical applications for breast cancer, GPT-4.0 showcases not only superiority in terms of quality and relevance but also demonstrates exceptional capability in applicability, especially when compared to GPT-3.5. Relative to Claude2, GPT-4.0 holds advantages in specific domains. With the expanding use of LLMs in the clinical field, ongoing optimization and rigorous accuracy assessments are paramount.
Collapse
Affiliation(s)
- Linfang Deng
- Department of Nursing, Jinzhou Medical University, Jinzhou
| | | | - Yangzhang
- Department of Breast Surgery, Xingtai People’s Hospital of Hebei Medical University, Xingtai, Hebei, People’s Republic of China
| | - Zhenhua Zhai
- Department of General Surgery, The First Affiliated Hospital of Jinzhou Medical University, Jinzhou
| | - Wei Tao
- Department of Breast Surgery
| | | | - Yi Zhao
- Department of Breast Surgery
| | - Shaoting Luo
- Department of Pediatric Orthopedics, Shengjing Hospital of China Medical University, Shenyang
| | - Jinjiang Xu
- Department of Health Management Center, The First Hospital of Jinzhou Medical University, Jinzhou, Liaoning
| |
Collapse
|
3
|
Beaulieu-Jones BR, Berrigan MT, Shah S, Marwaha JS, Lai SL, Brat GA. Evaluating capabilities of large language models: Performance of GPT-4 on surgical knowledge assessments. Surgery 2024; 175:936-942. [PMID: 38246839 PMCID: PMC10947829 DOI: 10.1016/j.surg.2023.12.014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2023] [Revised: 12/09/2023] [Accepted: 12/15/2023] [Indexed: 01/23/2024]
Abstract
BACKGROUND Artificial intelligence has the potential to dramatically alter health care by enhancing how we diagnose and treat disease. One promising artificial intelligence model is ChatGPT, a general-purpose large language model trained by OpenAI. ChatGPT has shown human-level performance on several professional and academic benchmarks. We sought to evaluate its performance on surgical knowledge questions and assess the stability of this performance on repeat queries. METHODS We evaluated the performance of ChatGPT-4 on questions from the Surgical Council on Resident Education question bank and a second commonly used surgical knowledge assessment, referred to as Data-B. Questions were entered in 2 formats: open-ended and multiple-choice. ChatGPT outputs were assessed for accuracy and insights by surgeon evaluators. We categorized reasons for model errors and the stability of performance on repeat queries. RESULTS A total of 167 Surgical Council on Resident Education and 112 Data-B questions were presented to the ChatGPT interface. ChatGPT correctly answered 71.3% and 67.9% of multiple choice and 47.9% and 66.1% of open-ended questions for Surgical Council on Resident Education and Data-B, respectively. For both open-ended and multiple-choice questions, approximately two-thirds of ChatGPT responses contained nonobvious insights. Common reasons for incorrect responses included inaccurate information in a complex question (n = 16, 36.4%), inaccurate information in a fact-based question (n = 11, 25.0%), and accurate information with circumstantial discrepancy (n = 6, 13.6%). Upon repeat query, the answer selected by ChatGPT varied for 36.4% of questions answered incorrectly on the first query; the response accuracy changed for 6/16 (37.5%) questions. CONCLUSION Consistent with findings in other academic and professional domains, we demonstrate near or above human-level performance of ChatGPT on surgical knowledge questions from 2 widely used question banks. ChatGPT performed better on multiple-choice than open-ended questions, prompting questions regarding its potential for clinical application. Unique to this study, we demonstrate inconsistency in ChatGPT responses on repeat queries. This finding warrants future consideration including efforts at training large language models to provide the safe and consistent responses required for clinical application. Despite near or above human-level performance on question banks and given these observations, it is unclear whether large language models such as ChatGPT are able to safely assist clinicians in providing care.
Collapse
Affiliation(s)
- Brendin R Beaulieu-Jones
- Department of Surgery, Beth Israel Deaconess Medical Center, Boston, MA; Department of Biomedical Informatics, Harvard Medical School, Boston, MA. https://twitter.com/bratogram
| | | | - Sahaj Shah
- Geisinger Commonwealth School of Medicine, Scranton, PA
| | - Jayson S Marwaha
- Division of Colorectal Surgery, National Taiwan University Hospital, Taipei, Taiwan
| | - Shuo-Lun Lai
- Division of Colorectal Surgery, National Taiwan University Hospital, Taipei, Taiwan
| | - Gabriel A Brat
- Department of Surgery, Beth Israel Deaconess Medical Center, Boston, MA; Department of Biomedical Informatics, Harvard Medical School, Boston, MA.
| |
Collapse
|
4
|
Meng J, Zhang Z, Tang H, Xiao Y, Liu P, Gao S, He M. Evaluation of ChatGPT in providing appropriate fracture prevention recommendations and medical science question responses: A quantitative research. Medicine (Baltimore) 2024; 103:e37458. [PMID: 38489735 PMCID: PMC10939678 DOI: 10.1097/md.0000000000037458] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/12/2024] [Revised: 02/08/2024] [Accepted: 02/12/2024] [Indexed: 03/17/2024] Open
Abstract
Currently, there are limited studies assessing ChatGPT ability to provide appropriate responses to medical questions. Our study aims to evaluate ChatGPT adequacy in responding to questions regarding osteoporotic fracture prevention and medical science. We created a list of 25 questions based on the guidelines and our clinical experience. Additionally, we included 11 medical science questions from the journal Science. Three patients, 3 non-medical professionals, 3 specialist doctor and 3 scientists were involved to evaluate the accuracy and appropriateness of responses by ChatGPT3.5 on October 2, 2023. To simulate a consultation, an inquirer (either a patient or non-medical professional) would send their questions to a consultant (specialist doctor or scientist) via a website. The consultant would forward the questions to ChatGPT for answers, which would then be evaluated for accuracy and appropriateness by the consultant before being sent back to the inquirer via the website for further review. The primary outcome is the appropriate, inappropriate, and unreliable rate of ChatGPT responses as evaluated separately by the inquirer and consultant groups. Compared to orthopedic clinicians, the patients' rating on the appropriateness of ChatGPT responses to the questions about osteoporotic fracture prevention was slightly higher, although the difference was not statistically significant (88% vs 80%, P = .70). For medical science questions, non-medical professionals and medical scientists rated similarly. In addition, the experts' ratings on the appropriateness of ChatGPT responses to osteoporotic fracture prevention and to medical science questions were comparable. On the other hand, the patients perceived that the appropriateness of ChatGPT responses to osteoporotic fracture prevention questions was slightly higher than that to medical science questions (88% vs 72·7%, P = .34). ChatGPT is capable of providing comparable and appropriate responses to medical science questions, as well as to fracture prevention related issues. Both the inquirers seeking advice and the consultants providing advice recognize ChatGPT expertise in these areas.
Collapse
Affiliation(s)
- Jiahao Meng
- Department of Orthopaedics, Xiangya Hospital, Central South University, #87 Xiangya Road, Changsha, Hunan, China
| | - Ziyi Zhang
- Department of Neurology, The Second Xiangya Hospital, Central South University, Changsha, Hunan, China
| | - Hang Tang
- Department of Orthopaedics, Xiangya Hospital, Central South University, #87 Xiangya Road, Changsha, Hunan, China
| | - Yifan Xiao
- Department of Orthopaedics, Xiangya Hospital, Central South University, #87 Xiangya Road, Changsha, Hunan, China
| | - Pan Liu
- Department of Orthopaedics, Xiangya Hospital, Central South University, #87 Xiangya Road, Changsha, Hunan, China
| | - Shuguang Gao
- Department of Orthopaedics, Xiangya Hospital, Central South University, #87 Xiangya Road, Changsha, Hunan, China
- National Clinical Research Center of Geriatric Disorders, Xiangya Hospital, Central South University, Changsha, Hunan, China
| | - Miao He
- Department of Neurology, The Second Xiangya Hospital, Central South University, Changsha, Hunan, China
| |
Collapse
|
5
|
Abbas A, Rehman MS, Rehman SS. Comparing the Performance of Popular Large Language Models on the National Board of Medical Examiners Sample Questions. Cureus 2024; 16:e55991. [PMID: 38606229 PMCID: PMC11007479 DOI: 10.7759/cureus.55991] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 03/11/2024] [Indexed: 04/13/2024] Open
Abstract
INTRODUCTION Large language models (LLMs) have transformed various domains in medicine, aiding in complex tasks and clinical decision-making, with OpenAI's GPT-4, GPT-3.5, Google's Bard, and Anthropic's Claude among the most widely used. While GPT-4 has demonstrated superior performance in some studies, comprehensive comparisons among these models remain limited. Recognizing the significance of the National Board of Medical Examiners (NBME) exams in assessing the clinical knowledge of medical students, this study aims to compare the accuracy of popular LLMs on NBME clinical subject exam sample questions. METHODS The questions used in this study were multiple-choice questions obtained from the official NBME website and are publicly available. Questions from the NBME subject exams in medicine, pediatrics, obstetrics and gynecology, clinical neurology, ambulatory care, family medicine, psychiatry, and surgery were used to query each LLM. The responses from GPT-4, GPT-3.5, Claude, and Bard were collected in October 2023. The response by each LLM was compared to the answer provided by the NBME and checked for accuracy. Statistical analysis was performed using one-way analysis of variance (ANOVA). RESULTS A total of 163 questions were queried by each LLM. GPT-4 scored 163/163 (100%), GPT-3.5 scored 134/163 (82.2%), Bard scored 123/163 (75.5%), and Claude scored 138/163 (84.7%). The total performance of GPT-4 was statistically superior to that of GPT-3.5, Claude, and Bard by 17.8%, 15.3%, and 24.5%, respectively. The total performance of GPT-3.5, Claude, and Bard was not significantly different. GPT-4 significantly outperformed Bard in specific subjects, including medicine, pediatrics, family medicine, and ambulatory care, and GPT-3.5 in ambulatory care and family medicine. Across all LLMs, the surgery exam had the highest average score (18.25/20), while the family medicine exam had the lowest average score (3.75/5). Conclusion: GPT-4's superior performance on NBME clinical subject exam sample questions underscores its potential in medical education and practice. While LLMs exhibit promise, discernment in their application is crucial, considering occasional inaccuracies. As technological advancements continue, regular reassessments and refinements are imperative to maintain their reliability and relevance in medicine.
Collapse
Affiliation(s)
- Ali Abbas
- Medical School, University of Texas Southwestern Medical School, Dallas, USA
| | - Mahad S Rehman
- Medical School, University of Texas Southwestern Medical School, Dallas, USA
| | - Syed S Rehman
- Nephrology, Baptist Hospitals of Southeast Texas, Beaumont, USA
| |
Collapse
|
6
|
Lum ZC, Collins DP, Dennison S, Guntupalli L, Choudhary S, Saiz AM, Randall RL. Generative Artificial Intelligence Performs at a Second-Year Orthopedic Resident Level. Cureus 2024; 16:e56104. [PMID: 38618358 PMCID: PMC11014641 DOI: 10.7759/cureus.56104] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2024] [Accepted: 03/12/2024] [Indexed: 04/16/2024] Open
Abstract
Introduction Artificial intelligence (AI) models using large language models (LLMs) and non-specific domains have gained attention for their innovative information processing. As AI advances, it's essential to regularly evaluate these tools' competency to maintain high standards, prevent errors or biases, and avoid flawed reasoning or misinformation that could harm patients or spread inaccuracies. Our study aimed to determine the performance of Chat Generative Pre-trained Transformer (ChatGPT) by OpenAI and Google BARD (BARD) in orthopedic surgery, assess performance based on question types, contrast performance between different AIs and compare AI performance to orthopedic residents. Methods We administered ChatGPT and BARD 757 Orthopedic In-Training Examination (OITE) questions. After excluding image-related questions, the AIs answered 390 multiple choice questions, all categorized within 10 sub-specialties (basic science, trauma, sports medicine, spine, hip and knee, pediatrics, oncology, shoulder and elbow, hand, and food and ankle) and three taxonomy classes (recall, interpretation, and application of knowledge). Statistical analysis was performed to analyze the number of questions answered correctly by each AI model, the performance returned by each AI model within the categorized question sub-specialty designation, and the performance of each AI model in comparison to the results returned by orthopedic residents classified by their respective post-graduate year (PGY) level. Results BARD answered more overall questions correctly (58% vs 54%, p<0.001). ChatGPT performed better in sports medicine and basic science and worse in hand surgery, while BARD performed better in basic science (p<0.05). The AIs performed better in recall questions compared to the application of knowledge (p<0.05). Based on previous data, it ranked in the 42nd-96th percentile for post-graduate year ones (PGY1s), 27th-58th for PGY2s, 3rd-29th for PGY3s, 1st-21st for PGY4s, and 1st-17th for PGY5s. Discussion ChatGPT excelled in sports medicine but fell short in hand surgery, while both AIs performed well in the basic science sub-specialty but performed poorly in the application of knowledge-based taxonomy questions. BARD performed better than ChatGPT overall. Although the AI reached the second-year PGY orthopedic resident level, it fell short of passing the American Board of Orthopedic Surgery (ABOS). Its strengths in recall-based inquiries highlight its potential as an orthopedic learning and educational tool.
Collapse
Affiliation(s)
- Zachary C Lum
- Orthopedic Surgery, University of California (UC) Davis School of Medicine, Sacramento, USA
- Orthopedic Surgery, Nova Southeastern University, Pembroke Pines, USA
| | - Dylon P Collins
- College of Medicine, Nova Southeastern University Dr. Kiran C. Patel College of Osteopathic Medicine, Fort Lauderdale, USA
| | - Stanley Dennison
- College of Medicine, Nova Southeastern University Dr. Kiran C. Patel College of Osteopathic Medicine, Fort Lauderdale, USA
| | - Lohitha Guntupalli
- Osteopathic Medicine, Nova Southeastern University Dr. Kiran C. Patel College of Osteopathic Medicine, Clearwater, USA
| | - Soham Choudhary
- Orthopedic Surgery, University of California, Davis, Davis, USA
| | - Augustine M Saiz
- Orthopedic Surgery, University of California (UC) Davis Health, Sacramento, USA
| | - Robert L Randall
- Orthopedic Surgery, University of California (UC) Davis Health, Sacramento, USA
| |
Collapse
|
7
|
Sudharshan R, Shen A, Gupta S, Zhang-Nunes S. Assessing the Utility of ChatGPT in Simplifying Text Complexity of Patient Educational Materials. Cureus 2024; 16:e55304. [PMID: 38559518 PMCID: PMC10981786 DOI: 10.7759/cureus.55304] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/29/2024] [Indexed: 04/04/2024] Open
Abstract
INTRODUCTION AI chatbots are being increasingly used in healthcare settings. There is growing interest in using AI to assist in patient education. Currently, extensive healthcare information is found online but is often too complex to understand. Our objective is to determine if physicians can recommend the free version of ChatGPT version 3.5 (OpenAI, San Francisco, CA, USA) for patients to simplify text from the American Academy of Ophthalmology (AAO) in English and Spanish. This version of ChatGPT was assessed in this study due to its increased accessibility across various patient populations. METHODS Fifteen articles were chosen from AAO in both languages and simplified with ChatGPT 10 times each. The readability of original and simplified articles was assessed with the Flesch Reading Ease and Gunning Fog Index for English and Fernández Huerta, Gutiérrez, Szigriszt-Pazo, INFLESZ, and Legibilidad-µ for Spanish. Grade levels to assess readability were calculated with Flesch Kincaid Grade Level and Crawford Nivel-de-Grado. Mean, standard deviation, and two-tailed t-tests were performed to assess differences before and after simplification. RESULTS Average grade levels before and after simplification were as follows: English 8.43±1.17 to 8.9±2.1 (p=0.41) and Spanish 5.3±0.34 to 4.1±1.1 (p=0.0001). Spanish articles were significantly simplified per Legibilidad-µ (p=0.003). No significant difference was noted for other scales. CONCLUSIONS The readability of AAO articles in English worsened without significance but significantly improved in Spanish. This may result from simpler syllable structures and a lesser overall vocabulary in Spanish. With increased testing, physicians can recommend ChatGPT for Spanish-speaking patients to improve health literacy.
Collapse
Affiliation(s)
- Rasika Sudharshan
- Ophthalmology, University of Southern California (USC) Roski Eye Institute, Los Angeles, USA
| | - Alena Shen
- Ophthalmology, University of Southern California (USC) Roski Eye Institute, Los Angeles, USA
| | - Shreya Gupta
- Ophthalmology, University of Southern California (USC) Roski Eye Institute, Los Angeles, USA
| | - Sandy Zhang-Nunes
- Ophthalmology, University of Southern California (USC) Roski Eye Institute, Los Angeles, USA
| |
Collapse
|
8
|
Lee GU, Hong DY, Kim SY, Kim JW, Lee YH, Park SO, Lee KR. Comparison of the problem-solving performance of ChatGPT-3.5, ChatGPT-4, Bing Chat, and Bard for the Korean emergency medicine board examination question bank. Medicine (Baltimore) 2024; 103:e37325. [PMID: 38428889 PMCID: PMC10906566 DOI: 10.1097/md.0000000000037325] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/03/2023] [Accepted: 01/31/2024] [Indexed: 03/03/2024] Open
Abstract
Large language models (LLMs) have been deployed in diverse fields, and the potential for their application in medicine has been explored through numerous studies. This study aimed to evaluate and compare the performance of ChatGPT-3.5, ChatGPT-4, Bing Chat, and Bard for the Emergency Medicine Board Examination question bank in the Korean language. Of the 2353 questions in the question bank, 150 questions were randomly selected, and 27 containing figures were excluded. Questions that required abilities such as analysis, creative thinking, evaluation, and synthesis were classified as higher-order questions, and those that required only recall, memory, and factual information in response were classified as lower-order questions. The answers and explanations obtained by inputting the 123 questions into the LLMs were analyzed and compared. ChatGPT-4 (75.6%) and Bing Chat (70.7%) showed higher correct response rates than ChatGPT-3.5 (56.9%) and Bard (51.2%). ChatGPT-4 showed the highest correct response rate for the higher-order questions at 76.5%, and Bard and Bing Chat showed the highest rate for the lower-order questions at 71.4%. The appropriateness of the explanation for the answer was significantly higher for ChatGPT-4 and Bing Chat than for ChatGPT-3.5 and Bard (75.6%, 68.3%, 52.8%, and 50.4%, respectively). ChatGPT-4 and Bing Chat outperformed ChatGPT-3.5 and Bard in answering a random selection of Emergency Medicine Board Examination questions in the Korean language.
Collapse
Affiliation(s)
- Go Un Lee
- Department of Emergency Medicine, Konkuk University Medical Center, Seoul, Republic of Korea
| | - Dae Young Hong
- Department of Emergency Medicine, Konkuk University School of Medicine, Seoul, Republic of Korea
| | - Sin Young Kim
- Department of Emergency Medicine, Konkuk University Medical Center, Seoul, Republic of Korea
| | - Jong Won Kim
- Department of Emergency Medicine, Konkuk University School of Medicine, Seoul, Republic of Korea
| | - Young Hwan Lee
- Department of Emergency Medicine, Konkuk University School of Medicine, Seoul, Republic of Korea
| | - Sang O Park
- Department of Emergency Medicine, Konkuk University School of Medicine, Seoul, Republic of Korea
| | - Kyeong Ryong Lee
- Department of Emergency Medicine, Konkuk University School of Medicine, Seoul, Republic of Korea
| |
Collapse
|
9
|
Khatib M, Hasani IW. Acetabular Aneurysmal Bone Cyst During the Syrian Conflict: A Case Report of Surgical Treatment and Outcomes. Cureus 2024; 16:e56474. [PMID: 38638726 PMCID: PMC11025696 DOI: 10.7759/cureus.56474] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 03/19/2024] [Indexed: 04/20/2024] Open
Abstract
Aneurysmal bone cysts (ABCs) are uncommon benign bone lesions that consist of blood-filled vascular spaces surrounded by fibrous tissue septa. Their diagnosis and surgical management are challenging in a war-torn region. In this case report, we present a rare case of a giant aneurysmal bone cyst located around the acetabulum in a 10-year-old female child who presented with an antalgic limp and left hip pain. The lesion was successfully treated with curettage and mixed autologous and synthetic bone grafts, and the follow-up for two years revealed a complete resolution of symptoms and radiological evidence of bone regeneration. This case highlights the successful surgical treatment of a challenging case of ABC in a difficult setting during the Syrian conflict.
Collapse
Affiliation(s)
| | - Ibrahim W Hasani
- Biochemistry, Idlib University Hospital, Idlib, SYR
- Biochemistry, Mary Private University (MPU), Idlib, SYR
- Biochemistry, Al-Shamal Private University (SPU), Idlib, SYR
| |
Collapse
|
10
|
Nakajima N, Fujimori T, Furuya M, Kanie Y, Imai H, Kita K, Uemura K, Okada S. A Comparison Between GPT-3.5, GPT-4, and GPT-4V: Can the Large Language Model ( ChatGPT) Pass the Japanese Board of Orthopaedic Surgery Examination? Cureus 2024; 16:e56402. [PMID: 38633935 PMCID: PMC11023708 DOI: 10.7759/cureus.56402] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/19/2024] [Indexed: 04/19/2024] Open
Abstract
Introduction Recently, large-scale language models, such as ChatGPT (OpenAI, San Francisco, CA), have evolved. These models are designed to think and act like humans and possess a broad range of specialized knowledge. GPT-3.5 was reported to be at a level of passing the United States Medical Licensing Examination. Its capabilities continue to evolve, and in October 2023, GPT-4V became available as a model capable of image recognition. Therefore, it is important to know the current performance of these models because they will be soon incorporated into medical practice. We aimed to evaluate the performance of ChatGPT in the field of orthopedic surgery. Methods We used three years' worth of Japanese Board of Orthopaedic Surgery Examinations (JBOSE) conducted in 2021, 2022, and 2023. Questions and their multiple-choice answers were used in their original Japanese form, as was the official examination rubric. We inputted these questions into three versions of ChatGPT: GPT-3.5, GPT-4, and GPT-4V. For image-based questions, we inputted only textual statements for GPT-3.5 and GPT-4, and both image and textual statements for GPT-4V. As the minimum scoring rate acquired to pass is not officially disclosed, it was calculated using publicly available data. Results The estimated minimum scoring rate acquired to pass was calculated as 50.1% (43.7-53.8%). For GPT-4, even when answering all questions, including the image-based ones, the percentage of correct answers was 59% (55-61%) and GPT-4 was able to achieve the passing line. When excluding image-based questions, the score reached 67% (63-73%). For GPT-3.5, the percentage was limited to 30% (28-32%), and this version could not pass the examination. There was a significant difference in the performance between GPT-4 and GPT-3.5 (p < 0.001). For image-based questions, the percentage of correct answers was 25% in GPT-3.5, 38% in GPT-4, and 38% in GPT-4V. There was no significant difference in the performance for image-based questions between GPT-4 and GPT-4V. Conclusions ChatGPT had enough performance to pass the orthopedic specialist examination. After adding further training data such as images, ChatGPT is expected to be applied to the orthopedics field.
Collapse
Affiliation(s)
| | - Takahito Fujimori
- Orthopaedic Surgery, Osaka University, Graduate School of Medicine, Suita, JPN
| | - Masayuki Furuya
- Orthopaedic Surgery, Osaka University, Graduate School of Medicine, Suita, JPN
| | - Yuya Kanie
- Orthopaedic Surgery, Osaka University, Graduate School of Medicine, Suita, JPN
| | - Hirotatsu Imai
- Orthopaedic Surgery, Osaka University, Graduate School of Medicine, Suita, JPN
| | - Kosuke Kita
- Orthopaedic Surgery, Osaka University, Graduate School of Medicine, Suita, JPN
| | - Keisuke Uemura
- Orthopaedic Surgery, Osaka University, Graduate School of Medicine, Suita, JPN
| | - Seiji Okada
- Orthopaedic Surgery, Osaka University, Graduate School of Medicine, Suita, JPN
| |
Collapse
|
11
|
Yalla GR, Hyman N, Hock LE, Zhang Q, Shukla AG, Kolomeyer NN. Performance of Artificial Intelligence Chatbots on Glaucoma Questions Adapted From Patient Brochures. Cureus 2024; 16:e56766. [PMID: 38650824 PMCID: PMC11034394 DOI: 10.7759/cureus.56766] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 03/23/2024] [Indexed: 04/25/2024] Open
Abstract
Introduction With the potential for artificial intelligence (AI) chatbots to serve as the primary source of glaucoma information to patients, it is essential to characterize the information that chatbots provide such that providers can tailor discussions, anticipate patient concerns, and identify misleading information. Therefore, the purpose of this study was to evaluate glaucoma information from AI chatbots, including ChatGPT-4, Bard, and Bing, by analyzing response accuracy, comprehensiveness, readability, word count, and character count in comparison to each other and glaucoma-related American Academy of Ophthalmology (AAO) patient materials. Methods Section headers from AAO glaucoma-related patient education brochures were adapted into question form and asked five times to each AI chatbot (ChatGPT-4, Bard, and Bing). Two sets of responses from each chatbot were used to evaluate the accuracy of AI chatbot responses and AAO brochure information, and the comprehensiveness of AI chatbot responses compared to the AAO brochure information, scored 1-5 by three independent glaucoma-trained ophthalmologists. Readability (assessed with Flesch-Kincaid Grade Level (FKGL), corresponding to the United States school grade levels), word count, and character count were determined for all chatbot responses and AAO brochure sections. Results Accuracy scores for AAO, ChatGPT, Bing, and Bard were 4.84, 4.26, 4.53, and 3.53, respectively. On direct comparison, AAO was more accurate than ChatGPT (p=0.002), and Bard was the least accurate (Bard versus AAO, p<0.001; Bard versus ChatGPT, p<0.002; Bard versus Bing, p=0.001). ChatGPT had the most comprehensive responses (ChatGPT versus Bing, p<0.001; ChatGPT versus Bard p=0.008), with comprehensiveness scores for ChatGPT, Bing, and Bard at 3.32, 2.16, and 2.79, respectively. AAO information and Bard responses were at the most accessible readability levels (AAO versus ChatGPT, AAO versus Bing, Bard versus ChatGPT, Bard versus Bing, all p<0.0001), with readability levels for AAO, ChatGPT, Bing, and Bard at 8.11, 13.01, 11.73, and 7.90, respectively. Bing responses had the lowest word and character count. Conclusion AI chatbot responses varied in accuracy, comprehensiveness, and readability. With accuracy scores and comprehensiveness below that of AAO brochures and elevated readability levels, AI chatbots require improvements to be a more useful supplementary source of glaucoma information for patients. Physicians must be aware of these limitations such that patients are asked about existing knowledge and questions and are then provided with clarifying and comprehensive information.
Collapse
Affiliation(s)
- Goutham R Yalla
- Department of Ophthalmology, Sidney Kimmel Medical College, Thomas Jefferson University, Philadelphia, USA
- Glaucoma Research Center, Wills Eye Hospital, Philadelphia, USA
| | - Nicholas Hyman
- Department of Ophthalmology, Vagelos College of Physicians and Surgeons, Columbia University, New York, USA
- Department of Ophthalmology, Glaucoma Division, Columbia University Irving Medical Center, New York, USA
| | - Lauren E Hock
- Glaucoma Research Center, Wills Eye Hospital, Philadelphia, USA
| | - Qiang Zhang
- Glaucoma Research Center, Wills Eye Hospital, Philadelphia, USA
- Biostatistics Consulting Core, Vickie and Jack Farber Vision Research Center, Wills Eye Hospital, Philadelphia, USA
| | - Aakriti G Shukla
- Department of Ophthalmology, Glaucoma Division, Columbia University Irving Medical Center, New York, USA
| | | |
Collapse
|
12
|
Wright BM, Bodnar MS, Moore AD, Maseda MC, Kucharik MP, Diaz CC, Schmidt CM, Mir HR. Is ChatGPT a trusted source of information for total hip and knee arthroplasty patients? Bone Jt Open 2024; 5:139-146. [PMID: 38354748 PMCID: PMC10867788 DOI: 10.1302/2633-1462.52.bjo-2023-0113.r1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 02/16/2024] Open
Abstract
Aims While internet search engines have been the primary information source for patients' questions, artificial intelligence large language models like ChatGPT are trending towards becoming the new primary source. The purpose of this study was to determine if ChatGPT can answer patient questions about total hip (THA) and knee arthroplasty (TKA) with consistent accuracy, comprehensiveness, and easy readability. Methods We posed the 20 most Google-searched questions about THA and TKA, plus ten additional postoperative questions, to ChatGPT. Each question was asked twice to evaluate for consistency in quality. Following each response, we responded with, "Please explain so it is easier to understand," to evaluate ChatGPT's ability to reduce response reading grade level, measured as Flesch-Kincaid Grade Level (FKGL). Five resident physicians rated the 120 responses on 1 to 5 accuracy and comprehensiveness scales. Additionally, they answered a "yes" or "no" question regarding acceptability. Mean scores were calculated for each question, and responses were deemed acceptable if ≥ four raters answered "yes." Results The mean accuracy and comprehensiveness scores were 4.26 (95% confidence interval (CI) 4.19 to 4.33) and 3.79 (95% CI 3.69 to 3.89), respectively. Out of all the responses, 59.2% (71/120; 95% CI 50.0% to 67.7%) were acceptable. ChatGPT was consistent when asked the same question twice, giving no significant difference in accuracy (t = 0.821; p = 0.415), comprehensiveness (t = 1.387; p = 0.171), acceptability (χ2 = 1.832; p = 0.176), and FKGL (t = 0.264; p = 0.793). There was a significantly lower FKGL (t = 2.204; p = 0.029) for easier responses (11.14; 95% CI 10.57 to 11.71) than original responses (12.15; 95% CI 11.45 to 12.85). Conclusion ChatGPT answered THA and TKA patient questions with accuracy comparable to previous reports of websites, with adequate comprehensiveness, but with limited acceptability as the sole information source. ChatGPT has potential for answering patient questions about THA and TKA, but needs improvement.
Collapse
Affiliation(s)
- Benjamin M. Wright
- Morsani College of Medicine, University of South Florida, Tampa, Florida, USA
| | - Michael S. Bodnar
- Morsani College of Medicine, University of South Florida, Tampa, Florida, USA
| | - Andrew D. Moore
- Department of Orthopaedic Surgery, University of South Florida, Tampa, Florida, USA
| | - Meghan C. Maseda
- Department of Orthopaedic Surgery, University of South Florida, Tampa, Florida, USA
| | - Michael P. Kucharik
- Department of Orthopaedic Surgery, University of South Florida, Tampa, Florida, USA
| | - Connor C. Diaz
- Department of Orthopaedic Surgery, University of South Florida, Tampa, Florida, USA
| | - Christian M. Schmidt
- Department of Orthopaedic Surgery, University of South Florida, Tampa, Florida, USA
| | - Hassan R. Mir
- Orthopaedic Trauma Service, Florida Orthopedic Institute, Tampa, Florida, USA
| |
Collapse
|
13
|
Gengatharan D, Saggi SS, Bin Abd Razak HR. Pre-operative Planning of High Tibial Osteotomy With ChatGPT: Are We There Yet? Cureus 2024; 16:e54858. [PMID: 38533173 PMCID: PMC10964394 DOI: 10.7759/cureus.54858] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/23/2024] [Indexed: 03/28/2024] Open
Abstract
INTRODUCTION ChatGPT (Chat Generative Pre-trained Transformer), developed by OpenAI (San Francisco, CA, USA), has gained attention in the medical field. It has the potential to enhance and simplify tasks, such as preoperative planning in orthopedic surgery. We aimed to test ChatGPT's accuracy in measuring the angle of correction for high tibial osteotomy for cases planned and performed at a tertiary teaching hospital in Singapore. MATERIALS AND METHODS Peri-operative angular parameters from 114 consecutive patients who underwent medial opening wedge high tibial osteotomy (MOWHTO) were used to query ChatGPT 3.0. First ChatGPT 3.0 was queried on what information it required to plan a MOWHTO. Based on its response, pre-operative medial proximal tibial angle (MPTA) and joint line congruence angle (JLCA) were provided. ChatGPT 3.0 then responded with its recommended angle of correction. This was compared against the manually planned surgical correction by our fellowship-trained surgeon. A root mean square analysis was then performed to compare ChatGPT 3.0 and manual planning. RESULTS The root mean square error (RMSE) of ChatGPT 3.0 in predicting correction angle in MWHTO was 2.96, suggesting a very poor model fit. CONCLUSION Although ChatGPT 3.0 represents a significant breakthrough in large language models with extensive capabilities, it is not currently optimized to effectively perform complex pre-operative planning in orthopedic surgery, specifically in the context of MOWHTO. Further refinement and consideration of specific factors are necessary to enhance its accuracy and suitability for such applications.
Collapse
Affiliation(s)
| | | | - Hamid Rahmatullah Bin Abd Razak
- Musculoskeletal Sciences, Duke-Nus Medical School, Singapore, SGP
- Orthopaedic Surgery, Sengkang General Hospital, Singapore, SGP
| |
Collapse
|
14
|
Podda M, Di Martino M, Ielpo B, Catena F, Coccolini F, Pata F, Marchegiani G, De Simone B, Damaskos D, Mole D, Leppaniemi A, Sartelli M, Yang B, Ansaloni L, Biffl W, Kluger Y, Moore EE, Pellino G, Di Saverio S, Pisanu A. The 2023 MANCTRA Acute Biliary Pancreatitis Care Bundle: A Joint Effort Between Human Knowledge and Artificial Intelligence ( ChatGPT) to Optimize the Care of Patients With Acute Biliary Pancreatitis in Western Countries. Ann Surg 2024; 279:203-212. [PMID: 37450700 PMCID: PMC10782931 DOI: 10.1097/sla.0000000000006008] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/18/2023]
Abstract
OBJECTIVE To generate an up-to-date bundle to manage acute biliary pancreatitis using an evidence-based, artificial intelligence (AI)-assisted GRADE method. BACKGROUND A care bundle is a set of core elements of care that are distilled from the most solid evidence-based practice guidelines and recommendations. METHODS The research questions were addressed in this bundle following the PICO criteria. The working group summarized the effects of interventions with the strength of recommendation and quality of evidence applying the GRADE methodology. ChatGPT AI system was used to independently assess the quality of evidence of each element in the bundle, together with the strength of the recommendations. RESULTS The 7 elements of the bundle discourage antibiotic prophylaxis in patients with acute biliary pancreatitis, support the use of a full-solid diet in patients with mild to moderately severe acute biliary pancreatitis, and recommend early enteral nutrition in patients unable to feed by mouth. The bundle states that endoscopic retrograde cholangiopancreatography should be performed within the first 48 to 72 hours of hospital admission in patients with cholangitis. Early laparoscopic cholecystectomy should be performed in patients with mild acute biliary pancreatitis. When operative intervention is needed for necrotizing pancreatitis, this should start with the endoscopic step-up approach. CONCLUSIONS We have developed a new care bundle with 7 key elements for managing patients with acute biliary pancreatitis. This new bundle, whose scientific strength has been increased thanks to the alliance between human knowledge and AI from the new ChatGPT software, should be introduced to emergency departments, wards, and intensive care units.
Collapse
Affiliation(s)
- Mauro Podda
- Department of Surgical Science, Emergency Surgery Unit, Cagliari State University Hospital, Cagliari, Italy
| | - Marcello Di Martino
- Division of Hepatobiliary and Liver Transplantation Surgery, A.O.R.N. Cardarelli, Naples, Italy
| | - Benedetto Ielpo
- Hepatobiliary Division, Hospital del Mar, Pompeu Fabra University, Barcelona, Spain
| | - Fausto Catena
- Department of Emergency and Trauma Surgery, Bufalini Hospital, Cesena, Italy
| | - Federico Coccolini
- General, Emergency and Trauma Surgery Unit, Pisa University Hospital, Pisa, Italy
| | - Francesco Pata
- Department of Surgery, University of Calabria, Cosenza, Italy
| | - Giovanni Marchegiani
- Department of Surgical, Oncological and Gastroenterological Sciences (DISCOG), Hepato-Pancreato-Biliary Surgery and Liver Transplantation Unit, University of Padua, Padua, Italy
| | - Belinda De Simone
- Department of Emergency and Metabolic Minimally Invasive Surgery, Centre Hospitalier Intercommunal de Poissy/Saint Germain en Laye, Poissy Cedex, France
| | - Dimitrios Damaskos
- Department of Upper GI Surgery, Royal Infirmary of Edinburgh, Edinburgh, Scotland, UK
| | - Damian Mole
- Centre for Inflammation Research, Clinical Surgery, University of Edinburgh, Edinburgh, Scotland, UK
| | - Ari Leppaniemi
- Department of Abdominal Surgery, University of Helsinki and Helsinki University Central Hospital, Helsinki, Finland
| | | | - Baohong Yang
- Department of Oncology, Weifang People’s Hospital, The First Affiliated Hospital of Weifang Medical University, Weifang, Shandong, China
- Department of Gastroenterology, The First Affiliated Hospital of Zhengzhou University, Zhengzhou, Henan, China
| | - Luca Ansaloni
- Department of General Surgery, IRCCS Policlinico San Matteo Foundation, Pavia, Italy
| | - Walter Biffl
- Division of Trauma and Acute Care Surgery, Scripps Memorial Hospital La Jolla, La Jolla, CA
| | - Yoram Kluger
- Department of General Surgery, Rambam Medical Center, Haifa, Israel
| | - Ernest E. Moore
- Denver Health System—Denver Health Medical Center, Denver, CO
| | - Gianluca Pellino
- “Luigi Vanvitelli” University of Campania, Naples, Italy
- Department of Colorectal Surgery, Vall d’Hebron University Hospital, Universitat Autonoma de Barcelona UAB, Barcelona, Spain
| | - Salomone Di Saverio
- Department of Surgery, Madonna del Soccorso Hospital, San Benedetto del Tronto, Italy
| | - Adolfo Pisanu
- Department of Surgical Science, Emergency Surgery Unit, Cagliari State University Hospital, Cagliari, Italy
| |
Collapse
|
15
|
Reis F, Lenz C. Performance of Artificial Intelligence (AI)-Powered Chatbots in the Assessment of Medical Case Reports: Qualitative Insights From Simulated Scenarios. Cureus 2024; 16:e53899. [PMID: 38465163 PMCID: PMC10925004 DOI: 10.7759/cureus.53899] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/08/2024] [Indexed: 03/12/2024] Open
Abstract
Introduction With the expanding awareness and use of AI-powered chatbots, it seems possible that an increasing number of people could use them to assess and evaluate their medical symptoms. If chatbots are used for this purpose, that have not previously undergone a thorough medical evaluation for this specific use, various risks might arise. The aim of this study is to analyze and compare the performance of popular chatbots in differentiating between severe and less critical medical symptoms described from a patient's perspective and to examine the variations in substantive medical assessment accuracy and empathetic communication style among the chatbots' responses. Materials and methods Our study compared three different AI-supported chatbots - OpenAI's ChatGPT 3.5, Microsoft's Bing Chat, and Inflection's Pi AI. Three exemplary case reports for medical emergencies as well as three cases without an urgent reason for an emergency medical admission were constructed and analyzed. Each case report was accompanied by identical questions concerning the most likely suspected diagnosis and the urgency of an immediate medical evaluation. The respective answers of the chatbots were qualitatively compared with each other regarding the medical accuracy of the differential diagnoses mentioned and the conclusions drawn, as well as regarding patient-oriented and empathetic language. Results All examined chatbots were capable of providing medically plausible and probable diagnoses and classifying situations as acute or less critical. However, their responses varied slightly in the level of their urgency assessment. Clear differences could be seen in the level of detail of the differential diagnoses, the overall length of the answers, and how the chatbot dealt with the challenge of being confronted with medical issues. All given answers were comparable in terms of empathy level and comprehensibility. Conclusion Even AI chatbots that are not designed for medical applications already offer substantial guidance in assessing typical medical emergency indications but should always be provided with a disclaimer. In responding to medical queries, characteristic differences emerge among chatbots in the extent and style of their respective answers. Given the lack of medical supervision of many established chatbots, subsequent studies, and experiences are essential to clarify whether a more extensive use of these chatbots for medical concerns will have a positive impact on healthcare or rather pose major medical risks.
Collapse
Affiliation(s)
- Florian Reis
- Medical Affairs, Pfizer Pharma GmbH, Berlin, DEU
| | | |
Collapse
|
16
|
Kapsali MZ, Livanis E, Tsalikidis C, Oikonomou P, Voultsos P, Tsaroucha A. Ethical Concerns About ChatGPT in Healthcare: A Useful Tool or the Tombstone of Original and Reflective Thinking? Cureus 2024; 16:e54759. [PMID: 38523987 PMCID: PMC10961144 DOI: 10.7759/cureus.54759] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/23/2024] [Indexed: 03/26/2024] Open
Abstract
Artificial intelligence (AI), the uprising technology of computer science aiming to create digital systems with human behavior and intelligence, seems to have invaded almost every field of modern life. Launched in November 2022, ChatGPT (Chat Generative Pre-trained Transformer) is a textual AI application capable of creating human-like responses characterized by original language and high coherence. Although AI-based language models have demonstrated impressive capabilities in healthcare, ChatGPT has received controversial annotations from the scientific and academic communities. This chatbot already appears to have a massive impact as an educational tool for healthcare professionals and transformative potential for clinical practice and could lead to dramatic changes in scientific research. Nevertheless, rational concerns were raised regarding whether the pre-trained, AI-generated text would be a menace not only for original thinking and new scientific ideas but also for academic and research integrity, as it gets more and more difficult to distinguish its AI origin due to the coherence and fluency of the produced text. This short review aims to summarize the potential applications and the consequential implications of ChatGPT in the three critical pillars of medicine: education, research, and clinical practice. In addition, this paper discusses whether the current use of this chatbot is in compliance with the ethical principles for the safe use of AI in healthcare, as determined by the World Health Organization. Finally, this review highlights the need for an updated ethical framework and the increased vigilance of healthcare stakeholders to harvest the potential benefits and limit the imminent dangers of this new innovative technology.
Collapse
Affiliation(s)
- Marina Z Kapsali
- Postgraduate Program on Bioethics, Laboratory of Bioethics, Democritus University of Thrace, Alexandroupolis, GRC
| | - Efstratios Livanis
- Department of Accounting and Finance, University of Macedonia, Thessaloniki, GRC
| | - Christos Tsalikidis
- Department of General Surgery, Democritus University of Thrace, Alexandroupolis, GRC
| | - Panagoula Oikonomou
- Laboratory of Experimental Surgery, Department of General Surgery, Democritus University of Thrace, Alexandroupolis, GRC
| | - Polychronis Voultsos
- Laboratory of Forensic Medicine & Toxicology (Medical Law and Ethics), School of Medicine, Faculty of Health Sciences, Aristotle University of Thessaloniki, Thessaloniki, GRC
| | - Aleka Tsaroucha
- Department of General Surgery, Democritus University of Thrace, Alexandroupolis, GRC
| |
Collapse
|
17
|
Almagazzachi A, Mustafa A, Eighaei Sedeh A, Vazquez Gonzalez AE, Polianovskaia A, Abood M, Abdelrahman A, Muyolema Arce V, Acob T, Saleem B. Generative Artificial Intelligence in Patient Education: ChatGPT Takes on Hypertension Questions. Cureus 2024; 16:e53441. [PMID: 38435177 PMCID: PMC10909311 DOI: 10.7759/cureus.53441] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/01/2024] [Indexed: 03/05/2024] Open
Abstract
Introduction Uncontrolled hypertension significantly contributes to the development and deterioration of various medical conditions, such as myocardial infarction, chronic kidney disease, and cerebrovascular events. Despite being the most common preventable risk factor for all-cause mortality, only a fraction of affected individuals maintain their blood pressure in the desired range. In recent times, there has been a growing reliance on online platforms for medical information. While providing a convenient source of information, differentiating reliable from unreliable information can be daunting for the layperson, and false information can potentially hinder timely diagnosis and management of medical conditions. The surge in accessibility of generative artificial intelligence (GeAI) technology has led to increased use in obtaining health-related information. This has sparked debates among healthcare providers about the potential for misuse and misinformation while recognizing the role of GeAI in improving health literacy. This study aims to investigate the accuracy of AI-generated information specifically related to hypertension. Additionally, it seeks to explore the reproducibility of information provided by GeAI. Method A nonhuman-subject qualitative study was devised to evaluate the accuracy of information provided by ChatGPT regarding hypertension and its secondary complications. Frequently asked questions on hypertension were compiled by three study staff, internal medicine residents at an ACGME-accredited program, and then reviewed by a physician experienced in treating hypertension, resulting in a final set of 100 questions. Each question was posed to ChatGPT three times, once by each study staff, and the majority response was then assessed against the recommended guidelines. A board-certified internal medicine physician with over eight years of experience further reviewed the responses and categorized them into two classes based on their clinical appropriateness: appropriate (in line with clinical recommendations) and inappropriate (containing errors). Descriptive statistical analysis was employed to assess ChatGPT responses for accuracy and reproducibility. Result Initially, a pool of 130 questions was gathered, of which a final set of 100 questions was selected for the purpose of this study. When assessed against acceptable standard responses, ChatGPT responses were found to be appropriate in 92.5% of cases and inappropriate in 7.5%. Furthermore, ChatGPT had a reproducibility score of 93%, meaning that it could consistently reproduce answers that conveyed similar meanings across multiple runs. Conclusion ChatGPT showcased commendable accuracy in addressing commonly asked questions about hypertension. These results underscore the potential of GeAI in providing valuable information to patients. However, continued research and refinement are essential to evaluate further the reliability and broader applicability of ChatGPT within the medical field.
Collapse
Affiliation(s)
| | - Ahmed Mustafa
- Internal Medicine, Capital Health System, Trenton, USA
| | | | | | | | - Muhanad Abood
- Internal Medicine, Capital Health System, Trenton, USA
| | | | | | - Talar Acob
- Internal Medicine Residency Program, Capital Health Regional Medical Center, Trenton, USA
| | - Bushra Saleem
- Internal Medicine, Capital Health System, Trenton, USA
| |
Collapse
|
18
|
Rammohan R, Joy MV, Magam SG, Natt D, Magam SR, Pannikodu L, Desai J, Akande O, Bunting S, Yost RM, Mustacchia P. Understanding the Landscape: The Emergence of Artificial Intelligence (AI), ChatGPT, and Google Bard in Gastroenterology. Cureus 2024; 16:e51848. [PMID: 38327910 PMCID: PMC10847895 DOI: 10.7759/cureus.51848] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/07/2024] [Indexed: 02/09/2024] Open
Abstract
Introduction Artificial intelligence (AI) integration in healthcare, specifically in gastroenterology, has opened new avenues for enhanced patient care and medical decision-making. This study aims to assess the reliability and accuracy of two prominent AI tools, ChatGPT 4.0 and Google Bard, in answering gastroenterology-related queries, thereby evaluating their potential utility in medical settings. Methods The study employed a structured approach where typical gastroenterology questions were input into ChatGPT 4.0 and Google Bard. Independent reviewers evaluated responses using a Likert scale and cross-referenced them with guidelines from authoritative gastroenterology bodies. Statistical analysis, including the Mann-Whitney U test, was conducted to assess the significance of differences in ratings. Results ChatGPT 4.0 demonstrated higher reliability and accuracy in its responses than Google Bard, as indicated by higher mean ratings and statistically significant p-values in hypothesis testing. However, limitations in the data structure, such as the inability to conduct detailed correlation analysis, were noted. Conclusion The study concludes that ChatGPT 4.0 outperforms Google Bard in providing reliable and accurate responses to gastroenterology-related queries. This finding underscores the potential of AI tools like ChatGPT in enhancing healthcare delivery. However, the study also highlights the need for a broader and more diverse assessment of AI capabilities in healthcare to leverage their potential in clinical practice fully.
Collapse
Affiliation(s)
- Rajmohan Rammohan
- Gastroenterology, Nassau University Medical Center, East Meadow, USA
| | - Melvin V Joy
- Internal Medicine, Nassau University Medical Center, East Meadow, USA
| | | | - Dilman Natt
- Internal Medicine, Nassau University Medical Center, East Meadow, USA
| | - Sai Reshma Magam
- Internal Medicine, Nassau University Medical Center, East Meadow, USA
| | - Leeza Pannikodu
- Internal Medicine, Nassau University Medical Center, East Meadow, USA
| | - Jiten Desai
- Internal Medicine, Nassau University Medical Center, East Meadow, USA
| | - Olawale Akande
- Internal Medicine, Nassau University Medical Center, East Meadow, USA
| | - Susan Bunting
- Internal Medicine, Nassau University Medical Center, East Meadow, USA
| | - Robert M Yost
- Internal Medicine, Nassau University Medical Center, East Meadow, USA
| | - Paul Mustacchia
- Gastroenterology and Hepatology, Nassau University Medical Center, East Meadow, USA
| |
Collapse
|
19
|
Mediboina A, Badam RK, Chodavarapu S. Assessing the Accuracy of Information on Medication Abortion: A Comparative Analysis of ChatGPT and Google Bard AI. Cureus 2024; 16:e51544. [PMID: 38318564 PMCID: PMC10840059 DOI: 10.7759/cureus.51544] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/01/2024] [Indexed: 02/07/2024] Open
Abstract
Background and objective ChatGPT and Google Bard AI are widely used conversational chatbots, even in healthcare. While they have several strengths, they can generate seemingly correct but erroneous responses, warranting caution in medical contexts. In an era where access to abortion care is diminishing, patients may increasingly rely on online resources and AI-driven language models for information on medication abortions. In light of this, this study aimed to compare the accuracy and comprehensiveness of responses generated by ChatGPT 3.5 and Google Bard AI to medical queries about medication abortions. Methods Fourteen open-ended questions about medication abortion were formulated based on the Frequently Asked Questions (FAQs) from the National Abortion Federation (NAF) and the Reproductive Health Access Project (RHAP) websites. These questions were answered using ChatGPT version 3.5 and Google Bard AI on October 7, 2023. The accuracy of the responses was analyzed by cross-referencing the generated answers against the information provided by NAF and RHAP. Any discrepancies were further verified against the guidelines from the American Congress of Obstetricians and Gynecologists (ACOG). A rating scale used by Johnson et al. was employed for assessment, utilizing a 6-point Likert scale [ranging from 1 (completely incorrect) to 6 (correct)] to evaluate accuracy and a 3-point scale [ranging from 1 (incomplete) to 3 (comprehensive)] to assess completeness. Questions that did not yield answers were assigned a score of 0 and omitted from the correlation analysis. Data analysis and visualization were done using R Software version 4.3.1. Statistical significance was determined by employing Spearman's R and Mann-Whitney U tests. Results All questions were entered sequentially into both chatbots by the same author. On the initial attempt, ChatGPT successfully generated relevant responses for all questions, while Google Bard AI failed to provide answers for five questions. Repeating the same question in Google Bard AI yielded an answer for one; two were answered with different phrasing; and two remained unanswered despite rephrasing. ChatGPT showed a median accuracy score of 5 (mean: 5.26, SD: 0.73) and a median completeness score of 3 (mean: 2.57, SD: 0.51). It showed the highest accuracy score in six responses and the highest completeness score in eight responses. In contrast, Google Bard AI had a median accuracy score of 5 (mean: 4.5, SD: 2.03) and a median completeness score of 2 (mean: 2.14, SD: 1.03). It achieved the highest accuracy score in five responses and the highest completeness score in six responses. Spearman's correlation coefficient revealed no correlation between accuracy and completeness for ChatGPT (rs = -0.46771, p = 0.09171). However, Google Bard AI showed a marginally significant correlation (rs = 0.5738, p = 0.05108). Mann-Whitney U test indicated no statistically significant differences between ChatGPT and Google Bard AI concerning accuracy (U = 82, p>0.05) or completeness (U = 78, p>0.05). Conclusion While both chatbots showed similar levels of accuracy, minor errors were noted, pertaining to finer aspects that demand specialized knowledge of abortion care. This could explain the lack of a significant correlation between accuracy and completeness. Ultimately, AI-driven language models have the potential to provide information on medication abortions, but there is a need for continual refinement and oversight.
Collapse
Affiliation(s)
- Anjali Mediboina
- Community Medicine, Alluri Sita Ramaraju Academy of Medical Sciences, Eluru, IND
| | - Rajani Kumari Badam
- Obstetrics and Gynaecology, Sri Venkateswara Medical College, Tirupathi, IND
| | - Sailaja Chodavarapu
- Obstetrics and Gynaecology, Government Medical College, Rajamahendravaram, IND
| |
Collapse
|
20
|
Zhu L, Mou W, Wu K, Zhang J, Luo P. Can DALL-E 3 Reliably Generate 12-Lead ECGs and Teaching Illustrations? Cureus 2024; 16:e52748. [PMID: 38384621 PMCID: PMC10879738 DOI: 10.7759/cureus.52748] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/22/2024] [Indexed: 02/23/2024] Open
Abstract
The recent integration of the latest image generation model DALL-E 3 into ChatGPT allows text prompts to easily generate the corresponding images, enabling multimodal output from ChatGPT. We explored the feasibility of DALL-E 3 for drawing a 12-lead ECG and found that it can draw rudimentary 12-lead electrocardiograms (ECG) displaying some of the parameters, although the details are not completely accurate. We also explored DALL-E 3's capacity to create vivid illustrations for teaching resuscitation-related medical knowledge. DALL-E 3 produced accurate CPR illustrations emphasizing proper hand placement and technique. For ECG principles, it produced creative heart-shaped waveforms tying ECGs to the heart. With further training, DALL-E 3 shows promise to expand easy-to-understand visual medical teaching materials and ECG simulations for different disease states. In conclusion, DALL-E 3 has the potential to generate realistic 12-lead ECGs and teaching schematics, but expert validation is still needed.
Collapse
Affiliation(s)
- Lingxuan Zhu
- Department of Oncology, Zhujiang Hospital of Southern Medical University, Guangzhou, CHN
| | - Weiming Mou
- Department of Urology, Shanghai General Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, CHN
| | - Keren Wu
- Department of Oncology, Zhujiang Hospital of Southern Medical University, Guangzhou, CHN
| | - Jian Zhang
- Department of Oncology, Zhujiang Hospital of Southern Medical University, Guangzhou, CHN
| | - Peng Luo
- Department of Oncology, Zhujiang Hospital of Southern Medical University, Guangzhou, CHN
| |
Collapse
|
21
|
George Pallivathukal R, Kyaw Soe HH, Donald PM, Samson RS, Hj Ismail AR. ChatGPT for Academic Purposes: Survey Among Undergraduate Healthcare Students in Malaysia. Cureus 2024; 16:e53032. [PMID: 38410331 PMCID: PMC10895383 DOI: 10.7759/cureus.53032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/27/2024] [Indexed: 02/28/2024] Open
Abstract
BACKGROUND The impact of generative artificial intelligence-based Chatbots on medical education, particularly in Southeast Asia, is understudied regarding healthcare students' perceptions of its academic utility. Sociodemographic profiles and educational strategies influence prospective healthcare practitioners' attitudes toward AI tools. AIM AND OBJECTIVES This study aimed to assess healthcare university students' knowledge, attitude, and practice regarding ChatGPT for academic purposes. It explored chatbot usage frequency, purposes, satisfaction levels, and associations between age, gender, and ChatGPT variables. METHODOLOGY Four hundred forty-three undergraduate students at a Malaysian tertiary healthcare institute participated, revealing varying awareness levels of ChatGPT's academic utility. Despite concerns about accuracy, ethics, and dependency, participants generally held positive attitudes toward ChatGPT in academics. RESULTS Multiple logistic regression highlighted associations between demographics, knowledge, attitude, and academic ChatGPT use. MBBS students were significantly more likely to use ChatGPT for academics than BDS and FIS students. Final-year students exhibited the highest likelihood of academic ChatGPT use. Higher knowledge and positive attitudes correlated with increased academic usage. Most users (45.8%) employed ChatGPT to aid specific assignment sections while completing most work independently. Some did not use it (41.1%), while others heavily relied on it (9.3%). Users also employed it for various purposes, from generating questions to understanding concepts. Thematic analysis of responses showed students' concerns about data accuracy, plagiarism, ethical issues, and dependency on ChatGPT for academic tasks. CONCLUSION This study aids in creating guidelines for implementing GAI chatbots in healthcare education, emphasizing benefits, and risks, and informing AI developers and educators about ChatGPT's potential in academia.
Collapse
Affiliation(s)
| | | | - Preethy Mary Donald
- Oral Medicine and Oral Radiology, Manipal University College Malaysia, Melaka, MYS
| | | | | |
Collapse
|
22
|
Yapar D, Demir Avcı Y, Tokur Sonuvar E, Eğerci ÖF, Yapar A. ChatGPT's potential to support home care for patients in the early period after orthopedic interventions and enhance public health. Jt Dis Relat Surg 2024; 35:169-176. [PMID: 38108178 PMCID: PMC10746912 DOI: 10.52312/jdrs.2023.1402] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2023] [Accepted: 11/06/2023] [Indexed: 12/19/2023] Open
Abstract
OBJECTIVES This study presents the first investigation into the potential of ChatGPT to provide medical consultation for patients undergoing orthopedic interventions, with the primary objective of evaluating ChatGPT's effectiveness in supporting patient self-management during the essential early recovery phase at home. MATERIALS AND METHODS Seven scenarios, representative of common situations in orthopedics and traumatology, were presented to ChatGPT version 4.0 to obtain advice. These scenarios and ChatGPT̓s responses were then evaluated by 68 expert orthopedists (67 males, 1 female; mean age: 37.9±5.9 years; range, 30 to 59 years), 40 of whom had at least four years of orthopedic experience, while 28 were associate or full professors. Expert orthopedists used a rubric on a scale of 1 to 5 to evaluate ChatGPT's advice based on accuracy, applicability, comprehensiveness, and clarity. Those who gave ChatGPT a score of 4 or higher considered its performance as above average or excellent. RESULTS In all scenarios, the median evaluation scores were at least 4 across accuracy, applicability, comprehensiveness, and communication. As for mean scores, accuracy was the highest-rated dimension at 4.2±0.8, while mean comprehensiveness was slightly lower at 3.9±0.8. Orthopedist characteristics, such as academic title and prior use of ChatGPT, did not influence their evaluation (all p>0.05). Across all scenarios, ChatGPT demonstrated an accuracy of 79.8%, with applicability at 75.2%, comprehensiveness at 70.6%, and a 75.6% rating for communication clarity. CONCLUSION This study emphasizes ChatGPT̓s strengths in accuracy and applicability for home care after orthopedic intervention but underscores a need for improved comprehensiveness. This focused evaluation not only sheds light on ChatGPT̓s potential in specialized medical advice but also suggests its potential to play a broader role in the advancement of public health.
Collapse
Affiliation(s)
| | | | | | | | - Aliekber Yapar
- Antalya Eğitim ve Araştırma Hastanesi, Ortopedi ve Travmatoloji Kliniği, 07100 Muratpaşa, Antalya, Türkiye.
| |
Collapse
|
23
|
Bazzari FH, Bazzari AH. Utilizing ChatGPT in Telepharmacy. Cureus 2024; 16:e52365. [PMID: 38230387 PMCID: PMC10790595 DOI: 10.7759/cureus.52365] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/15/2024] [Indexed: 01/18/2024] Open
Abstract
BACKGROUND ChatGPT is an artificial intelligence-powered chatbot that has demonstrated capabilities in numerous fields, including medical and healthcare sciences. This study evaluates the potential for ChatGPT application in telepharmacy, the delivering of pharmaceutical care via means of telecommunications, through assessing its interactions, adherence to instructions, and ability to role-play as a pharmacist while handling a series of life-like scenario questions. METHODS Two versions (ChatGPT 3.5 and 4.0, OpenAI) were assessed using two independent trials each. ChatGPT was instructed to act as a pharmacist and answer patient inquiries, followed by a set of 20 assessment questions. Then, ChatGPT was instructed to stop its act, provide feedback and list its sources for drug information. The responses to the assessment questions were evaluated in terms of accuracy, precision and clarity using a 4-point Likert-like scale. RESULTS ChatGPT demonstrated the ability to follow detailed instructions, role-play as a pharmacist, and appropriately handle all questions. ChatGPT was able to understand case details, recognize generic and brand drug names, identify drug side effects, interactions, prescription requirements and precautions, and provide proper point-by-point instructions regarding administration, dosing, storage and disposal. The overall means of pooled scores were 3.425 (0.712) and 3.7 (0.61) for ChatGPT 3.5 and 4.0, respectively. The rank distribution of scores was not significantly different (P>0.05). None of the answers could be considered directly harmful or labeled as entirely or mostly incorrect, and most point deductions were due to other factors such as indecisiveness, adding immaterial information, missing certain considerations, or partial unclarity. The answers were similar in length across trials and appropriately concise. ChatGPT 4.0 showed superior performance, higher consistency, better character adherence and the ability to report various reliable information sources. However, it only allowed an input of 40 questions every three hours and provided inaccurate feedback regarding the number of assessed patients, compared to 3.5 which allowed unlimited input but was unable to provide feedback. CONCLUSIONS Integrating ChatGPT in telepharmacy holds promising potential; however, a number of drawbacks are to be overcome in order to function effectively.
Collapse
Affiliation(s)
| | - Amjad H Bazzari
- Basic Scientific Sciences, Applied Science Private University, Amman, JOR
| |
Collapse
|
24
|
Janopaul-Naylor JR, Koo A, Qian DC, McCall NS, Liu Y, Patel SA. Physician Assessment of ChatGPT and Bing Answers to American Cancer Society's Questions to Ask About Your Cancer. Am J Clin Oncol 2024; 47:17-21. [PMID: 37823708 PMCID: PMC10841271 DOI: 10.1097/coc.0000000000001050] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/13/2023]
Abstract
OBJECTIVES Artificial intelligence (AI) chatbots are a new, publicly available tool for patients to access health care-related information with unknown reliability related to cancer-related questions. This study assesses the quality of responses to common questions for patients with cancer. METHODS From February to March 2023, we queried chat generative pretrained transformer (ChatGPT) from OpenAI and Bing AI from Microsoft questions from the American Cancer Society's recommended "Questions to Ask About Your Cancer" customized for all stages of breast, colon, lung, and prostate cancer. Questions were, in addition, grouped by type (prognosis, treatment, or miscellaneous). The quality of AI chatbot responses was assessed by an expert panel using the validated DISCERN criteria. RESULTS Of the 117 questions presented to ChatGPT and Bing, the average score for all questions were 3.9 and 3.2, respectively ( P < 0.001) and the overall DISCERN scores were 4.1 and 4.4, respectively. By disease site, the average score for ChatGPT and Bing, respectively, were 3.9 and 3.6 for prostate cancer ( P = 0.02), 3.7 and 3.3 for lung cancer ( P < 0.001), 4.1 and 2.9 for breast cancer ( P < 0.001), and 3.8 and 3.0 for colorectal cancer ( P < 0.001). By type of question, the average score for ChatGPT and Bing, respectively, were 3.6 and 3.4 for prognostic questions ( P = 0.12), 3.9 and 3.1 for treatment questions ( P < 0.001), and 4.2 and 3.3 for miscellaneous questions ( P = 0.001). For 3 responses (3%) by ChatGPT and 18 responses (15%) by Bing, at least one panelist rated them as having serious or extensive shortcomings. CONCLUSIONS AI chatbots provide multiple opportunities for innovating health care. This analysis suggests a critical need, particularly around cancer prognostication, for continual refinement to limit misleading counseling, confusion, and emotional distress to patients and families.
Collapse
Affiliation(s)
- James R Janopaul-Naylor
- Department of Radiation Oncology, Emory University
- Department of Radiation Oncology, Memorial Sloan Kettering Cancer Center
| | - Andee Koo
- Department of Radiation Oncology, Emory University
| | - David C Qian
- Department of Radiation Oncology, Emory University
| | | | - Yuan Liu
- Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University
| | | |
Collapse
|
25
|
Coraci D, Maccarone MC, Regazzo G, Accordi G, Papathanasiou JV, Masiero S. ChatGPT in the development of medical questionnaires. The example of the low back pain. Eur J Transl Myol 2023; 33:12114. [PMID: 38112605 PMCID: PMC10811646 DOI: 10.4081/ejtm.2023.12114] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2023] [Accepted: 12/04/2023] [Indexed: 12/21/2023] Open
Abstract
In the last year, Chat Generative Pre-Trained Transformer (ChatGPT), a web software based on artificial intelligence has been showing high potential in every field of knowledge. In the medical area, its possible application is an object of many studies with promising results. We performed the current study to investigate the possible usefulness of ChatGPT in assessing low back pain. We asked ChatGPT to generate a questionnaire about this clinical condition and we compared the obtained questions and results with the ones obtained by other validated questionnaires: Oswestry Disability Index, Quebec Back Pain Disability Scale, Roland-Morris Disability Questionnaire, and Numeric Rating Scale for pain. We enrolled 20 subjects with low back pain and we found important consistencies among the validated questionnaires. The ChatGPT questionnaire showed an acceptable significant correlation only with Oswestry Disability Index and Quebec Back Pain Disability Scale. ChatGPT showed some peculiarities, especially in the assessment of quality of life and medical consultation and treatments. Our study shows that ChatGPT can help evaluate patients, including multilevel perspectives. However, its power is limited, and further research and validation are required.
Collapse
Affiliation(s)
- Daniele Coraci
- Department of Neuroscience, Section of Rehabilitation, University of Padova, Padua.
| | | | - Gianluca Regazzo
- Department of Neuroscience, Section of Rehabilitation, University of Padova, Padua.
| | - Giorgia Accordi
- Department of Neuroscience, Section of Rehabilitation, University of Padova, Padua.
| | - Jannis V Papathanasiou
- Department of Kinesiotherapy, Faculty of Public Health, Medical University of Sofia, Sofia, Bulgaria; Department of Medical Imaging, Allergology and Physiotherapy, Faculty of Dental Medicine, Medical University of Plovdiv, Plovdiv.
| | - Stefano Masiero
- Department of Neuroscience, Section of Rehabilitation, University of Padova, Padua.
| |
Collapse
|
26
|
Ho SYC, Chien TW, Chou W. Circle packing charts generated by ChatGPT to identify the characteristics of articles by anesthesiology authors in 2022: Bibliometric analysis. Medicine (Baltimore) 2023; 102:e34511. [PMID: 38115345 PMCID: PMC10727539 DOI: 10.1097/md.0000000000034511] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/15/2023] [Revised: 07/03/2023] [Accepted: 07/05/2023] [Indexed: 12/21/2023] Open
Abstract
BACKGROUND The ChatGPT (Open AI, San Francisco, CA), denoted by the Chat Generative Pretrained Transformer, has been a hot topic for discussion over the past few months. A verification of whether the code for drawing circle packing charts (CPCs) with R can be generated by ChatGPT and used to identify characteristics of articles by anesthesiology authors is needed. This study aimed to provide insights into article characteristics in the field of anesthesiology and to highlight the potential of ChatGPT for data visualization techniques (e.g., CPCs) in bibliometric analysis. METHODS A total of 23,012 articles were indexed in PubMed in 2022 by authors in the field of anesthesiology. The code for drawing CPCs with R was generated by ChatGPT and then modified by the authors to identify the characteristics of articles in 2 forms: 23,012 and 100 top-impact factors in journals (T100IF). Using CPCs and 3 other visualizations-network charts, impact beam plots, and Sankey diagrams-we were able to display article features commonly used in bibliometric analysis. The author-weighted scheme and absolute advantage coefficient were used to assess dominant entities, such as countries, institutes, authors, and themes (defined by PubMed and MeSH terms). RESULTS Our findings indicate that: further modifications should be made to the code generated by ChatGPT for drawing CPCs in R; publications in the field of anesthesiology are dominated by China, followed by the United States and Japan; Capital Medical University (China) and Showa University Hospital (Japan) dominate research institutes in terms of publications and IF, respectively; and COVID-19 is the most frequently reported theme in T100IF, accounting for 29%. CONCLUSIONS No such articles with CPCs regarding bibliometrics have ever been found in PubMed. The code for drawing CPCs with R can be generated by ChatGPT, but further modification is required for implementation in bibliometrics. CPCs should be used in future studies to identify the characteristics of articles in other areas of research rather than limiting them to anesthesiology, as we did in this study.
Collapse
Affiliation(s)
- Sam Yu-Chieh Ho
- Department of Emergency Medicine, Chi-Mei Medical Center, Tainan, Taiwan
- Department of Geriatrics and Gerontology, ChiMei Medical Center, Tainan, Taiwan
| | - Tsair-Wei Chien
- Department of Medical Research, Chi-Mei Medical Center, Tainan, Taiwan
| | - Willy Chou
- Department of Physical Medicine and Rehabilitation, Chiali Chi-Mei Hospital, Tainan 710, Taiwan
- Department of Physical Medicine and Rehabilitation, Chung San Medical University Hospital, Taichung, Taiwan
| |
Collapse
|
27
|
Alanzi TM, Alzahrani W, Albalawi NS, Allahyani T, Alghamdi A, Al-Zahrani H, Almutairi A, Alzahrani H, Almulhem L, Alanzi N, Al Moarfeg A, Farhah N. Public Awareness of Obesity as a Risk Factor for Cancer in Central Saudi Arabia: Feasibility of ChatGPT as an Educational Intervention. Cureus 2023; 15:e50781. [PMID: 38239542 PMCID: PMC10795720 DOI: 10.7759/cureus.50781] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/17/2023] [Indexed: 01/22/2024] Open
Abstract
BACKGROUND While the link between obesity and chronic diseases such as diabetes and cardiovascular disorders is well-documented, there is a growing body of evidence connecting obesity with an increased risk of cancer. However, public awareness of this connection remains limited. STUDY PURPOSE To analyze public awareness of overweight/obesity as a risk factor for cancer and analyze public perceptions on the feasibility of ChatGPT, an artificial intelligence-based conversational agent, as an educational intervention tool. METHODS A mixed-methods approach including deductive quantitative cross-sectional approach to draw precise conclusions based on empirical evidence on public awareness of the link between obesity and cancer; and inductive qualitative approach to interpret public perceptions on using ChatGPT for creating awareness of obesity, cancer and its risk factors was used in this study. Participants included adult residents in Saudi Arabia. A total of 486 individuals and 21 individuals were included in the survey and semi-structured interviews respectively. RESULTS About 65% of the participants are not completely aware of cancer and its risk factors. Significant differences in awareness were observed concerning age groups (p < .0001), socio-economic status (p = .041), and regional distribution (p = .0351). A total of 10 themes were analyzed from the interview data, which included four positive factors (accessibility, personalization, cost-effectiveness, anonymity and privacy, multi-language support) and five negative factors (information inaccuracy, lack of emotional intelligence, dependency and overreliance, data privacy and security, and inability to provide physical support or diagnosis). CONCLUSION This study has underscored the potential of leveraging ChatGPT as a valuable public awareness tool for cancer in Saudi Arabia.
Collapse
Affiliation(s)
- Turki M Alanzi
- Department of Health Information Management and Technology, College of Public Health, Imam Abdulrahman Bin Faisal University, Dammam, SAU
| | - Wala Alzahrani
- Department of Clinical Nutrition, College of Applied Medical Sciences, King Abdulaziz University, Jeddah, SAU
| | | | - Taif Allahyani
- College of Applied Medical Sciences, Umm Al-Qura University, Makkah, SAU
| | | | - Haneen Al-Zahrani
- Department of Hematology, Armed Forces Hospital at King Abdulaziz Airbase Dhahran, Dhahran, SAU
| | - Awatif Almutairi
- Department of Clinical Laboratories Sciences, College of Applied Medical Sciences, Jouf University, Jouf, SAU
| | | | | | - Nouf Alanzi
- Department of Clinical Laboratories Sciences, College of Applied Medical Sciences, Jouf University, Jouf, SAU
| | | | - Nesren Farhah
- Department of Health Informatics, College of Health Sciences, Saudi Electronic University, Riyadh, SAU
| |
Collapse
|
28
|
Mondal H, Mondal S. ChatGPT in academic writing: Maximizing its benefits and minimizing the risks. Indian J Ophthalmol 2023; 71:3600-3606. [PMID: 37991290 PMCID: PMC10788737 DOI: 10.4103/ijo.ijo_718_23] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2023] [Revised: 08/11/2023] [Accepted: 08/21/2023] [Indexed: 11/23/2023] Open
Abstract
This review article explores the use of ChatGPT in academic writing and provides insights on how to utilize it judiciously. With the increasing popularity of AI-powered language models, ChatGPT has emerged as a potential tool for assisting writers in the research and writing process. We have provided a list of potential uses of ChatGPT by a novice researcher for getting help during research proposal preparation and manuscript writing. However, there are concerns regarding its reliability and potential risks associated with its use. The review highlights the importance of maintaining human judgment in the writing process and using ChatGPT as a complementary tool rather than a replacement for human effort. The article concludes with recommendations for researchers and writers to ensure responsible and effective use of ChatGPT in academic writing.
Collapse
Affiliation(s)
- Himel Mondal
- Department of Physiology, All India Institute of Medical Sciences, Deoghar, Jharkhand, India
| | - Shaikat Mondal
- Department of Physiology, Raiganj Government Medical College and Hospital, West Bengal, India
| |
Collapse
|
29
|
Sakai D, Maeda T, Ozaki A, Kanda GN, Kurimoto Y, Takahashi M. Performance of ChatGPT in Board Examinations for Specialists in the Japanese Ophthalmology Society. Cureus 2023; 15:e49903. [PMID: 38174202 PMCID: PMC10763518 DOI: 10.7759/cureus.49903] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/04/2023] [Indexed: 01/05/2024] Open
Abstract
We investigated the potential of ChatGPT in the ophthalmological field in the Japanese language using board examinations for specialists in the Japanese Ophthalmology Society. We tested GPT-3.5 and GPT-4-based ChatGPT on five sets of past board examination problems in July 2023. Japanese text was used as the prompt adopting two strategies: zero- and few-shot prompting. We compared the correct answer rate of ChatGPT with that of actual examinees, and the performance characteristics in 10 subspecialties were assessed. ChatGPT-3.5 and ChatGPT-4 correctly answered 112 (22.4%) and 229 (45.8%) out of 500 questions with simple zero-shot prompting, respectively, and ChatGPT-4 correctly answered 231 (46.2%) questions with few-shot prompting. The correct answer rates of ChatGPT-3.5 were approximately two to three times lower than those of the actual examinees for each examination set (p = 0.001). However, the correct answer rates for ChatGPT-4 were close to approximately 70% of those of the examinees. ChatGPT-4 had the highest correct answer rate (71.4% with zero-shot prompting and 61.9% with few-shot prompting) in "blepharoplasty, orbit, and ocular oncology," and the lowest answer rate (30.0% with zero-shot prompting and 23.3% with few-shot prompting) in "pediatric ophthalmology." We concluded that ChatGPT could be one of the advanced technologies for practical tools in Japanese ophthalmology.
Collapse
Affiliation(s)
- Daiki Sakai
- Department of Ophthalmology, Kobe City Eye Hospital, Kobe, JPN
- Department of Ophthalmology, Kobe City Medical Center General Hospital, Kobe, JPN
- Department of Surgery, Division of Ophthalmology, Kobe University Graduate School of Medicine, Kobe, JPN
| | - Tadao Maeda
- Department of Ophthalmology, Kobe City Eye Hospital, Kobe, JPN
| | - Atsuta Ozaki
- Department of Ophthalmology, Kobe City Eye Hospital, Kobe, JPN
- Department of Ophthalmology, Mie University Graduate School of Medicine, Tsu, JPN
| | - Genki N Kanda
- Department of Ophthalmology, Kobe City Eye Hospital, Kobe, JPN
- Laboratory for Biologically Inspired Computing, RIKEN Center for Biosystems Dynamics Research, Kobe, JPN
| | - Yasuo Kurimoto
- Department of Ophthalmology, Kobe City Eye Hospital, Kobe, JPN
- Department of Ophthalmology, Kobe City Medical Center General Hospital, Kobe, JPN
| | | |
Collapse
|
30
|
Melnyk O, Ismail A, Ghorashi NS, Heekin M, Javan R. Generative Artificial Intelligence Terminology: A Primer for Clinicians and Medical Researchers. Cureus 2023; 15:e49890. [PMID: 38174178 PMCID: PMC10762565 DOI: 10.7759/cureus.49890] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/04/2023] [Indexed: 01/05/2024] Open
Abstract
Generative artificial intelligence (AI) is rapidly transforming the medical field, as advanced tools powered by large language models (LLMs) make their way into clinical practice, research, and education. Chatbots, which can generate human-like responses, have gained attention for their potential applications. Therefore, familiarity with LLMs and other promising generative AI tools is crucial to harness their potential safely and effectively. As these AI-based technologies continue to evolve, medical professionals must develop a strong understanding of AI terminologies and concepts, particularly generative AI, to effectively tackle real-world challenges and create solutions. This knowledge will enable healthcare professionals to utilize AI-driven innovations for improved patient care and increased productivity in the future. In this brief technical report, we explore 20 of the most relevant terminology associated with the underlying technology behind LLMs and generative AI as they relate to the medical field and provide some examples of how these topics relate to healthcare applications to help in their understanding.
Collapse
Affiliation(s)
- Oleksiy Melnyk
- Department of Radiology, George Washington University School of Medicine and Health Sciences, Washington D.C., USA
| | - Ahmed Ismail
- Department of Radiology, George Washington University School of Medicine and Health Sciences, Washington D.C., USA
| | - Nima S Ghorashi
- Department of Radiology, George Washington University School of Medicine and Health Sciences, Washington D.C., USA
| | - Mary Heekin
- Department of Radiology, George Washington University School of Medicine and Health Sciences, Washington D.C., USA
| | - Ramin Javan
- Department of Radiology, George Washington University School of Medicine and Health Sciences, Washington D.C., USA
| |
Collapse
|
31
|
Sarangi PK, Lumbani A, Swarup MS, Panda S, Sahoo SS, Hui P, Choudhary A, Mohakud S, Patel RK, Mondal H. Assessing ChatGPT's Proficiency in Simplifying Radiological Reports for Healthcare Professionals and Patients. Cureus 2023; 15:e50881. [PMID: 38249202 PMCID: PMC10799309 DOI: 10.7759/cureus.50881] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/21/2023] [Indexed: 01/23/2024] Open
Abstract
Background Clear communication of radiological findings is crucial for effective healthcare decision-making. However, radiological reports are often complex with technical terminology, making them challenging for non-radiology healthcare professionals and patients to comprehend. Large language models like ChatGPT (Chat Generative Pre-trained Transformer, by OpenAI, San Francisco, CA) offer a potential solution by translating intricate reports into simplified language. This study aimed to assess the capability of ChatGPT-3.5 in simplifying radiological reports to facilitate improved understanding by healthcare professionals and patients. Materials and methods Nine radiological reports were taken for this study spanning various imaging modalities and medical conditions. These reports were used to ask ChatGPT a set of seven questions (describe the procedure, mention the key findings, express in a simple language, suggestions for further investigation, need of further investigation, grammatical or typing errors, and translation into Hindi). A total of eight radiologists rated the generated content in detailing, summarizing, simplifying content and language, factual correctness, further investigation, grammatical errors, and translation to Hindi. Results The highest score was obtained for detailing the report (94.17% accuracy) and the lowest score was for drawing conclusions for the patient (85% accuracy); case-wise scores were similar (p-value = 0.97). The Hindi translation by ChatGPT was not suitable for patient communication. Conclusion The current free version of ChatGPT-3.5 was able to simplify radiological reports effectively, removing technical jargon while preserving essential diagnostic information. The free version adeptly simplifies radiological reports, enhancing accessibility for healthcare professionals and patients. Hence, it has the potential to enhance medical communication, facilitating informed decision-making by healthcare professionals and patients.
Collapse
Affiliation(s)
| | - Amrita Lumbani
- Physiology, Mayo Institute of Medical Sciences, Barabanki, IND
| | - M Sarthak Swarup
- Radiodiagnosis, Vardhman Mahavir Medical College and Safdarjung Hospital, New Delhi, IND
| | - Suvankar Panda
- Radiodiagnosis, SCB (Srirama Chandra Bhanja) Medical College and Hospital, Cuttack, IND
| | - Smruti Snigdha Sahoo
- Radiodiagnosis, SCB (Srirama Chandra Bhanja) Medical College and Hospital, Cuttack, IND
| | - Pratisruti Hui
- Radiodiagnosis, All India Institute of Medical Sciences, Kalyani, Kalyani, IND
| | - Anish Choudhary
- Radiodiagnosis, Central Institute of Psychiatry, Ranchi, IND
| | - Sudipta Mohakud
- Radiodiagnosis, All India Institute of Medical Sciences, Bhubaneswar, Bhubaneswar, IND
| | - Ranjan Kumar Patel
- Radiodiagnosis, All India Institute of Medical Sciences, Bhubaneswar, Bhubaneswar, IND
| | - Himel Mondal
- Physiology, All India Institute of Medical Sciences, Deoghar, Deoghar, IND
| |
Collapse
|
32
|
Tanaka OM, Gasparello GG, Hartmann GC, Casagrande FA, Pithon MM. Assessing the reliability of ChatGPT: a content analysis of self-generated and self-answered questions on clear aligners, TADs and digital imaging. Dental Press J Orthod 2023; 28:e2323183. [PMID: 37937680 PMCID: PMC10627416 DOI: 10.1590/2177-6709.28.5.e2323183.oar] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2023] [Accepted: 09/04/2023] [Indexed: 11/09/2023] Open
Abstract
INTRODUCTION Artificial Intelligence (AI) is a tool that is already part of our reality, and this is an opportunity to understand how it can be useful in interacting with patients and providing valuable information about orthodontics. OBJECTIVE This study evaluated the accuracy of ChatGPT in providing accurate and quality information to answer questions on Clear aligners, Temporary anchorage devices and Digital imaging in orthodontics. METHODS forty-five questions and answers were generated by the ChatGPT 4.0, and analyzed separately by five orthodontists. The evaluators independently rated the quality of information provided on a Likert scale, in which higher scores indicated greater quality of information (1 = very poor; 2 = poor; 3 = acceptable; 4 = good; 5 = very good). The Kruskal-Wallis H test (p< 0.05) and post-hoc pairwise comparisons with the Bonferroni correction were performed. RESULTS From the 225 evaluations of the five different evaluators, 11 (4.9%) were considered as very poor, 4 (1.8%) as poor, and 15 (6.7%) as acceptable. The majority were considered as good [34 (15,1%)] and very good [161 (71.6%)]. Regarding evaluators' scores, a slight agreement was perceived, with Fleiss's Kappa equal to 0.004. CONCLUSIONS ChatGPT has proven effective in providing quality answers related to clear aligners, temporary anchorage devices, and digital imaging within the context of interest of orthodontics.
Collapse
|
33
|
Haidar O, Jaques A, McCaughran PW, Metcalfe MJ. AI-Generated Information for Vascular Patients: Assessing the Standard of Procedure-Specific Information Provided by the ChatGPT AI-Language Model. Cureus 2023; 15:e49764. [PMID: 38046759 PMCID: PMC10691169 DOI: 10.7759/cureus.49764] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/30/2023] [Indexed: 12/05/2023] Open
Abstract
Introduction Ensuring access to high-quality information is paramount to facilitating informed surgical decision-making. The use of the internet to access health-related information is increasing, along with the growing prevalence of AI language models such as ChatGPT. We aim to assess the standard of AI-generated patient-facing information through a qualitative analysis of its readability and quality. Materials and methods We performed a retrospective qualitative analysis of information regarding three common vascular procedures: endovascular aortic repair (EVAR), endovenous laser ablation (EVLA), and femoro-popliteal bypass (FPBP). The ChatGPT responses were compared to patient information leaflets provided by the vascular charity, Circulation Foundation UK. Readability was assessed using four readability scores: the Flesch-Kincaid reading ease (FKRE) score, the Flesch-Kincaid grade level (FKGL), the Gunning fog score (GFS), and the simple measure of gobbledygook (SMOG) index. Quality was assessed using the DISCERN tool by two independent assessors. Results The mean FKRE score was 33.3, compared to 59.1 for the information provided by the Circulation Foundation (SD=14.5, p=0.025) indicating poor readability of AI-generated information. The FFKGL indicated that the expected grade of students likely to read and understand ChatGPT responses was consistently higher than compared to information leaflets at 12.7 vs. 9.4 (SD=1.9, p=0.002). Two metrics measure readability in terms of the number of years of education required to understand a piece of writing: the GFS and SMOG. Both scores indicated that AI-generated answers were less accessible. The GFS for ChatGPT-provided information was 16.7 years versus 12.8 years for the leaflets (SD=2.2, p=0.002) and the SMOG index scores were 12.2 and 9.4 years for ChatGPT and the patient information leaflets, respectively (SD=1.7, p=0.001). The DISCERN scores were consistently higher in human-generated patient information leaflets compared to AI-generated information across all procedures; the mean score for the information provided by ChatGPT was 50.3 vs. 56.0 for the Circulation Foundation information leaflets (SD=3.38, p<0.001). Conclusion We concluded that AI-generated information about vascular surgical procedures is currently poor in both the readability of text and the quality of information. Patients should be directed to reputable, human-generated information sources from trusted professional bodies to supplement direct education from the clinician during the pre-procedure consultation process.
Collapse
Affiliation(s)
- Omar Haidar
- Vascular Surgery, Lister Hospital, Stevenage, GBR
| | | | | | | |
Collapse
|
34
|
Murphy Lonergan R, Curry J, Dhas K, Simmons BI. Stratified Evaluation of GPT's Question Answering in Surgery Reveals Artificial Intelligence (AI) Knowledge Gaps. Cureus 2023; 15:e48788. [PMID: 38098921 PMCID: PMC10720372 DOI: 10.7759/cureus.48788] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/14/2023] [Indexed: 12/17/2023] Open
Abstract
Large language models (LLMs) have broad potential applications in medicine, such as aiding with education, providing reassurance to patients, and supporting clinical decision-making. However, there is a notable gap in understanding their applicability and performance in the surgical domain and how their performance varies across specialties. This paper aims to evaluate the performance of LLMs in answering surgical questions relevant to clinical practice and to assess how this performance varies across different surgical specialties. We used the MedMCQA dataset, a large-scale multi-choice question-answer (MCQA) dataset consisting of clinical questions across all areas of medicine. We extracted the relevant 23,035 surgical questions and submitted them to the popular LLMs Generative Pre-trained Transformers (GPT)-3.5 and GPT-4 (OpenAI OpCo, LLC, San Francisco, CA). Generative Pre-trained Transformer is a large language model that can generate human-like text by predicting subsequent words in a sentence based on the context of the words that come before it. It is pre-trained on a diverse range of texts and can perform a variety of tasks, such as answering questions, without needing task-specific training. The question-answering accuracy of GPT was calculated and compared between the two models and across surgical specialties. Both GPT-3.5 and GPT-4 achieved accuracies of 53.3% and 64.4%, respectively, on surgical questions, showing a statistically significant difference in performance. When compared to their performance on the full MedMCQA dataset, the two models performed differently: GPT-4 performed worse on surgical questions than on the dataset as a whole, while GPT-3.5 showed the opposite pattern. Significant variations in accuracy were also observed across different surgical specialties, with strong performances in anatomy, vascular, and paediatric surgery and worse performances in orthopaedics, ENT, and neurosurgery. Large language models exhibit promising capabilities in addressing surgical questions, although the variability in their performance between specialties cannot be ignored. The lower performance of the latest GPT-4 model on surgical questions relative to questions across all medicine highlights the need for targeted improvements and continuous updates to ensure relevance and accuracy in surgical applications. Further research and continuous monitoring of LLM performance in surgical domains are crucial to fully harnessing their potential and mitigating the risks of misinformation.
Collapse
Affiliation(s)
- Rebecca Murphy Lonergan
- Department of Medical Education, Chelsea and Westminster Hospital NHS Foundation Trust, London, GBR
| | - Jake Curry
- Centre for Ecology and Conservation, University of Exeter, Penryn, GBR
| | - Kallpana Dhas
- Department of Medical Education, Chelsea and Westminster Hospital NHS Foundation Trust, London, GBR
| | - Benno I Simmons
- Centre for Ecology and Conservation, University of Exeter, Penryn, GBR
| |
Collapse
|
35
|
Mondal H, Dash I, Mondal S, Behera JK. ChatGPT in Answering Queries Related to Lifestyle-Related Diseases and Disorders. Cureus 2023; 15:e48296. [PMID: 38058315 PMCID: PMC10696911 DOI: 10.7759/cureus.48296] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/04/2023] [Indexed: 12/08/2023] Open
Abstract
Background Lifestyle-related diseases and disorders have become a significant global health burden. However, the majority of the population ignores or do not consult doctors for such disease or disorders. Artificial intelligence (AI)-based large language model (LLM) like ChatGPT (GPT3.5) is capable of generating customized queries of a user. Hence, it can act as a virtual telehealth agent. Its capability to answer lifestyle-related diseases or disorders has not been explored. Objective This study aimed to evaluate the effectiveness of ChatGPT, an LLM, in providing answers to queries related to lifestyle-related diseases or disorders. Methods A set of 20 lifestyle-related disease or disorder cases covering a wide range of topics such as obesity, diabetes, cardiovascular health, and mental health were prepared with four questions. The case and questions were presented to ChatGPT and asked for the answers to those questions. Two physicians rated the content on a three-point Likert-like scale ranging from accurate (2), partially accurate (1), and inaccurate (0). Further, the content was rated as adequate (2), inadequate (1), and misguiding (0) for testing the applicability of the guides for patients. The readability of the text was analyzed by the Flesch-Kincaid Ease Score (FKES). Results Among 20 cases, the average score of accuracy was 1.83±0.37 and guidance was 1.9±0.21. Both the scores were higher than the hypothetical median of 1.5 (p=0.004 and p<0.0001, respectively). ChatGPT answered the questions with a natural tone in 11 cases and nine with a positive tone. The text was understandable for college graduates with a mean FKES of 27.8±5.74. Conclusion The analysis of content accuracy revealed that ChatGPT provided reasonably accurate information in the majority of the cases, successfully addressing queries related to lifestyle-related diseases or disorders. Hence, initial guidance can be obtained by patients when they get little time to consult a doctor or wait for an appointment to consult a doctor for suggestions about their condition.
Collapse
Affiliation(s)
- Himel Mondal
- Physiology, All India Institute of Medical Sciences, Deoghar, IND
| | - Ipsita Dash
- Biochemistry, Saheed Laxman Nayak Medical College and Hospital, Koraput, IND
| | - Shaikat Mondal
- Physiology, Raiganj Government Medical College and Hospital, Raiganj, IND
| | | |
Collapse
|
36
|
Hernandez CA, Vazquez Gonzalez AE, Polianovskaia A, Amoro Sanchez R, Muyolema Arce V, Mustafa A, Vypritskaya E, Perez Gutierrez O, Bashir M, Eighaei Sedeh A. The Future of Patient Education: AI-Driven Guide for Type 2 Diabetes. Cureus 2023; 15:e48919. [PMID: 38024047 PMCID: PMC10654048 DOI: 10.7759/cureus.48919] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/16/2023] [Indexed: 12/01/2023] Open
Abstract
Introduction and aim The surging incidence of type 2 diabetes has become a growing concern for the healthcare sector. This chronic ailment, characterized by its complex blend of genetic and lifestyle determinants, has witnessed a notable increase in recent times, exerting substantial pressure on healthcare resources. As more individuals turn to online platforms for health guidance and embrace the utilization of Chat Generative Pre-trained Transformer (ChatGPT; San Francisco, CA: OpenAI), a text-generating AI (TGAI), to get insights into their well-being, evaluating its effectiveness and reliability becomes crucial. This research primarily aimed to evaluate the correctness of TGAI responses to type 2 diabetes (T2DM) inquiries via ChatGPT. Furthermore, this study aimed to examine the consistency of TGAI in addressing common queries on T2DM complications for patient education. Material and methods Questions on T2DM were formulated by experienced physicians and screened by research personnel before querying ChatGPT. Each question was posed thrice, and the collected answers were summarized. Responses were then sorted into three distinct categories as follows: (a) appropriate, (b) inappropriate, and (c) unreliable by two seasoned physicians. In instances of differing opinions, a third physician was consulted to achieve consensus. Results From the initial set of 110 T2DM questions, 40 were dismissed by experts for relevance, resulting in a final count of 70. An overwhelming 98.5% of the AI's answers were judged as appropriate, thus underscoring its reliability over traditional online search engines. Nonetheless, a 1.5% rate of inappropriate responses underlines the importance of ongoing AI improvements and strict adherence to medical protocols. Conclusion TGAI provides medical information of high quality and reliability. This study underscores TGAI's impressive effectiveness in delivering reliable information about T2DM, with 98.5% of responses aligning with the standard of care. These results hold promise for integrating AI platforms as supplementary tools to enhance patient education and outcomes.
Collapse
|
37
|
Sikander B, Baker JJ, Deveci CD, Lund L, Rosenberg J. ChatGPT-4 and Human Researchers Are Equal in Writing Scientific Introduction Sections: A Blinded, Randomized, Non-inferiority Controlled Study. Cureus 2023; 15:e49019. [PMID: 38111405 PMCID: PMC10727453 DOI: 10.7759/cureus.49019] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/18/2023] [Indexed: 12/20/2023] Open
Abstract
Background Natural language processing models are increasingly used in scientific research, and their ability to perform various tasks in the research process is rapidly advancing. This study aims to investigate whether Generative Pre-trained Transformer 4 (GPT-4) is equal to humans in writing introduction sections for scientific articles. Methods This randomized non-inferiority study was reported according to the Consolidated Standards of Reporting Trials for non-inferiority trials and artificial intelligence (AI) guidelines. GPT-4 was instructed to synthesize 18 introduction sections based on the aim of previously published studies, and these sections were compared to the human-written introductions already published in a medical journal. Eight blinded assessors randomly evaluated the introduction sections using 1-10 Likert scales. Results There was no significant difference between GPT-4 and human introductions regarding publishability and content quality. GPT-4 had one point significantly better scores in readability, which was considered a non-relevant difference. The majority of assessors (59%) preferred GPT-4, while 33% preferred human-written introductions. Based on Lix and Flesch-Kincaid scores, GPT-4 introductions were 10 and two points higher, respectively, indicating that the sentences were longer and had longer words. Conclusion GPT-4 was found to be equal to humans in writing introductions regarding publishability, readability, and content quality. The majority of assessors preferred GPT-4 introductions and less than half could determine which were written by GPT-4 or humans. These findings suggest that GPT-4 can be a useful tool for writing introduction sections, and further studies should evaluate its ability to write other parts of scientific articles.
Collapse
Affiliation(s)
| | | | | | - Lars Lund
- Urology, Odense University Hospital, Odense, DNK
| | | |
Collapse
|
38
|
Abujaber AA, Abd-Alrazaq A, Al-Qudimat AR, Nashwan AJ. A Strengths, Weaknesses, Opportunities, and Threats (SWOT) Analysis of ChatGPT Integration in Nursing Education: A Narrative Review. Cureus 2023; 15:e48643. [PMID: 38090452 PMCID: PMC10711690 DOI: 10.7759/cureus.48643] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/11/2023] [Indexed: 03/25/2024] Open
Abstract
Amidst evolving healthcare demands, nursing education plays a pivotal role in preparing future nurses for complex challenges. Traditional approaches, however, must be revised to meet modern healthcare needs. The ChatGPT, an AI-based chatbot, has garnered significant attention due to its ability to personalize learning experiences, enhance virtual clinical simulations, and foster collaborative learning in nursing education. This review aims to thoroughly assess the potential impact of integrating ChatGPT into nursing education. The hypothesis is that valuable insights can be provided for stakeholders through a comprehensive SWOT analysis examining the strengths, weaknesses, opportunities, and threats associated with ChatGPT. This will enable informed decisions about its integration, prioritizing improved learning outcomes. A thorough narrative literature review was undertaken to provide a solid foundation for the SWOT analysis. The materials included scholarly articles and reports, which ensure the study's credibility and allow for a holistic and unbiased assessment. The analysis identified accessibility, consistency, adaptability, cost-effectiveness, and staying up-to-date as crucial factors influencing the strengths, weaknesses, opportunities, and threats associated with ChatGPT integration in nursing education. These themes provided a framework to understand the potential risks and benefits of integrating ChatGPT into nursing education. This review highlights the importance of responsible and effective use of ChatGPT in nursing education and the need for collaboration among educators, policymakers, and AI developers. Addressing the identified challenges and leveraging the strengths of ChatGPT can lead to improved learning outcomes and enriched educational experiences for students. The findings emphasize the importance of responsibly integrating ChatGPT in nursing education, balancing technological advancement with careful consideration of associated risks, to achieve optimal outcomes.
Collapse
Affiliation(s)
| | - Alaa Abd-Alrazaq
- AI Center for Precision Health, Weill Cornell Medicine-Qatar, Doha, QAT
| | - Ahmad R Al-Qudimat
- Department of Public Health, Qatar University, Doha, QAT
- Surgical Research Section, Department of Surgery, Hamad Medical Corporation, Doha, QAT
| | | |
Collapse
|
39
|
Kaneda Y, Takita M, Hamaki T, Ozaki A, Tanimoto T. ChatGPT's Potential in Enhancing Physician Efficiency: A Japanese Case Study. Cureus 2023; 15:e48235. [PMID: 38050503 PMCID: PMC10693924 DOI: 10.7759/cureus.48235] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/03/2023] [Indexed: 12/06/2023] Open
Abstract
Artificial intelligence (AI), particularly ChatGPT, developed by OpenAI (San Francisco, CA, USA), is making significant strides in the medical field. In a simulated case study, a 66-year-old Japanese female patient's dialogue with a physician was transcribed and inputted into ChatGPT to assess its efficacy in drafting medical records, formulating differential diagnoses, and establishing treatment plans. The results showed a high similarity between the medical summaries generated by ChatGPT and those of the attending physician. This suggests that ChatGPT has the potential to assist physicians in clinical reasoning and reduce the administrative burden, allowing them to spend more time with patients. However, there are limitations, such as the system's reliance on linguistic data and occasional inaccuracies. Despite its potential, the ethical implications of using patient data and the risk of AI replacing clinicians emphasize the need for continuous evaluation, rigorous oversight, and the establishment of comprehensive guidelines. As AI continues to integrate into healthcare, it is crucial for physicians to ensure that technology complements, rather than replaces, human expertise, with the primary focus remaining on delivering high-quality patient care.
Collapse
Affiliation(s)
- Yudai Kaneda
- Epidemiology and Public Health, School of Medicine, Hokkaido University, Hokkaido, JPN
| | - Morihito Takita
- Internal Medicine, Medical Governance Research Institute, Tokyo, JPN
| | - Tamae Hamaki
- Internal Medicine, Accessible Rail Medical Services Tetsuikai, Navitas Clinic Shinjuku, Tokyo, JPN
| | - Akihiko Ozaki
- Breast and Thyroid Surgery, Jyoban Hospital of Tokiwa Foundation, Fukushima, JPN
| | - Tetsuya Tanimoto
- Internal Medicine, Accessible Rail Medical Services Tetsuikai, Navitas Clinic Kawasaki, Kanagawa, JPN
| |
Collapse
|
40
|
Makiev KG, Asimakidou M, Vasios IS, Keskinis A, Petkidis G, Tilkeridis K, Ververidis A, Iliopoulos E. A Study on Distinguishing ChatGPT-Generated and Human-Written Orthopaedic Abstracts by Reviewers: Decoding the Discrepancies. Cureus 2023; 15:e49166. [PMID: 38130535 PMCID: PMC10733892 DOI: 10.7759/cureus.49166] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/21/2023] [Indexed: 12/23/2023] Open
Abstract
BACKGROUND ChatGPT (OpenAI Incorporated, Mission District, San Francisco, United States) is an artificial intelligence (AI)-based language model that generates human-resembling texts. This AI-generated literary work is comprehensible and contextually relevant and it is really difficult to differentiate from human-written content. ChatGPT has risen in popularity lately and is widely utilized in scholarly manuscript drafting. The aim of this study is to identify if 1) human reviewers can differentiate between AI-generated and human-written abstracts and 2) AI detectors are currently reliable in detecting AI-generated abstracts. METHODS Seven blinded reviewers were asked to read 21 abstracts and differentiate which were AI-generated and which were human-written. The first group consisted of three orthopaedic residents with limited research experience (OR). The second group included three orthopaedic professors with extensive research experience (OP). The seventh reviewer was a non-orthopaedic doctor and acted as a control in terms of expertise. All abstracts were scanned by a plagiarism detector program. The performance of detecting AI-generated abstracts of two different AI detectors was also analyzed. A structured interview was conducted at the end of the survey in order to evaluate the decision-making process utilized by each reviewer. RESULTS The OR group managed to identify correctly 34.9% of the abstracts' authorship and the OP group 31.7%. The non-orthopaedic control identified correctly 76.2%. All AI-generated abstracts were 100% unique (0% plagiarism). The first AI detector managed to identify correctly only 9/21 (42.9%) of the abstracts' authors, whereas the second AI detector identified 14/21 (66.6%). CONCLUSION Inability to correctly identify AI-generated context poses a significant scientific risk as "false" abstracts can end up in scientific conferences or publications. Neither expertise nor research background was shown to have any meaningful impact on the predictive outcome. Focus on statistical data presentation may help the differentiation process. Further research is warranted in order to highlight which elements could help reveal an AI-generated abstract.
Collapse
Affiliation(s)
- Konstantinos G Makiev
- Department of Orthopaedics, University General Hospital of Alexandroupolis, Democritus University of Thrace, Alexandroupoli, GRC
| | - Maria Asimakidou
- School of Medicine, University General Hospital of Alexandroupolis, Democritus University of Thrace, Alexandroupoli, GRC
| | - Ioannis S Vasios
- Department of Orthopaedics, University General Hospital of Alexandroupolis, Democritus University of Thrace, Alexandroupoli, GRC
| | - Anthimos Keskinis
- Department of Orthopaedics, University General Hospital of Alexandroupolis, Democritus University of Thrace, Alexandroupoli, GRC
| | - Georgios Petkidis
- Department of Orthopaedics, University General Hospital of Alexandroupolis, Democritus University of Thrace, Alexandroupoli, GRC
| | - Konstantinos Tilkeridis
- Department of Orthopaedics, University General Hospital of Alexandroupolis, Democritus University of Thrace, Alexandroupoli, GRC
| | - Athanasios Ververidis
- Department of Orthopaedics, University General Hospital of Alexandroupolis, Democritus University of Thrace, Alexandroupoli, GRC
| | - Efthymios Iliopoulos
- Department of Orthopaedics, University General Hospital of Alexandroupolis, Democritus University of Thrace, Alexandroupoli, GRC
| |
Collapse
|
41
|
Aliyeva A. "Bot or Not": Turing Problem in Otolaryngology. Cureus 2023; 15:e48170. [PMID: 38046723 PMCID: PMC10693309 DOI: 10.7759/cureus.48170] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/02/2023] [Indexed: 12/05/2023] Open
Abstract
The aim of this article is to shed light on the evolving landscape of artificial intelligence (AI) integration in otolaryngology and its implications, particularly focusing on the ethical considerations surrounding AI applications, and to highlight the potential benefits of ChatGPT in patient management and scientific research within otolaryngology while emphasizing the necessity for ethical guidelines and validation processes. Ultimately, the article seeks to encourage a responsible and informed approach to AI adoption in otolaryngology, promoting collaboration between AI and healthcare professionals for the betterment of science and human well-being.
Collapse
Affiliation(s)
- Aynur Aliyeva
- Otolaryngology - Head and Neck Surgery, Cincinnati Children's Hospital Medical Center, Ohio, USA
| |
Collapse
|
42
|
Diane A, Gencarelli P, Lee JM, Mittal R. Utilizing ChatGPT to Streamline the Generation of Prior Authorization Letters and Enhance Clerical Workflow in Orthopedic Surgery Practice: A Case Report. Cureus 2023; 15:e49680. [PMID: 38161881 PMCID: PMC10756745 DOI: 10.7759/cureus.49680] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/29/2023] [Indexed: 01/03/2024] Open
Abstract
Prior authorization is a cumbersome process that requires clinicians to create an individualized letter that includes detailed information about the patient's medical condition, proposed treatment plan, and any supplemental information required to obtain approval from a patient's insurance company before any services or procedures may be provided to the patient. However, drafting authorization letters is time-consuming clerical work that not only places an increased administrative burden on orthopedic surgeons and office staff but also concurrently takes time away from patient care. Therefore, there is a need to improve this process by streamlining workflows for healthcare providers in order to prioritize direct patient care. In this report, we present a case utilizing OpenAI's ChatGPT (OpenAI, L.L.C., San Francisco, CA, USA) to draft a prior authorization request letter for the use of matrix-induced autologous chondrocyte implantation to treat a cartilage injury of the knee.
Collapse
Affiliation(s)
- Alioune Diane
- Department of Orthopaedic Surgery, Rutgers Robert Wood Johnson Medical School, New Brunswick, USA
| | - Pasquale Gencarelli
- Department of Orthopaedic Surgery, Rutgers Robert Wood Johnson Medical School, New Brunswick, USA
| | - James M Lee
- Department of Orthopaedic Surgery, Orange Orthopaedic Associates, West Orange, USA
| | - Rahul Mittal
- Department of Health Informatics, Rutgers School of Health Professions, Newark, USA
| |
Collapse
|
43
|
Lenihan D. Three Effective, Efficient, and Easily Implementable Ways to Integrate A.I. Into Medical Education. Cureus 2023; 15:e47204. [PMID: 37854479 PMCID: PMC10581027 DOI: 10.7759/cureus.47204] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/17/2023] [Indexed: 10/20/2023] Open
Abstract
As a medical school CEO who is following the development of A.I. very closely, I believe that med students are eager to adopt the possibilities that A.I. tools can deliver in their training. Not only do these students already use variations of A.I. in other areas of their lives, but they also embrace advanced technology and understand how to use it. With the tech readiness of today's students in mind, I have devised three recommendations for how to best infuse A.I. into medical education. This strategic guidance can deliver significant benefits to today's tech-fluent medical school students and enhance their training in their journeys to become doctors.
Collapse
Affiliation(s)
- David Lenihan
- Neurology, Ponce Health Sciences University, Ponce, PRI
- Neurology, Ponce Health Sciences University, St. Louis, USA
| |
Collapse
|
44
|
Li J, Zhong J, Li Z, Xiao Y, Wang S. Ectopic Pituitary Neuroendocrine Tumor: A Case Report Written With the Help of ChatGPT. Cureus 2023; 15:e46999. [PMID: 37965416 PMCID: PMC10641033 DOI: 10.7759/cureus.46999] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/12/2023] [Indexed: 11/16/2023] Open
Abstract
ChatGPT (Chat Generative Pre-trained Transformer) is a chat robot program developed by OpenAI (OpenAI Incorporated, Mission District, San Francisco, USA). It is a natural-language processing tool driven by artificial intelligence technology. It can complete tasks such as writing emails, video scripts, copywriting, translation, coding, and writing papers. At present, several studies have been conducted on the use of this tool for academic writing, so we report a rare case of suprasellar ectopic pituitary neuroendocrine tumor (PitNET) secreting growth hormones, in a female 53-year-old patient with facial changes and hypertrophy of her hands and feet for six years. When asking questions on unusual topics, ChatGPT can provide answers but not specific ones, and similar to most machine learning solutions, verifying the correctness of information is a problem. Question asked from ChatGPT: Summarize ectopic pituitary neuroendocrine tumor. "Suprasellar heterotopic pituitary neuroendocrine tumors refer to tumors formed by the heterotopic growth of pituitary neuroendocrine tumor cells in the sellar region other than the pituitary glands. This tumor is common in middle-aged women, and it often manifests as headache, visual impairment, and endocrine dysfunction. Diagnosis is usually made by combining history, physical examination, blood tests, and imaging. Treatment is primarily performed through surgical resection of the tumor, radiotherapy, and drug therapy, and the prognosis is mostly good." The content in quotation marks is generated by ChatGPT.
Collapse
Affiliation(s)
- Jun Li
- Neurosurgery, Department of Neurosurgery, Fuzhou 900th Hospital, Fuzong Clinical Medical College of Fujian Medical University, Fuzhou, CHN
| | - Jiansheng Zhong
- Neurosurgery, Department of Neurosurgery, Fuzhou 900th Hospital, Fuzong Clinical Medical College of Fujian Medical University, Fuzhou, CHN
| | - Ziqi Li
- Neurosurgery, Department of Neurosurgery, Oriental Hospital Affiliated to Xiamen University, Fuzhou, CHN
| | - Yong Xiao
- Neurosurgery, Central Institute for Mental Health, University of Heidelberg, Heidelberg, DEU
| | - Shousen Wang
- Neurosurgery, Department of Neurosurgery, Oriental Hospital Affiliated to Xiamen University, Fuzhou, CHN
| |
Collapse
|
45
|
Kuang YR, Zou MX, Niu HQ, Zheng BY, Zhang TL, Zheng BW. ChatGPT encounters multiple opportunities and challenges in neurosurgery. Int J Surg 2023; 109:2886-2891. [PMID: 37352529 PMCID: PMC10583932 DOI: 10.1097/js9.0000000000000571] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2023] [Accepted: 06/10/2023] [Indexed: 06/25/2023]
Abstract
BACKGROUND ChatGPT, powered by the GPT model and Transformer architecture, has demonstrated remarkable performance in the domains of medicine and healthcare, providing customized and informative responses. In our study, we investigated the potential of ChatGPT in the field of neurosurgery, focusing on its applications at the patient, neurosurgery student/resident, and neurosurgeon levels. METHOD The authors conducted inquiries with ChatGPT from the viewpoints of patients, neurosurgery students/residents, and neurosurgeons, covering a range of topics, such as disease diagnosis, treatment options, prognosis, rehabilitation, and patient care. The authors also explored concepts related to neurosurgery, including fundamental principles and clinical aspects, as well as tools and techniques to enhance the skills of neurosurgery students/residents. Additionally, the authors examined disease-specific medical interventions and the decision-making processes involved in clinical practice. RESULTS The authors received individual responses from ChatGPT, but they tended to be shallow and repetitive, lacking depth and personalization. Furthermore, ChatGPT may struggle to discern a patient's emotional state, hindering the establishment of rapport and the delivery of appropriate care. The language used in the medical field is influenced by technical and cultural factors, and biases in the training data can result in skewed or inaccurate responses. Additionally, ChatGPT's limitations include the inability to conduct physical examinations or interpret diagnostic images, potentially overlooking complex details and individual nuances in each patient's case. Moreover, its absence in the surgical setting limits its practical utility. CONCLUSION Although ChatGPT is a powerful language model, it cannot substitute for the expertise and experience of trained medical professionals. It lacks the capability to perform physical examinations, make diagnoses, administer treatments, establish trust, provide emotional support, and assist in the recovery process. Moreover, the implementation of Artificial Intelligence in healthcare necessitates careful consideration of legal and ethical concerns. While recognizing the potential of ChatGPT, additional training with comprehensive data is necessary to fully maximize its capabilities.
Collapse
Affiliation(s)
- Yi-Rui Kuang
- Department of Spine Surgery, The First Affiliated Hospital, Hengyang medical school, University of South China, Hengyang, China
- Department of Neurosurgery, Xiangya Hospital, Central South University, Changsha, China
- National Clinical Research Center for Geriatric Disorders, Xiangya Hospital, Central South University, Changsha, China
| | - Ming-Xiang Zou
- Department of Spine Surgery, The First Affiliated Hospital, Hengyang medical school, University of South China, Hengyang, China
| | - Hua-Qing Niu
- Department of Ophthalmology, The Second Affiliated Hospital of Zhengzhou University, Zhengzhou, China
| | - Bo-Yv Zheng
- Department of Orthopedics Surgery, General Hospital of the Central Theater Command, Wuhan, China
| | - Tao-Lan Zhang
- Department of Spine Surgery, The First Affiliated Hospital, Hengyang medical school, University of South China, Hengyang, China
- Department of Pharmacy, The First Affiliated Hospital, Hengyang Medical School, University of South China, Hengyang, China
| | - Bo-Wen Zheng
- Department of Musculoskeletal Tumor Center, People’s Hospital, Peking University, Beijing Key Laboratory of Musculoskeletal Tumor. Beijing, China
| |
Collapse
|
46
|
Cankurtaran RE, Polat YH, Aydemir NG, Umay E, Yurekli OT. Reliability and Usefulness of ChatGPT for Inflammatory Bowel Diseases: An Analysis for Patients and Healthcare Professionals. Cureus 2023; 15:e46736. [PMID: 38022227 PMCID: PMC10630704 DOI: 10.7759/cureus.46736] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/09/2023] [Indexed: 12/01/2023] Open
Abstract
AIM We aimed to evaluate the performance of Chat Generative Pre-trained Transformer (ChatGPT) within the context of inflammatory bowel disease (IBD), which is expected to become an increasingly significant health issue in the future. In addition, the objective of the study was to assess whether ChatGPT serves as a reliable and useful resource for both patients and healthcare professionals. METHODS For this study, 20 specific questions were identified for the two main components of IBD, which are Crohn's disease (CD) and ulcerative colitis (UC). The questions were divided into two sets: one set contained questions directed at healthcare professionals while the second set contained questions directed toward patients. The responses were evaluated with seven-point Likert-type reliability and usefulness scales. RESULTS The distribution of the reliability and utility scores was calculated into four groups (two diseases and two question sources) by averaging the mean scores from both raters. The highest scores in both reliability and usefulness were obtained from professional sources (5.00± 1.21 and 5.15±1.08, respectively). The ranking in terms of reliability and usefulness, respectively, was as follows: CD questions (4.70±1.26 and 4.75±1.06) and UC questions (4.40±1.21 and 4.55±1.31). The reliability scores of the answers for the professionals were significantly higher than those for the patients (both raters, p=0.032). Conclusion: Despite its capacity for reliability and usefulness in the context of IBD, ChatGPT still has some limitations and deficiencies. The correction of ChatGPT's deficiencies and its enhancement by developers with more detailed and up-to-date information could make it a significant source of information for both patients and medical professionals.
Collapse
Affiliation(s)
| | - Yunus Halil Polat
- Department of Gastroenterology, Ankara Training and Research Hospital, Ankara, TUR
| | | | - Ebru Umay
- Physical Medicine and Rehabilitation, University of Health Sciences, Ankara Etlik City Hospital, Ankara, TUR
| | - Oyku Tayfur Yurekli
- Department of Gastroenterology, Ankara Yildirim Beyazit University Faculty of Medicine, Ankara, TUR
| |
Collapse
|
47
|
Irfan B, Yaqoob A. ChatGPT's Epoch in Rheumatological Diagnostics: A Critical Assessment in the Context of Sjögren's Syndrome. Cureus 2023; 15:e47754. [PMID: 38022092 PMCID: PMC10676288 DOI: 10.7759/cureus.47754] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/26/2023] [Indexed: 12/01/2023] Open
Abstract
INTRODUCTION The rise of artificial intelligence in medical practice is reshaping clinical care. Large language models (LLMs) like ChatGPT have the potential to assist in rheumatology by personalizing scientific information retrieval, particularly in the context of Sjögren's Syndrome. This study aimed to evaluate the efficacy of ChatGPT in providing insights into Sjögren's Syndrome, differentiating it from other rheumatological conditions. MATERIALS AND METHODS A database of peer-reviewed articles and clinical guidelines focused on Sjögren's Syndrome was compiled. Clinically relevant questions were presented to ChatGPT, with responses assessed for accuracy, relevance, and comprehensiveness. Techniques such as blinding, random control queries, and temporal analysis ensured unbiased evaluation. ChatGPT's responses were also assessed using the 15-questionnaire DISCERN tool. RESULTS ChatGPT effectively highlighted key immunopathological and histopathological characteristics of Sjögren's Syndrome, though some crucial data and citation inconsistencies were noted. For a given clinical vignette, ChatGPT correctly identified potential etiological considerations with Sjögren's Syndrome being prominent. DISCUSSION LLMs like ChatGPT offer rapid access to vast amounts of data, beneficial for both patients and providers. While it democratizes information, limitations like potential oversimplification and reference inaccuracies were observed. The balance between LLM insights and clinical judgment, as well as continuous model refinement, is crucial. CONCLUSION LLMs like ChatGPT offer significant potential in rheumatology, providing swift and broad medical insights. However, a cautious approach is vital, ensuring rigorous training and ethical application for optimal patient care and clinical practice.
Collapse
Affiliation(s)
- Bilal Irfan
- Microbiology and Immunology, University of Michigan, Ann Arbor, USA
| | | |
Collapse
|
48
|
Köroğlu EY, Fakı S, Beştepe N, Tam AA, Çuhacı Seyrek N, Topaloglu O, Ersoy R, Cakir B. A Novel Approach: Evaluating ChatGPT's Utility for the Management of Thyroid Nodules. Cureus 2023; 15:e47576. [PMID: 38021609 PMCID: PMC10666652 DOI: 10.7759/cureus.47576] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/24/2023] [Indexed: 12/01/2023] Open
Abstract
Background and objective Artificial intelligence (AI) applications such as Chat Generative Pre-Trained Transformer (ChatGPT) created by OpenAI, which represent the revolutionary aspects of today's technology, have benefitted professionals in many fields and society at large. In this study, we aimed to assess how effective is ChatGPT in helping both the patient and the physician manage thyroid nodules, a very common pathology. Methods Fifty-five questions frequently asked by patients were identified and asked to ChatGPT. Subsequently, three cases of thyroid nodules were progressively presented to ChatGPT. The answers to patient questions were scored for correctness and reliability by two endocrinologists. As for the cases, diagnostic and therapeutic approaches provided by ChatGPT were analyzed and scored by two endocrinologists for correctness, safety, and usability. The responses were evaluated by using 7-point Likert-type scales designed by us. Results The answers to patient questions were found to be mostly correct and reliable by both raters (Rater #1: 6.47 ± 0.50 and 6.27 ± 0.52; Rater #2: 6.18 ± 0.92 and 6.09 ± 0.96). Regarding the management of cases, ChatGPT's approach was found to be largely correct, safe, and usable by Rater #1, while Rater #2 evaluated the approaches as partially or mostly correct, safe, and usable. Conclusion Based on our findings, ChatGPT can be used as an informative and reliable resource for managing patients with thyroid nodules. While it is not suitable to be used as a primary resource for physicians, it has the potential to be a helpful and supportive tool.
Collapse
Affiliation(s)
- Ekin Y Köroğlu
- Endocrinology and Metabolism, Ankara City Hospital, Ankara, TUR
| | - Sevgül Fakı
- Endocrinology and Metabolism, Ankara City Hospital, Ankara, TUR
| | - Nagihan Beştepe
- Endocrinology and Metabolism, Ankara City Hospital, Ankara, TUR
| | - Abbas A Tam
- Endocrinology and Metabolism, Ankara Yıldırım Beyazıt University School of Medicine, Ankara, TUR
| | - Neslihan Çuhacı Seyrek
- Endocrinology and Metabolism, Ankara Yıldırım Beyazıt University School of Medicine, Ankara, TUR
| | - Oya Topaloglu
- Endocrinology and Metabolism, Ankara Yıldırım Beyazıt University School of Medicine, Ankara, TUR
| | - Reyhan Ersoy
- Endocrinology and Metabolism, Ankara Yıldırım Beyazıt University School of Medicine, Ankara, TUR
| | - Bekir Cakir
- Endocrinology and Metabolism, Ankara Yıldırım Beyazıt University School of Medicine, Ankara, TUR
| |
Collapse
|
49
|
Sultan I, Al-Abdallat H, Alnajjar Z, Ismail L, Abukhashabeh R, Bitar L, Abu Shanap M. Using ChatGPT to Predict Cancer Predisposition Genes: A Promising Tool for Pediatric Oncologists. Cureus 2023; 15:e47594. [PMID: 38021917 PMCID: PMC10666922 DOI: 10.7759/cureus.47594] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/18/2023] [Indexed: 12/01/2023] Open
Abstract
BACKGROUND Determining genetic susceptibility for cancer predisposition syndromes (CPS) through cancer predisposition genes (CPGs) testing is critical in facilitating appropriate prevention and surveillance strategies. This study investigates the use of ChatGPT, a large language model, in predicting CPGs using clinical notes. METHODS Our study involved 53 patients with pathogenic CPG mutations. Two kinds of clinical notes were used: the first visit note, containing a thorough history and physical exam, and the genetic clinic note, summarizing the patient's diagnosis and family history. We asked ChatGPT to recommend CPS genes based on these notes and compared these predictions with previously identified mutations. RESULTS Rb1 was the most frequently mutated gene in our cohort (34%), followed by NF1 (9.4%), TP53 (5.7%), and VHL (5.7%). Out of 53 patients, 30 had genetic clinic notes of a median length of 54 words. ChatGPT correctly predicted the gene in 93% of these cases. However, it failed to predict EPCAM and VHL genes in specific patients. For the first visit notes (median length: 461 words), ChatGPT correctly predicted the gene in 64% of these cases. CONCLUSION ChatGPT shows promise in predicting CPGs from clinical notes, particularly genetic clinic notes. This approach may be useful in enhancing CPG testing, especially in areas lacking genetic testing resources. With further training, there is a possibility for ChatGPT to improve its predictive potential and expand its clinical applicability. However, additional research is needed to explore the full potential and applicability of ChatGPT.
Collapse
Affiliation(s)
- Iyad Sultan
- Department of Pediatrics, King Hussein Cancer Center, Amman, JOR
| | | | - Zaina Alnajjar
- Department of Medicine, Hashemite University, Zarqa, JOR
| | - Layan Ismail
- Department of Medicine, University of Jordan, Amman, JOR
| | - Razan Abukhashabeh
- Department of Cell Therapy and Applied Genomics, King Hussein Cancer Center, Amman, JOR
| | - Layla Bitar
- Department of Pediatric Oncology, King Hussein Cancer Center, Amman, JOR
| | - Mayada Abu Shanap
- Department of Pediatric Oncology, King Hussein Cancer Center, Amman, JOR
| |
Collapse
|
50
|
Biri SK, Kumar S, Panigrahi M, Mondal S, Behera JK, Mondal H. Assessing the Utilization of Large Language Models in Medical Education: Insights From Undergraduate Medical Students. Cureus 2023; 15:e47468. [PMID: 38021810 PMCID: PMC10662537 DOI: 10.7759/cureus.47468] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/22/2023] [Indexed: 12/01/2023] Open
Abstract
Background Artificial intelligence (AI) has the potential to be integrated into medical education. Among AI-based technology, large language models (LLMs) such as ChatGPT, Google Bard, Microsoft Bing, and Perplexity have emerged as powerful tools with capabilities in natural language processing. With this background, this study investigates the knowledge, attitude, and practice of undergraduate medical students regarding the utilization of LLMs in medical education in a medical college in Jharkhand, India. Methods A cross-sectional online survey was sent to 370 undergraduate medical students on Google Forms. The questionnaire comprised the following three domains: knowledge, attitude, and practice, each containing six questions. Cronbach's alphas for knowledge, attitude, and practice domains were 0.703, 0.707, and 0.809, respectively. Intraclass correlation coefficients for knowledge, attitude, and practice domains were 0.82, 0.87, and 0.78, respectively. The average scores in the three domains were compared using ANOVA. Results A total of 172 students participated in the study (response rate: 46.49%). The majority of the students (45.93%) rarely used the LLMs for their teaching-learning purposes (chi-square (3) = 41.44, p < 0.0001). The overall score of knowledge (3.21±0.55), attitude (3.47±0.54), and practice (3.26±0.61) were statistically significantly different (ANOVA F (2, 513) = 10.2, p < 0.0001), with the highest score in attitude and lowest in knowledge. Conclusion While there is a generally positive attitude toward the incorporation of LLMs in medical education, concerns about overreliance and potential inaccuracies are evident. LLMs offer the potential to enhance learning resources and provide accessible education, but their integration requires further planning. Further studies are required to explore the long-term impact of LLMs in diverse educational contexts.
Collapse
Affiliation(s)
| | - Subir Kumar
- Pharmacology, Phulo Jhano Medical College, Dumka, IND
| | | | - Shaikat Mondal
- Physiology, Raiganj Government Medical College & Hospital, Raiganj, IND
| | - Joshil Kumar Behera
- Physiology, Nagaland Institute of Medical Sciences and Research, Kohima, IND
| | - Himel Mondal
- Physiology, All India Institute of Medical Sciences, Deoghar, IND
| |
Collapse
|