Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

Download

Total Articles

134
(from Reference Citation Analysis)

Article PDFs (1)

Cited by > 0 (52)

Searched Name

chatgpt

Ranked By

Results Analysis

Year Published Analysis
Article Type Analysis
Publication Title Analysis
Category Analysis

Results Analysis

Indexed Articles

Year Published

Show more Refine

Article Type

Show more Refine

Article Statistics

Refine

MESH Headings

Show more Refine

First Author

Show more Refine

First Author Affiliations

Show more Refine

Authors

Show more Refine

Publication Titles

Show more Refine

Grant Agencies

Show more Refine

Countries/Regions

Show more Refine

Affiliations

Show more Refine

Corresponding Author Affiliations

Show more Refine

Category

Show more Refine

Number

Citation Analysis

Gül Ş, Erdemir İ, Hanci V, Aydoğmuş E, Erkoç YS. How artificial intelligence can provide information about subdural hematoma: Assessment of readability, reliability, and quality of ChatGPT, BARD, and perplexity responses. Medicine (Baltimore) 2024;103:e38009. [PMID: 38701313 PMCID: PMC11062651 DOI: 10.1097/md.0000000000038009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/21/2024] [Accepted: 04/04/2024] [Indexed: 05/05/2024] Open

Abstract

Subdural hematoma is defined as blood collection in the subdural space between the dura mater and arachnoid. Subdural hematoma is a condition that neurosurgeons frequently encounter and has acute, subacute and chronic forms. The incidence in adults is reported to be 1.72-20.60/100.000 people annually. Our study aimed to evaluate the quality, reliability and readability of the answers to questions asked to ChatGPT, Bard, and perplexity about "Subdural Hematoma." In this observational and cross-sectional study, we asked ChatGPT, Bard, and perplexity to provide the 100 most frequently asked questions about "Subdural Hematoma" separately. Responses from both chatbots were analyzed separately for readability, quality, reliability and adequacy. When the median readability scores of ChatGPT, Bard, and perplexity answers were compared with the sixth-grade reading level, a statistically significant difference was observed in all formulas (P < .001). All 3 chatbot responses were found to be difficult to read. Bard responses were more readable than ChatGPT's (P < .001) and perplexity's (P < .001) responses for all scores evaluated. Although there were differences between the results of the evaluated calculators, perplexity's answers were determined to be more readable than ChatGPT's answers (P < .05). Bard answers were determined to have the best GQS scores (P < .001). Perplexity responses had the best Journal of American Medical Association and modified DISCERN scores (P < .001). ChatGPT, Bard, and perplexity's current capabilities are inadequate in terms of quality and readability of "Subdural Hematoma" related text content. The readability standard for patient education materials as determined by the American Medical Association, National Institutes of Health, and the United States Department of Health and Human Services is at or below grade 6. The readability levels of the responses of artificial intelligence applications such as ChatGPT, Bard, and perplexity are significantly higher than the recommended 6th grade level.

Collapse

Deng L, Wang T, Yangzhang, Zhai Z, Tao W, Li J, Zhao Y, Luo S, Xu J. Evaluation of large language models in breast cancer clinical scenarios: a comparative analysis based on ChatGPT-3.5, ChatGPT-4.0, and Claude2. Int J Surg 2024;110:1941-1950. [PMID: 38668655 PMCID: PMC11019981 DOI: 10.1097/js9.0000000000001066] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2023] [Accepted: 12/23/2023] [Indexed: 04/29/2024]

Abstract

BACKGROUND

Large language models (LLMs) have garnered significant attention in the AI domain owing to their exemplary context recognition and response capabilities. However, the potential of LLMs in specific clinical scenarios, particularly in breast cancer diagnosis, treatment, and care, has not been fully explored. This study aimed to compare the performances of three major LLMs in the clinical context of breast cancer.

METHODS

In this study, clinical scenarios designed specifically for breast cancer were segmented into five pivotal domains (nine cases): assessment and diagnosis, treatment decision-making, postoperative care, psychosocial support, and prognosis and rehabilitation. The LLMs were used to generate feedback for various queries related to these domains. For each scenario, a panel of five breast cancer specialists, each with over a decade of experience, evaluated the feedback from LLMs. They assessed feedback concerning LLMs in terms of their quality, relevance, and applicability.

RESULTS

There was a moderate level of agreement among the raters (Fleiss' kappa=0.345, P<0.05). Comparing the performance of different models regarding response length, GPT-4.0 and GPT-3.5 provided relatively longer feedback than Claude2. Furthermore, across the nine case analyses, GPT-4.0 significantly outperformed the other two models in average quality, relevance, and applicability. Within the five clinical areas, GPT-4.0 markedly surpassed GPT-3.5 in the quality of the other four areas and scored higher than Claude2 in tasks related to psychosocial support and treatment decision-making.

CONCLUSION

This study revealed that in the realm of clinical applications for breast cancer, GPT-4.0 showcases not only superiority in terms of quality and relevance but also demonstrates exceptional capability in applicability, especially when compared to GPT-3.5. Relative to Claude2, GPT-4.0 holds advantages in specific domains. With the expanding use of LLMs in the clinical field, ongoing optimization and rigorous accuracy assessments are paramount.

Collapse

Beaulieu-Jones BR, Berrigan MT, Shah S, Marwaha JS, Lai SL, Brat GA. Evaluating capabilities of large language models: Performance of GPT-4 on surgical knowledge assessments. Surgery 2024;175:936-942. [PMID: 38246839 PMCID: PMC10947829 DOI: 10.1016/j.surg.2023.12.014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2023] [Revised: 12/09/2023] [Accepted: 12/15/2023] [Indexed: 01/23/2024]

Abstract

BACKGROUND

Artificial intelligence has the potential to dramatically alter health care by enhancing how we diagnose and treat disease. One promising artificial intelligence model is ChatGPT, a general-purpose large language model trained by OpenAI. ChatGPT has shown human-level performance on several professional and academic benchmarks. We sought to evaluate its performance on surgical knowledge questions and assess the stability of this performance on repeat queries.

METHODS

We evaluated the performance of ChatGPT-4 on questions from the Surgical Council on Resident Education question bank and a second commonly used surgical knowledge assessment, referred to as Data-B. Questions were entered in 2 formats: open-ended and multiple-choice. ChatGPT outputs were assessed for accuracy and insights by surgeon evaluators. We categorized reasons for model errors and the stability of performance on repeat queries.

RESULTS

A total of 167 Surgical Council on Resident Education and 112 Data-B questions were presented to the ChatGPT interface. ChatGPT correctly answered 71.3% and 67.9% of multiple choice and 47.9% and 66.1% of open-ended questions for Surgical Council on Resident Education and Data-B, respectively. For both open-ended and multiple-choice questions, approximately two-thirds of ChatGPT responses contained nonobvious insights. Common reasons for incorrect responses included inaccurate information in a complex question (n = 16, 36.4%), inaccurate information in a fact-based question (n = 11, 25.0%), and accurate information with circumstantial discrepancy (n = 6, 13.6%). Upon repeat query, the answer selected by ChatGPT varied for 36.4% of questions answered incorrectly on the first query; the response accuracy changed for 6/16 (37.5%) questions.

CONCLUSION

Consistent with findings in other academic and professional domains, we demonstrate near or above human-level performance of ChatGPT on surgical knowledge questions from 2 widely used question banks. ChatGPT performed better on multiple-choice than open-ended questions, prompting questions regarding its potential for clinical application. Unique to this study, we demonstrate inconsistency in ChatGPT responses on repeat queries. This finding warrants future consideration including efforts at training large language models to provide the safe and consistent responses required for clinical application. Despite near or above human-level performance on question banks and given these observations, it is unclear whether large language models such as ChatGPT are able to safely assist clinicians in providing care.

Collapse

Meng J, Zhang Z, Tang H, Xiao Y, Liu P, Gao S, He M. Evaluation of ChatGPT in providing appropriate fracture prevention recommendations and medical science question responses: A quantitative research. Medicine (Baltimore) 2024;103:e37458. [PMID: 38489735 PMCID: PMC10939678 DOI: 10.1097/md.0000000000037458] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/12/2024] [Revised: 02/08/2024] [Accepted: 02/12/2024] [Indexed: 03/17/2024] Open

Abstract

Currently, there are limited studies assessing ChatGPT ability to provide appropriate responses to medical questions. Our study aims to evaluate ChatGPT adequacy in responding to questions regarding osteoporotic fracture prevention and medical science. We created a list of 25 questions based on the guidelines and our clinical experience. Additionally, we included 11 medical science questions from the journal Science. Three patients, 3 non-medical professionals, 3 specialist doctor and 3 scientists were involved to evaluate the accuracy and appropriateness of responses by ChatGPT3.5 on October 2, 2023. To simulate a consultation, an inquirer (either a patient or non-medical professional) would send their questions to a consultant (specialist doctor or scientist) via a website. The consultant would forward the questions to ChatGPT for answers, which would then be evaluated for accuracy and appropriateness by the consultant before being sent back to the inquirer via the website for further review. The primary outcome is the appropriate, inappropriate, and unreliable rate of ChatGPT responses as evaluated separately by the inquirer and consultant groups. Compared to orthopedic clinicians, the patients' rating on the appropriateness of ChatGPT responses to the questions about osteoporotic fracture prevention was slightly higher, although the difference was not statistically significant (88% vs 80%, P = .70). For medical science questions, non-medical professionals and medical scientists rated similarly. In addition, the experts' ratings on the appropriateness of ChatGPT responses to osteoporotic fracture prevention and to medical science questions were comparable. On the other hand, the patients perceived that the appropriateness of ChatGPT responses to osteoporotic fracture prevention questions was slightly higher than that to medical science questions (88% vs 72·7%, P = .34). ChatGPT is capable of providing comparable and appropriate responses to medical science questions, as well as to fracture prevention related issues. Both the inquirers seeking advice and the consultants providing advice recognize ChatGPT expertise in these areas.

Collapse

Abbas A, Rehman MS, Rehman SS. Comparing the Performance of Popular Large Language Models on the National Board of Medical Examiners Sample Questions. Cureus 2024;16:e55991. [PMID: 38606229 PMCID: PMC11007479 DOI: 10.7759/cureus.55991] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 03/11/2024] [Indexed: 04/13/2024] Open

Abstract

INTRODUCTION

Large language models (LLMs) have transformed various domains in medicine, aiding in complex tasks and clinical decision-making, with OpenAI's GPT-4, GPT-3.5, Google's Bard, and Anthropic's Claude among the most widely used. While GPT-4 has demonstrated superior performance in some studies, comprehensive comparisons among these models remain limited. Recognizing the significance of the National Board of Medical Examiners (NBME) exams in assessing the clinical knowledge of medical students, this study aims to compare the accuracy of popular LLMs on NBME clinical subject exam sample questions.

METHODS

The questions used in this study were multiple-choice questions obtained from the official NBME website and are publicly available. Questions from the NBME subject exams in medicine, pediatrics, obstetrics and gynecology, clinical neurology, ambulatory care, family medicine, psychiatry, and surgery were used to query each LLM. The responses from GPT-4, GPT-3.5, Claude, and Bard were collected in October 2023. The response by each LLM was compared to the answer provided by the NBME and checked for accuracy. Statistical analysis was performed using one-way analysis of variance (ANOVA).

RESULTS

A total of 163 questions were queried by each LLM. GPT-4 scored 163/163 (100%), GPT-3.5 scored 134/163 (82.2%), Bard scored 123/163 (75.5%), and Claude scored 138/163 (84.7%). The total performance of GPT-4 was statistically superior to that of GPT-3.5, Claude, and Bard by 17.8%, 15.3%, and 24.5%, respectively. The total performance of GPT-3.5, Claude, and Bard was not significantly different. GPT-4 significantly outperformed Bard in specific subjects, including medicine, pediatrics, family medicine, and ambulatory care, and GPT-3.5 in ambulatory care and family medicine. Across all LLMs, the surgery exam had the highest average score (18.25/20), while the family medicine exam had the lowest average score (3.75/5). Conclusion: GPT-4's superior performance on NBME clinical subject exam sample questions underscores its potential in medical education and practice. While LLMs exhibit promise, discernment in their application is crucial, considering occasional inaccuracies. As technological advancements continue, regular reassessments and refinements are imperative to maintain their reliability and relevance in medicine.

Collapse

Lum ZC, Collins DP, Dennison S, Guntupalli L, Choudhary S, Saiz AM, Randall RL. Generative Artificial Intelligence Performs at a Second-Year Orthopedic Resident Level. Cureus 2024;16:e56104. [PMID: 38618358 PMCID: PMC11014641 DOI: 10.7759/cureus.56104] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2024] [Accepted: 03/12/2024] [Indexed: 04/16/2024] Open

Abstract

Introduction Artificial intelligence (AI) models using large language models (LLMs) and non-specific domains have gained attention for their innovative information processing. As AI advances, it's essential to regularly evaluate these tools' competency to maintain high standards, prevent errors or biases, and avoid flawed reasoning or misinformation that could harm patients or spread inaccuracies. Our study aimed to determine the performance of Chat Generative Pre-trained Transformer (ChatGPT) by OpenAI and Google BARD (BARD) in orthopedic surgery, assess performance based on question types, contrast performance between different AIs and compare AI performance to orthopedic residents. Methods We administered ChatGPT and BARD 757 Orthopedic In-Training Examination (OITE) questions. After excluding image-related questions, the AIs answered 390 multiple choice questions, all categorized within 10 sub-specialties (basic science, trauma, sports medicine, spine, hip and knee, pediatrics, oncology, shoulder and elbow, hand, and food and ankle) and three taxonomy classes (recall, interpretation, and application of knowledge). Statistical analysis was performed to analyze the number of questions answered correctly by each AI model, the performance returned by each AI model within the categorized question sub-specialty designation, and the performance of each AI model in comparison to the results returned by orthopedic residents classified by their respective post-graduate year (PGY) level. Results BARD answered more overall questions correctly (58% vs 54%, p<0.001). ChatGPT performed better in sports medicine and basic science and worse in hand surgery, while BARD performed better in basic science (p<0.05). The AIs performed better in recall questions compared to the application of knowledge (p<0.05). Based on previous data, it ranked in the 42nd-96th percentile for post-graduate year ones (PGY1s), 27th-58th for PGY2s, 3rd-29th for PGY3s, 1st-21st for PGY4s, and 1st-17th for PGY5s. Discussion ChatGPT excelled in sports medicine but fell short in hand surgery, while both AIs performed well in the basic science sub-specialty but performed poorly in the application of knowledge-based taxonomy questions. BARD performed better than ChatGPT overall. Although the AI reached the second-year PGY orthopedic resident level, it fell short of passing the American Board of Orthopedic Surgery (ABOS). Its strengths in recall-based inquiries highlight its potential as an orthopedic learning and educational tool.

Collapse

Sudharshan R, Shen A, Gupta S, Zhang-Nunes S. Assessing the Utility of ChatGPT in Simplifying Text Complexity of Patient Educational Materials. Cureus 2024;16:e55304. [PMID: 38559518 PMCID: PMC10981786 DOI: 10.7759/cureus.55304] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/29/2024] [Indexed: 04/04/2024] Open

Lee GU, Hong DY, Kim SY, Kim JW, Lee YH, Park SO, Lee KR. Comparison of the problem-solving performance of ChatGPT-3.5, ChatGPT-4, Bing Chat, and Bard for the Korean emergency medicine board examination question bank. Medicine (Baltimore) 2024;103:e37325. [PMID: 38428889 PMCID: PMC10906566 DOI: 10.1097/md.0000000000037325] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/03/2023] [Accepted: 01/31/2024] [Indexed: 03/03/2024] Open

Khatib M, Hasani IW. Acetabular Aneurysmal Bone Cyst During the Syrian Conflict: A Case Report of Surgical Treatment and Outcomes. Cureus 2024;16:e56474. [PMID: 38638726 PMCID: PMC11025696 DOI: 10.7759/cureus.56474] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 03/19/2024] [Indexed: 04/20/2024] Open

Nakajima N, Fujimori T, Furuya M, Kanie Y, Imai H, Kita K, Uemura K, Okada S. A Comparison Between GPT-3.5, GPT-4, and GPT-4V: Can the Large Language Model (ChatGPT) Pass the Japanese Board of Orthopaedic Surgery Examination? Cureus 2024;16:e56402. [PMID: 38633935 PMCID: PMC11023708 DOI: 10.7759/cureus.56402] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/19/2024] [Indexed: 04/19/2024] Open

Abstract

Introduction Recently, large-scale language models, such as ChatGPT (OpenAI, San Francisco, CA), have evolved. These models are designed to think and act like humans and possess a broad range of specialized knowledge. GPT-3.5 was reported to be at a level of passing the United States Medical Licensing Examination. Its capabilities continue to evolve, and in October 2023, GPT-4V became available as a model capable of image recognition. Therefore, it is important to know the current performance of these models because they will be soon incorporated into medical practice. We aimed to evaluate the performance of ChatGPT in the field of orthopedic surgery. Methods We used three years' worth of Japanese Board of Orthopaedic Surgery Examinations (JBOSE) conducted in 2021, 2022, and 2023. Questions and their multiple-choice answers were used in their original Japanese form, as was the official examination rubric. We inputted these questions into three versions of ChatGPT: GPT-3.5, GPT-4, and GPT-4V. For image-based questions, we inputted only textual statements for GPT-3.5 and GPT-4, and both image and textual statements for GPT-4V. As the minimum scoring rate acquired to pass is not officially disclosed, it was calculated using publicly available data. Results The estimated minimum scoring rate acquired to pass was calculated as 50.1% (43.7-53.8%). For GPT-4, even when answering all questions, including the image-based ones, the percentage of correct answers was 59% (55-61%) and GPT-4 was able to achieve the passing line. When excluding image-based questions, the score reached 67% (63-73%). For GPT-3.5, the percentage was limited to 30% (28-32%), and this version could not pass the examination. There was a significant difference in the performance between GPT-4 and GPT-3.5 (p < 0.001). For image-based questions, the percentage of correct answers was 25% in GPT-3.5, 38% in GPT-4, and 38% in GPT-4V. There was no significant difference in the performance for image-based questions between GPT-4 and GPT-4V. Conclusions ChatGPT had enough performance to pass the orthopedic specialist examination. After adding further training data such as images, ChatGPT is expected to be applied to the orthopedics field.

Collapse

Yalla GR, Hyman N, Hock LE, Zhang Q, Shukla AG, Kolomeyer NN. Performance of Artificial Intelligence Chatbots on Glaucoma Questions Adapted From Patient Brochures. Cureus 2024;16:e56766. [PMID: 38650824 PMCID: PMC11034394 DOI: 10.7759/cureus.56766] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 03/23/2024] [Indexed: 04/25/2024] Open

Abstract

Introduction With the potential for artificial intelligence (AI) chatbots to serve as the primary source of glaucoma information to patients, it is essential to characterize the information that chatbots provide such that providers can tailor discussions, anticipate patient concerns, and identify misleading information. Therefore, the purpose of this study was to evaluate glaucoma information from AI chatbots, including ChatGPT-4, Bard, and Bing, by analyzing response accuracy, comprehensiveness, readability, word count, and character count in comparison to each other and glaucoma-related American Academy of Ophthalmology (AAO) patient materials. Methods Section headers from AAO glaucoma-related patient education brochures were adapted into question form and asked five times to each AI chatbot (ChatGPT-4, Bard, and Bing). Two sets of responses from each chatbot were used to evaluate the accuracy of AI chatbot responses and AAO brochure information, and the comprehensiveness of AI chatbot responses compared to the AAO brochure information, scored 1-5 by three independent glaucoma-trained ophthalmologists. Readability (assessed with Flesch-Kincaid Grade Level (FKGL), corresponding to the United States school grade levels), word count, and character count were determined for all chatbot responses and AAO brochure sections. Results Accuracy scores for AAO, ChatGPT, Bing, and Bard were 4.84, 4.26, 4.53, and 3.53, respectively. On direct comparison, AAO was more accurate than ChatGPT (p=0.002), and Bard was the least accurate (Bard versus AAO, p<0.001; Bard versus ChatGPT, p<0.002; Bard versus Bing, p=0.001). ChatGPT had the most comprehensive responses (ChatGPT versus Bing, p<0.001; ChatGPT versus Bard p=0.008), with comprehensiveness scores for ChatGPT, Bing, and Bard at 3.32, 2.16, and 2.79, respectively. AAO information and Bard responses were at the most accessible readability levels (AAO versus ChatGPT, AAO versus Bing, Bard versus ChatGPT, Bard versus Bing, all p<0.0001), with readability levels for AAO, ChatGPT, Bing, and Bard at 8.11, 13.01, 11.73, and 7.90, respectively. Bing responses had the lowest word and character count. Conclusion AI chatbot responses varied in accuracy, comprehensiveness, and readability. With accuracy scores and comprehensiveness below that of AAO brochures and elevated readability levels, AI chatbots require improvements to be a more useful supplementary source of glaucoma information for patients. Physicians must be aware of these limitations such that patients are asked about existing knowledge and questions and are then provided with clarifying and comprehensive information.

Collapse

Wright BM, Bodnar MS, Moore AD, Maseda MC, Kucharik MP, Diaz CC, Schmidt CM, Mir HR. Is ChatGPT a trusted source of information for total hip and knee arthroplasty patients? Bone Jt Open 2024;5:139-146. [PMID: 38354748 PMCID: PMC10867788 DOI: 10.1302/2633-1462.52.bjo-2023-0113.r1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 02/16/2024] Open

Abstract

Aims

While internet search engines have been the primary information source for patients' questions, artificial intelligence large language models like ChatGPT are trending towards becoming the new primary source. The purpose of this study was to determine if ChatGPT can answer patient questions about total hip (THA) and knee arthroplasty (TKA) with consistent accuracy, comprehensiveness, and easy readability.

Methods

We posed the 20 most Google-searched questions about THA and TKA, plus ten additional postoperative questions, to ChatGPT. Each question was asked twice to evaluate for consistency in quality. Following each response, we responded with, "Please explain so it is easier to understand," to evaluate ChatGPT's ability to reduce response reading grade level, measured as Flesch-Kincaid Grade Level (FKGL). Five resident physicians rated the 120 responses on 1 to 5 accuracy and comprehensiveness scales. Additionally, they answered a "yes" or "no" question regarding acceptability. Mean scores were calculated for each question, and responses were deemed acceptable if ≥ four raters answered "yes."

Results

The mean accuracy and comprehensiveness scores were 4.26 (95% confidence interval (CI) 4.19 to 4.33) and 3.79 (95% CI 3.69 to 3.89), respectively. Out of all the responses, 59.2% (71/120; 95% CI 50.0% to 67.7%) were acceptable. ChatGPT was consistent when asked the same question twice, giving no significant difference in accuracy (t = 0.821; p = 0.415), comprehensiveness (t = 1.387; p = 0.171), acceptability (χ2 = 1.832; p = 0.176), and FKGL (t = 0.264; p = 0.793). There was a significantly lower FKGL (t = 2.204; p = 0.029) for easier responses (11.14; 95% CI 10.57 to 11.71) than original responses (12.15; 95% CI 11.45 to 12.85).

Conclusion

ChatGPT answered THA and TKA patient questions with accuracy comparable to previous reports of websites, with adequate comprehensiveness, but with limited acceptability as the sole information source. ChatGPT has potential for answering patient questions about THA and TKA, but needs improvement.

Collapse

Gengatharan D, Saggi SS, Bin Abd Razak HR. Pre-operative Planning of High Tibial Osteotomy With ChatGPT: Are We There Yet? Cureus 2024;16:e54858. [PMID: 38533173 PMCID: PMC10964394 DOI: 10.7759/cureus.54858] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/23/2024] [Indexed: 03/28/2024] Open

Podda M, Di Martino M, Ielpo B, Catena F, Coccolini F, Pata F, Marchegiani G, De Simone B, Damaskos D, Mole D, Leppaniemi A, Sartelli M, Yang B, Ansaloni L, Biffl W, Kluger Y, Moore EE, Pellino G, Di Saverio S, Pisanu A. The 2023 MANCTRA Acute Biliary Pancreatitis Care Bundle: A Joint Effort Between Human Knowledge and Artificial Intelligence (ChatGPT) to Optimize the Care of Patients With Acute Biliary Pancreatitis in Western Countries. Ann Surg 2024;279:203-212. [PMID: 37450700 PMCID: PMC10782931 DOI: 10.1097/sla.0000000000006008] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/18/2023]

Affiliation(s)

Mauro Podda Department of Surgical Science, Emergency Surgery Unit, Cagliari State University Hospital, Cagliari, Italy
Marcello Di Martino Division of Hepatobiliary and Liver Transplantation Surgery, A.O.R.N. Cardarelli, Naples, Italy
Benedetto Ielpo Hepatobiliary Division, Hospital del Mar, Pompeu Fabra University, Barcelona, Spain
Fausto Catena Department of Emergency and Trauma Surgery, Bufalini Hospital, Cesena, Italy
Federico Coccolini General, Emergency and Trauma Surgery Unit, Pisa University Hospital, Pisa, Italy
Francesco Pata Department of Surgery, University of Calabria, Cosenza, Italy
Giovanni Marchegiani Department of Surgical, Oncological and Gastroenterological Sciences (DISCOG), Hepato-Pancreato-Biliary Surgery and Liver Transplantation Unit, University of Padua, Padua, Italy
Belinda De Simone Department of Emergency and Metabolic Minimally Invasive Surgery, Centre Hospitalier Intercommunal de Poissy/Saint Germain en Laye, Poissy Cedex, France
Dimitrios Damaskos Department of Upper GI Surgery, Royal Infirmary of Edinburgh, Edinburgh, Scotland, UK
Damian Mole Centre for Inflammation Research, Clinical Surgery, University of Edinburgh, Edinburgh, Scotland, UK
Ari Leppaniemi Department of Abdominal Surgery, University of Helsinki and Helsinki University Central Hospital, Helsinki, Finland
Massimo Sartelli Department of Surgery, Macerata Civil Hospital, Macerata, Italy
Baohong Yang Department of Oncology, Weifang People’s Hospital, The First Affiliated Hospital of Weifang Medical University, Weifang, Shandong, China Department of Gastroenterology, The First Affiliated Hospital of Zhengzhou University, Zhengzhou, Henan, China
Luca Ansaloni Department of General Surgery, IRCCS Policlinico San Matteo Foundation, Pavia, Italy
Walter Biffl Division of Trauma and Acute Care Surgery, Scripps Memorial Hospital La Jolla, La Jolla, CA
Yoram Kluger Department of General Surgery, Rambam Medical Center, Haifa, Israel
Ernest E. Moore Denver Health System—Denver Health Medical Center, Denver, CO
Gianluca Pellino “Luigi Vanvitelli” University of Campania, Naples, Italy Department of Colorectal Surgery, Vall d’Hebron University Hospital, Universitat Autonoma de Barcelona UAB, Barcelona, Spain
Salomone Di Saverio Department of Surgery, Madonna del Soccorso Hospital, San Benedetto del Tronto, Italy
Adolfo Pisanu Department of Surgical Science, Emergency Surgery Unit, Cagliari State University Hospital, Cagliari, Italy

Collapse

Reis F, Lenz C. Performance of Artificial Intelligence (AI)-Powered Chatbots in the Assessment of Medical Case Reports: Qualitative Insights From Simulated Scenarios. Cureus 2024;16:e53899. [PMID: 38465163 PMCID: PMC10925004 DOI: 10.7759/cureus.53899] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/08/2024] [Indexed: 03/12/2024] Open

Abstract

Introduction With the expanding awareness and use of AI-powered chatbots, it seems possible that an increasing number of people could use them to assess and evaluate their medical symptoms. If chatbots are used for this purpose, that have not previously undergone a thorough medical evaluation for this specific use, various risks might arise. The aim of this study is to analyze and compare the performance of popular chatbots in differentiating between severe and less critical medical symptoms described from a patient's perspective and to examine the variations in substantive medical assessment accuracy and empathetic communication style among the chatbots' responses. Materials and methods Our study compared three different AI-supported chatbots - OpenAI's ChatGPT 3.5, Microsoft's Bing Chat, and Inflection's Pi AI. Three exemplary case reports for medical emergencies as well as three cases without an urgent reason for an emergency medical admission were constructed and analyzed. Each case report was accompanied by identical questions concerning the most likely suspected diagnosis and the urgency of an immediate medical evaluation. The respective answers of the chatbots were qualitatively compared with each other regarding the medical accuracy of the differential diagnoses mentioned and the conclusions drawn, as well as regarding patient-oriented and empathetic language. Results All examined chatbots were capable of providing medically plausible and probable diagnoses and classifying situations as acute or less critical. However, their responses varied slightly in the level of their urgency assessment. Clear differences could be seen in the level of detail of the differential diagnoses, the overall length of the answers, and how the chatbot dealt with the challenge of being confronted with medical issues. All given answers were comparable in terms of empathy level and comprehensibility. Conclusion Even AI chatbots that are not designed for medical applications already offer substantial guidance in assessing typical medical emergency indications but should always be provided with a disclaimer. In responding to medical queries, characteristic differences emerge among chatbots in the extent and style of their respective answers. Given the lack of medical supervision of many established chatbots, subsequent studies, and experiences are essential to clarify whether a more extensive use of these chatbots for medical concerns will have a positive impact on healthcare or rather pose major medical risks.

Collapse

Kapsali MZ, Livanis E, Tsalikidis C, Oikonomou P, Voultsos P, Tsaroucha A. Ethical Concerns About ChatGPT in Healthcare: A Useful Tool or the Tombstone of Original and Reflective Thinking? Cureus 2024;16:e54759. [PMID: 38523987 PMCID: PMC10961144 DOI: 10.7759/cureus.54759] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/23/2024] [Indexed: 03/26/2024] Open

Almagazzachi A, Mustafa A, Eighaei Sedeh A, Vazquez Gonzalez AE, Polianovskaia A, Abood M, Abdelrahman A, Muyolema Arce V, Acob T, Saleem B. Generative Artificial Intelligence in Patient Education: ChatGPT Takes on Hypertension Questions. Cureus 2024;16:e53441. [PMID: 38435177 PMCID: PMC10909311 DOI: 10.7759/cureus.53441] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/01/2024] [Indexed: 03/05/2024] Open

Abstract

Introduction Uncontrolled hypertension significantly contributes to the development and deterioration of various medical conditions, such as myocardial infarction, chronic kidney disease, and cerebrovascular events. Despite being the most common preventable risk factor for all-cause mortality, only a fraction of affected individuals maintain their blood pressure in the desired range. In recent times, there has been a growing reliance on online platforms for medical information. While providing a convenient source of information, differentiating reliable from unreliable information can be daunting for the layperson, and false information can potentially hinder timely diagnosis and management of medical conditions. The surge in accessibility of generative artificial intelligence (GeAI) technology has led to increased use in obtaining health-related information. This has sparked debates among healthcare providers about the potential for misuse and misinformation while recognizing the role of GeAI in improving health literacy. This study aims to investigate the accuracy of AI-generated information specifically related to hypertension. Additionally, it seeks to explore the reproducibility of information provided by GeAI. Method A nonhuman-subject qualitative study was devised to evaluate the accuracy of information provided by ChatGPT regarding hypertension and its secondary complications. Frequently asked questions on hypertension were compiled by three study staff, internal medicine residents at an ACGME-accredited program, and then reviewed by a physician experienced in treating hypertension, resulting in a final set of 100 questions. Each question was posed to ChatGPT three times, once by each study staff, and the majority response was then assessed against the recommended guidelines. A board-certified internal medicine physician with over eight years of experience further reviewed the responses and categorized them into two classes based on their clinical appropriateness: appropriate (in line with clinical recommendations) and inappropriate (containing errors). Descriptive statistical analysis was employed to assess ChatGPT responses for accuracy and reproducibility. Result Initially, a pool of 130 questions was gathered, of which a final set of 100 questions was selected for the purpose of this study. When assessed against acceptable standard responses, ChatGPT responses were found to be appropriate in 92.5% of cases and inappropriate in 7.5%. Furthermore, ChatGPT had a reproducibility score of 93%, meaning that it could consistently reproduce answers that conveyed similar meanings across multiple runs. Conclusion ChatGPT showcased commendable accuracy in addressing commonly asked questions about hypertension. These results underscore the potential of GeAI in providing valuable information to patients. However, continued research and refinement are essential to evaluate further the reliability and broader applicability of ChatGPT within the medical field.

Collapse

Rammohan R, Joy MV, Magam SG, Natt D, Magam SR, Pannikodu L, Desai J, Akande O, Bunting S, Yost RM, Mustacchia P. Understanding the Landscape: The Emergence of Artificial Intelligence (AI), ChatGPT, and Google Bard in Gastroenterology. Cureus 2024;16:e51848. [PMID: 38327910 PMCID: PMC10847895 DOI: 10.7759/cureus.51848] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/07/2024] [Indexed: 02/09/2024] Open

Mediboina A, Badam RK, Chodavarapu S. Assessing the Accuracy of Information on Medication Abortion: A Comparative Analysis of ChatGPT and Google Bard AI. Cureus 2024;16:e51544. [PMID: 38318564 PMCID: PMC10840059 DOI: 10.7759/cureus.51544] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/01/2024] [Indexed: 02/07/2024] Open

Abstract

Background and objective ChatGPT and Google Bard AI are widely used conversational chatbots, even in healthcare. While they have several strengths, they can generate seemingly correct but erroneous responses, warranting caution in medical contexts. In an era where access to abortion care is diminishing, patients may increasingly rely on online resources and AI-driven language models for information on medication abortions. In light of this, this study aimed to compare the accuracy and comprehensiveness of responses generated by ChatGPT 3.5 and Google Bard AI to medical queries about medication abortions. Methods Fourteen open-ended questions about medication abortion were formulated based on the Frequently Asked Questions (FAQs) from the National Abortion Federation (NAF) and the Reproductive Health Access Project (RHAP) websites. These questions were answered using ChatGPT version 3.5 and Google Bard AI on October 7, 2023. The accuracy of the responses was analyzed by cross-referencing the generated answers against the information provided by NAF and RHAP. Any discrepancies were further verified against the guidelines from the American Congress of Obstetricians and Gynecologists (ACOG). A rating scale used by Johnson et al. was employed for assessment, utilizing a 6-point Likert scale [ranging from 1 (completely incorrect) to 6 (correct)] to evaluate accuracy and a 3-point scale [ranging from 1 (incomplete) to 3 (comprehensive)] to assess completeness. Questions that did not yield answers were assigned a score of 0 and omitted from the correlation analysis. Data analysis and visualization were done using R Software version 4.3.1. Statistical significance was determined by employing Spearman's R and Mann-Whitney U tests. Results All questions were entered sequentially into both chatbots by the same author. On the initial attempt, ChatGPT successfully generated relevant responses for all questions, while Google Bard AI failed to provide answers for five questions. Repeating the same question in Google Bard AI yielded an answer for one; two were answered with different phrasing; and two remained unanswered despite rephrasing. ChatGPT showed a median accuracy score of 5 (mean: 5.26, SD: 0.73) and a median completeness score of 3 (mean: 2.57, SD: 0.51). It showed the highest accuracy score in six responses and the highest completeness score in eight responses. In contrast, Google Bard AI had a median accuracy score of 5 (mean: 4.5, SD: 2.03) and a median completeness score of 2 (mean: 2.14, SD: 1.03). It achieved the highest accuracy score in five responses and the highest completeness score in six responses. Spearman's correlation coefficient revealed no correlation between accuracy and completeness for ChatGPT (rs = -0.46771, p = 0.09171). However, Google Bard AI showed a marginally significant correlation (rs = 0.5738, p = 0.05108). Mann-Whitney U test indicated no statistically significant differences between ChatGPT and Google Bard AI concerning accuracy (U = 82, p>0.05) or completeness (U = 78, p>0.05). Conclusion While both chatbots showed similar levels of accuracy, minor errors were noted, pertaining to finer aspects that demand specialized knowledge of abortion care. This could explain the lack of a significant correlation between accuracy and completeness. Ultimately, AI-driven language models have the potential to provide information on medication abortions, but there is a need for continual refinement and oversight.

Collapse

Zhu L, Mou W, Wu K, Zhang J, Luo P. Can DALL-E 3 Reliably Generate 12-Lead ECGs and Teaching Illustrations? Cureus 2024;16:e52748. [PMID: 38384621 PMCID: PMC10879738 DOI: 10.7759/cureus.52748] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/22/2024] [Indexed: 02/23/2024] Open

George Pallivathukal R, Kyaw Soe HH, Donald PM, Samson RS, Hj Ismail AR. ChatGPT for Academic Purposes: Survey Among Undergraduate Healthcare Students in Malaysia. Cureus 2024;16:e53032. [PMID: 38410331 PMCID: PMC10895383 DOI: 10.7759/cureus.53032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/27/2024] [Indexed: 02/28/2024] Open

Abstract

BACKGROUND

The impact of generative artificial intelligence-based Chatbots on medical education, particularly in Southeast Asia, is understudied regarding healthcare students' perceptions of its academic utility. Sociodemographic profiles and educational strategies influence prospective healthcare practitioners' attitudes toward AI tools.

AIM AND OBJECTIVES

This study aimed to assess healthcare university students' knowledge, attitude, and practice regarding ChatGPT for academic purposes. It explored chatbot usage frequency, purposes, satisfaction levels, and associations between age, gender, and ChatGPT variables.

METHODOLOGY

Four hundred forty-three undergraduate students at a Malaysian tertiary healthcare institute participated, revealing varying awareness levels of ChatGPT's academic utility. Despite concerns about accuracy, ethics, and dependency, participants generally held positive attitudes toward ChatGPT in academics.

RESULTS

Multiple logistic regression highlighted associations between demographics, knowledge, attitude, and academic ChatGPT use. MBBS students were significantly more likely to use ChatGPT for academics than BDS and FIS students. Final-year students exhibited the highest likelihood of academic ChatGPT use. Higher knowledge and positive attitudes correlated with increased academic usage. Most users (45.8%) employed ChatGPT to aid specific assignment sections while completing most work independently. Some did not use it (41.1%), while others heavily relied on it (9.3%). Users also employed it for various purposes, from generating questions to understanding concepts. Thematic analysis of responses showed students' concerns about data accuracy, plagiarism, ethical issues, and dependency on ChatGPT for academic tasks.

CONCLUSION

This study aids in creating guidelines for implementing GAI chatbots in healthcare education, emphasizing benefits, and risks, and informing AI developers and educators about ChatGPT's potential in academia.

Collapse

Yapar D, Demir Avcı Y, Tokur Sonuvar E, Eğerci ÖF, Yapar A. ChatGPT's potential to support home care for patients in the early period after orthopedic interventions and enhance public health. Jt Dis Relat Surg 2024;35:169-176. [PMID: 38108178 PMCID: PMC10746912 DOI: 10.52312/jdrs.2023.1402] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2023] [Accepted: 11/06/2023] [Indexed: 12/19/2023] Open

Abstract

OBJECTIVES

This study presents the first investigation into the potential of ChatGPT to provide medical consultation for patients undergoing orthopedic interventions, with the primary objective of evaluating ChatGPT's effectiveness in supporting patient self-management during the essential early recovery phase at home.

MATERIALS AND METHODS

Seven scenarios, representative of common situations in orthopedics and traumatology, were presented to ChatGPT version 4.0 to obtain advice. These scenarios and ChatGPT̓s responses were then evaluated by 68 expert orthopedists (67 males, 1 female; mean age: 37.9±5.9 years; range, 30 to 59 years), 40 of whom had at least four years of orthopedic experience, while 28 were associate or full professors. Expert orthopedists used a rubric on a scale of 1 to 5 to evaluate ChatGPT's advice based on accuracy, applicability, comprehensiveness, and clarity. Those who gave ChatGPT a score of 4 or higher considered its performance as above average or excellent.

RESULTS

In all scenarios, the median evaluation scores were at least 4 across accuracy, applicability, comprehensiveness, and communication. As for mean scores, accuracy was the highest-rated dimension at 4.2±0.8, while mean comprehensiveness was slightly lower at 3.9±0.8. Orthopedist characteristics, such as academic title and prior use of ChatGPT, did not influence their evaluation (all p>0.05). Across all scenarios, ChatGPT demonstrated an accuracy of 79.8%, with applicability at 75.2%, comprehensiveness at 70.6%, and a 75.6% rating for communication clarity.

CONCLUSION

This study emphasizes ChatGPT̓s strengths in accuracy and applicability for home care after orthopedic intervention but underscores a need for improved comprehensiveness. This focused evaluation not only sheds light on ChatGPT̓s potential in specialized medical advice but also suggests its potential to play a broader role in the advancement of public health.

Collapse

Bazzari FH, Bazzari AH. Utilizing ChatGPT in Telepharmacy. Cureus 2024;16:e52365. [PMID: 38230387 PMCID: PMC10790595 DOI: 10.7759/cureus.52365] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/15/2024] [Indexed: 01/18/2024] Open

Abstract

BACKGROUND

ChatGPT is an artificial intelligence-powered chatbot that has demonstrated capabilities in numerous fields, including medical and healthcare sciences. This study evaluates the potential for ChatGPT application in telepharmacy, the delivering of pharmaceutical care via means of telecommunications, through assessing its interactions, adherence to instructions, and ability to role-play as a pharmacist while handling a series of life-like scenario questions.

METHODS

Two versions (ChatGPT 3.5 and 4.0, OpenAI) were assessed using two independent trials each. ChatGPT was instructed to act as a pharmacist and answer patient inquiries, followed by a set of 20 assessment questions. Then, ChatGPT was instructed to stop its act, provide feedback and list its sources for drug information. The responses to the assessment questions were evaluated in terms of accuracy, precision and clarity using a 4-point Likert-like scale.

RESULTS

ChatGPT demonstrated the ability to follow detailed instructions, role-play as a pharmacist, and appropriately handle all questions. ChatGPT was able to understand case details, recognize generic and brand drug names, identify drug side effects, interactions, prescription requirements and precautions, and provide proper point-by-point instructions regarding administration, dosing, storage and disposal. The overall means of pooled scores were 3.425 (0.712) and 3.7 (0.61) for ChatGPT 3.5 and 4.0, respectively. The rank distribution of scores was not significantly different (P>0.05). None of the answers could be considered directly harmful or labeled as entirely or mostly incorrect, and most point deductions were due to other factors such as indecisiveness, adding immaterial information, missing certain considerations, or partial unclarity. The answers were similar in length across trials and appropriately concise. ChatGPT 4.0 showed superior performance, higher consistency, better character adherence and the ability to report various reliable information sources. However, it only allowed an input of 40 questions every three hours and provided inaccurate feedback regarding the number of assessed patients, compared to 3.5 which allowed unlimited input but was unable to provide feedback.

CONCLUSIONS

Integrating ChatGPT in telepharmacy holds promising potential; however, a number of drawbacks are to be overcome in order to function effectively.

Collapse

Janopaul-Naylor JR, Koo A, Qian DC, McCall NS, Liu Y, Patel SA. Physician Assessment of ChatGPT and Bing Answers to American Cancer Society's Questions to Ask About Your Cancer. Am J Clin Oncol 2024;47:17-21. [PMID: 37823708 PMCID: PMC10841271 DOI: 10.1097/coc.0000000000001050] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/13/2023]

Abstract

OBJECTIVES

Artificial intelligence (AI) chatbots are a new, publicly available tool for patients to access health care-related information with unknown reliability related to cancer-related questions. This study assesses the quality of responses to common questions for patients with cancer.

METHODS

From February to March 2023, we queried chat generative pretrained transformer (ChatGPT) from OpenAI and Bing AI from Microsoft questions from the American Cancer Society's recommended "Questions to Ask About Your Cancer" customized for all stages of breast, colon, lung, and prostate cancer. Questions were, in addition, grouped by type (prognosis, treatment, or miscellaneous). The quality of AI chatbot responses was assessed by an expert panel using the validated DISCERN criteria.

RESULTS

Of the 117 questions presented to ChatGPT and Bing, the average score for all questions were 3.9 and 3.2, respectively ( P < 0.001) and the overall DISCERN scores were 4.1 and 4.4, respectively. By disease site, the average score for ChatGPT and Bing, respectively, were 3.9 and 3.6 for prostate cancer ( P = 0.02), 3.7 and 3.3 for lung cancer ( P < 0.001), 4.1 and 2.9 for breast cancer ( P < 0.001), and 3.8 and 3.0 for colorectal cancer ( P < 0.001). By type of question, the average score for ChatGPT and Bing, respectively, were 3.6 and 3.4 for prognostic questions ( P = 0.12), 3.9 and 3.1 for treatment questions ( P < 0.001), and 4.2 and 3.3 for miscellaneous questions ( P = 0.001). For 3 responses (3%) by ChatGPT and 18 responses (15%) by Bing, at least one panelist rated them as having serious or extensive shortcomings.

CONCLUSIONS

AI chatbots provide multiple opportunities for innovating health care. This analysis suggests a critical need, particularly around cancer prognostication, for continual refinement to limit misleading counseling, confusion, and emotional distress to patients and families.

Collapse

Coraci D, Maccarone MC, Regazzo G, Accordi G, Papathanasiou JV, Masiero S. ChatGPT in the development of medical questionnaires. The example of the low back pain. Eur J Transl Myol 2023;33:12114. [PMID: 38112605 PMCID: PMC10811646 DOI: 10.4081/ejtm.2023.12114] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2023] [Accepted: 12/04/2023] [Indexed: 12/21/2023] Open

Ho SYC, Chien TW, Chou W. Circle packing charts generated by ChatGPT to identify the characteristics of articles by anesthesiology authors in 2022: Bibliometric analysis. Medicine (Baltimore) 2023;102:e34511. [PMID: 38115345 PMCID: PMC10727539 DOI: 10.1097/md.0000000000034511] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/15/2023] [Revised: 07/03/2023] [Accepted: 07/05/2023] [Indexed: 12/21/2023] Open

Abstract

BACKGROUND

The ChatGPT (Open AI, San Francisco, CA), denoted by the Chat Generative Pretrained Transformer, has been a hot topic for discussion over the past few months. A verification of whether the code for drawing circle packing charts (CPCs) with R can be generated by ChatGPT and used to identify characteristics of articles by anesthesiology authors is needed. This study aimed to provide insights into article characteristics in the field of anesthesiology and to highlight the potential of ChatGPT for data visualization techniques (e.g., CPCs) in bibliometric analysis.

METHODS

A total of 23,012 articles were indexed in PubMed in 2022 by authors in the field of anesthesiology. The code for drawing CPCs with R was generated by ChatGPT and then modified by the authors to identify the characteristics of articles in 2 forms: 23,012 and 100 top-impact factors in journals (T100IF). Using CPCs and 3 other visualizations-network charts, impact beam plots, and Sankey diagrams-we were able to display article features commonly used in bibliometric analysis. The author-weighted scheme and absolute advantage coefficient were used to assess dominant entities, such as countries, institutes, authors, and themes (defined by PubMed and MeSH terms).

RESULTS

Our findings indicate that: further modifications should be made to the code generated by ChatGPT for drawing CPCs in R; publications in the field of anesthesiology are dominated by China, followed by the United States and Japan; Capital Medical University (China) and Showa University Hospital (Japan) dominate research institutes in terms of publications and IF, respectively; and COVID-19 is the most frequently reported theme in T100IF, accounting for 29%.

CONCLUSIONS

No such articles with CPCs regarding bibliometrics have ever been found in PubMed. The code for drawing CPCs with R can be generated by ChatGPT, but further modification is required for implementation in bibliometrics. CPCs should be used in future studies to identify the characteristics of articles in other areas of research rather than limiting them to anesthesiology, as we did in this study.

Collapse

Alanzi TM, Alzahrani W, Albalawi NS, Allahyani T, Alghamdi A, Al-Zahrani H, Almutairi A, Alzahrani H, Almulhem L, Alanzi N, Al Moarfeg A, Farhah N. Public Awareness of Obesity as a Risk Factor for Cancer in Central Saudi Arabia: Feasibility of ChatGPT as an Educational Intervention. Cureus 2023;15:e50781. [PMID: 38239542 PMCID: PMC10795720 DOI: 10.7759/cureus.50781] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/17/2023] [Indexed: 01/22/2024] Open

Abstract

BACKGROUND

While the link between obesity and chronic diseases such as diabetes and cardiovascular disorders is well-documented, there is a growing body of evidence connecting obesity with an increased risk of cancer. However, public awareness of this connection remains limited.

STUDY PURPOSE

To analyze public awareness of overweight/obesity as a risk factor for cancer and analyze public perceptions on the feasibility of ChatGPT, an artificial intelligence-based conversational agent, as an educational intervention tool.

METHODS

A mixed-methods approach including deductive quantitative cross-sectional approach to draw precise conclusions based on empirical evidence on public awareness of the link between obesity and cancer; and inductive qualitative approach to interpret public perceptions on using ChatGPT for creating awareness of obesity, cancer and its risk factors was used in this study. Participants included adult residents in Saudi Arabia. A total of 486 individuals and 21 individuals were included in the survey and semi-structured interviews respectively.

RESULTS

About 65% of the participants are not completely aware of cancer and its risk factors. Significant differences in awareness were observed concerning age groups (p < .0001), socio-economic status (p = .041), and regional distribution (p = .0351). A total of 10 themes were analyzed from the interview data, which included four positive factors (accessibility, personalization, cost-effectiveness, anonymity and privacy, multi-language support) and five negative factors (information inaccuracy, lack of emotional intelligence, dependency and overreliance, data privacy and security, and inability to provide physical support or diagnosis).

CONCLUSION

This study has underscored the potential of leveraging ChatGPT as a valuable public awareness tool for cancer in Saudi Arabia.

Collapse

Mondal H, Mondal S. ChatGPT in academic writing: Maximizing its benefits and minimizing the risks. Indian J Ophthalmol 2023;71:3600-3606. [PMID: 37991290 PMCID: PMC10788737 DOI: 10.4103/ijo.ijo_718_23] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2023] [Revised: 08/11/2023] [Accepted: 08/21/2023] [Indexed: 11/23/2023] Open

Sakai D, Maeda T, Ozaki A, Kanda GN, Kurimoto Y, Takahashi M. Performance of ChatGPT in Board Examinations for Specialists in the Japanese Ophthalmology Society. Cureus 2023;15:e49903. [PMID: 38174202 PMCID: PMC10763518 DOI: 10.7759/cureus.49903] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/04/2023] [Indexed: 01/05/2024] Open

Melnyk O, Ismail A, Ghorashi NS, Heekin M, Javan R. Generative Artificial Intelligence Terminology: A Primer for Clinicians and Medical Researchers. Cureus 2023;15:e49890. [PMID: 38174178 PMCID: PMC10762565 DOI: 10.7759/cureus.49890] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/04/2023] [Indexed: 01/05/2024] Open

Sarangi PK, Lumbani A, Swarup MS, Panda S, Sahoo SS, Hui P, Choudhary A, Mohakud S, Patel RK, Mondal H. Assessing ChatGPT's Proficiency in Simplifying Radiological Reports for Healthcare Professionals and Patients. Cureus 2023;15:e50881. [PMID: 38249202 PMCID: PMC10799309 DOI: 10.7759/cureus.50881] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/21/2023] [Indexed: 01/23/2024] Open

Abstract

Background Clear communication of radiological findings is crucial for effective healthcare decision-making. However, radiological reports are often complex with technical terminology, making them challenging for non-radiology healthcare professionals and patients to comprehend. Large language models like ChatGPT (Chat Generative Pre-trained Transformer, by OpenAI, San Francisco, CA) offer a potential solution by translating intricate reports into simplified language. This study aimed to assess the capability of ChatGPT-3.5 in simplifying radiological reports to facilitate improved understanding by healthcare professionals and patients. Materials and methods Nine radiological reports were taken for this study spanning various imaging modalities and medical conditions. These reports were used to ask ChatGPT a set of seven questions (describe the procedure, mention the key findings, express in a simple language, suggestions for further investigation, need of further investigation, grammatical or typing errors, and translation into Hindi). A total of eight radiologists rated the generated content in detailing, summarizing, simplifying content and language, factual correctness, further investigation, grammatical errors, and translation to Hindi. Results The highest score was obtained for detailing the report (94.17% accuracy) and the lowest score was for drawing conclusions for the patient (85% accuracy); case-wise scores were similar (p-value = 0.97). The Hindi translation by ChatGPT was not suitable for patient communication. Conclusion The current free version of ChatGPT-3.5 was able to simplify radiological reports effectively, removing technical jargon while preserving essential diagnostic information. The free version adeptly simplifies radiological reports, enhancing accessibility for healthcare professionals and patients. Hence, it has the potential to enhance medical communication, facilitating informed decision-making by healthcare professionals and patients.

Collapse

Tanaka OM, Gasparello GG, Hartmann GC, Casagrande FA, Pithon MM. Assessing the reliability of ChatGPT: a content analysis of self-generated and self-answered questions on clear aligners, TADs and digital imaging. Dental Press J Orthod 2023;28:e2323183. [PMID: 37937680 PMCID: PMC10627416 DOI: 10.1590/2177-6709.28.5.e2323183.oar] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2023] [Accepted: 09/04/2023] [Indexed: 11/09/2023] Open

Haidar O, Jaques A, McCaughran PW, Metcalfe MJ. AI-Generated Information for Vascular Patients: Assessing the Standard of Procedure-Specific Information Provided by the ChatGPT AI-Language Model. Cureus 2023;15:e49764. [PMID: 38046759 PMCID: PMC10691169 DOI: 10.7759/cureus.49764] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/30/2023] [Indexed: 12/05/2023] Open

Abstract

Introduction Ensuring access to high-quality information is paramount to facilitating informed surgical decision-making. The use of the internet to access health-related information is increasing, along with the growing prevalence of AI language models such as ChatGPT. We aim to assess the standard of AI-generated patient-facing information through a qualitative analysis of its readability and quality. Materials and methods We performed a retrospective qualitative analysis of information regarding three common vascular procedures: endovascular aortic repair (EVAR), endovenous laser ablation (EVLA), and femoro-popliteal bypass (FPBP). The ChatGPT responses were compared to patient information leaflets provided by the vascular charity, Circulation Foundation UK. Readability was assessed using four readability scores: the Flesch-Kincaid reading ease (FKRE) score, the Flesch-Kincaid grade level (FKGL), the Gunning fog score (GFS), and the simple measure of gobbledygook (SMOG) index. Quality was assessed using the DISCERN tool by two independent assessors. Results The mean FKRE score was 33.3, compared to 59.1 for the information provided by the Circulation Foundation (SD=14.5, p=0.025) indicating poor readability of AI-generated information. The FFKGL indicated that the expected grade of students likely to read and understand ChatGPT responses was consistently higher than compared to information leaflets at 12.7 vs. 9.4 (SD=1.9, p=0.002). Two metrics measure readability in terms of the number of years of education required to understand a piece of writing: the GFS and SMOG. Both scores indicated that AI-generated answers were less accessible. The GFS for ChatGPT-provided information was 16.7 years versus 12.8 years for the leaflets (SD=2.2, p=0.002) and the SMOG index scores were 12.2 and 9.4 years for ChatGPT and the patient information leaflets, respectively (SD=1.7, p=0.001). The DISCERN scores were consistently higher in human-generated patient information leaflets compared to AI-generated information across all procedures; the mean score for the information provided by ChatGPT was 50.3 vs. 56.0 for the Circulation Foundation information leaflets (SD=3.38, p<0.001). Conclusion We concluded that AI-generated information about vascular surgical procedures is currently poor in both the readability of text and the quality of information. Patients should be directed to reputable, human-generated information sources from trusted professional bodies to supplement direct education from the clinician during the pre-procedure consultation process.

Collapse

Murphy Lonergan R, Curry J, Dhas K, Simmons BI. Stratified Evaluation of GPT's Question Answering in Surgery Reveals Artificial Intelligence (AI) Knowledge Gaps. Cureus 2023;15:e48788. [PMID: 38098921 PMCID: PMC10720372 DOI: 10.7759/cureus.48788] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/14/2023] [Indexed: 12/17/2023] Open

Abstract

Large language models (LLMs) have broad potential applications in medicine, such as aiding with education, providing reassurance to patients, and supporting clinical decision-making. However, there is a notable gap in understanding their applicability and performance in the surgical domain and how their performance varies across specialties. This paper aims to evaluate the performance of LLMs in answering surgical questions relevant to clinical practice and to assess how this performance varies across different surgical specialties. We used the MedMCQA dataset, a large-scale multi-choice question-answer (MCQA) dataset consisting of clinical questions across all areas of medicine. We extracted the relevant 23,035 surgical questions and submitted them to the popular LLMs Generative Pre-trained Transformers (GPT)-3.5 and GPT-4 (OpenAI OpCo, LLC, San Francisco, CA). Generative Pre-trained Transformer is a large language model that can generate human-like text by predicting subsequent words in a sentence based on the context of the words that come before it. It is pre-trained on a diverse range of texts and can perform a variety of tasks, such as answering questions, without needing task-specific training. The question-answering accuracy of GPT was calculated and compared between the two models and across surgical specialties. Both GPT-3.5 and GPT-4 achieved accuracies of 53.3% and 64.4%, respectively, on surgical questions, showing a statistically significant difference in performance. When compared to their performance on the full MedMCQA dataset, the two models performed differently: GPT-4 performed worse on surgical questions than on the dataset as a whole, while GPT-3.5 showed the opposite pattern. Significant variations in accuracy were also observed across different surgical specialties, with strong performances in anatomy, vascular, and paediatric surgery and worse performances in orthopaedics, ENT, and neurosurgery. Large language models exhibit promising capabilities in addressing surgical questions, although the variability in their performance between specialties cannot be ignored. The lower performance of the latest GPT-4 model on surgical questions relative to questions across all medicine highlights the need for targeted improvements and continuous updates to ensure relevance and accuracy in surgical applications. Further research and continuous monitoring of LLM performance in surgical domains are crucial to fully harnessing their potential and mitigating the risks of misinformation.

Collapse

Mondal H, Dash I, Mondal S, Behera JK. ChatGPT in Answering Queries Related to Lifestyle-Related Diseases and Disorders. Cureus 2023;15:e48296. [PMID: 38058315 PMCID: PMC10696911 DOI: 10.7759/cureus.48296] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/04/2023] [Indexed: 12/08/2023] Open

Abstract

Background Lifestyle-related diseases and disorders have become a significant global health burden. However, the majority of the population ignores or do not consult doctors for such disease or disorders. Artificial intelligence (AI)-based large language model (LLM) like ChatGPT (GPT3.5) is capable of generating customized queries of a user. Hence, it can act as a virtual telehealth agent. Its capability to answer lifestyle-related diseases or disorders has not been explored. Objective This study aimed to evaluate the effectiveness of ChatGPT, an LLM, in providing answers to queries related to lifestyle-related diseases or disorders. Methods A set of 20 lifestyle-related disease or disorder cases covering a wide range of topics such as obesity, diabetes, cardiovascular health, and mental health were prepared with four questions. The case and questions were presented to ChatGPT and asked for the answers to those questions. Two physicians rated the content on a three-point Likert-like scale ranging from accurate (2), partially accurate (1), and inaccurate (0). Further, the content was rated as adequate (2), inadequate (1), and misguiding (0) for testing the applicability of the guides for patients. The readability of the text was analyzed by the Flesch-Kincaid Ease Score (FKES). Results Among 20 cases, the average score of accuracy was 1.83±0.37 and guidance was 1.9±0.21. Both the scores were higher than the hypothetical median of 1.5 (p=0.004 and p<0.0001, respectively). ChatGPT answered the questions with a natural tone in 11 cases and nine with a positive tone. The text was understandable for college graduates with a mean FKES of 27.8±5.74. Conclusion The analysis of content accuracy revealed that ChatGPT provided reasonably accurate information in the majority of the cases, successfully addressing queries related to lifestyle-related diseases or disorders. Hence, initial guidance can be obtained by patients when they get little time to consult a doctor or wait for an appointment to consult a doctor for suggestions about their condition.

Collapse

Hernandez CA, Vazquez Gonzalez AE, Polianovskaia A, Amoro Sanchez R, Muyolema Arce V, Mustafa A, Vypritskaya E, Perez Gutierrez O, Bashir M, Eighaei Sedeh A. The Future of Patient Education: AI-Driven Guide for Type 2 Diabetes. Cureus 2023;15:e48919. [PMID: 38024047 PMCID: PMC10654048 DOI: 10.7759/cureus.48919] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/16/2023] [Indexed: 12/01/2023] Open

Abstract

Introduction and aim The surging incidence of type 2 diabetes has become a growing concern for the healthcare sector. This chronic ailment, characterized by its complex blend of genetic and lifestyle determinants, has witnessed a notable increase in recent times, exerting substantial pressure on healthcare resources. As more individuals turn to online platforms for health guidance and embrace the utilization of Chat Generative Pre-trained Transformer (ChatGPT; San Francisco, CA: OpenAI), a text-generating AI (TGAI), to get insights into their well-being, evaluating its effectiveness and reliability becomes crucial. This research primarily aimed to evaluate the correctness of TGAI responses to type 2 diabetes (T2DM) inquiries via ChatGPT. Furthermore, this study aimed to examine the consistency of TGAI in addressing common queries on T2DM complications for patient education. Material and methods Questions on T2DM were formulated by experienced physicians and screened by research personnel before querying ChatGPT. Each question was posed thrice, and the collected answers were summarized. Responses were then sorted into three distinct categories as follows: (a) appropriate, (b) inappropriate, and (c) unreliable by two seasoned physicians. In instances of differing opinions, a third physician was consulted to achieve consensus. Results From the initial set of 110 T2DM questions, 40 were dismissed by experts for relevance, resulting in a final count of 70. An overwhelming 98.5% of the AI's answers were judged as appropriate, thus underscoring its reliability over traditional online search engines. Nonetheless, a 1.5% rate of inappropriate responses underlines the importance of ongoing AI improvements and strict adherence to medical protocols. Conclusion TGAI provides medical information of high quality and reliability. This study underscores TGAI's impressive effectiveness in delivering reliable information about T2DM, with 98.5% of responses aligning with the standard of care. These results hold promise for integrating AI platforms as supplementary tools to enhance patient education and outcomes.

Collapse

Sikander B, Baker JJ, Deveci CD, Lund L, Rosenberg J. ChatGPT-4 and Human Researchers Are Equal in Writing Scientific Introduction Sections: A Blinded, Randomized, Non-inferiority Controlled Study. Cureus 2023;15:e49019. [PMID: 38111405 PMCID: PMC10727453 DOI: 10.7759/cureus.49019] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/18/2023] [Indexed: 12/20/2023] Open

Abujaber AA, Abd-Alrazaq A, Al-Qudimat AR, Nashwan AJ. A Strengths, Weaknesses, Opportunities, and Threats (SWOT) Analysis of ChatGPT Integration in Nursing Education: A Narrative Review. Cureus 2023;15:e48643. [PMID: 38090452 PMCID: PMC10711690 DOI: 10.7759/cureus.48643] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/11/2023] [Indexed: 03/25/2024] Open

Abstract

Amidst evolving healthcare demands, nursing education plays a pivotal role in preparing future nurses for complex challenges. Traditional approaches, however, must be revised to meet modern healthcare needs. The ChatGPT, an AI-based chatbot, has garnered significant attention due to its ability to personalize learning experiences, enhance virtual clinical simulations, and foster collaborative learning in nursing education. This review aims to thoroughly assess the potential impact of integrating ChatGPT into nursing education. The hypothesis is that valuable insights can be provided for stakeholders through a comprehensive SWOT analysis examining the strengths, weaknesses, opportunities, and threats associated with ChatGPT. This will enable informed decisions about its integration, prioritizing improved learning outcomes. A thorough narrative literature review was undertaken to provide a solid foundation for the SWOT analysis. The materials included scholarly articles and reports, which ensure the study's credibility and allow for a holistic and unbiased assessment. The analysis identified accessibility, consistency, adaptability, cost-effectiveness, and staying up-to-date as crucial factors influencing the strengths, weaknesses, opportunities, and threats associated with ChatGPT integration in nursing education. These themes provided a framework to understand the potential risks and benefits of integrating ChatGPT into nursing education. This review highlights the importance of responsible and effective use of ChatGPT in nursing education and the need for collaboration among educators, policymakers, and AI developers. Addressing the identified challenges and leveraging the strengths of ChatGPT can lead to improved learning outcomes and enriched educational experiences for students. The findings emphasize the importance of responsibly integrating ChatGPT in nursing education, balancing technological advancement with careful consideration of associated risks, to achieve optimal outcomes.

Collapse

Kaneda Y, Takita M, Hamaki T, Ozaki A, Tanimoto T. ChatGPT's Potential in Enhancing Physician Efficiency: A Japanese Case Study. Cureus 2023;15:e48235. [PMID: 38050503 PMCID: PMC10693924 DOI: 10.7759/cureus.48235] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/03/2023] [Indexed: 12/06/2023] Open

Makiev KG, Asimakidou M, Vasios IS, Keskinis A, Petkidis G, Tilkeridis K, Ververidis A, Iliopoulos E. A Study on Distinguishing ChatGPT-Generated and Human-Written Orthopaedic Abstracts by Reviewers: Decoding the Discrepancies. Cureus 2023;15:e49166. [PMID: 38130535 PMCID: PMC10733892 DOI: 10.7759/cureus.49166] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/21/2023] [Indexed: 12/23/2023] Open

Abstract

BACKGROUND

ChatGPT (OpenAI Incorporated, Mission District, San Francisco, United States) is an artificial intelligence (AI)-based language model that generates human-resembling texts. This AI-generated literary work is comprehensible and contextually relevant and it is really difficult to differentiate from human-written content. ChatGPT has risen in popularity lately and is widely utilized in scholarly manuscript drafting. The aim of this study is to identify if 1) human reviewers can differentiate between AI-generated and human-written abstracts and 2) AI detectors are currently reliable in detecting AI-generated abstracts.

METHODS

Seven blinded reviewers were asked to read 21 abstracts and differentiate which were AI-generated and which were human-written. The first group consisted of three orthopaedic residents with limited research experience (OR). The second group included three orthopaedic professors with extensive research experience (OP). The seventh reviewer was a non-orthopaedic doctor and acted as a control in terms of expertise. All abstracts were scanned by a plagiarism detector program. The performance of detecting AI-generated abstracts of two different AI detectors was also analyzed. A structured interview was conducted at the end of the survey in order to evaluate the decision-making process utilized by each reviewer.

RESULTS

The OR group managed to identify correctly 34.9% of the abstracts' authorship and the OP group 31.7%. The non-orthopaedic control identified correctly 76.2%. All AI-generated abstracts were 100% unique (0% plagiarism). The first AI detector managed to identify correctly only 9/21 (42.9%) of the abstracts' authors, whereas the second AI detector identified 14/21 (66.6%).

CONCLUSION

Inability to correctly identify AI-generated context poses a significant scientific risk as "false" abstracts can end up in scientific conferences or publications. Neither expertise nor research background was shown to have any meaningful impact on the predictive outcome. Focus on statistical data presentation may help the differentiation process. Further research is warranted in order to highlight which elements could help reveal an AI-generated abstract.

Collapse

Aliyeva A. "Bot or Not": Turing Problem in Otolaryngology. Cureus 2023;15:e48170. [PMID: 38046723 PMCID: PMC10693309 DOI: 10.7759/cureus.48170] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/02/2023] [Indexed: 12/05/2023] Open

Diane A, Gencarelli P, Lee JM, Mittal R. Utilizing ChatGPT to Streamline the Generation of Prior Authorization Letters and Enhance Clerical Workflow in Orthopedic Surgery Practice: A Case Report. Cureus 2023;15:e49680. [PMID: 38161881 PMCID: PMC10756745 DOI: 10.7759/cureus.49680] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/29/2023] [Indexed: 01/03/2024] Open

Lenihan D. Three Effective, Efficient, and Easily Implementable Ways to Integrate A.I. Into Medical Education. Cureus 2023;15:e47204. [PMID: 37854479 PMCID: PMC10581027 DOI: 10.7759/cureus.47204] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/17/2023] [Indexed: 10/20/2023] Open

Li J, Zhong J, Li Z, Xiao Y, Wang S. Ectopic Pituitary Neuroendocrine Tumor: A Case Report Written With the Help of ChatGPT. Cureus 2023;15:e46999. [PMID: 37965416 PMCID: PMC10641033 DOI: 10.7759/cureus.46999] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/12/2023] [Indexed: 11/16/2023] Open

Kuang YR, Zou MX, Niu HQ, Zheng BY, Zhang TL, Zheng BW. ChatGPT encounters multiple opportunities and challenges in neurosurgery. Int J Surg 2023;109:2886-2891. [PMID: 37352529 PMCID: PMC10583932 DOI: 10.1097/js9.0000000000000571] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2023] [Accepted: 06/10/2023] [Indexed: 06/25/2023]

Abstract

BACKGROUND

ChatGPT, powered by the GPT model and Transformer architecture, has demonstrated remarkable performance in the domains of medicine and healthcare, providing customized and informative responses. In our study, we investigated the potential of ChatGPT in the field of neurosurgery, focusing on its applications at the patient, neurosurgery student/resident, and neurosurgeon levels.

METHOD

The authors conducted inquiries with ChatGPT from the viewpoints of patients, neurosurgery students/residents, and neurosurgeons, covering a range of topics, such as disease diagnosis, treatment options, prognosis, rehabilitation, and patient care. The authors also explored concepts related to neurosurgery, including fundamental principles and clinical aspects, as well as tools and techniques to enhance the skills of neurosurgery students/residents. Additionally, the authors examined disease-specific medical interventions and the decision-making processes involved in clinical practice.

RESULTS

The authors received individual responses from ChatGPT, but they tended to be shallow and repetitive, lacking depth and personalization. Furthermore, ChatGPT may struggle to discern a patient's emotional state, hindering the establishment of rapport and the delivery of appropriate care. The language used in the medical field is influenced by technical and cultural factors, and biases in the training data can result in skewed or inaccurate responses. Additionally, ChatGPT's limitations include the inability to conduct physical examinations or interpret diagnostic images, potentially overlooking complex details and individual nuances in each patient's case. Moreover, its absence in the surgical setting limits its practical utility.

CONCLUSION

Although ChatGPT is a powerful language model, it cannot substitute for the expertise and experience of trained medical professionals. It lacks the capability to perform physical examinations, make diagnoses, administer treatments, establish trust, provide emotional support, and assist in the recovery process. Moreover, the implementation of Artificial Intelligence in healthcare necessitates careful consideration of legal and ethical concerns. While recognizing the potential of ChatGPT, additional training with comprehensive data is necessary to fully maximize its capabilities.

Collapse

Cankurtaran RE, Polat YH, Aydemir NG, Umay E, Yurekli OT. Reliability and Usefulness of ChatGPT for Inflammatory Bowel Diseases: An Analysis for Patients and Healthcare Professionals. Cureus 2023;15:e46736. [PMID: 38022227 PMCID: PMC10630704 DOI: 10.7759/cureus.46736] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/09/2023] [Indexed: 12/01/2023] Open

Irfan B, Yaqoob A. ChatGPT's Epoch in Rheumatological Diagnostics: A Critical Assessment in the Context of Sjögren's Syndrome. Cureus 2023;15:e47754. [PMID: 38022092 PMCID: PMC10676288 DOI: 10.7759/cureus.47754] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/26/2023] [Indexed: 12/01/2023] Open

Abstract

INTRODUCTION

The rise of artificial intelligence in medical practice is reshaping clinical care. Large language models (LLMs) like ChatGPT have the potential to assist in rheumatology by personalizing scientific information retrieval, particularly in the context of Sjögren's Syndrome. This study aimed to evaluate the efficacy of ChatGPT in providing insights into Sjögren's Syndrome, differentiating it from other rheumatological conditions.

MATERIALS AND METHODS

A database of peer-reviewed articles and clinical guidelines focused on Sjögren's Syndrome was compiled. Clinically relevant questions were presented to ChatGPT, with responses assessed for accuracy, relevance, and comprehensiveness. Techniques such as blinding, random control queries, and temporal analysis ensured unbiased evaluation. ChatGPT's responses were also assessed using the 15-questionnaire DISCERN tool.

RESULTS

ChatGPT effectively highlighted key immunopathological and histopathological characteristics of Sjögren's Syndrome, though some crucial data and citation inconsistencies were noted. For a given clinical vignette, ChatGPT correctly identified potential etiological considerations with Sjögren's Syndrome being prominent.

DISCUSSION

LLMs like ChatGPT offer rapid access to vast amounts of data, beneficial for both patients and providers. While it democratizes information, limitations like potential oversimplification and reference inaccuracies were observed. The balance between LLM insights and clinical judgment, as well as continuous model refinement, is crucial.

CONCLUSION

LLMs like ChatGPT offer significant potential in rheumatology, providing swift and broad medical insights. However, a cautious approach is vital, ensuring rigorous training and ethical application for optimal patient care and clinical practice.

Collapse

Köroğlu EY, Fakı S, Beştepe N, Tam AA, Çuhacı Seyrek N, Topaloglu O, Ersoy R, Cakir B. A Novel Approach: Evaluating ChatGPT's Utility for the Management of Thyroid Nodules. Cureus 2023;15:e47576. [PMID: 38021609 PMCID: PMC10666652 DOI: 10.7759/cureus.47576] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/24/2023] [Indexed: 12/01/2023] Open

Sultan I, Al-Abdallat H, Alnajjar Z, Ismail L, Abukhashabeh R, Bitar L, Abu Shanap M. Using ChatGPT to Predict Cancer Predisposition Genes: A Promising Tool for Pediatric Oncologists. Cureus 2023;15:e47594. [PMID: 38021917 PMCID: PMC10666922 DOI: 10.7759/cureus.47594] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/18/2023] [Indexed: 12/01/2023] Open

Biri SK, Kumar S, Panigrahi M, Mondal S, Behera JK, Mondal H. Assessing the Utilization of Large Language Models in Medical Education: Insights From Undergraduate Medical Students. Cureus 2023;15:e47468. [PMID: 38021810 PMCID: PMC10662537 DOI: 10.7759/cureus.47468] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/22/2023] [Indexed: 12/01/2023] Open

Abstract

Background Artificial intelligence (AI) has the potential to be integrated into medical education. Among AI-based technology, large language models (LLMs) such as ChatGPT, Google Bard, Microsoft Bing, and Perplexity have emerged as powerful tools with capabilities in natural language processing. With this background, this study investigates the knowledge, attitude, and practice of undergraduate medical students regarding the utilization of LLMs in medical education in a medical college in Jharkhand, India. Methods A cross-sectional online survey was sent to 370 undergraduate medical students on Google Forms. The questionnaire comprised the following three domains: knowledge, attitude, and practice, each containing six questions. Cronbach's alphas for knowledge, attitude, and practice domains were 0.703, 0.707, and 0.809, respectively. Intraclass correlation coefficients for knowledge, attitude, and practice domains were 0.82, 0.87, and 0.78, respectively. The average scores in the three domains were compared using ANOVA. Results A total of 172 students participated in the study (response rate: 46.49%). The majority of the students (45.93%) rarely used the LLMs for their teaching-learning purposes (chi-square (3) = 41.44, p < 0.0001). The overall score of knowledge (3.21±0.55), attitude (3.47±0.54), and practice (3.26±0.61) were statistically significantly different (ANOVA F (2, 513) = 10.2, p < 0.0001), with the highest score in attitude and lowest in knowledge. Conclusion While there is a generally positive attitude toward the incorporation of LLMs in medical education, concerns about overreliance and potential inaccuracies are evident. LLMs offer the potential to enhance learning resources and provide accessible education, but their integration requires further planning. Further studies are required to explore the long-term impact of LLMs in diverse educational contexts.

Collapse