1
|
Mastrokostas PG, Mastrokostas LE, Emara AK, Wellington IJ, Ginalis E, Houten JK, Khalsa AS, Saleh A, Razi AE, Ng MK. GPT-4 as a Source of Patient Information for Anterior Cervical Discectomy and Fusion: A Comparative Analysis Against Google Web Search. Global Spine J 2024; 14:2389-2398. [PMID: 38513636 PMCID: PMC11529100 DOI: 10.1177/21925682241241241] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 03/23/2024] Open
Abstract
STUDY DESIGN Comparative study. OBJECTIVES This study aims to compare Google and GPT-4 in terms of (1) question types, (2) response readability, (3) source quality, and (4) numerical response accuracy for the top 10 most frequently asked questions (FAQs) about anterior cervical discectomy and fusion (ACDF). METHODS "Anterior cervical discectomy and fusion" was searched on Google and GPT-4 on December 18, 2023. Top 10 FAQs were classified according to the Rothwell system. Source quality was evaluated using JAMA benchmark criteria and readability was assessed using Flesch Reading Ease and Flesch-Kincaid grade level. Differences in JAMA scores, Flesch-Kincaid grade level, Flesch Reading Ease, and word count between platforms were analyzed using Student's t-tests. Statistical significance was set at the .05 level. RESULTS Frequently asked questions from Google were varied, while GPT-4 focused on technical details and indications/management. GPT-4 showed a higher Flesch-Kincaid grade level (12.96 vs 9.28, P = .003), lower Flesch Reading Ease score (37.07 vs 54.85, P = .005), and higher JAMA scores for source quality (3.333 vs 1.800, P = .016). Numerically, 6 out of 10 responses varied between platforms, with GPT-4 providing broader recovery timelines for ACDF. CONCLUSIONS This study demonstrates GPT-4's ability to elevate patient education by providing high-quality, diverse information tailored to those with advanced literacy levels. As AI technology evolves, refining these tools for accuracy and user-friendliness remains crucial, catering to patients' varying literacy levels and information needs in spine surgery.
Collapse
Affiliation(s)
- Paul G. Mastrokostas
- College of Medicine, State University of New York (SUNY) Downstate, Brooklyn, NY, USA
| | | | - Ahmed K. Emara
- Department of Orthopaedic Surgery, Cleveland Clinic, Cleveland, OH, USA
| | - Ian J. Wellington
- Department of Orthopaedic Surgery, University of Connecticut, Hartford, CT, USA
| | | | - John K. Houten
- Department of Neurosurgery, Mount Sinai School of Medicine, New York, NY, USA
| | - Amrit S. Khalsa
- Department of Orthopaedic Surgery, University of Pennsylvania, Philadelphia, PA, USA
| | - Ahmed Saleh
- Department of Orthopaedic Surgery, Maimonides Medical Center, Brooklyn, NY, USA
| | - Afshin E. Razi
- Department of Orthopaedic Surgery, Maimonides Medical Center, Brooklyn, NY, USA
| | - Mitchell K. Ng
- Department of Orthopaedic Surgery, Maimonides Medical Center, Brooklyn, NY, USA
| |
Collapse
|
2
|
Guirguis PG, Youssef MP, Punreddy A, Botros M, Raiford M, McDowell S. Is Information About Musculoskeletal Malignancies From Large Language Models or Web Resources at a Suitable Reading Level for Patients? Clin Orthop Relat Res 2024:00003086-990000000-01751. [PMID: 39330944 DOI: 10.1097/corr.0000000000003263] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/06/2024] [Accepted: 09/06/2024] [Indexed: 09/28/2024]
Abstract
BACKGROUND Patients and caregivers may experience immense distress when receiving the diagnosis of a primary musculoskeletal malignancy and subsequently turn to internet resources for more information. It is not clear whether these resources, including Google and ChatGPT, offer patients information that is readable, a measure of how easy text is to understand. Since many patients turn to Google and artificial intelligence resources for healthcare information, we thought it was important to ascertain whether the information they find is readable and easy to understand. The objective of this study was to compare readability of Google search results and ChatGPT answers to frequently asked questions and assess whether these sources meet NIH recommendations for readability. QUESTIONS/PURPOSES (1) What is the readability of ChatGPT-3.5 as a source of patient information for the three most common primary bone malignancies compared with top online resources from Google search? (2) Do ChatGPT-3.5 responses and online resources meet NIH readability guidelines for patient education materials? METHODS This was a cross-sectional analysis of the 12 most common online questions about osteosarcoma, chondrosarcoma, and Ewing sarcoma. To be consistent with other studies of similar design that utilized national society frequently asked questions lists, questions were selected from the American Cancer Society and categorized based on content, including diagnosis, treatment, and recovery and prognosis. Google was queried using all 36 questions, and top responses were recorded. Author types, such as hospital systems, national health organizations, or independent researchers, were recorded. ChatGPT-3.5 was provided each question in independent queries without further prompting. Responses were assessed with validated reading indices to determine readability by grade level. An independent t-test was performed with significance set at p < 0.05. RESULTS Google (n = 36) and ChatGPT-3.5 (n = 36) answers were recorded, 12 for each of the three cancer types. Reading grade levels based on mean readability scores were 11.0 ± 2.9 and 16.1 ± 3.6, respectively. This corresponds to the eleventh grade reading level for Google and a fourth-year undergraduate student level for ChatGPT-3.5. Google answers were more readable across all individual indices, without differences in word count. No difference in readability was present across author type, question category, or cancer type. Of 72 total responses across both search modalities, none met NIH readability criteria at the sixth-grade level. CONCLUSION Google material was presented at a high school reading level, whereas ChatGPT-3.5 was at an undergraduate reading level. The readability of both resources was inadequate based on NIH recommendations. Improving readability is crucial for better patient understanding during cancer treatment. Physicians should assess patients' needs, offer them tailored materials, and guide them to reliable resources to prevent reliance on online information that is hard to understand. LEVEL OF EVIDENCE Level III, prognostic study.
Collapse
Affiliation(s)
- Paul G Guirguis
- University of Rochester School of Medicine and Dentistry, Rochester, NY, USA
| | - Mark P Youssef
- A.T. Still School of Osteopathic Medicine, Mesa, AZ, USA
| | - Ankit Punreddy
- University of Rochester School of Medicine and Dentistry, Rochester, NY, USA
| | - Mina Botros
- Department of Orthopaedics and Physical Performance, University of Rochester Medical Center, Rochester, NY, USA
| | - Mattie Raiford
- Department of Orthopaedics and Physical Performance, University of Rochester Medical Center, Rochester, NY, USA
| | - Susan McDowell
- Department of Orthopaedics and Physical Performance, University of Rochester Medical Center, Rochester, NY, USA
| |
Collapse
|
3
|
Isch EL, Sarikonda A, Sambangi A, Carreras A, Sircar A, Self DM, Habarth-Morales TE, Caterson EJ, Aycart M. Evaluating the Efficacy of Large Language Models in CPT Coding for Craniofacial Surgery: A Comparative Analysis. J Craniofac Surg 2024:00001665-990000000-01868. [PMID: 39221924 DOI: 10.1097/scs.0000000000010575] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2024] [Accepted: 07/23/2024] [Indexed: 09/04/2024] Open
Abstract
BACKGROUND The advent of Large Language Models (LLMs) like ChatGPT has introduced significant advancements in various surgical disciplines. These developments have led to an increased interest in the utilization of LLMs for Current Procedural Terminology (CPT) coding in surgery. With CPT coding being a complex and time-consuming process, often exacerbated by the scarcity of professional coders, there is a pressing need for innovative solutions to enhance coding efficiency and accuracy. METHODS This observational study evaluated the effectiveness of 5 publicly available large language models-Perplexity.AI, Bard, BingAI, ChatGPT 3.5, and ChatGPT 4.0-in accurately identifying CPT codes for craniofacial procedures. A consistent query format was employed to test each model, ensuring the inclusion of detailed procedure components where necessary. The responses were classified as correct, partially correct, or incorrect based on their alignment with established CPT coding for the specified procedures. RESULTS The results indicate that while there is no overall significant association between the type of AI model and the correctness of CPT code identification, there are notable differences in performance for simple and complex CPT codes among the models. Specifically, ChatGPT 4.0 showed higher accuracy for complex codes, whereas Perplexity.AI and Bard were more consistent with simple codes. DISCUSSION The use of AI chatbots for CPT coding in craniofacial surgery presents a promising avenue for reducing the administrative burden and associated costs of manual coding. Despite the lower accuracy rates compared with specialized, trained algorithms, the accessibility and minimal training requirements of the AI chatbots make them attractive alternatives. The study also suggests that priming AI models with operative notes may enhance their accuracy, offering a resource-efficient strategy for improving CPT coding in clinical practice. CONCLUSIONS This study highlights the feasibility and potential benefits of integrating LLMs into the CPT coding process for craniofacial surgery. The findings advocate for further refinement and training of AI models to improve their accuracy and practicality, suggesting a future where AI-assisted coding could become a standard component of surgical workflows, aligning with the ongoing digital transformation in health care.
Collapse
Affiliation(s)
- Emily L Isch
- Department of General Surgery, Thomas Jefferson University
| | | | | | | | - Adrija Sircar
- Sidney Kimmel Medical College at Thomas Jefferson University
| | - D Mitchell Self
- Department of Neurosurgery, Thomas Jefferson University and Jefferson Hospital for Neuroscience, Philadelphia, PA
| | | | - E J Caterson
- Department of Surgery, Division of Plastic Surgery, Nemours Children's Hospital Wilmington, DE
| | - Mario Aycart
- Department of Surgery, Division of Plastic Surgery, Nemours Children's Hospital Wilmington, DE
| |
Collapse
|
4
|
Ward M, Unadkat P, Toscano D, Kashanian A, Lynch DG, Horn AC, D'Amico RS, Mittler M, Baum GR. A Quantitative Assessment of ChatGPT as a Neurosurgical Triaging Tool. Neurosurgery 2024; 95:487-495. [PMID: 38353523 DOI: 10.1227/neu.0000000000002867] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2023] [Accepted: 12/26/2023] [Indexed: 07/16/2024] Open
Abstract
BACKGROUND AND OBJECTIVES ChatGPT is a natural language processing chatbot with increasing applicability to the medical workflow. Although ChatGPT has been shown to be capable of passing the American Board of Neurological Surgery board examination, there has never been an evaluation of the chatbot in triaging and diagnosing novel neurosurgical scenarios without defined answer choices. In this study, we assess ChatGPT's capability to determine the emergent nature of neurosurgical scenarios and make diagnoses based on information one would find in a neurosurgical consult. METHODS Thirty clinical scenarios were given to 3 attendings, 4 residents, 2 physician assistants, and 2 subinterns. Participants were asked to determine if the scenario constituted an urgent neurosurgical consultation and what the most likely diagnosis was. Attending responses provided a consensus to use as the answer key. Generative pretraining transformer (GPT) 3.5 and GPT 4 were given the same questions, and their responses were compared with the other participants. RESULTS GPT 4 was 100% accurate in both diagnosis and triage of the scenarios. GPT 3.5 had an accuracy of 92.59%, slightly below that of a PGY1 (96.3%), an 88.24% sensitivity, 100% specificity, 100% positive predictive value, and 83.3% negative predicative value in triaging each situation. When making a diagnosis, GPT 3.5 had an accuracy of 92.59%, which was higher than the subinterns and similar to resident responders. CONCLUSION GPT 4 is able to diagnose and triage neurosurgical scenarios at the level of a senior neurosurgical resident. There has been a clear improvement between GPT 3.5 and 4. It is likely that the recent updates in internet access and directing the functionality of ChatGPT will further improve its utility in neurosurgical triage.
Collapse
Affiliation(s)
- Max Ward
- Department of Neurological Surgery, Zucker School of Medicine at Hofstra/Northwell, Hempstead , New York , USA
| | - Prashin Unadkat
- Department of Neurological Surgery, Zucker School of Medicine at Hofstra/Northwell, Hempstead , New York , USA
- Elmezzi Graduate School of Molecular Medicine, Feinstein Institutes of Medical Research, Northwell Health, Manhasset , New York , USA
| | - Daniel Toscano
- Department of Neurological Surgery, Zucker School of Medicine at Hofstra/Northwell, Hempstead , New York , USA
| | - Alon Kashanian
- Department of Neurological Surgery, Zucker School of Medicine at Hofstra/Northwell, Hempstead , New York , USA
| | - Daniel G Lynch
- Department of Neurological Surgery, Zucker School of Medicine at Hofstra/Northwell, Hempstead , New York , USA
| | - Alexander C Horn
- Department of Neurological Surgery, Wake Forest School of Medicine, Winston-Salem , North Carolina , USA
| | - Randy S D'Amico
- Department of Neurological Surgery, Zucker School of Medicine at Hofstra/Northwell, Hempstead , New York , USA
- Department of Neurological Surgery, Lenox Hill Hospital, New York , New York , USA
| | - Mark Mittler
- Department of Neurological Surgery, Zucker School of Medicine at Hofstra/Northwell, Hempstead , New York , USA
- Department of Pediatric Neurosurgery, Cohens Childrens Medical Center, Queens , New York , USA
| | - Griffin R Baum
- Department of Neurological Surgery, Zucker School of Medicine at Hofstra/Northwell, Hempstead , New York , USA
- Department of Neurological Surgery, Lenox Hill Hospital, New York , New York , USA
| |
Collapse
|
5
|
Tuttle JJ, Moshirfar M, Garcia J, Altaf AW, Omidvarnia S, Hoopes PC. Learning the Randleman Criteria in Refractive Surgery: Utilizing ChatGPT-3.5 Versus Internet Search Engine. Cureus 2024; 16:e64768. [PMID: 39156271 PMCID: PMC11329333 DOI: 10.7759/cureus.64768] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/15/2024] [Indexed: 08/20/2024] Open
Abstract
Introduction Large language models such as OpenAI's (San Francisco, CA) ChatGPT-3.5 hold immense potential to augment self-directed learning in medicine, but concerns have risen regarding its accuracy in specialized fields. This study compares ChatGPT-3.5 with an internet search engine in their ability to define the Randleman criteria and its five parameters within a self-directed learning environment. Methods Twenty-three medical students gathered information on the Randleman criteria. Each student was allocated 10 minutes to interact with ChatGPT-3.5, followed by 10 minutes to search the internet independently. Each ChatGPT-3.5 conversation, student summary, and internet reference were subsequently analyzed for accuracy, efficiency, and reliability. Results ChatGPT-3.5 provided the correct definition for 26.1% of students (6/23, 95% CI: 12.3% to 46.8%), while an independent internet search resulted in sources containing the correct definition for 100% of students (23/23, 95% CI: 87.5% to 100%, p = 0.0001). ChatGPT-3.5 incorrectly identified the Randleman criteria as a corneal ectasia staging system for 17.4% of students (4/23), fabricated a "Randleman syndrome" for 4.3% of students (1/23), and gave no definition for 52.2% of students (12/23). When a definition was given (47.8%, 11/23), a median of two of the five correct parameters was provided along with a median of two additional falsified parameters. Conclusion Internet search engine outperformed ChatGPT-3.5 in providing accurate and reliable information on the Randleman criteria. ChatGPT-3.5 gave false information, required excessive prompting, and propagated misunderstandings. Learners should exercise discernment when using ChatGPT-3.5. Future initiatives should evaluate the implementation of prompt engineering and updated large-language models.
Collapse
Affiliation(s)
- Jared J Tuttle
- Ophthalmology, University of Texas Health Science Center at San Antonio, San Antonio, USA
| | - Majid Moshirfar
- Hoopes Vision Research Center, Hoopes Vision, Draper, USA
- John A. Moran Eye Center, University of Utah School of Medicine, Salt Lake City, USA
- Eye Banking and Corneal Transplantation, Utah Lions Eye Bank, Murray, USA
| | - James Garcia
- Ophthalmology, University of Texas Health Science Center at San Antonio, San Antonio, USA
| | - Amal W Altaf
- Medicine, University of Arizona College of Medicine - Phoenix, Phoenix, USA
| | | | | |
Collapse
|
6
|
Şahin Ş, Tekin MS, Yigit YE, Erkmen B, Duymaz YK, Bahşi İ. Evaluating the Success of ChatGPT in Addressing Patient Questions Concerning Thyroid Surgery. J Craniofac Surg 2024:00001665-990000000-01698. [PMID: 38861337 DOI: 10.1097/scs.0000000000010395] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2024] [Accepted: 05/15/2024] [Indexed: 06/13/2024] Open
Abstract
OBJECTIVE This study aimed to evaluate the utility and efficacy of ChatGPT in addressing questions related to thyroid surgery, taking into account accuracy, readability, and relevance. METHODS A simulated physician-patient consultation on thyroidectomy surgery was conducted by posing 21 hypothetical questions to ChatGPT. Responses were evaluated using the DISCERN score by 3 independent ear, nose and throat specialists. Readability measures including Flesch Reading Ease), Flesch-Kincaid Grade Level, Gunning Fog Index, Simple Measure of Gobbledygook, Coleman-Liau Index, and Automated Readability Index were also applied. RESULTS The majority of ChatGPT responses were rated fair or above using the DISCERN system, with an average score of 45.44 ± 11.24. However, the readability scores were consistently higher than the recommended grade 6 level, indicating the information may not be easily comprehensible to the general public. CONCLUSION While ChatGPT exhibits potential in answering patient queries related to thyroid surgery, its current formulation is not yet optimally tailored for patient comprehension. Further refinements are necessary for its efficient application in the medical domain.
Collapse
Affiliation(s)
- Şamil Şahin
- Ear Nose and Throat Specialist, Private Practice
| | | | - Yesim Esen Yigit
- Department of Otolaryngology, Umraniye Training and Research Hospital, University of Health Sciences, Istanbul
| | - Burak Erkmen
- Ear Nose and Throat Specialist, Private Practice
| | - Yasar Kemal Duymaz
- Department of Otolaryngology, Umraniye Training and Research Hospital, University of Health Sciences, Istanbul
| | - İlhan Bahşi
- Department of Anatomy, Faculty of Medicine, Gaziantep University, Gaziantep, Turkey
| |
Collapse
|
7
|
Huang KT, Mehta NH, Gupta S, See AP, Arnaout O. Evaluation of the safety, accuracy, and helpfulness of the GPT-4.0 Large Language Model in neurosurgery. J Clin Neurosci 2024; 123:151-156. [PMID: 38574687 DOI: 10.1016/j.jocn.2024.03.021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2024] [Revised: 03/19/2024] [Accepted: 03/22/2024] [Indexed: 04/06/2024]
Abstract
BACKGROUND Although prior work demonstrated the surprising accuracy of Large Language Models (LLMs) on neurosurgery board-style questions, their use in day-to-day clinical situations warrants further investigation. This study assessed GPT-4.0's responses to common clinical questions across various subspecialties of neurosurgery. METHODS A panel of attending neurosurgeons formulated 35 general neurosurgical questions spanning neuro-oncology, spine, vascular, functional, pediatrics, and trauma. All questions were input into GPT-4.0 with a prespecified, standard prompt. Responses were evaluated by two attending neurosurgeons, each on a standardized scale for accuracy, safety, and helpfulness. Citations were indexed and evaluated against identifiable database references. RESULTS GPT-4.0 responses were consistent with current medical guidelines and accounted for recent advances in the field 92.8 % and 78.6 % of the time respectively. Neurosurgeons reported GPT-4.0 responses providing unrealistic information or potentially risky information 14.3 % and 7.1 % of the time respectively. Assessed on 5-point scales, responses suggested that GPT-4.0 was clinically useful (4.0 ± 0.6), relevant (4.7 ± 0.3), and coherent (4.9 ± 0.2). The depth of clinical responses varied (3.7 ± 0.6), and "red flag" symptoms were missed 7.1 % of the time. Moreover, GPT-4.0 cited 86 references (2.46 citations per answer), of which only 50 % were deemed valid, and 77.1 % of responses contained at least one inappropriate citation. CONCLUSION Current general LLM technology can offer generally accurate, safe, and helpful neurosurgical information, but may not fully evaluate medical literature or recent field advances. Citation generation and usage remains unreliable. As this technology becomes more ubiquitous, clinicians will need to exercise caution when dealing with it in practice.
Collapse
Affiliation(s)
- Kevin T Huang
- Brigham and Women's Hospital, Department of Neurosurgery, 60 Fenwood Road, Hale Building, 4th FloorBoston, MA 02115, United States; Harvard Medical School, Department of Neurosurgery, 25 Shattuck StreetBoston, MA 02115, United States.
| | - Neel H Mehta
- Harvard Medical School, Department of Neurosurgery, 25 Shattuck StreetBoston, MA 02115, United States
| | - Saksham Gupta
- Brigham and Women's Hospital, Department of Neurosurgery, 60 Fenwood Road, Hale Building, 4th FloorBoston, MA 02115, United States; Harvard Medical School, Department of Neurosurgery, 25 Shattuck StreetBoston, MA 02115, United States
| | - Alfred P See
- Harvard Medical School, Department of Neurosurgery, 25 Shattuck StreetBoston, MA 02115, United States; Boston Children's Hospital, Department of Neurosurgery, 300 Longwood AvenueBoston, MA 02115, United States
| | - Omar Arnaout
- Brigham and Women's Hospital, Department of Neurosurgery, 60 Fenwood Road, Hale Building, 4th FloorBoston, MA 02115, United States; Harvard Medical School, Department of Neurosurgery, 25 Shattuck StreetBoston, MA 02115, United States
| |
Collapse
|
8
|
Parikh AO, Oca MC, Conger JR, McCoy A, Chang J, Zhang-Nunes S. Accuracy and Bias in Artificial Intelligence Chatbot Recommendations for Oculoplastic Surgeons. Cureus 2024; 16:e57611. [PMID: 38707042 PMCID: PMC11069401 DOI: 10.7759/cureus.57611] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 03/30/2024] [Indexed: 05/07/2024] Open
Abstract
Purpose The purpose of this study is to assess the accuracy of and bias in recommendations for oculoplastic surgeons from three artificial intelligence (AI) chatbot systems. Methods ChatGPT, Microsoft Bing Balanced, and Google Bard were asked for recommendations for oculoplastic surgeons practicing in 20 cities with the highest population in the United States. Three prompts were used: "can you help me find (an oculoplastic surgeon)/(a doctor who does eyelid lifts)/(an oculofacial plastic surgeon) in (city)." Results A total of 672 suggestions were made between (oculoplastic surgeon; doctor who does eyelid lifts; oculofacial plastic surgeon); 19.8% suggestions were excluded, leaving 539 suggested physicians. Of these, 64.1% were oculoplastics specialists (of which 70.1% were American Society of Ophthalmic Plastic and Reconstructive Surgery (ASOPRS) members); 16.1% were general plastic surgery trained, 9.0% were ENT trained, 8.8% were ophthalmology but not oculoplastics trained, and 1.9% were trained in another specialty. 27.7% of recommendations across all AI systems were female. Conclusions Among the chatbot systems tested, there were high rates of inaccuracy: up to 38% of recommended surgeons were nonexistent or not practicing in the city requested, and 35.9% of those recommended as oculoplastic/oculofacial plastic surgeons were not oculoplastics specialists. Choice of prompt affected the result, with requests for "a doctor who does eyelid lifts" resulting in more plastic surgeons and ENTs and fewer oculoplastic surgeons. It is important to identify inaccuracies and biases in recommendations provided by AI systems as more patients may start using them to choose a surgeon.
Collapse
Affiliation(s)
- Alomi O Parikh
- Ophthalmology, USC Roski Eye Institute, Keck School of Medicine, University of Southern California, Los Angeles, USA
| | - Michael C Oca
- Ophthalmology, University of California San Diego School of Medicine, La Jolla, USA
| | - Jordan R Conger
- Oculofacial Plastic Surgery, USC Roski Eye Institute, Keck School of Medicine, University of Southern California, Los Angeles, USA
| | - Allison McCoy
- Oculofacial Plastic Surgery, Del Mar Plastic Surgery, San Diego, USA
| | - Jessica Chang
- Oculofacial Plastic Surgery, USC Roski Eye Institute, Keck School Medicine, University of Southern California, Los Angeles, USA
| | - Sandy Zhang-Nunes
- Ophthalmology, USC Roski Eye Institute, Keck School Medicine, University of Southern California, Los Angeles, USA
| |
Collapse
|
9
|
Lee KH, Lee RW. ChatGPT's Accuracy on Magnetic Resonance Imaging Basics: Characteristics and Limitations Depending on the Question Type. Diagnostics (Basel) 2024; 14:171. [PMID: 38248048 PMCID: PMC10814518 DOI: 10.3390/diagnostics14020171] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2023] [Revised: 01/04/2024] [Accepted: 01/11/2024] [Indexed: 01/23/2024] Open
Abstract
Our study aimed to assess the accuracy and limitations of ChatGPT in the domain of MRI, focused on evaluating ChatGPT's performance in answering simple knowledge questions and specialized multiple-choice questions related to MRI. A two-step approach was used to evaluate ChatGPT. In the first step, 50 simple MRI-related questions were asked, and ChatGPT's answers were categorized as correct, partially correct, or incorrect by independent researchers. In the second step, 75 multiple-choice questions covering various MRI topics were posed, and the answers were similarly categorized. The study utilized Cohen's kappa coefficient for assessing interobserver agreement. ChatGPT demonstrated high accuracy in answering straightforward MRI questions, with over 85% classified as correct. However, its performance varied significantly across multiple-choice questions, with accuracy rates ranging from 40% to 66.7%, depending on the topic. This indicated a notable gap in its ability to handle more complex, specialized questions requiring deeper understanding and context. In conclusion, this study critically evaluates the accuracy of ChatGPT in addressing questions related to Magnetic Resonance Imaging (MRI), highlighting its potential and limitations in the healthcare sector, particularly in radiology. Our findings demonstrate that ChatGPT, while proficient in responding to straightforward MRI-related questions, exhibits variability in its ability to accurately answer complex multiple-choice questions that require more profound, specialized knowledge of MRI. This discrepancy underscores the nuanced role AI can play in medical education and healthcare decision-making, necessitating a balanced approach to its application.
Collapse
Affiliation(s)
| | - Ro-Woon Lee
- Department of Radiology, Inha University College of Medicine, Incheon 22212, Republic of Korea;
| |
Collapse
|
10
|
Singh A, Das S, Mishra RK, Agrawal A. Artificial intelligence and machine learning in healthcare: Scope and opportunities to use ChatGPT. J Neurosci Rural Pract 2023; 14:391-392. [PMID: 37692807 PMCID: PMC10483215 DOI: 10.25259/jnrp_391_2023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2023] [Accepted: 07/19/2023] [Indexed: 09/12/2023] Open
Affiliation(s)
- Ajai Singh
- Executive Director and CEO, All India Institute of Medical Sciences, Bhopal, Madhya Pradesh, India
| | - Saikat Das
- Department of Radiation Oncology, All India Institute of Medical Sciences, Bhopal, Madhya Pradesh, India
| | - Rakesh Kumar Mishra
- Department of Neurosurgery, Institute of Medical Sciences, Banaras Hindu University, Varanasi, Uttar Pradesh, India
| | - Amit Agrawal
- Department of Neurosurgery, All India Institute of Medical Sciences, Bhopal, Madhya Pradesh, India
| |
Collapse
|