1
|
Vaishya R, Iyengar KP, Patralekh MK, Botchu R, Shirodkar K, Jain VK, Vaish A, Scarlat MM. Effectiveness of AI-powered Chatbots in responding to orthopaedic postgraduate exam questions-an observational study. INTERNATIONAL ORTHOPAEDICS 2024; 48:1963-1969. [PMID: 38619565 DOI: 10.1007/s00264-024-06182-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/25/2024] [Accepted: 04/03/2024] [Indexed: 04/16/2024]
Abstract
PURPOSE This study analyses the performance and proficiency of the three Artificial Intelligence (AI) generative chatbots (ChatGPT-3.5, ChatGPT-4.0, Bard Google AI®) and in answering the Multiple Choice Questions (MCQs) of postgraduate (PG) level orthopaedic qualifying examinations. METHODS A series of 120 mock Single Best Answer' (SBA) MCQs with four possible options named A, B, C and D as answers on various musculoskeletal (MSK) conditions covering Trauma and Orthopaedic curricula were compiled. A standardised text prompt was used to generate and feed ChatGPT (both 3.5 and 4.0 versions) and Google Bard programs, which were then statistically analysed. RESULTS Significant differences were found between responses from Chat GPT 3.5 with Chat GPT 4.0 (Chi square = 27.2, P < 0.001) and on comparing both Chat GPT 3.5 (Chi square = 63.852, P < 0.001) with Chat GPT 4.0 (Chi square = 44.246, P < 0.001) with. Bard Google AI® had 100% efficiency and was significantly more efficient than both Chat GPT 3.5 with Chat GPT 4.0 (p < 0.0001). CONCLUSION The results demonstrate the variable potential of the different AI generative chatbots (Chat GPT 3.5, Chat GPT 4.0 and Bard Google) in their ability to answer the MCQ of PG-level orthopaedic qualifying examinations. Bard Google AI® has shown superior performance than both ChatGPT versions, underlining the potential of such large language processing models in processing and applying orthopaedic subspecialty knowledge at a PG level.
Collapse
Affiliation(s)
- Raju Vaishya
- Department of Orthopaedics, Indraprastha Apollo Hospitals, Sarita Vihar, New Delhi, 110076, India.
| | - Karthikeyan P Iyengar
- Department of Orthopaedics, Southport and Ormskirk Hospital, Mersey West Lancashire Teaching NHS Trust, Southport, UK
| | | | - Rajesh Botchu
- Department of Musculoskeletal Radiology, Royal Orthopedic Hospital, Birmingham, UK
| | - Kapil Shirodkar
- Department of Musculoskeletal Radiology, Royal Orthopedic Hospital, Birmingham, UK
| | | | - Abhishek Vaish
- Department of Orthopaedics, Indraprastha Apollo Hospitals, Sarita Vihar, New Delhi, 110076, India
| | | |
Collapse
|
2
|
Horiuchi D, Tatekawa H, Oura T, Shimono T, Walston SL, Takita H, Matsushita S, Mitsuyama Y, Miki Y, Ueda D. ChatGPT's diagnostic performance based on textual vs. visual information compared to radiologists' diagnostic performance in musculoskeletal radiology. Eur Radiol 2024:10.1007/s00330-024-10902-5. [PMID: 38995378 DOI: 10.1007/s00330-024-10902-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2024] [Revised: 05/02/2024] [Accepted: 06/24/2024] [Indexed: 07/13/2024]
Abstract
OBJECTIVES To compare the diagnostic accuracy of Generative Pre-trained Transformer (GPT)-4-based ChatGPT, GPT-4 with vision (GPT-4V) based ChatGPT, and radiologists in musculoskeletal radiology. MATERIALS AND METHODS We included 106 "Test Yourself" cases from Skeletal Radiology between January 2014 and September 2023. We input the medical history and imaging findings into GPT-4-based ChatGPT and the medical history and images into GPT-4V-based ChatGPT, then both generated a diagnosis for each case. Two radiologists (a radiology resident and a board-certified radiologist) independently provided diagnoses for all cases. The diagnostic accuracy rates were determined based on the published ground truth. Chi-square tests were performed to compare the diagnostic accuracy of GPT-4-based ChatGPT, GPT-4V-based ChatGPT, and radiologists. RESULTS GPT-4-based ChatGPT significantly outperformed GPT-4V-based ChatGPT (p < 0.001) with accuracy rates of 43% (46/106) and 8% (9/106), respectively. The radiology resident and the board-certified radiologist achieved accuracy rates of 41% (43/106) and 53% (56/106). The diagnostic accuracy of GPT-4-based ChatGPT was comparable to that of the radiology resident, but was lower than that of the board-certified radiologist although the differences were not significant (p = 0.78 and 0.22, respectively). The diagnostic accuracy of GPT-4V-based ChatGPT was significantly lower than those of both radiologists (p < 0.001 and < 0.001, respectively). CONCLUSION GPT-4-based ChatGPT demonstrated significantly higher diagnostic accuracy than GPT-4V-based ChatGPT. While GPT-4-based ChatGPT's diagnostic performance was comparable to radiology residents, it did not reach the performance level of board-certified radiologists in musculoskeletal radiology. CLINICAL RELEVANCE STATEMENT GPT-4-based ChatGPT outperformed GPT-4V-based ChatGPT and was comparable to radiology residents, but it did not reach the level of board-certified radiologists in musculoskeletal radiology. Radiologists should comprehend ChatGPT's current performance as a diagnostic tool for optimal utilization. KEY POINTS This study compared the diagnostic performance of GPT-4-based ChatGPT, GPT-4V-based ChatGPT, and radiologists in musculoskeletal radiology. GPT-4-based ChatGPT was comparable to radiology residents, but did not reach the level of board-certified radiologists. When utilizing ChatGPT, it is crucial to input appropriate descriptions of imaging findings rather than the images.
Collapse
Affiliation(s)
- Daisuke Horiuchi
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
| | - Hiroyuki Tatekawa
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
| | - Tatsushi Oura
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
| | - Taro Shimono
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
| | - Shannon L Walston
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
| | - Hirotaka Takita
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
| | - Shu Matsushita
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
| | - Yasuhito Mitsuyama
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
| | - Yukio Miki
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
| | - Daiju Ueda
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan.
- Department of Artificial Intelligence, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan.
| |
Collapse
|
3
|
Passby L, Madhwapathi V, Tso S, Wernham A. Appraisal of AI-generated dermatology literature reviews. J Eur Acad Dermatol Venereol 2024. [PMID: 38994876 DOI: 10.1111/jdv.20237] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2024] [Accepted: 06/28/2024] [Indexed: 07/13/2024]
Abstract
BACKGROUND Artificial intelligence (AI) tools have the potential to revolutionize many facets of medicine and medical sciences research. Numerous AI tools have been developed and are in continuous states of iterative improvement in their functionality. OBJECTIVES This study aimed to assess the performance of three AI tools: The Literature, Microsoft's Copilot and Google's Gemini in performing literature reviews on a range of dermatology topics. METHODS Each tool was asked to write a literature review on five topics. The topics chosen have recently had peer-reviewed systematic reviews published. The outputs of each took were graded on their evidence and analysis, conclusions and references on a 5-point Likert scale by three dermatologists who are working in clinical practice, have completed the UK dermatology postgraduate training examination and are partaking in continued professional development. RESULTS Across all five topics chosen, the literature reviews written by Gemini scored the highest. The mean score for Gemini for each review was 10.53, significantly higher than the mean scores achieved by The Literature (7.73) and Copilot (7.4) (p < 0.001). CONCLUSIONS This paper shows that AI-generated literature reviews can provide real-time summaries of medical literature across a range of dermatology topics, but limitations to their comprehensiveness and accuracy are apparent.
Collapse
Affiliation(s)
- Lauren Passby
- University Hospitals Birmingham NHS Foundation Trust, Solihull, UK
| | - Vidya Madhwapathi
- University Hospitals Birmingham NHS Foundation Trust, Birmingham, UK
| | - Simon Tso
- Jephson Dermatology Centre, South Warwickshire NHS Foundation Trust, Warwick, UK
| | | |
Collapse
|
4
|
Chen Z, Chen C, Yang G, He X, Chi X, Zeng Z, Chen X. Research integrity in the era of artificial intelligence: Challenges and responses. Medicine (Baltimore) 2024; 103:e38811. [PMID: 38968491 PMCID: PMC11224801 DOI: 10.1097/md.0000000000038811] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/28/2024] [Accepted: 06/13/2024] [Indexed: 07/07/2024] Open
Abstract
The application of artificial intelligence (AI) technologies in scientific research has significantly enhanced efficiency and accuracy but also introduced new forms of academic misconduct, such as data fabrication and text plagiarism using AI algorithms. These practices jeopardize research integrity and can mislead scientific directions. This study addresses these challenges, underscoring the need for the academic community to strengthen ethical norms, enhance researcher qualifications, and establish rigorous review mechanisms. To ensure responsible and transparent research processes, we recommend the following specific key actions: Development and enforcement of comprehensive AI research integrity guidelines that include clear protocols for AI use in data analysis and publication, ensuring transparency and accountability in AI-assisted research. Implementation of mandatory AI ethics and integrity training for researchers, aimed at fostering an in-depth understanding of potential AI misuses and promoting ethical research practices. Establishment of international collaboration frameworks to facilitate the exchange of best practices and development of unified ethical standards for AI in research. Protecting research integrity is paramount for maintaining public trust in science, making these recommendations urgent for the scientific community consideration and action.
Collapse
Affiliation(s)
- Ziyu Chen
- The First Affiliated Hospital of Shenzhen University, Shenzhen University, Shenzhen, China
| | - Changye Chen
- The First Affiliated Hospital of Shenzhen University, Shenzhen University, Shenzhen, China
| | - Guozhao Yang
- The First Affiliated Hospital of Shenzhen University, Shenzhen University, Shenzhen, China
| | - Xiangpeng He
- The First Affiliated Hospital of Shenzhen University, Shenzhen University, Shenzhen, China
| | - Xiaoxia Chi
- The First Affiliated Hospital of Shenzhen University, Shenzhen University, Shenzhen, China
| | - Zhuoying Zeng
- The First Affiliated Hospital of Shenzhen University, Shenzhen University, Shenzhen, China
- Chemical Analysis & Physical Testing Institute, Shenzhen Center for Disease Control and Prevention, Shenzhen, China
| | - Xuhong Chen
- The First Affiliated Hospital of Shenzhen University, Shenzhen University, Shenzhen, China
| |
Collapse
|
5
|
Keshavarz P, Bagherieh S, Nabipoorashrafi SA, Chalian H, Rahsepar AA, Kim GHJ, Hassani C, Raman SS, Bedayat A. ChatGPT in radiology: A systematic review of performance, pitfalls, and future perspectives. Diagn Interv Imaging 2024; 105:251-265. [PMID: 38679540 DOI: 10.1016/j.diii.2024.04.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Revised: 03/11/2024] [Accepted: 04/16/2024] [Indexed: 05/01/2024]
Abstract
PURPOSE The purpose of this study was to systematically review the reported performances of ChatGPT, identify potential limitations, and explore future directions for its integration, optimization, and ethical considerations in radiology applications. MATERIALS AND METHODS After a comprehensive review of PubMed, Web of Science, Embase, and Google Scholar databases, a cohort of published studies was identified up to January 1, 2024, utilizing ChatGPT for clinical radiology applications. RESULTS Out of 861 studies derived, 44 studies evaluated the performance of ChatGPT; among these, 37 (37/44; 84.1%) demonstrated high performance, and seven (7/44; 15.9%) indicated it had a lower performance in providing information on diagnosis and clinical decision support (6/44; 13.6%) and patient communication and educational content (1/44; 2.3%). Twenty-four (24/44; 54.5%) studies reported the proportion of ChatGPT's performance. Among these, 19 (19/24; 79.2%) studies recorded a median accuracy of 70.5%, and in five (5/24; 20.8%) studies, there was a median agreement of 83.6% between ChatGPT outcomes and reference standards [radiologists' decision or guidelines], generally confirming ChatGPT's high accuracy in these studies. Eleven studies compared two recent ChatGPT versions, and in ten (10/11; 90.9%), ChatGPTv4 outperformed v3.5, showing notable enhancements in addressing higher-order thinking questions, better comprehension of radiology terms, and improved accuracy in describing images. Risks and concerns about using ChatGPT included biased responses, limited originality, and the potential for inaccurate information leading to misinformation, hallucinations, improper citations and fake references, cybersecurity vulnerabilities, and patient privacy risks. CONCLUSION Although ChatGPT's effectiveness has been shown in 84.1% of radiology studies, there are still multiple pitfalls and limitations to address. It is too soon to confirm its complete proficiency and accuracy, and more extensive multicenter studies utilizing diverse datasets and pre-training techniques are required to verify ChatGPT's role in radiology.
Collapse
Affiliation(s)
- Pedram Keshavarz
- Department of Radiological Sciences, David Geffen School of Medicine, University of California, Los Angeles (UCLA), Los Angeles, CA 90095, USA; School of Science and Technology, The University of Georgia, Tbilisi 0171, Georgia
| | - Sara Bagherieh
- Independent Clinical Radiology Researcher, Los Angeles, CA 90024, USA
| | | | - Hamid Chalian
- Department of Radiology, Cardiothoracic Imaging, University of Washington, Seattle, WA 98195, USA
| | - Amir Ali Rahsepar
- Department of Radiological Sciences, David Geffen School of Medicine, University of California, Los Angeles (UCLA), Los Angeles, CA 90095, USA
| | - Grace Hyun J Kim
- Department of Radiological Sciences, David Geffen School of Medicine, University of California, Los Angeles (UCLA), Los Angeles, CA 90095, USA; Department of Radiological Sciences, Center for Computer Vision and Imaging Biomarkers, University of California, Los Angeles (UCLA), Los Angeles, CA 90095, USA
| | - Cameron Hassani
- Department of Radiological Sciences, David Geffen School of Medicine, University of California, Los Angeles (UCLA), Los Angeles, CA 90095, USA
| | - Steven S Raman
- Department of Radiological Sciences, David Geffen School of Medicine, University of California, Los Angeles (UCLA), Los Angeles, CA 90095, USA
| | - Arash Bedayat
- Department of Radiological Sciences, David Geffen School of Medicine, University of California, Los Angeles (UCLA), Los Angeles, CA 90095, USA.
| |
Collapse
|
6
|
Botchu B, Iyengar KP, Botchu R. Can ChatGPT empower people with dyslexia? Disabil Rehabil Assist Technol 2024; 19:2131-2132. [PMID: 37697967 DOI: 10.1080/17483107.2023.2256805] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2023] [Accepted: 09/05/2023] [Indexed: 09/13/2023]
Affiliation(s)
| | - Karthikeyan P Iyengar
- Department of Orthopaedics, Southport and Ormskirk Hospital NHS Trust, Southport, UK
| | - Rajesh Botchu
- Department of Musculoskeletal Radiology, Royal Orthopedic Hospital, Birmingham, UK
| |
Collapse
|
7
|
Kotzur T, Singh A, Parker J, Peterson B, Sager B, Rose R, Corley F, Brady C. Evaluation of a Large Language Model's Ability to Assist in an Orthopedic Hand Clinic. Hand (N Y) 2024:15589447241257643. [PMID: 38907651 DOI: 10.1177/15589447241257643] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 06/24/2024]
Abstract
BACKGROUND Advancements in artificial intelligence technology, such as OpenAI's large language model, ChatGPT, could transform medicine through applications in a clinical setting. This study aimed to assess the utility of ChatGPT as a clinical assistant in an orthopedic hand clinic. METHODS Nine clinical vignettes, describing various common and uncommon hand pathologies, were constructed and reviewed by 4 fellowship-trained orthopedic hand surgeons and an orthopedic resident. ChatGPT was given these vignettes and asked to generate a differential diagnosis, potential workup plan, and provide treatment options for its top differential. Responses were graded for accuracy and the overall utility scored on a 5-point Likert scale. RESULTS The diagnostic accuracy of ChatGPT was 7 out of 9 cases, indicating an overall accuracy rate of 78%. ChatGPT was less reliable with more complex pathologies and failed to identify an intentionally incorrect presentation. ChatGPT received a score of 3.8 ± 1.4 for correct diagnosis, 3.4 ± 1.4 for helpfulness in guiding patient management, 4.1 ± 1.0 for appropriate workup for the actual diagnosis, 4.3 ± 0.8 for an appropriate recommended treatment plan for the diagnosis, and 4.4 ± 0.8 for the helpfulness of treatment options in managing patients. CONCLUSION ChatGPT was successful in diagnosing most of the conditions; however, the overall utility of its advice was variable. While it performed well in recommending treatments, it faced difficulties in providing appropriate diagnoses for uncommon pathologies. In addition, it failed to identify an obvious error in presenting pathology.
Collapse
|
8
|
Wang ZP, Bhandary P, Wang Y, Moore JH. Using GPT-4 to write a scientific review article: a pilot evaluation study. BioData Min 2024; 17:16. [PMID: 38890715 PMCID: PMC11184879 DOI: 10.1186/s13040-024-00371-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2024] [Accepted: 06/11/2024] [Indexed: 06/20/2024] Open
Abstract
GPT-4, as the most advanced version of OpenAI's large language models, has attracted widespread attention, rapidly becoming an indispensable AI tool across various areas. This includes its exploration by scientists for diverse applications. Our study focused on assessing GPT-4's capabilities in generating text, tables, and diagrams for biomedical review papers. We also assessed the consistency in text generation by GPT-4, along with potential plagiarism issues when employing this model for the composition of scientific review papers. Based on the results, we suggest the development of enhanced functionalities in ChatGPT, aiming to meet the needs of the scientific community more effectively. This includes enhancements in uploaded document processing for reference materials, a deeper grasp of intricate biomedical concepts, more precise and efficient information distillation for table generation, and a further refined model specifically tailored for scientific diagram creation.
Collapse
Affiliation(s)
- Zhiping Paul Wang
- Department of Computational Biomedicine, Cedars Sinai Medical Center, 700 N. San Vicente Blvd, Pacific Design Center, Suite G-541, West Hollywood, CA, 90069, USA
| | - Priyanka Bhandary
- Department of Computational Biomedicine, Cedars Sinai Medical Center, 700 N. San Vicente Blvd, Pacific Design Center, Suite G-541, West Hollywood, CA, 90069, USA
| | - Yizhou Wang
- Department of Computational Biomedicine, Cedars Sinai Medical Center, 700 N. San Vicente Blvd, Pacific Design Center, Suite G-541, West Hollywood, CA, 90069, USA
| | - Jason H Moore
- Department of Computational Biomedicine, Cedars Sinai Medical Center, 700 N. San Vicente Blvd, Pacific Design Center, Suite G-541, West Hollywood, CA, 90069, USA.
| |
Collapse
|
9
|
DeCook R, Muffly BT, Mahmood S, Holland CT, Ayeni AM, Ast MP, Bolognese MP, Guild GN, Sheth NP, Pean CA, Premkumar A. AI-Generated Graduate Medical Education Content for Total Joint Arthroplasty: Comparing ChatGPT Against Orthopaedic Fellows. Arthroplast Today 2024; 27:101412. [PMID: 38912098 PMCID: PMC11190484 DOI: 10.1016/j.artd.2024.101412] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/15/2024] [Revised: 03/20/2024] [Accepted: 04/28/2024] [Indexed: 06/25/2024] Open
Abstract
Background Artificial intelligence (AI) in medicine has primarily focused on diagnosing and treating diseases and assisting in the development of academic scholarly work. This study aimed to evaluate a new use of AI in orthopaedics: content generation for professional medical education. Quality, accuracy, and time were compared between content created by ChatGPT and orthopaedic surgery clinical fellows. Methods ChatGPT and 3 orthopaedic adult reconstruction fellows were tasked with creating educational summaries of 5 total joint arthroplasty-related topics. Responses were evaluated across 5 domains by 4 blinded reviewers from different institutions who are all current or former total joint arthroplasty fellowship directors or national arthroplasty board review course directors. Results ChatGPT created better orthopaedic content than fellows when mean aggregate scores for all 5 topics and domains were compared (P ≤ .001). The only domain in which fellows outperformed ChatGPT was the integration of key points and references (P = .006). ChatGPT outperformed the fellows in response time, averaging 16.6 seconds vs the fellows' 94 minutes per prompt (P = .002). Conclusions With its efficient and accurate content generation, the current findings underscore ChatGPT's potential as an adjunctive tool to enhance orthopaedic arthroplasty graduate medical education. Future studies are warranted to explore AI's role further and optimize its utility in augmenting the educational development of arthroplasty trainees.
Collapse
Affiliation(s)
- Ryan DeCook
- Philadelphia College of Osteopathic Medicine, Swannee, GA, USA
| | - Brian T. Muffly
- Department of Orthopaedic Surgery, Emory University School of Medicine, Atlanta, GA, USA
| | - Sania Mahmood
- Department of Orthopaedic Surgery, Emory University School of Medicine, Atlanta, GA, USA
| | | | - Ayomide M. Ayeni
- Department of Orthopaedic Surgery, Emory University School of Medicine, Atlanta, GA, USA
| | | | - Michael P. Bolognese
- Department of Orthopaedic Surgery, Duke University School of Medicine, Durham, NC, USA
| | - George N. Guild
- Department of Orthopaedic Surgery, Emory University School of Medicine, Atlanta, GA, USA
| | - Neil P. Sheth
- Department of Orthopaedic Surgery, Perelman School of Medicine, Philadelphia, PA, USA
| | - Christian A. Pean
- Department of Orthopaedic Surgery, Duke University School of Medicine, Durham, NC, USA
| | - Ajay Premkumar
- Department of Orthopaedic Surgery, Emory University School of Medicine, Atlanta, GA, USA
| |
Collapse
|
10
|
Jenko N, Ariyaratne S, Jeys L, Evans S, Iyengar KP, Botchu R. An evaluation of AI generated literature reviews in musculoskeletal radiology. Surgeon 2024; 22:194-197. [PMID: 38218659 DOI: 10.1016/j.surge.2023.12.005] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2023] [Revised: 12/20/2023] [Accepted: 12/27/2023] [Indexed: 01/15/2024]
Abstract
PURPOSE The use of artificial intelligence (AI) tools to aid in summarizing information in medicine and research has recently garnered a huge amount of interest. While tools such as ChatGPT produce convincing and naturally sounding output, the answers are sometimes incorrect. Some of these drawbacks, it is hoped, can be avoided by using programmes trained for a more specific scope. In this study we compared the performance of a new AI tool (the-literature.com) to the latest version OpenAI's ChatGPT (GPT-4) in summarizing topics that the authors have significantly contributed to. METHODS The AI tools were asked to produce a literature review on 7 topics. These were selected based on the research topics that the authors were intimately familiar with and have contributed to through their own publications. The output produced by the AI tools were graded on a 1-5 Likert scale for accuracy, comprehensiveness, and relevance by two fellowship trained consultant radiologists. RESULTS The-literature.com produced 3 excellent summaries, 3 very poor summaries not relevant to the prompt, and one summary, which was relevant but did not include all relevant papers. All of the summaries produced by GPT-4 were relevant, but fewer relevant papers were identified. The average Likert rating was for the-literature was 2.88 and 3.86 for GPT-4. There was good agreement between the ratings of both radiologists (ICC = 0.883). CONCLUSION Summaries produced by AI in its current state require careful human validation. GPT-4 on average provides higher quality summaries. Neither tool can reliably identify all relevant publications.
Collapse
Affiliation(s)
- N Jenko
- Radiology, Royal Orthopaedic Hospital NHS Foundation Trust, Birmingham, UK.
| | - S Ariyaratne
- Radiology, Royal Orthopaedic Hospital NHS Foundation Trust, Birmingham, UK
| | - L Jeys
- Orthopaedic Surgery, Royal Orthopaedic Hospital NHS Foundation Trust, Birmingham, UK
| | - S Evans
- Orthopaedic Surgery, Royal Orthopaedic Hospital NHS Foundation Trust, Birmingham, UK
| | - K P Iyengar
- Orthopaedic Surgery, Mersey and West Lancashire Teaching Hospitals NHS Trust, Southport, UK
| | - R Botchu
- Radiology, Royal Orthopaedic Hospital NHS Foundation Trust, Birmingham, UK
| |
Collapse
|
11
|
Vaira LA, Lechien JR, Abbate V, Allevi F, Audino G, Beltramini GA, Bergonzani M, Bolzoni A, Committeri U, Crimi S, Gabriele G, Lonardi F, Maglitto F, Petrocelli M, Pucci R, Saponaro G, Tel A, Vellone V, Chiesa-Estomba CM, Boscolo-Rizzo P, Salzano G, De Riu G. Accuracy of ChatGPT-Generated Information on Head and Neck and Oromaxillofacial Surgery: A Multicenter Collaborative Analysis. Otolaryngol Head Neck Surg 2024; 170:1492-1503. [PMID: 37595113 DOI: 10.1002/ohn.489] [Citation(s) in RCA: 18] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2023] [Revised: 06/16/2023] [Accepted: 07/14/2023] [Indexed: 08/20/2023]
Abstract
OBJECTIVE To investigate the accuracy of Chat-Based Generative Pre-trained Transformer (ChatGPT) in answering questions and solving clinical scenarios of head and neck surgery. STUDY DESIGN Observational and valuative study. SETTING Eighteen surgeons from 14 Italian head and neck surgery units. METHODS A total of 144 clinical questions encompassing different subspecialities of head and neck surgery and 15 comprehensive clinical scenarios were developed. Questions and scenarios were inputted into ChatGPT4, and the resulting answers were evaluated by the researchers using accuracy (range 1-6), completeness (range 1-3), and references' quality Likert scales. RESULTS The overall median score of open-ended questions was 6 (interquartile range[IQR]: 5-6) for accuracy and 3 (IQR: 2-3) for completeness. Overall, the reviewers rated the answer as entirely or nearly entirely correct in 87.2% of cases and as comprehensive and covering all aspects of the question in 73% of cases. The artificial intelligence (AI) model achieved a correct response in 84.7% of the closed-ended questions (11 wrong answers). As for the clinical scenarios, ChatGPT provided a fully or nearly fully correct diagnosis in 81.7% of cases. The proposed diagnostic or therapeutic procedure was judged to be complete in 56.7% of cases. The overall quality of the bibliographic references was poor, and sources were nonexistent in 46.4% of the cases. CONCLUSION The results generally demonstrate a good level of accuracy in the AI's answers. The AI's ability to resolve complex clinical scenarios is promising, but it still falls short of being considered a reliable support for the decision-making process of specialists in head-neck surgery.
Collapse
Affiliation(s)
- Luigi Angelo Vaira
- Maxillofacial Surgery Operative Unit, Department of Medicine, Surgery and Pharmacy, University of Sassari, Sassari, Italy
- Biomedical Sciences Department, PhD School of Biomedical Science, University of Sassari, Sassari, Italy
| | - Jerome R Lechien
- Department of Anatomy and Experimental Oncology, Mons School of Medicine, UMONS, Research Institute for Health Sciences and Technology, University of Mons (UMons), Mons, Belgium
- Department of Otolaryngology-Head Neck Surgery, Elsan Polyclinic of Poitiers, Poitiers, France
| | - Vincenzo Abbate
- Head and Neck Section, Department of Neurosciences, Reproductive and Odontostomatological Science, Federico II University of Naples, Naples, Italy
| | - Fabiana Allevi
- Maxillofacial Surgery Department, ASSt Santi Paolo e Carlo, University of Milan, Milan, Italy
| | - Giovanni Audino
- Head and Neck Section, Department of Neurosciences, Reproductive and Odontostomatological Science, Federico II University of Naples, Naples, Italy
| | - Giada Anna Beltramini
- Department of Biomedical, Surgical and Dental Sciences, University of Milan, Milan, Italy
- Maxillofacial and Dental Unit, Fondazione IRCCS Cà Granda Ospedale Maggiore Policlinico, Milan, Italy
| | - Michela Bergonzani
- Maxillo-Facial Surgery Division, Head and Neck Department, University Hospital of Parma, Parma, Italy
| | - Alessandro Bolzoni
- Department of Biomedical, Surgical and Dental Sciences, University of Milan, Milan, Italy
| | - Umberto Committeri
- Head and Neck Section, Department of Neurosciences, Reproductive and Odontostomatological Science, Federico II University of Naples, Naples, Italy
| | - Salvatore Crimi
- Operative Unit of Maxillofacial Surgery, Policlinico San Marco, University of Catania, Catania, Italy
| | - Guido Gabriele
- Department of Maxillofacial Surgery, University of Siena, Siena, Italy
| | - Fabio Lonardi
- Department of Maxillofacial Surgery, University of Verona, Verona, Italy
| | - Fabio Maglitto
- Maxillo-Facial Surgery Unit, University of Bari "Aldo Moro", Bari, Italy
| | - Marzia Petrocelli
- Maxillofacial Surgery Operative Unit, Bellaria and Maggiore Hospital, Bologna, Italy
| | - Resi Pucci
- Maxillofacial Surgery Unit, San Camillo-Forlanini Hospital, Rome, Italy
| | - Gianmarco Saponaro
- Maxillo-Facial Surgery Unit, IRCSS "A. Gemelli" Foundation-Catholic, University of the Sacred Heart, Rome, Italy
| | - Alessandro Tel
- Department of Head and Neck Surgery and Neuroscience, Clinic of Maxillofacial Surgery, University Hospital of Udine, Udine, Italy
| | | | | | - Paolo Boscolo-Rizzo
- Department of Medical, Surgical and Health Sciences, Section of Otolaryngology, University of Trieste, Trieste, Italy
| | - Giovanni Salzano
- Head and Neck Section, Department of Neurosciences, Reproductive and Odontostomatological Science, Federico II University of Naples, Naples, Italy
| | - Giacomo De Riu
- Maxillofacial Surgery Operative Unit, Department of Medicine, Surgery and Pharmacy, University of Sassari, Sassari, Italy
| |
Collapse
|
12
|
Şan H, Bayrakcı Ö, Çağdaş B, Serdengeçti M, Alagöz E. Reliability and readability analysis of ChatGPT-4 and Google Bard as a patient information source for the most commonly applied radionuclide treatments in cancer patients. Rev Esp Med Nucl Imagen Mol 2024:500021. [PMID: 38821410 DOI: 10.1016/j.remnie.2024.500021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2024] [Accepted: 05/21/2024] [Indexed: 06/02/2024]
Abstract
PURPOSE Searching for online health information is a popular approach employed by patients to enhance their knowledge for their diseases. Recently developed AI chatbots are probably the easiest way in this regard. The purpose of the study is to analyze the reliability and readability of AI chatbot responses in terms of the most commonly applied radionuclide treatments in cancer patients. METHODS Basic patient questions, thirty about RAI, PRRT and TARE treatments and twenty-nine about PSMA-TRT, were asked one by one to GPT-4 and Bard on January 2024. The reliability and readability of the responses were assessed by using DISCERN scale, Flesch Reading Ease(FRE) and Flesch-Kincaid Reading Grade Level(FKRGL). RESULTS The mean (SD) FKRGL scores for the responses of GPT-4 and Google Bard about RAI, PSMA-TRT, PRRT and TARE treatmens were 14.57 (1.19), 14.65 (1.38), 14.25 (1.10), 14.38 (1.2) and 11.49 (1.59), 12.42 (1.71), 11.35 (1.80), 13.01 (1.97), respectively. In terms of readability the FRKGL scores of the responses of GPT-4 and Google Bard about RAI, PSMA-TRT, PRRT and TARE treatments were above the general public reading grade level. The mean (SD) DISCERN scores assesses by nuclear medicine phsician for the responses of GPT-4 and Bard about RAI, PSMA-TRT, PRRT and TARE treatments were 47.86 (5.09), 48.48 (4.22), 46.76 (4.09), 48.33 (5.15) and 51.50 (5.64), 53.44 (5.42), 53 (6.36), 49.43 (5.32), respectively. Based on mean DISCERN scores, the reliability of the responses of GPT-4 and Google Bard about RAI, PSMA-TRT, PRRT, and TARE treatments ranged from fair to good. The inter-rater reliability correlation coefficient of DISCERN scores assessed by GPT-4, Bard and nuclear medicine physician for the responses of GPT-4 about RAI, PSMA-TRT, PRRT and TARE treatments were 0.512(95% CI 0.296: 0.704), 0.695(95% CI 0.518: 0.829), 0.687(95% CI 0.511: 0.823) and 0.649 (95% CI 0.462: 0.798), respectively (p < 0.01). The inter-rater reliability correlation coefficient of DISCERN scores assessed by GPT-4, Bard and nuclear medicine physician for the responses of Bard about RAI, PSMA-TRT, PRRT and TARE treatments were 0.753(95% CI 0.602: 0.863), 0.812(95% CI 0.686: 0.899), 0.804(95% CI 0.677: 0.894) and 0.671 (95% CI 0.489: 0.812), respectively (p < 0.01). The inter-rater reliability for the responses of Bard and GPT-4 about RAİ, PSMA-TRT, PRRT and TARE treatments were moderate to good. Further, consulting to the nuclear medicine physician was rarely emphasized both in GPT-4 and Google Bard and references were included in some responses of Google Bard, but there were no references in GPT-4. CONCLUSION Although the information provided by AI chatbots may be acceptable in medical terms, it can not be easy to read for the general public, which may prevent it from being understandable. Effective prompts using 'prompt engineering' may refine the responses in a more comprehensible manner. Since radionuclide treatments are specific to nuclear medicine expertise, nuclear medicine physician need to be stated as a consultant in responses in order to guide patients and caregivers to obtain accurate medical advice. Referencing is significant in terms of confidence and satisfaction of patients and caregivers seeking information.
Collapse
Affiliation(s)
- Hüseyin Şan
- Ankara Bilkent City Hospital, Department of Nuclear Medicine, Ankara, Turkey.
| | - Özkan Bayrakcı
- Ankara Bilkent City Hospital, Department of Nuclear Medicine, Ankara, Turkey
| | - Berkay Çağdaş
- Ankara Bilkent City Hospital, Department of Nuclear Medicine, Ankara, Turkey
| | - Mustafa Serdengeçti
- Ankara Bilkent City Hospital, Department of Nuclear Medicine, Ankara, Turkey
| | - Engin Alagöz
- Gulhane Training and Research Hospital, Department of Nuclear Medicine, Ankara, Turkey
| |
Collapse
|
13
|
Horiuchi D, Tatekawa H, Oura T, Oue S, Walston SL, Takita H, Matsushita S, Mitsuyama Y, Shimono T, Miki Y, Ueda D. Comparing the Diagnostic Performance of GPT-4-based ChatGPT, GPT-4V-based ChatGPT, and Radiologists in Challenging Neuroradiology Cases. Clin Neuroradiol 2024:10.1007/s00062-024-01426-y. [PMID: 38806794 DOI: 10.1007/s00062-024-01426-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2024] [Accepted: 05/06/2024] [Indexed: 05/30/2024]
Abstract
PURPOSE To compare the diagnostic performance among Generative Pre-trained Transformer (GPT)-4-based ChatGPT, GPT‑4 with vision (GPT-4V) based ChatGPT, and radiologists in challenging neuroradiology cases. METHODS We collected 32 consecutive "Freiburg Neuropathology Case Conference" cases from the journal Clinical Neuroradiology between March 2016 and December 2023. We input the medical history and imaging findings into GPT-4-based ChatGPT and the medical history and images into GPT-4V-based ChatGPT, then both generated a diagnosis for each case. Six radiologists (three radiology residents and three board-certified radiologists) independently reviewed all cases and provided diagnoses. ChatGPT and radiologists' diagnostic accuracy rates were evaluated based on the published ground truth. Chi-square tests were performed to compare the diagnostic accuracy of GPT-4-based ChatGPT, GPT-4V-based ChatGPT, and radiologists. RESULTS GPT‑4 and GPT-4V-based ChatGPTs achieved accuracy rates of 22% (7/32) and 16% (5/32), respectively. Radiologists achieved the following accuracy rates: three radiology residents 28% (9/32), 31% (10/32), and 28% (9/32); and three board-certified radiologists 38% (12/32), 47% (15/32), and 44% (14/32). GPT-4-based ChatGPT's diagnostic accuracy was lower than each radiologist, although not significantly (all p > 0.07). GPT-4V-based ChatGPT's diagnostic accuracy was also lower than each radiologist and significantly lower than two board-certified radiologists (p = 0.02 and 0.03) (not significant for radiology residents and one board-certified radiologist [all p > 0.09]). CONCLUSION While GPT-4-based ChatGPT demonstrated relatively higher diagnostic performance than GPT-4V-based ChatGPT, the diagnostic performance of GPT‑4 and GPT-4V-based ChatGPTs did not reach the performance level of either radiology residents or board-certified radiologists in challenging neuroradiology cases.
Collapse
Affiliation(s)
- Daisuke Horiuchi
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
| | - Hiroyuki Tatekawa
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
| | - Tatsushi Oura
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
| | - Satoshi Oue
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
| | - Shannon L Walston
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
| | - Hirotaka Takita
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
| | - Shu Matsushita
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
| | - Yasuhito Mitsuyama
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
| | - Taro Shimono
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
| | - Yukio Miki
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
| | - Daiju Ueda
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan.
- Center for Health Science Innovation, Osaka Metropolitan University, Osaka, Japan.
| |
Collapse
|
14
|
Ariyaratne S, Jenko N, Mark Davies A, Iyengar KP, Botchu R. Could ChatGPT Pass the UK Radiology Fellowship Examinations? Acad Radiol 2024; 31:2178-2182. [PMID: 38160089 DOI: 10.1016/j.acra.2023.11.026] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2023] [Revised: 11/12/2023] [Accepted: 11/18/2023] [Indexed: 01/03/2024]
Abstract
RATIONALE AND OBJECTIVES Chat Generative Pre-trained Transformer (ChatGPT) is an artificial intelligence (AI) tool which utilises machine learning to generate original text resembling human language. AI models have recently demonstrated remarkable ability at analysing and solving problems, including passing professional examinations. We investigate the performance of ChatGPT on some of the UK radiology fellowship equivalent examination questions. METHODS ChatGPT was asked to answer questions from question banks resembling the Fellowship of the Royal College of Radiologists (FRCR) examination. The entire physics part 1 question bank (203 5-part true/false questions) was answered by the GPT-4 model and answers recorded. 240 single best answer questions (SBAs) (representing the true length of the FRCR 2A examination) were answered by both GPT-3.5 and GPT-4 models. RESULTS ChatGPT 4 answered 74.8% of part 1 true/false statements correctly. The spring 2023 passing mark of the part 1 examination was 75.5% and ChatGPT thus narrowly failed. In the 2A examination, ChatGPT 3.5 answered 50.8% SBAs correctly, while GPT-4 answered 74.2% correctly. The winter 2022 2A pass mark was 63.3% and thus GPT-4 clearly passed. CONCLUSION AI models such as ChatGPT are able to answer the majority of questions in an FRCR style examination. It is reasonable to assume that further developments in AI will be more likely to succeed in comprehending and solving questions related to medicine, specifically clinical radiology. ADVANCES IN KNOWLEDGE Our findings outline the unprecedented capabilities of AI, adding to the current relatively small body of literature on the subject, which in turn can play a role medical training, evaluation and practice. This can undoubtedly have implications for radiology.
Collapse
Affiliation(s)
- Sisith Ariyaratne
- Department of Musculoskeletal Radiology, The Royal Orthopaedic Hospital NHS Foundation Trust, Northfield, Birmingham, UK (S.A., N.J., A.M.D., R.B.).
| | - Nathan Jenko
- Department of Musculoskeletal Radiology, The Royal Orthopaedic Hospital NHS Foundation Trust, Northfield, Birmingham, UK (S.A., N.J., A.M.D., R.B.)
| | - A Mark Davies
- Department of Musculoskeletal Radiology, The Royal Orthopaedic Hospital NHS Foundation Trust, Northfield, Birmingham, UK (S.A., N.J., A.M.D., R.B.)
| | - Karthikeyan P Iyengar
- Department of Trauma & Orthopaedics, Southport & Ormskirk Hospitals, Mersey and West Lancashire Teaching NHS Trust, Southport, UK (K.P.I.)
| | - Rajesh Botchu
- Department of Musculoskeletal Radiology, The Royal Orthopaedic Hospital NHS Foundation Trust, Northfield, Birmingham, UK (S.A., N.J., A.M.D., R.B.)
| |
Collapse
|
15
|
Dağci M, Çam F, Dost A. Reliability and Quality of the Nursing Care Planning Texts Generated by ChatGPT. Nurse Educ 2024; 49:E109-E114. [PMID: 37994523 DOI: 10.1097/nne.0000000000001566] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2023]
Abstract
BACKGROUND The research on ChatGPT-generated nursing care planning texts is critical for enhancing nursing education through innovative and accessible learning methods, improving reliability and quality. PURPOSE The aim of the study was to examine the quality, authenticity, and reliability of the nursing care planning texts produced using ChatGPT. METHODS The study sample comprised 40 texts generated by ChatGPT selected nursing diagnoses that were included in NANDA 2021-2023. The texts were evaluated by using descriptive criteria form and DISCERN tool to evaluate health information. RESULTS DISCERN total average score of the texts was 45.93 ± 4.72. All texts had a moderate level of reliability and 97.5% of them provided moderate quality subscale score of information. A statistically significant relationship was found among the number of accessible references, reliability ( r = 0.408) and quality subscale score ( r = 0.379) of the texts ( P < .05). CONCLUSION ChatGPT-generated texts exhibited moderate reliability, quality of nursing care information, and overall quality despite low similarity rates.
Collapse
Affiliation(s)
- Mahmut Dağci
- Author Affiliation: Department of Nursing, Bezmialem Vakif University, Faculty of Health Sciences, Istanbul, Turkey
| | | | | |
Collapse
|
16
|
Carobene A, Padoan A, Cabitza F, Banfi G, Plebani M. Rising adoption of artificial intelligence in scientific publishing: evaluating the role, risks, and ethical implications in paper drafting and review process. Clin Chem Lab Med 2024; 62:835-843. [PMID: 38019961 DOI: 10.1515/cclm-2023-1136] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2023] [Accepted: 11/13/2023] [Indexed: 12/01/2023]
Abstract
BACKGROUND In the rapid evolving landscape of artificial intelligence (AI), scientific publishing is experiencing significant transformations. AI tools, while offering unparalleled efficiencies in paper drafting and peer review, also introduce notable ethical concerns. CONTENT This study delineates AI's dual role in scientific publishing: as a co-creator in the writing and review of scientific papers and as an ethical challenge. We first explore the potential of AI as an enhancer of efficiency, efficacy, and quality in creating scientific papers. A critical assessment follows, evaluating the risks vs. rewards for researchers, especially those early in their careers, emphasizing the need to maintain a balance between AI's capabilities and fostering independent reasoning and creativity. Subsequently, we delve into the ethical dilemmas of AI's involvement, particularly concerning originality, plagiarism, and preserving the genuine essence of scientific discourse. The evolving dynamics further highlight an overlooked aspect: the inadequate recognition of human reviewers in the academic community. With the increasing volume of scientific literature, tangible metrics and incentives for reviewers are proposed as essential to ensure a balanced academic environment. SUMMARY AI's incorporation in scientific publishing is promising yet comes with significant ethical and operational challenges. The role of human reviewers is accentuated, ensuring authenticity in an AI-influenced environment. OUTLOOK As the scientific community treads the path of AI integration, a balanced symbiosis between AI's efficiency and human discernment is pivotal. Emphasizing human expertise, while exploit artificial intelligence responsibly, will determine the trajectory of an ethically sound and efficient AI-augmented future in scientific publishing.
Collapse
Affiliation(s)
- Anna Carobene
- Laboratory Medicine, IRCCS San Raffaele Scientific Institute, Milan, Italy
| | - Andrea Padoan
- Department of Medicine-DIMED, University of Padova, Padova, Italy
- Laboratory Medicine Unit, University Hospital of Padova, Padova, Italy
| | - Federico Cabitza
- DISCo, Università Degli Studi di Milano-Bicocca, Milan, Italy
- IRCCS Ospedale Galeazzi - Sant'Ambrogio, Milan, Italy
| | - Giuseppe Banfi
- IRCCS Ospedale Galeazzi - Sant'Ambrogio, Milan, Italy
- University Vita-Salute San Raffaele, Milan, Italy
| | - Mario Plebani
- Laboratory Medicine Unit, University Hospital of Padova, Padova, Italy
- University of Padova, Padova, Italy
| |
Collapse
|
17
|
Graf EM, McKinney JA, Dye AB, Lin L, Sanchez-Ramos L. Exploring the Limits of Artificial Intelligence for Referencing Scientific Articles. Am J Perinatol 2024. [PMID: 38653452 DOI: 10.1055/s-0044-1786033] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 04/25/2024]
Abstract
OBJECTIVE To evaluate the reliability of three artificial intelligence (AI) chatbots (ChatGPT, Google Bard, and Chatsonic) in generating accurate references from existing obstetric literature. STUDY DESIGN Between mid-March and late April 2023, ChatGPT, Google Bard, and Chatsonic were prompted to provide references for specific obstetrical randomized controlled trials (RCTs) published in 2020. RCTs were considered for inclusion if they were mentioned in a previous article that primarily evaluated RCTs published by the top medical and obstetrics and gynecology journals with the highest impact factors in 2020 as well as RCTs published in a new journal focused on publishing obstetric RCTs. The selection of the three AI models was based on their popularity, performance in natural language processing, and public availability. Data collection involved prompting the AI chatbots to provide references according to a standardized protocol. The primary evaluation metric was the accuracy of each AI model in correctly citing references, including authors, publication title, journal name, and digital object identifier (DOI). Statistical analysis was performed using a permutation test to compare the performance of the AI models. RESULTS Among the 44 RCTs analyzed, Google Bard demonstrated the highest accuracy, correctly citing 13.6% of the requested RCTs, whereas ChatGPT and Chatsonic exhibited lower accuracy rates of 2.4 and 0%, respectively. Google Bard often substantially outperformed Chatsonic and ChatGPT in correctly citing the studied reference components. The majority of references from all AI models studied were noted to provide DOIs for unrelated studies or DOIs that do not exist. CONCLUSION To ensure the reliability of scientific information being disseminated, authors must exercise caution when utilizing AI for scientific writing and literature search. However, despite their limitations, collaborative partnerships between AI systems and researchers have the potential to drive synergistic advancements, leading to improved patient care and outcomes. KEY POINTS · AI chatbots often cite scientific articles incorrectly.. · AI chatbots can create false references.. · Responsible AI use in research is vital..
Collapse
Affiliation(s)
- Emily M Graf
- Department of Obstetrics and Gynecology, University of Florida College of Medicine, Jacksonville, Florida
| | - Jordan A McKinney
- Department of Obstetrics and Gynecology, University of Florida College of Medicine, Jacksonville, Florida
| | - Alexander B Dye
- Department of Obstetrics and Gynecology, University of Florida College of Medicine, Jacksonville, Florida
| | - Lifeng Lin
- Department of Epidemiology and Biostatistics, University of Arizona, Tucson, Arizona
| | - Luis Sanchez-Ramos
- Department of Obstetrics and Gynecology, University of Florida College of Medicine, Jacksonville, Florida
| |
Collapse
|
18
|
Ni Z, Peng R, Zheng X, Xie P. Embracing the future: Integrating ChatGPT into China's nursing education system. Int J Nurs Sci 2024; 11:295-299. [PMID: 38707690 PMCID: PMC11064564 DOI: 10.1016/j.ijnss.2024.03.006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2023] [Revised: 02/13/2024] [Accepted: 03/06/2024] [Indexed: 05/07/2024] Open
Abstract
This article delves into the role of ChatGPT within the rapidly evolving field of artificial intelligence, especially highlighting its significant potential in nursing education. Initially, the paper presents the notable advancements ChatGPT has achieved in facilitating interactive learning and providing real-time feedback, along with the academic community's growing interest in this technology. Subsequently, summarizing the research outcomes of ChatGPT's applications in nursing education, including various clinical disciplines and scenarios, showcases the enormous potential for multidisciplinary education and addressing clinical issues. Comparing the performance of several Large Language Models (LLMs) on China's National Nursing Licensure Examination, we observed that ChatGPT demonstrated a higher accuracy rate than its counterparts, providing a solid theoretical foundation for its application in Chinese nursing education and clinical settings. Educational institutions should establish a targeted and effective regulatory framework to leverage ChatGPT in localized nursing education while assuming corresponding responsibilities. Through standardized training for users and adjustments to existing educational assessment methods aimed at preventing potential misuse and abuse, the full potential of ChatGPT as an innovative auxiliary tool in China's nursing education system can be realized, aligning with the developmental needs of modern teaching methodologies.
Collapse
Affiliation(s)
- Zhengxin Ni
- School of Nursing, Yangzhou University, Yangzhou, China
| | - Rui Peng
- Department of Bone and Joint Surgery and Sports Medicine Center, The First Affiliated Hospital of Jinan University, Guangzhou, China
| | - Xiaofei Zheng
- Department of Bone and Joint Surgery and Sports Medicine Center, The First Affiliated Hospital of Jinan University, Guangzhou, China
| | - Ping Xie
- Department of External Cooperation, Northern Jiangsu People’s Hospital, Nanjing, China
| |
Collapse
|
19
|
Shen OY, Pratap JS, Li X, Chen NC, Bhashyam AR. How Does ChatGPT Use Source Information Compared With Google? A Text Network Analysis of Online Health Information. Clin Orthop Relat Res 2024; 482:578-588. [PMID: 38517757 PMCID: PMC10936961 DOI: 10.1097/corr.0000000000002995] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/03/2023] [Accepted: 01/08/2024] [Indexed: 03/24/2024]
Abstract
BACKGROUND The lay public is increasingly using ChatGPT (a large language model) as a source of medical information. Traditional search engines such as Google provide several distinct responses to each search query and indicate the source for each response, but ChatGPT provides responses in paragraph form in prose without providing the sources used, which makes it difficult or impossible to ascertain whether those sources are reliable. One practical method to infer the sources used by ChatGPT is text network analysis. By understanding how ChatGPT uses source information in relation to traditional search engines, physicians and physician organizations can better counsel patients on the use of this new tool. QUESTIONS/PURPOSES (1) In terms of key content words, how similar are ChatGPT and Google Search responses for queries related to topics in orthopaedic surgery? (2) Does the source distribution (academic, governmental, commercial, or form of a scientific manuscript) differ for Google Search responses based on the topic's level of medical consensus, and how is this reflected in the text similarity between ChatGPT and Google Search responses? (3) Do these results vary between different versions of ChatGPT? METHODS We evaluated three search queries relating to orthopaedic conditions: "What is the cause of carpal tunnel syndrome?," "What is the cause of tennis elbow?," and "Platelet-rich plasma for thumb arthritis?" These were selected because of their relatively high, medium, and low consensus in the medical evidence, respectively. Each question was posed to ChatGPT version 3.5 and version 4.0 20 times for a total of 120 responses. Text network analysis using term frequency-inverse document frequency (TF-IDF) was used to compare text similarity between responses from ChatGPT and Google Search. In the field of information retrieval, TF-IDF is a weighted statistical measure of the importance of a key word to a document in a collection of documents. Higher TF-IDF scores indicate greater similarity between two sources. TF-IDF scores are most often used to compare and rank the text similarity of documents. Using this type of text network analysis, text similarity between ChatGPT and Google Search can be determined by calculating and summing the TF-IDF for all keywords in a ChatGPT response and comparing it with each Google search result to assess their text similarity to each other. In this way, text similarity can be used to infer relative content similarity. To answer our first question, we characterized the text similarity between ChatGPT and Google Search responses by finding the TF-IDF scores of the ChatGPT response and each of the 20 Google Search results for each question. Using these scores, we could compare the similarity of each ChatGPT response to the Google Search results. To provide a reference point for interpreting TF-IDF values, we generated randomized text samples with the same term distribution as the Google Search results. By comparing ChatGPT TF-IDF to the random text sample, we could assess whether TF-IDF values were statistically significant from TF-IDF values obtained by random chance, and it allowed us to test whether text similarity was an appropriate quantitative statistical measure of relative content similarity. To answer our second question, we classified the Google Search results to better understand sourcing. Google Search provides 20 or more distinct sources of information, but ChatGPT gives only a single prose paragraph in response to each query. So, to answer this question, we used TF-IDF to ascertain whether the ChatGPT response was principally driven by one of four source categories: academic, government, commercial, or material that took the form of a scientific manuscript but was not peer-reviewed or indexed on a government site (such as PubMed). We then compared the TF-IDF similarity between ChatGPT responses and the source category. To answer our third research question, we repeated both analyses and compared the results when using ChatGPT 3.5 versus ChatGPT 4.0. RESULTS The ChatGPT response was dominated by the top Google Search result. For example, for carpal tunnel syndrome, the top result was an academic website with a mean TF-IDF of 7.2. A similar result was observed for the other search topics. To provide a reference point for interpreting TF-IDF values, a randomly generated sample of text compared with Google Search would have a mean TF-IDF of 2.7 ± 1.9, controlling for text length and keyword distribution. The observed TF-IDF distribution was higher for ChatGPT responses than for random text samples, supporting the claim that keyword text similarity is a measure of relative content similarity. When comparing source distribution, the ChatGPT response was most similar to the most common source category from Google Search. For the subject where there was strong consensus (carpal tunnel syndrome), the ChatGPT response was most similar to high-quality academic sources rather than lower-quality commercial sources (TF-IDF 8.6 versus 2.2). For topics with low consensus, the ChatGPT response paralleled lower-quality commercial websites compared with higher-quality academic websites (TF-IDF 14.6 versus 0.2). ChatGPT 4.0 had higher text similarity to Google Search results than ChatGPT 3.5 (mean increase in TF-IDF similarity of 0.80 to 0.91; p < 0.001). The ChatGPT 4.0 response was still dominated by the top Google Search result and reflected the most common search category for all search topics. CONCLUSION ChatGPT responses are similar to individual Google Search results for queries related to orthopaedic surgery, but the distribution of source information can vary substantially based on the relative level of consensus on a topic. For example, for carpal tunnel syndrome, where there is widely accepted medical consensus, ChatGPT responses had higher similarity to academic sources and therefore used those sources more. When fewer academic or government sources are available, especially in our search related to platelet-rich plasma, ChatGPT appears to have relied more heavily on a small number of nonacademic sources. These findings persisted even as ChatGPT was updated from version 3.5 to version 4.0. CLINICAL RELEVANCE Physicians should be aware that ChatGPT and Google likely use the same sources for a specific question. The main difference is that ChatGPT can draw upon multiple sources to create one aggregate response, while Google maintains its distinctness by providing multiple results. For topics with a low consensus and therefore a low number of quality sources, there is a much higher chance that ChatGPT will use less-reliable sources, in which case physicians should take the time to educate patients on the topic or provide resources that give more reliable information. Physician organizations should make it clear when the evidence is limited so that ChatGPT can reflect the lack of quality information or evidence.
Collapse
Affiliation(s)
- Oscar Y. Shen
- Department of Orthopaedic Surgery, Hand and Upper Extremity Service, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
- Faculty of Medicine, The Chinese University of Hong Kong, Hong Kong
| | - Jayanth S. Pratap
- Department of Orthopaedic Surgery, Hand and Upper Extremity Service, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
- Harvard University, Cambridge, MA, USA
| | - Xiang Li
- Department of Radiology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
| | - Neal C. Chen
- Department of Orthopaedic Surgery, Hand and Upper Extremity Service, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
| | - Abhiram R. Bhashyam
- Department of Orthopaedic Surgery, Hand and Upper Extremity Service, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
| |
Collapse
|
20
|
Temperley HC, O'Sullivan NJ, Mac Curtain BM, Corr A, Meaney JF, Kelly ME, Brennan I. Current applications and future potential of ChatGPT in radiology: A systematic review. J Med Imaging Radiat Oncol 2024; 68:257-264. [PMID: 38243605 DOI: 10.1111/1754-9485.13621] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2023] [Accepted: 12/29/2023] [Indexed: 01/21/2024]
Abstract
This study aimed to comprehensively evaluate the current utilization and future potential of ChatGPT, an AI-based chat model, in the field of radiology. The primary focus is on its role in enhancing decision-making processes, optimizing workflow efficiency, and fostering interdisciplinary collaboration and teaching within healthcare. A systematic search was conducted in PubMed, EMBASE and Web of Science databases. Key aspects, such as its impact on complex decision-making, workflow enhancement and collaboration, were assessed. Limitations and challenges associated with ChatGPT implementation were also examined. Overall, six studies met the inclusion criteria and were included in our analysis. All studies were prospective in nature. A total of 551 chatGPT (version 3.0 to 4.0) assessment events were included in our analysis. Considering the generation of academic papers, ChatGPT was found to output data inaccuracies 80% of the time. When ChatGPT was asked questions regarding common interventional radiology procedures, it contained entirely incorrect information 45% of the time. ChatGPT was seen to better answer US board-style questions when lower order thinking was required (P = 0.002). Improvements were seen between chatGPT 3.5 and 4.0 in regard to imaging questions with accuracy rates of 61 versus 85%(P = 0.009). ChatGPT was observed to have an average translational ability score of 4.27/5 on the Likert scale regarding CT and MRI findings. ChatGPT demonstrates substantial potential to augment decision-making and optimizing workflow. While ChatGPT's promise is evident, thorough evaluation and validation are imperative before widespread adoption in the field of radiology.
Collapse
Affiliation(s)
- Hugo C Temperley
- Department of Radiology, St. James's Hospital, Dublin, Ireland
- Department of Surgery, St. James's Hospital, Dublin, Ireland
| | | | | | - Alison Corr
- Department of Radiology, St. James's Hospital, Dublin, Ireland
| | - James F Meaney
- Department of Radiology, St. James's Hospital, Dublin, Ireland
| | - Michael E Kelly
- Department of Surgery, St. James's Hospital, Dublin, Ireland
| | - Ian Brennan
- Department of Radiology, St. James's Hospital, Dublin, Ireland
| |
Collapse
|
21
|
Lin KC, Chen TA, Lin MH, Chen YC, Chen TJ. Integration and Assessment of ChatGPT in Medical Case Reporting: A Multifaceted Approach. Eur J Investig Health Psychol Educ 2024; 14:888-901. [PMID: 38667812 PMCID: PMC11049282 DOI: 10.3390/ejihpe14040057] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2024] [Revised: 03/18/2024] [Accepted: 03/19/2024] [Indexed: 04/28/2024] Open
Abstract
ChatGPT, a large language model, has gained significance in medical writing, particularly in case reports that document the course of an illness. This article explores the integration of ChatGPT and how ChatGPT shapes the process, product, and politics of medical writing in the real world. We conducted a bibliometric analysis on case reports utilizing ChatGPT and indexed in PubMed, encompassing publication information. Furthermore, an in-depth analysis was conducted to categorize the applications and limitations of ChatGPT and the publication trend of application categories. A total of 66 case reports utilizing ChatGPT were identified, with a predominant preference for the online version and English input by the authors. The prevalent application categories were information retrieval and content generation. Notably, this trend remained consistent across different months. Within the subset of 32 articles addressing ChatGPT limitations in case report writing, concerns related to inaccuracies and a lack of clinical context were prominently emphasized. This pointed out the important role of clinical thinking and professional expertise, representing the foundational tenets of medical education, while also accentuating the distinction between physicians and generative artificial intelligence.
Collapse
Affiliation(s)
- Kuan-Chen Lin
- Department of Family Medicine, Taipei Veterans General Hospital, No. 201, Sec. 2, Shih-Pai Road, Taipei 11217, Taiwan; (K.-C.L.); (T.-A.C.); (M.-H.L.)
| | - Tsung-An Chen
- Department of Family Medicine, Taipei Veterans General Hospital, No. 201, Sec. 2, Shih-Pai Road, Taipei 11217, Taiwan; (K.-C.L.); (T.-A.C.); (M.-H.L.)
| | - Ming-Hwai Lin
- Department of Family Medicine, Taipei Veterans General Hospital, No. 201, Sec. 2, Shih-Pai Road, Taipei 11217, Taiwan; (K.-C.L.); (T.-A.C.); (M.-H.L.)
- School of Medicine, National Yang Ming Chiao Tung University, Taipei 30010, Taiwan
| | - Yu-Chun Chen
- Department of Family Medicine, Taipei Veterans General Hospital, No. 201, Sec. 2, Shih-Pai Road, Taipei 11217, Taiwan; (K.-C.L.); (T.-A.C.); (M.-H.L.)
- School of Medicine, National Yang Ming Chiao Tung University, Taipei 30010, Taiwan
- Institute of Hospital and Health Care Administration, School of Medicine, National Yang Ming Chiao Tung University, Taipei 30010, Taiwan
- Big Data Center, Taipei Veterans General Hospital, Taipei 11217, Taiwan
| | - Tzeng-Ji Chen
- Department of Family Medicine, Taipei Veterans General Hospital Hsinchu Branch, No. 81, Sec. 1, Zhongfeng Road, Zhudong Township, Hsinchu 310403, Taiwan
- Department of Post-Baccalaureate Medicine, National Chung Hsing University, No. 145, Xingda Road, South District, Taichung 402202, Taiwan
| |
Collapse
|
22
|
Yahagi M, Hiruta R, Miyauchi C, Tanaka S, Taguchi A, Yaguchi Y. Comparison of Conventional Anesthesia Nurse Education and an Artificial Intelligence Chatbot (ChatGPT) Intervention on Preoperative Anxiety: A Randomized Controlled Trial. J Perianesth Nurs 2024:S1089-9472(23)01073-0. [PMID: 38520470 DOI: 10.1016/j.jopan.2023.12.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2023] [Revised: 12/03/2023] [Accepted: 12/04/2023] [Indexed: 03/25/2024]
Abstract
PURPOSE This study aimed to evaluate the effects of an artificial intelligence (AI) chatbot (ChatGPT-3.5, OpenAI) on preoperative anxiety reduction and patient satisfaction in adult patients undergoing surgery under general anesthesia. DESIGN The study used a single-blind, randomized controlled trial design. METHODS In this study, 100 adult patients were enrolled and divided into two groups: 50 in the control group, in which patients received standard preoperative information from anesthesia nurses, and 50 in the intervention group, in which patients interacted with ChatGPT. The primary outcome, preoperative anxiety reduction, was measured using the Japanese State-Trait Anxiety Inventory (STAI) self-report questionnaire. The secondary endpoints included participant satisfaction (Q1), comprehension of the treatment process (Q2), and the perception of the AI chatbot's responses as more relevant than those of the nurses (Q3). FINDINGS Of the 85 participants who completed the study, the STAI scores in the control group remained stable, whereas those in the intervention group decreased. The mixed-effects model showed significant effects of time and group-time interaction on the STAI scores; however, no main group effect was observed. The secondary endpoints revealed mixed results; some patients found that the chatbot's responses were more relevant, whereas others were dissatisfied or experienced difficulties. CONCLUSIONS The ChatGPT intervention significantly reduced preoperative anxiety compared with the control group; however, no overall difference in the STAI scores was observed. The mixed secondary endpoint results highlight the need for refining chatbot algorithms and knowledge bases to improve performance and satisfaction. AI chatbots should complement, rather than replace, human health care providers. Seamless integration and effective communication among AI chatbots, patients, and health care providers are essential for optimizing patient outcomes.
Collapse
Affiliation(s)
- Musashi Yahagi
- Department of Anesthesiology, Hitachi General Hospital, San Francisco, CA.
| | - Rie Hiruta
- Department of Surgery, Hitachi General Hospital, San Francisco, CA
| | - Chisato Miyauchi
- Department of Surgery, Hitachi General Hospital, San Francisco, CA
| | - Shoko Tanaka
- Department of Surgery, Hitachi General Hospital, San Francisco, CA
| | - Aya Taguchi
- Department of Surgery, Hitachi General Hospital, San Francisco, CA
| | - Yuichi Yaguchi
- Department of Anesthesiology, Hitachi General Hospital, San Francisco, CA
| |
Collapse
|
23
|
Bukar UA, Sayeed MS, Razak SFA, Yogarayan S, Amodu OA. An integrative decision-making framework to guide policies on regulating ChatGPT usage. PeerJ Comput Sci 2024; 10:e1845. [PMID: 38440047 PMCID: PMC10911759 DOI: 10.7717/peerj-cs.1845] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2023] [Accepted: 01/09/2024] [Indexed: 03/06/2024]
Abstract
Generative artificial intelligence has created a moment in history where human beings have begin to closely interact with artificial intelligence (AI) tools, putting policymakers in a position to restrict or legislate such tools. One particular example of such a tool is ChatGPT which is the first and world's most popular multipurpose generative AI tool. This study aims to put forward a policy-making framework of generative artificial intelligence based on the risk, reward, and resilience framework. A systematic search was conducted, by using carefully chosen keywords, excluding non-English content, conference articles, book chapters, and editorials. Published research were filtered based on their relevance to ChatGPT ethics, yielding a total of 41 articles. Key elements surrounding ChatGPT concerns and motivations were systematically deduced and classified under the risk, reward, and resilience categories to serve as ingredients for the proposed decision-making framework. The decision-making process and rules were developed as a primer to help policymakers navigate decision-making conundrums. Then, the framework was practically tailored towards some of the concerns surrounding ChatGPT in the context of higher education. In the case of the interconnection between risk and reward, the findings show that providing students with access to ChatGPT presents an opportunity for increased efficiency in tasks such as text summarization and workload reduction. However, this exposes them to risks such as plagiarism and cheating. Similarly, pursuing certain opportunities such as accessing vast amounts of information, can lead to rewards, but it also introduces risks like misinformation and copyright issues. Likewise, focusing on specific capabilities of ChatGPT, such as developing tools to detect plagiarism and misinformation, may enhance resilience in some areas (e.g., academic integrity). However, it may also create vulnerabilities in other domains, such as the digital divide, educational equity, and job losses. Furthermore, the finding indicates second-order effects of legislation regarding ChatGPT which have implications both positively and negatively. One potential effect is a decrease in rewards due to the limitations imposed by the legislation, which may hinder individuals from fully capitalizing on the opportunities provided by ChatGPT. Hence, the risk, reward, and resilience framework provides a comprehensive and flexible decision-making model that allows policymakers and in this use case, higher education institutions to navigate the complexities and trade-offs associated with ChatGPT, which have theoretical and practical implications for the future.
Collapse
Affiliation(s)
- Umar Ali Bukar
- Centre for Intelligent Cloud Computing (CICC), Faculty of Information Science & Technology, Multimedia University, Melaka, Malaysia
| | - Md Shohel Sayeed
- Centre for Intelligent Cloud Computing (CICC), Faculty of Information Science & Technology, Multimedia University, Melaka, Malaysia
| | - Siti Fatimah Abdul Razak
- Centre for Intelligent Cloud Computing (CICC), Faculty of Information Science & Technology, Multimedia University, Melaka, Malaysia
| | - Sumendra Yogarayan
- Centre for Intelligent Cloud Computing (CICC), Faculty of Information Science & Technology, Multimedia University, Melaka, Malaysia
| | - Oluwatosin Ahmed Amodu
- Information and Communication Engineering Department, Elizade University, Ilara-Mokin, Ondo State, Nigeria
| |
Collapse
|
24
|
Saad A, Jenko N, Ariyaratne S, Birch N, Iyengar KP, Davies AM, Vaishya R, Botchu R. Exploring the potential of ChatGPT in the peer review process: An observational study. Diabetes Metab Syndr 2024; 18:102946. [PMID: 38330745 DOI: 10.1016/j.dsx.2024.102946] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/21/2023] [Revised: 01/09/2024] [Accepted: 01/10/2024] [Indexed: 02/10/2024]
Abstract
BACKGROUND Peer review is the established method for evaluating the quality and validity of research manuscripts in scholarly publishing. However, scientific peer review faces challenges as the volume of submitted research has steadily increased in recent years. Time constraints and peer review quality assurance can place burdens on reviewers, potentially discouraging their participation. Some artificial intelligence (AI) tools might assist in relieving these pressures. This study explores the efficiency and effectiveness of one of the artificial intelligence (AI) chatbots, ChatGPT (Generative Pre-trained Transformer), in the peer review process. METHODS Twenty-one peer-reviewed research articles were anonymised to ensure unbiased evaluation. Each article was reviewed by two humans and by versions 3.5 and 4.0 of ChatGPT. The AI was instructed to provide three positive and three negative comments on the articles and recommend whether they should be accepted or rejected. The human and AI results were compared using a 5-point Likert scale to determine the level of agreement. The correlation between ChatGPT responses and the acceptance or rejection of the papers was also examined. RESULTS Subjective review similarity between human reviewers and ChatGPT showed a mean score of 3.6/5 for ChatGPT 3.5 and 3.76/5 for ChatGPT 4.0. The correlation between human and AI review scores was statistically significant for ChatGPT 3.5, but not for ChatGPT 4.0. CONCLUSION ChatGPT can complement human scientific peer review, enhancing efficiency and promptness in the editorial process. However, a fully automated AI review process is currently not advisable, and ChatGPT's role should be regarded as highly constrained for the present and near future.
Collapse
Affiliation(s)
- Ahmed Saad
- Department of Orthopedics, Royal Orthopaedic Hospital, Birmingham, B31 2AP, UK.
| | - Nathan Jenko
- Department of Musculoskeletal Radiology, Royal Orthopaedic Hospital, Birmingham, B31 2AP, UK.
| | - Sisith Ariyaratne
- Department of Musculoskeletal Radiology, Royal Orthopaedic Hospital, Birmingham, B31 2AP, UK.
| | - Nick Birch
- East Midlands Spine, Bragborough Hall Health & Wellbeing Centre, Welton Road, Braunston, Daventry, Northants, NN117JG, UK.
| | - Karthikeyan P Iyengar
- Department of Orthopedics, Mersey and West Lancashire Teaching Hospitals NHS Trust, Southport, PR8 6PN, UK.
| | - Arthur Mark Davies
- Department of Musculoskeletal Radiology, Royal Orthopaedic Hospital, Birmingham, B31 2AP, UK.
| | - Raju Vaishya
- Department of Orthopedics, Indraprastha Apollo Hospital, Mathura Rd, New Delhi, 110076, India.
| | - Rajesh Botchu
- Department of Musculoskeletal Radiology, Royal Orthopaedic Hospital, Birmingham, B31 2AP, UK.
| |
Collapse
|
25
|
Hatia A, Doldo T, Parrini S, Chisci E, Cipriani L, Montagna L, Lagana G, Guenza G, Agosta E, Vinjolli F, Hoxha M, D’Amelio C, Favaretto N, Chisci G. Accuracy and Completeness of ChatGPT-Generated Information on Interceptive Orthodontics: A Multicenter Collaborative Study. J Clin Med 2024; 13:735. [PMID: 38337430 PMCID: PMC10856539 DOI: 10.3390/jcm13030735] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2024] [Revised: 01/21/2024] [Accepted: 01/25/2024] [Indexed: 02/12/2024] Open
Abstract
Background: this study aims to investigate the accuracy and completeness of ChatGPT in answering questions and solving clinical scenarios of interceptive orthodontics. Materials and Methods: ten specialized orthodontists from ten Italian postgraduate orthodontics schools developed 21 clinical open-ended questions encompassing all of the subspecialities of interceptive orthodontics and 7 comprehensive clinical cases. Questions and scenarios were inputted into ChatGPT4, and the resulting answers were evaluated by the researchers using predefined accuracy (range 1-6) and completeness (range 1-3) Likert scales. Results: For the open-ended questions, the overall median score was 4.9/6 for the accuracy and 2.4/3 for completeness. In addition, the reviewers rated the accuracy of open-ended answers as entirely correct (score 6 on Likert scale) in 40.5% of cases and completeness as entirely correct (score 3 n Likert scale) in 50.5% of cases. As for the clinical cases, the overall median score was 4.9/6 for accuracy and 2.5/3 for completeness. Overall, the reviewers rated the accuracy of clinical case answers as entirely correct in 46% of cases and the completeness of clinical case answers as entirely correct in 54.3% of cases. Conclusions: The results showed a high level of accuracy and completeness in AI responses and a great ability to solve difficult clinical cases, but the answers were not 100% accurate and complete. ChatGPT is not yet sophisticated enough to replace the intellectual work of human beings.
Collapse
Affiliation(s)
- Arjeta Hatia
- Orthodontics Postgraduate School, Department of Medical Biotechnologies, University of Siena, 53100 Siena, Italy; (T.D.); (L.C.)
| | - Tiziana Doldo
- Orthodontics Postgraduate School, Department of Medical Biotechnologies, University of Siena, 53100 Siena, Italy; (T.D.); (L.C.)
| | - Stefano Parrini
- Oral Surgery Postgraduate School, Department of Medical Biotechnologies, University of Siena, 53100 Siena, Italy;
| | - Elettra Chisci
- Orthodontics Postgraduate School, University of Ferrara, 44121 Ferrara, Italy
| | - Linda Cipriani
- Orthodontics Postgraduate School, Department of Medical Biotechnologies, University of Siena, 53100 Siena, Italy; (T.D.); (L.C.)
| | - Livia Montagna
- Orthodontics Postgraduate School, University of Cagliari, 09121 Cagliari, Italy;
| | - Giuseppina Lagana
- Orthodontics Postgraduate School, “Sapienza” University of Rome, 00185 Rome, Italy;
| | - Guia Guenza
- Orthodontics Postgraduate School, University of Milano, 20019 Milan, Italy
| | - Edoardo Agosta
- Orthodontics Postgraduate School, University of Torino, 10024 Turin, Italy
| | - Franceska Vinjolli
- Orthodontics Postgraduate School, University of Roma Tor Vergata, 00133 Rome, Italy;
| | - Meladiona Hoxha
- Orthodontics Postgraduate School, “Cattolica” University of Rome, 00168 Rome, Italy;
| | - Claudio D’Amelio
- Orthodontics Postgraduate School, University of Chieti, 66100 Chieti, Italy;
| | - Nicolò Favaretto
- Orthodontics Postgraduate School, University of Trieste, 34100 Trieste, Italy
| | - Glauco Chisci
- Oral Surgery Postgraduate School, Department of Medical Biotechnologies, University of Siena, 53100 Siena, Italy;
| |
Collapse
|
26
|
Mese I. Tracing the Footprints of AI in Radiology Literature: A Detailed Analysis of Journal Abstracts. ROFO-FORTSCHR RONTG 2024. [PMID: 38228155 DOI: 10.1055/a-2224-9230] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2024]
Abstract
PURPOSE To assess and compare the probabilities of AI-generated content within scientific abstracts from selected Q1 journals in the fields of radiology, nuclear medicine, and imaging, published between May and August 2022 and May and August 2023. MATERIALS AND METHODS An extensive list of Q1 journals was acquired from Scopus in the fields of radiology, nuclear medicine, and imaging. All articles in these journals were acquired from the Medline databases, focusing on articles published between May and August in 2022 and 2023. The study specifically compared abstracts for limitations of the AI detection tool in terms of word constraints. Extracted abstracts from the two different periods were categorized into two groups, and each abstract was analyzed using the AI detection tool, a system capable of distinguishing between human and AI-generated content with a validated accuracy of 97.06 %. This tool assessed the probability of each abstract being AI-generated, enabling an in-depth comparison between the two groups in terms of the prevalence of AI-generated content probability. RESULTS Group 1 and Group 2 exhibit significant variations in the characteristics of AI-generated content probability. Group 1, consisting of 4,727 abstracts, has a median AI-generated content probability of 3.8 % (IQR1.9-9.9 %) and peaks at 49.9 %, with the computation times contained within a range of 2 to 10 seconds (IQR 3-8 s). In contrast, Group 2, which is composed of 3,917 abstracts, displays a significantly higher median AI-generated content probability at 5.7 % (IQR2.8-12.9 %) surging to a maximum of 69.9 %, with computation times spanning from 2 to 14 seconds (IQR 4-11 s). This comparison yields a statistically significant difference in median AI-generated content probability between the two groups (p = 0.005). No significant correlation was observed between word count and AI probability, as well as between article type, primarily original articles and reviews, and AI probability, indicating that AI probability is independent of these factors. CONCLUSION The comprehensive analysis reveals significant differences and variations in AI-generated content probabilities between 2022 and 2023, indicating a growing presence of AI-generated content. However, it also illustrates that abstract length or article type does not impact the likelihood of content being AI-generated. KEY POINTS · The study examines AI-generated content probability in scientific abstracts from Q1 journals between 2022 to 2023.. · The AI detector tool indicates an increased median AI content probability from 3.8 % to 5.7 %.. · No correlation was found between abstract length or article type and AI probability..
Collapse
Affiliation(s)
- Ismail Mese
- Department of Radiology, Istanbul Erenkoy Mental and Nervous Diseases Training and Research Hospital, Istanbul, Turkey
| |
Collapse
|
27
|
Warren E, Hurley ET, Park CN, Crook BS, Lorentz S, Levin JM, Anakwenze O, MacDonald PB, Klifto CS. Evaluation of information from artificial intelligence on rotator cuff repair surgery. JSES Int 2024; 8:53-57. [PMID: 38312282 PMCID: PMC10837709 DOI: 10.1016/j.jseint.2023.09.009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2024] Open
Abstract
Purpose The purpose of this study was to analyze the quality and readability of information regarding rotator cuff repair surgery available using an online AI software. Methods An open AI model (ChatGPT) was used to answer 24 commonly asked questions from patients on rotator cuff repair. Questions were stratified into one of three categories based on the Rothwell classification system: fact, policy, or value. The answers for each category were evaluated for reliability, quality and readability using The Journal of the American Medical Association Benchmark criteria, DISCERN score, Flesch-Kincaid Reading Ease Score and Grade Level. Results The Journal of the American Medical Association Benchmark criteria score for all three categories was 0, which is the lowest score indicating no reliable resources cited. The DISCERN score was 51 for fact, 53 for policy, and 55 for value questions, all of which are considered good scores. Across question categories, the reliability portion of the DISCERN score was low, due to a lack of resources. The Flesch-Kincaid Reading Ease Score (and Flesch-Kincaid Grade Level) was 48.3 (10.3) for the fact class, 42.0 (10.9) for the policy class, and 38.4 (11.6) for the value class. Conclusion The quality of information provided by the open AI chat system was generally high across all question types but had significant shortcomings in reliability due to the absence of source material citations. The DISCERN scores of the AI generated responses matched or exceeded previously published results of studies evaluating the quality of online information about rotator cuff repairs. The responses were U.S. 10th grade or higher reading level which is above the AMA and NIH recommendation of 6th grade reading level for patient materials. The AI software commonly referred the user to seek advice from orthopedic surgeons to improve their chances of a successful outcome.
Collapse
Affiliation(s)
- Eric Warren
- Duke University School of Medicine, Duke University, Durham, NC, USA
| | - Eoghan T. Hurley
- Department of Orthopaedic Surgery, Duke University, Durham, NC, USA
| | - Caroline N. Park
- Department of Orthopaedic Surgery, Duke University, Durham, NC, USA
| | - Bryan S. Crook
- Department of Orthopaedic Surgery, Duke University, Durham, NC, USA
| | - Samuel Lorentz
- Department of Orthopaedic Surgery, Duke University, Durham, NC, USA
| | - Jay M. Levin
- Department of Orthopaedic Surgery, Duke University, Durham, NC, USA
| | - Oke Anakwenze
- Department of Orthopaedic Surgery, Duke University, Durham, NC, USA
| | - Peter B. MacDonald
- Section of Orthopaedic Surgery & The Pan Am Clinic, University of Manitoba, Winnipeg, MB, Canada
| | | |
Collapse
|
28
|
Ferreira RM. New evidence-based practice: Artificial intelligence as a barrier breaker. World J Methodol 2023; 13:384-389. [PMID: 38229944 PMCID: PMC10789101 DOI: 10.5662/wjm.v13.i5.384] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/17/2023] [Revised: 10/24/2023] [Accepted: 11/08/2023] [Indexed: 12/20/2023] Open
Abstract
The concept of evidence-based practice has persisted over several years and remains a cornerstone in clinical practice, representing the gold standard for optimal patient care. However, despite widespread recognition of its significance, practical application faces various challenges and barriers, including a lack of skills in interpreting studies, limited resources, time constraints, linguistic competencies, and more. Recently, we have witnessed the emergence of a groundbreaking technological revolution known as artificial intelligence. Although artificial intelligence has become increasingly integrated into our daily lives, some reluctance persists among certain segments of the public. This article explores the potential of artificial intelligence as a solution to some of the main barriers encountered in the application of evidence-based practice. It highlights how artificial intelligence can assist in staying updated with the latest evidence, enhancing clinical decision-making, addressing patient misinformation, and mitigating time constraints in clinical practice. The integration of artificial intelligence into evidence-based practice has the potential to revolutionize healthcare, leading to more precise diagnoses, personalized treatment plans, and improved doctor-patient interactions. This proposed synergy between evidence-based practice and artificial intelligence may necessitate adjustments to its core concept, heralding a new era in healthcare.
Collapse
Affiliation(s)
- Ricardo Maia Ferreira
- Department of Sports and Exercise, Polytechnic Institute of Maia (N2i), Maia 4475-690, Porto, Portugal
- Department of Physioterapy, Polytechnic Institute of Coimbra, Coimbra Health School, Coimbra 3046-854, Coimbra, Portugal
- Department of Physioterapy, Polytechnic Institute of Castelo Branco, Dr. Lopes Dias Health School, Castelo Branco 6000-767, Castelo Branco, Portugal
- Sport Physical Activity and Health Research & Innovation Center, Polytechnic Institute of Viana do Castelo, Melgaço, 4960-320, Viana do Castelo, Portugal
| |
Collapse
|
29
|
Ariyaratne S, Iyengar KP, Nischal N, Babu NC, Botchu R. Authors' response to the Letter to the Editor: Re-evaluating the role of AI in scientific writing: a critical analysis on ChatGPT. Skeletal Radiol 2023; 52:2489. [PMID: 37462695 DOI: 10.1007/s00256-023-04405-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 10/18/2023]
Affiliation(s)
- Sisith Ariyaratne
- Department of Musculoskeletal Radiology, The Royal Orthopedic Hospital, Bristol Road South, Northfield, Birmingham, UK
| | | | - Neha Nischal
- Department of Radiology, Holy Family Hospital, New Delhi, India
| | - Naparla Chitti Babu
- Department of Radiology, Srinivas Institute of Medical Sciences & Research Centre, Mukka, Mangalore, India
| | - Rajesh Botchu
- Department of Musculoskeletal Radiology, The Royal Orthopedic Hospital, Bristol Road South, Northfield, Birmingham, UK.
| |
Collapse
|
30
|
Ariyaratne S, Iyengar KP, Nischal N, Babu NC, Botchu R. Authors' response to the Letter to the Editor. Skeletal Radiol 2023; 52:2491. [PMID: 37515642 DOI: 10.1007/s00256-023-04418-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 07/31/2023]
Affiliation(s)
- Sisith Ariyaratne
- Department of Musculoskeletal Radiology, The Royal Orthopedic Hospital, Bristol Road South, Northfield, Birmingham, UK
| | | | - Neha Nischal
- Department of Radiology, Holy Family Hospital, New Delhi, India
| | - Naparla Chitti Babu
- Department of Radiology, Srinivas Institute of Medical Sciences & Research Centre, Mukka, Mangalore, India
| | - Rajesh Botchu
- Department of Musculoskeletal Radiology, The Royal Orthopedic Hospital, Bristol Road South, Northfield, Birmingham, UK.
| |
Collapse
|
31
|
Kleebayoon A, Wiwanitkit V. ChatGPT-generated articles and human-written articles: correspondence. Skeletal Radiol 2023; 52:2493. [PMID: 37566150 DOI: 10.1007/s00256-023-04417-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/20/2023] [Revised: 07/20/2023] [Accepted: 07/24/2023] [Indexed: 08/12/2023]
Affiliation(s)
| | - Viroj Wiwanitkit
- Dr DY Patil Vidhyapeeth, Pune, India
- Joesph Ayobabalola University, Ikeji-Arakeji, Nigeria
| |
Collapse
|
32
|
Ariyaratne S, Iyengar KP, Botchu R. Author response to: Comment on: Will collaborative publishing with ChatGPT drive academic writing in the future? Br J Surg 2023; 110:1894. [PMID: 37702575 DOI: 10.1093/bjs/znad295] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2023] [Accepted: 09/05/2023] [Indexed: 09/14/2023]
Affiliation(s)
- Sisith Ariyaratne
- Department of Musculoskeletal Radiology, Royal Orthopaedic Hospital, Birmingham, UK
| | - Karthikeyan P Iyengar
- Department of Orthopaedics, Southport and Ormskirk Hospitals, Mersey and West Lancashire Teaching NHS Trust, Southport, UK
| | - Rajesh Botchu
- Department of Musculoskeletal Radiology, Royal Orthopaedic Hospital, Birmingham, UK
| |
Collapse
|
33
|
Sharun K, Banu SA, Pawde AM, Kumar R, Akash S, Dhama K, Pal A. ChatGPT and artificial hallucinations in stem cell research: assessing the accuracy of generated references - a preliminary study. Ann Med Surg (Lond) 2023; 85:5275-5278. [PMID: 37811040 PMCID: PMC10553015 DOI: 10.1097/ms9.0000000000001228] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2023] [Accepted: 08/12/2023] [Indexed: 10/10/2023] Open
Abstract
Stem cell research has the transformative potential to revolutionize medicine. Language models like ChatGPT, which use artificial intelligence (AI) and natural language processing, generate human-like text that can aid researchers. However, it is vital to ensure the accuracy and reliability of AI-generated references. This study assesses Chat Generative Pre-Trained Transformer (ChatGPT)'s utility in stem cell research and evaluates the accuracy of its references. Of the 86 references analyzed, 15.12% were fabricated and 9.30% were erroneous. These errors were due to limitations such as no real-time internet access and reliance on preexisting data. Artificial hallucinations were also observed, where the text seems plausible but deviates from fact. Monitoring, diverse training, and expanding knowledge cut-off can help to reduce fabricated references and hallucinations. Researchers must verify references and consider the limitations of AI models. Further research is needed to enhance the accuracy of such language models. Despite these challenges, ChatGPT has the potential to be a valuable tool for stem cell research. It can help researchers to stay up-to-date on the latest developments in the field and to find relevant information.
Collapse
Affiliation(s)
| | | | | | | | - Shopnil Akash
- Department of Pharmacy, Faculty of Allied Health Science, Daffodil International University, Daffodil Smart City, Ashulia, Savar, Dhaka, Bangladesh
| | - Kuldeep Dhama
- Division of Pathology, ICAR-Indian Veterinary Research Institute, Izatnagar, Bareilly, Uttar Pradesh, India
| | | |
Collapse
|
34
|
Saad A, Iyengar KP, Kurisunkal V, Botchu R. Assessing ChatGPT's ability to pass the FRCS orthopaedic part A exam: A critical analysis. Surgeon 2023; 21:263-266. [PMID: 37517980 DOI: 10.1016/j.surge.2023.07.001] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2023] [Accepted: 07/03/2023] [Indexed: 08/01/2023]
Abstract
AI technology has made significant advancements in recent years, with the notable development of ChatGPT in November 2022. Users have observed evidence of deductive reasoning, logical thinking, and coherent thought in ChatGPT's responses. This study aimed to determine if ChatGPT has the capability to pass the Orthopaedic Fellow of the Royal College of Surgeons (FRCS Orth) Part A exam. METHODS To assess ChatGPT4's ability to pass the Orthopaedic FRCS Orth Part A exam, a study was conducted using 240 mock FRCS Orth Part A questions. The study evaluated the accuracy of ChatGPT's answers and the response time for each question. Descriptive statistics were employed to analyse the chatbot's performance. RESULTS The evaluation revealed that ChatGPT4 achieved an overall score of 67.5% on Part A of the exam. However, ChatGPT4 did not meet the overall pass mark required for the FRCS Orth Part A exam. CONCLUSION This study demonstrates that ChatGPT was unable to pass the FRCS Orthopaedic examination. Several factors contributed to this outcome, including the lack of critical or high-order thinking abilities, limited clinical expertise, and the inability to meet the rigorous requirements of the exam.
Collapse
Affiliation(s)
- Ahmed Saad
- Department of Orthopedics, Royal Orthopedic Hospital, Birmingham, UK.
| | - Karthikeyan P Iyengar
- Department of Orthopedics, Southport and Ormskirk Hospital NHS Trust, Southport, UK.
| | - Vineet Kurisunkal
- Department of Orthopedic Oncology, Royal Orthopedic Hospital, Birmingham, UK.
| | - Rajesh Botchu
- Department of Musculoskeletal Radiology, Royal Orthopedic Hospital, Birmingham, UK.
| |
Collapse
|
35
|
Ariyaratne S, Iyengar KP, Botchu R. Will collaborative publishing with ChatGPT drive academic writing in the future? Br J Surg 2023; 110:1213-1214. [PMID: 37368994 DOI: 10.1093/bjs/znad198] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2023] [Revised: 04/28/2023] [Accepted: 05/26/2023] [Indexed: 06/29/2023]
Affiliation(s)
- Sisith Ariyaratne
- Department of Musculoskeletal Radiology, Royal Orthopaedic Hospital, Birmingham, UK
| | | | - Rajesh Botchu
- Department of Musculoskeletal Radiology, Royal Orthopaedic Hospital, Birmingham, UK
| |
Collapse
|
36
|
Bhattacharyya M, Miller VM, Bhattacharyya D, Miller LE. High Rates of Fabricated and Inaccurate References in ChatGPT-Generated Medical Content. Cureus 2023; 15:e39238. [PMID: 37337480 PMCID: PMC10277170 DOI: 10.7759/cureus.39238] [Citation(s) in RCA: 19] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/18/2023] [Indexed: 06/21/2023] Open
Abstract
Background The availability of large language models such as Chat Generative Pre-trained Transformer (ChatGPT, OpenAI) has enabled individuals from diverse backgrounds to access medical information. However, concerns exist about the accuracy of ChatGPT responses and the references used to generate medical content. Methods This observational study investigated the authenticity and accuracy of references in medical articles generated by ChatGPT. ChatGPT-3.5 generated 30 short medical papers, each with at least three references, based on standardized prompts encompassing various topics and therapeutic areas. Reference authenticity and accuracy were verified by searching Medline, Google Scholar, and the Directory of Open Access Journals. The authenticity and accuracy of individual ChatGPT-generated reference elements were also determined. Results Overall, 115 references were generated by ChatGPT, with a mean of 3.8±1.1 per paper. Among these references, 47% were fabricated, 46% were authentic but inaccurate, and only 7% were authentic and accurate. The likelihood of fabricated references significantly differed based on prompt variations; yet the frequency of authentic and accurate references remained low in all cases. Among the seven components evaluated for each reference, an incorrect PMID number was most common, listed in 93% of papers. Incorrect volume (64%), page numbers (64%), and year of publication (60%) were the next most frequent errors. The mean number of inaccurate components was 4.3±2.8 out of seven per reference. Conclusions The findings of this study emphasize the need for caution when seeking medical information on ChatGPT since most of the references provided were found to be fabricated or inaccurate. Individuals are advised to verify medical information from reliable sources and avoid relying solely on artificial intelligence-generated content.
Collapse
|