Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Homolak J. Opportunities and risks of ChatGPT in medicine, science, and academic publishing: a modern Promethean dilemma. Croat Med J 2023;64. [PMID: 36864812 PMCID: PMC10028563 DOI: 10.3325/cmj.2023.64.1] [Citation(s) in RCA: 59] [Impact Index Per Article: 59.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/24/2023] Open

For:	Homolak J. Opportunities and risks of ChatGPT in medicine, science, and academic publishing: a modern Promethean dilemma. Croat Med J 2023;64. [PMID: 36864812 PMCID: PMC10028563 DOI: 10.3325/cmj.2023.64.1] [Citation(s) in RCA: 59] [Impact Index Per Article: 59.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/24/2023] Open

Number

Cited by Other Article(s)

Iglesias G, Talavera E, Troya J, Díaz-Álvarez A, García-Remesal M. Artificial intelligence model for tumoral clinical decision support systems. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2024;253:108228. [PMID: 38810378 DOI: 10.1016/j.cmpb.2024.108228] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/02/2023] [Revised: 04/21/2024] [Accepted: 05/13/2024] [Indexed: 05/31/2024]

Abstract

BACKGROUND AND OBJECTIVE

Comparative diagnostic in brain tumor evaluation makes possible to use the available information of a medical center to compare similar cases when a new patient is evaluated. By leveraging Artificial Intelligence models, the proposed system is able of retrieving the most similar cases of brain tumors for a given query. The primary objective is to enhance the diagnostic process by generating more accurate representations of medical images, with a particular focus on patient-specific normal features and pathologies. A key distinction from previous models lies in its ability to produce enriched image descriptors solely from binary information, eliminating the need for costly and difficult to obtain tumor segmentation.

METHODS

The proposed model uses Artificial Intelligence to detect patient features to recommend the most similar cases from a database. The system not only suggests similar cases but also balances the representation of healthy and abnormal features in its design. This not only encourages the generalization of its use but also aids clinicians in their decision-making processes. This generalization makes possible for future research in different medical diagnosis areas with almost not any change in the system.

RESULTS

We conducted a comparative analysis of our approach in relation to similar studies. The proposed architecture obtains a Dice coefficient of 0.474 in both tumoral and healthy regions of the patients, which outperforms previous literature. Our proposed model excels at extracting and combining anatomical and pathological features from brain Magnetic Resonances (MRs), achieving state-of-the-art results while relying on less expensive label information. This substantially reduces the overall cost of the training process. Our findings highlight the significant potential for improving the efficiency and accuracy of comparative diagnostics and the treatment of tumoral pathologies.

CONCLUSIONS

This paper provides substantial grounds for further exploration of the broader applicability and optimization of the proposed architecture to enhance clinical decision-making. The novel approach presented in this work marks a significant advancement in the field of medical diagnosis, particularly in the context of Artificial Intelligence-assisted image retrieval, and promises to reduce costs and improve the quality of patient care using Artificial Intelligence as a support tool instead of a black box system.

Collapse

Hassanipour S, Nayak S, Bozorgi A, Keivanlou MH, Dave T, Alotaibi A, Joukar F, Mellatdoust P, Bakhshi A, Kuriyakose D, Polisetty LD, Chimpiri M, Amini-Salehi E. The Ability of ChatGPT in Paraphrasing Texts and Reducing Plagiarism: A Descriptive Analysis. JMIR MEDICAL EDUCATION 2024;10:e53308. [PMID: 38989841 DOI: 10.2196/53308] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/03/2023] [Revised: 01/03/2024] [Accepted: 05/01/2024] [Indexed: 07/12/2024]

Crouzet A, Lopez N, Riss Yaw B, Lepelletier Y, Demange L. The Millennia-Long Development of Drugs Associated with the 80-Year-Old Artificial Intelligence Story: The Therapeutic Big Bang? Molecules 2024;29:2716. [PMID: 38930784 PMCID: PMC11206022 DOI: 10.3390/molecules29122716] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2024] [Revised: 05/30/2024] [Accepted: 05/31/2024] [Indexed: 06/28/2024] Open

Perrot O, Schirmann A, Vidart A, Guillot-Tantay C, Izard V, Lebret T, Boillot B, Mesnard B, Lebacle C, Madec FX. Chatbots vs andrologists: Testing 25 clinical cases. THE FRENCH JOURNAL OF UROLOGY 2024;34:102636. [PMID: 38599321 DOI: 10.1016/j.fjurol.2024.102636] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/24/2023] [Accepted: 04/02/2024] [Indexed: 04/12/2024]

Griot M, Hemptinne C, Vanderdonckt J, Yuksel D. Impact of high-quality, mixed-domain data on the performance of medical language models. J Am Med Inform Assoc 2024:ocae120. [PMID: 38781312 DOI: 10.1093/jamia/ocae120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2023] [Revised: 03/31/2024] [Accepted: 05/08/2024] [Indexed: 05/25/2024] Open

Abstract

OBJECTIVE

To optimize the training strategy of large language models for medical applications, focusing on creating clinically relevant systems that efficiently integrate into healthcare settings, while ensuring high standards of accuracy and reliability.

MATERIALS AND METHODS

We curated a comprehensive collection of high-quality, domain-specific data and used it to train several models, each with different subsets of this data. These models were rigorously evaluated against standard medical benchmarks, such as the USMLE, to measure their performance. Furthermore, for a thorough effectiveness assessment, they were compared with other state-of-the-art medical models of comparable size.

RESULTS

The models trained with a mix of high-quality, domain-specific, and general data showed superior performance over those trained on larger, less clinically relevant datasets (P < .001). Our 7-billion-parameter model Med5 scores 60.5% on MedQA, outperforming the previous best of 49.3% from comparable models, and becomes the first of its size to achieve a passing score on the USMLE. Additionally, this model retained its proficiency in general domain tasks, comparable to state-of-the-art general domain models of similar size.

DISCUSSION

Our findings underscore the importance of integrating high-quality, domain-specific data in training large language models for medical purposes. The balanced approach between specialized and general data significantly enhances the model's clinical relevance and performance.

CONCLUSION

This study sets a new standard in medical language models, proving that a strategically trained, smaller model can outperform larger ones in clinical relevance and general proficiency, highlighting the importance of data quality and expert curation in generative artificial intelligence for healthcare applications.

Collapse

Tripathi S, Sukumaran R, Cook TS. Efficient healthcare with large language models: optimizing clinical workflow and enhancing patient care. J Am Med Inform Assoc 2024;31:1436-1440. [PMID: 38273739 PMCID: PMC11105142 DOI: 10.1093/jamia/ocad258] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2023] [Revised: 12/01/2023] [Accepted: 12/29/2023] [Indexed: 01/27/2024] Open

Kaneda Y, Tayuinosho A, Tomoyose R, Takita M, Hamaki T, Tanimoto T, Ozaki A. Evaluating ChatGPT's effectiveness and tendencies in Japanese internal medicine. J Eval Clin Pract 2024. [PMID: 38764369 DOI: 10.1111/jep.14011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/22/2024] [Revised: 04/22/2024] [Accepted: 04/28/2024] [Indexed: 05/21/2024]

Abstract

INTRODUCTION

ChatGPT, a large-scale language model, is a notable example of AI's potential in health care. However, its effectiveness in clinical settings, especially when compared to human physicians, is not fully understood. This study evaluates ChatGPT's capabilities and limitations in answering questions for Japanese internal medicine specialists, aiming to clarify its accuracy and tendencies in both correct and incorrect responses.

METHODS

We utilized ChatGPT's answers on four sets of self-training questions for internal medicine specialists in Japan from 2020 to 2023. We ran three trials for each set to evaluate its overall accuracy and performance on nonimage questions. Subsequently, we categorized the questions into two groups: those ChatGPT consistently answered correctly (Confirmed Correct Answer, CCA) and those it consistently answered incorrectly (Confirmed Incorrect Answer, CIA). For these groups, we calculated the average accuracy rates and 95% confidence intervals based on the actual performance of internal medicine physicians on each question and analyzed the statistical significance between the two groups. This process was then similarly applied to the subset of nonimage CCA and CIA questions.

RESULTS

ChatGPT's overall accuracy rate was 59.05%, increasing to 65.76% for nonimage questions. 24.87% of the questions had answers that varied between correct and incorrect in the three trials. Despite surpassing the passing threshold for nonimage questions, ChatGPT's accuracy was lower than that of human specialists. There was a significant variance in accuracy between CCA and CIA groups, with ChatGPT mirroring human physician patterns in responding to different question types.

CONCLUSION

This study underscores ChatGPT's potential utility and limitations in internal medicine. While effective in some aspects, its dependence on question type and context suggests that it should supplement, not replace, professional medical judgment. Further research is needed to integrate Artificial Intelligence tools like ChatGPT more effectively into specialized medical practices.

Collapse

Fournier A, Fallet C, Sadeghipour F, Perrottet N. Assessing the applicability and appropriateness of ChatGPT in answering clinical pharmacy questions. ANNALES PHARMACEUTIQUES FRANÇAISES 2024;82:507-513. [PMID: 37992892 DOI: 10.1016/j.pharma.2023.11.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2023] [Revised: 11/16/2023] [Accepted: 11/16/2023] [Indexed: 11/24/2023]

Shrestha N, Shen Z, Zaidat B, Duey AH, Tang JE, Ahmed W, Hoang T, Restrepo Mejia M, Rajjoub R, Markowitz JS, Kim JS, Cho SK. Performance of ChatGPT on NASS Clinical Guidelines for the Diagnosis and Treatment of Low Back Pain: A Comparison Study. Spine (Phila Pa 1976) 2024;49:640-651. [PMID: 38213186 DOI: 10.1097/brs.0000000000004915] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/19/2023] [Accepted: 12/14/2023] [Indexed: 01/13/2024]

Abstract

STUDY DESIGN

Comparative analysis.

OBJECTIVE

To evaluate Chat Generative Pre-trained Transformer (ChatGPT's) ability to predict appropriate clinical recommendations based on the most recent clinical guidelines for the diagnosis and treatment of low back pain.

BACKGROUND

Low back pain is a very common and often debilitating condition that affects many people globally. ChatGPT is an artificial intelligence model that may be able to generate recommendations for low back pain.

MATERIALS AND METHODS

Using the North American Spine Society Evidence-Based Clinical Guidelines as the gold standard, 82 clinical questions relating to low back pain were entered into ChatGPT (GPT-3.5) independently. For each question, we recorded ChatGPT's answer, then used a point-answer system-the point being the guideline recommendation and the answer being ChatGPT's response-and asked ChatGPT if the point was mentioned in the answer to assess for accuracy. This response accuracy was repeated with one caveat-a prior prompt is given in ChatGPT to answer as an experienced orthopedic surgeon-for each question by guideline category. A two-sample proportion z test was used to assess any differences between the preprompt and postprompt scenarios with alpha=0.05.

RESULTS

ChatGPT's response was accurate 65% (72% postprompt, P =0.41) for guidelines with clinical recommendations, 46% (58% postprompt, P =0.11) for guidelines with insufficient or conflicting data, and 49% (16% postprompt, P =0.003*) for guidelines with no adequate study to address the clinical question. For guidelines with insufficient or conflicting data, 44% (25% postprompt, P =0.01*) of ChatGPT responses wrongly suggested that sufficient evidence existed.

CONCLUSION

ChatGPT was able to produce a sufficient clinical guideline recommendation for low back pain, with overall improvements if initially prompted. However, it tended to wrongly suggest evidence and often failed to mention, especially postprompt, when there is not enough evidence to adequately give an accurate recommendation.

Collapse

Kedia N, Sanjeev S, Ong J, Chhablani J. ChatGPT and Beyond: An overview of the growing field of large language models and their use in ophthalmology. Eye (Lond) 2024;38:1252-1261. [PMID: 38172581 PMCID: PMC11076576 DOI: 10.1038/s41433-023-02915-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2023] [Revised: 11/23/2023] [Accepted: 12/20/2023] [Indexed: 01/05/2024] Open

Wang C, Ong J, Wang C, Ong H, Cheng R, Ong D. Potential for GPT Technology to Optimize Future Clinical Decision-Making Using Retrieval-Augmented Generation. Ann Biomed Eng 2024;52:1115-1118. [PMID: 37530906 DOI: 10.1007/s10439-023-03327-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2023] [Accepted: 07/17/2023] [Indexed: 08/03/2023]

Templin T, Perez MW, Sylvia S, Leek J, Sinnott-Armstrong N. Addressing 6 challenges in generative AI for digital health: A scoping review. PLOS DIGITAL HEALTH 2024;3:e0000503. [PMID: 38781686 PMCID: PMC11115971 DOI: 10.1371/journal.pdig.0000503] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/25/2024]

Thunström AO, Carlsen HK, Ali L, Larson T, Hellström A, Steingrimsson S. Usability Comparison Among Healthy Participants of an Anthropomorphic Digital Human and a Text-Based Chatbot as a Responder to Questions on Mental Health: Randomized Controlled Trial. JMIR Hum Factors 2024;11:e54581. [PMID: 38683664 DOI: 10.2196/54581] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2023] [Revised: 01/27/2024] [Accepted: 02/18/2024] [Indexed: 05/01/2024] Open

Abstract

BACKGROUND

The use of chatbots in mental health support has increased exponentially in recent years, with studies showing that they may be effective in treating mental health problems. More recently, the use of visual avatars called digital humans has been introduced. Digital humans have the capability to use facial expressions as another dimension in human-computer interactions. It is important to study the difference in emotional response and usability preferences between text-based chatbots and digital humans for interacting with mental health services.

OBJECTIVE

This study aims to explore to what extent a digital human interface and a text-only chatbot interface differed in usability when tested by healthy participants, using BETSY (Behavior, Emotion, Therapy System, and You) which uses 2 distinct interfaces: a digital human with anthropomorphic features and a text-only user interface. We also set out to explore how chatbot-generated conversations on mental health (specific to each interface) affected self-reported feelings and biometrics.

METHODS

We explored to what extent a digital human with anthropomorphic features differed from a traditional text-only chatbot regarding perception of usability through the System Usability Scale, emotional reactions through electroencephalography, and feelings of closeness. Healthy participants (n=45) were randomized to 2 groups that used a digital human with anthropomorphic features (n=25) or a text-only chatbot with no such features (n=20). The groups were compared by linear regression analysis and t tests.

RESULTS

No differences were observed between the text-only and digital human groups regarding demographic features. The mean System Usability Scale score was 75.34 (SD 10.01; range 57-90) for the text-only chatbot versus 64.80 (SD 14.14; range 40-90) for the digital human interface. Both groups scored their respective chatbot interfaces as average or above average in usability. Women were more likely to report feeling annoyed by BETSY.

CONCLUSIONS

The text-only chatbot was perceived as significantly more user-friendly than the digital human, although there were no significant differences in electroencephalography measurements. Male participants exhibited lower levels of annoyance with both interfaces, contrary to previously reported findings.

Collapse

Hudon A, Kiepura B, Pelletier M, Phan V. Using ChatGPT in Psychiatry to Design Script Concordance Tests in Undergraduate Medical Education: Mixed Methods Study. JMIR MEDICAL EDUCATION 2024;10:e54067. [PMID: 38596832 PMCID: PMC11007379 DOI: 10.2196/54067] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/28/2023] [Revised: 03/06/2024] [Accepted: 03/07/2024] [Indexed: 04/11/2024]

Abstract

Background

Undergraduate medical studies represent a wide range of learning opportunities served in the form of various teaching-learning modalities for medical learners. A clinical scenario is frequently used as a modality, followed by multiple-choice and open-ended questions among other learning and teaching methods. As such, script concordance tests (SCTs) can be used to promote a higher level of clinical reasoning. Recent technological developments have made generative artificial intelligence (AI)-based systems such as ChatGPT (OpenAI) available to assist clinician-educators in creating instructional materials.

Objective

The main objective of this project is to explore how SCTs generated by ChatGPT compared to SCTs produced by clinical experts on 3 major elements: the scenario (stem), clinical questions, and expert opinion.

Methods

This mixed method study evaluated 3 ChatGPT-generated SCTs with 3 expert-created SCTs using a predefined framework. Clinician-educators as well as resident doctors in psychiatry involved in undergraduate medical education in Quebec, Canada, evaluated via a web-based survey the 6 SCTs on 3 criteria: the scenario, clinical questions, and expert opinion. They were also asked to describe the strengths and weaknesses of the SCTs.

Results

A total of 102 respondents assessed the SCTs. There were no significant distinctions between the 2 types of SCTs concerning the scenario (P=.84), clinical questions (P=.99), and expert opinion (P=.07), as interpretated by the respondents. Indeed, respondents struggled to differentiate between ChatGPT- and expert-generated SCTs. ChatGPT showcased promise in expediting SCT design, aligning well with Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition criteria, albeit with a tendency toward caricatured scenarios and simplistic content.

Conclusions

This study is the first to concentrate on the design of SCTs supported by AI in a period where medicine is changing swiftly and where technologies generated from AI are expanding much faster. This study suggests that ChatGPT can be a valuable tool in creating educational materials, and further validation is essential to ensure educational efficacy and accuracy.

Collapse

Shorey S, Mattar C, Pereira TLB, Choolani M. A scoping review of ChatGPT's role in healthcare education and research. NURSE EDUCATION TODAY 2024;135:106121. [PMID: 38340639 DOI: 10.1016/j.nedt.2024.106121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/26/2023] [Revised: 01/05/2024] [Accepted: 02/04/2024] [Indexed: 02/12/2024]

Abstract

OBJECTIVES

To examine and consolidate literature regarding the advantages and disadvantages of utilizing ChatGPT in healthcare education and research.

DESIGN/METHODS

We searched seven electronic databases (PubMed/Medline, CINAHL, Embase, PsycINFO, Scopus, ProQuest Dissertations and Theses Global, and Web of Science) from November 2022 until September 2023. This scoping review adhered to Arksey and O'Malley's framework and followed reporting guidelines outlined in the PRISMA-ScR checklist. For analysis, we employed Thomas and Harden's thematic synthesis framework.

RESULTS

A total of 100 studies were included. An overarching theme, "Forging the Future: Bridging Theory and Integration of ChatGPT" emerged, accompanied by two main themes (1) Enhancing Healthcare Education, Research, and Writing with ChatGPT, (2) Controversies and Concerns about ChatGPT in Healthcare Education Research and Writing, and seven subthemes.

CONCLUSIONS

Our review underscores the importance of acknowledging legitimate concerns related to the potential misuse of ChatGPT such as 'ChatGPT hallucinations', its limited understanding of specialized healthcare knowledge, its impact on teaching methods and assessments, confidentiality and security risks, and the controversial practice of crediting it as a co-author on scientific papers, among other considerations. Furthermore, our review also recognizes the urgency of establishing timely guidelines and regulations, along with the active engagement of relevant stakeholders, to ensure the responsible and safe implementation of ChatGPT's capabilities. We advocate for the use of cross-verification techniques to enhance the precision and reliability of generated content, the adaptation of higher education curricula to incorporate ChatGPT's potential, educators' need to familiarize themselves with the technology to improve their literacy and teaching approaches, and the development of innovative methods to detect ChatGPT usage. Furthermore, data protection measures should be prioritized when employing ChatGPT, and transparent reporting becomes crucial when integrating ChatGPT into academic writing.

Collapse

Valdez D, Bunnell A, Lim SY, Sadowski P, Shepherd JA. Performance of Progressive Generations of GPT on an Exam Designed for Certifying Physicians as Certified Clinical Densitometrists. J Clin Densitom 2024;27:101480. [PMID: 38401238 DOI: 10.1016/j.jocd.2024.101480] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/05/2023] [Revised: 01/25/2024] [Accepted: 02/15/2024] [Indexed: 02/26/2024]

Abstract

BACKGROUND

Artificial intelligence (AI) large language models (LLMs) such as ChatGPT have demonstrated the ability to pass standardized exams. These models are not trained for a specific task, but instead trained to predict sequences of text from large corpora of documents sourced from the internet. It has been shown that even models trained on this general task can pass exams in a variety of domain-specific fields, including the United States Medical Licensing Examination. We asked if large language models would perform as well on a much narrower subdomain tests designed for medical specialists. Furthermore, we wanted to better understand how progressive generations of GPT (generative pre-trained transformer) models may be evolving in the completeness and sophistication of their responses even while generational training remains general. In this study, we evaluated the performance of two versions of GPT (GPT 3 and 4) on their ability to pass the certification exam given to physicians to work as osteoporosis specialists and become a certified clinical densitometrists. The CCD exam has a possible score range of 150 to 400. To pass, you need a score of 300.

METHODS

A 100-question multiple-choice practice exam was obtained from a 3rd party exam preparation website that mimics the accredited certification tests given by the ISCD (International Society for Clinical Densitometry). The exam was administered to two versions of GPT, the free version (GPT Playground) and ChatGPT+, which are based on GPT-3 and GPT-4, respectively (OpenAI, San Francisco, CA). The systems were prompted with the exam questions verbatim. If the response was purely textual and did not specify which of the multiple-choice answers to select, the authors matched the text to the closest answer. Each exam was graded and an estimated ISCD score was provided from the exam website. In addition, each response was evaluated by a rheumatologist CCD and ranked for accuracy using a 5-level scale. The two GPT versions were compared in terms of response accuracy and length.

RESULTS

The average response length was 11.6 ±19 words for GPT-3 and 50.0±43.6 words for GPT-4. GPT-3 answered 62 questions correctly resulting in a failing ISCD score of 289. However, GPT-4 answered 82 questions correctly with a passing score of 342. GPT-3 scored highest on the "Overview of Low Bone Mass and Osteoporosis" category (72 % correct) while GPT-4 scored well above 80 % accuracy on all categories except "Imaging Technology in Bone Health" (65 % correct). Regarding subjective accuracy, GPT-3 answered 23 questions with nonsensical or totally wrong responses while GPT-4 had no responses in that category.

CONCLUSION

If this had been an actual certification exam, GPT-4 would now have a CCD suffix to its name even after being trained using general internet knowledge. Clearly, more goes into physician training than can be captured in this exam. However, GPT algorithms may prove to be valuable physician aids in the diagnoses and monitoring of osteoporosis and other diseases.

Collapse

Shahin MH, Barth A, Podichetty JT, Liu Q, Goyal N, Jin JY, Ouellet D. Artificial Intelligence: From Buzzword to Useful Tool in Clinical Pharmacology. Clin Pharmacol Ther 2024;115:698-709. [PMID: 37881133 DOI: 10.1002/cpt.3083] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2023] [Accepted: 10/06/2023] [Indexed: 10/27/2023]

Temperley HC, O'Sullivan NJ, Mac Curtain BM, Corr A, Meaney JF, Kelly ME, Brennan I. Current applications and future potential of ChatGPT in radiology: A systematic review. J Med Imaging Radiat Oncol 2024;68:257-264. [PMID: 38243605 DOI: 10.1111/1754-9485.13621] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2023] [Accepted: 12/29/2023] [Indexed: 01/21/2024]

Yalamanchili A, Sengupta B, Song J, Lim S, Thomas TO, Mittal BB, Abazeed ME, Teo PT. Quality of Large Language Model Responses to Radiation Oncology Patient Care Questions. JAMA Netw Open 2024;7:e244630. [PMID: 38564215 PMCID: PMC10988356 DOI: 10.1001/jamanetworkopen.2024.4630] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/17/2023] [Accepted: 02/04/2024] [Indexed: 04/04/2024] Open

Abstract

Importance

Artificial intelligence (AI) large language models (LLMs) demonstrate potential in simulating human-like dialogue. Their efficacy in accurate patient-clinician communication within radiation oncology has yet to be explored.

Objective

To determine an LLM's quality of responses to radiation oncology patient care questions using both domain-specific expertise and domain-agnostic metrics.

Design, Setting, and Participants

This cross-sectional study retrieved questions and answers from websites (accessed February 1 to March 20, 2023) affiliated with the National Cancer Institute and the Radiological Society of North America. These questions were used as queries for an AI LLM, ChatGPT version 3.5 (accessed February 20 to April 20, 2023), to prompt LLM-generated responses. Three radiation oncologists and 3 radiation physicists ranked the LLM-generated responses for relative factual correctness, relative completeness, and relative conciseness compared with online expert answers. Statistical analysis was performed from July to October 2023.

Main Outcomes and Measures

The LLM's responses were ranked by experts using domain-specific metrics such as relative correctness, conciseness, completeness, and potential harm compared with online expert answers on a 5-point Likert scale. Domain-agnostic metrics encompassing cosine similarity scores, readability scores, word count, lexicon, and syllable counts were computed as independent quality checks for LLM-generated responses.

Results

Of the 115 radiation oncology questions retrieved from 4 professional society websites, the LLM performed the same or better in 108 responses (94%) for relative correctness, 89 responses (77%) for completeness, and 105 responses (91%) for conciseness compared with expert answers. Only 2 LLM responses were ranked as having potential harm. The mean (SD) readability consensus score for expert answers was 10.63 (3.17) vs 13.64 (2.22) for LLM answers (P < .001), indicating 10th grade and college reading levels, respectively. The mean (SD) number of syllables was 327.35 (277.15) for expert vs 376.21 (107.89) for LLM answers (P = .07), the mean (SD) word count was 226.33 (191.92) for expert vs 246.26 (69.36) for LLM answers (P = .27), and the mean (SD) lexicon score was 200.15 (171.28) for expert vs 219.10 (61.59) for LLM answers (P = .24).

Conclusions and Relevance

In this cross-sectional study, the LLM generated accurate, comprehensive, and concise responses with minimal risk of harm, using language similar to human experts but at a higher reading level. These findings suggest the LLM's potential, with some retraining, as a valuable resource for patient queries in radiation oncology and other medical fields.

Collapse

Artsi Y, Sorin V, Konen E, Glicksberg BS, Nadkarni G, Klang E. Large language models for generating medical examinations: systematic review. BMC MEDICAL EDUCATION 2024;24:354. [PMID: 38553693 PMCID: PMC10981304 DOI: 10.1186/s12909-024-05239-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/17/2024] [Accepted: 02/28/2024] [Indexed: 04/01/2024]

Berry CE, Fazilat AZ, Lavin C, Lintel H, Cole N, Stingl CS, Valencia C, Morgan AG, Momeni A, Wan DC. Both Patients and Plastic Surgeons Prefer Artificial Intelligence-Generated Microsurgical Information. J Reconstr Microsurg 2024. [PMID: 38382637 DOI: 10.1055/a-2273-4163] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/23/2024]

Abstract

BACKGROUND

With the growing relevance of artificial intelligence (AI)-based patient-facing information, microsurgical-specific online information provided by professional organizations was compared with that of ChatGPT (Chat Generative Pre-Trained Transformer) and assessed for accuracy, comprehensiveness, clarity, and readability.

METHODS

Six plastic and reconstructive surgeons blindly assessed responses to 10 microsurgery-related medical questions written either by the American Society of Reconstructive Microsurgery (ASRM) or ChatGPT based on accuracy, comprehensiveness, and clarity. Surgeons were asked to choose which source provided the overall highest-quality microsurgical patient-facing information. Additionally, 30 individuals with no medical background (ages: 18-81, μ = 49.8) were asked to determine a preference when blindly comparing materials. Readability scores were calculated, and all numerical scores were analyzed using the following six reliability formulas: Flesch-Kincaid Grade Level, Flesch-Kincaid Readability Ease, Gunning Fog Index, Simple Measure of Gobbledygook Index, Coleman-Liau Index, Linsear Write Formula, and Automated Readability Index. Statistical analysis of microsurgical-specific online sources was conducted utilizing paired t-tests.

RESULTS

Statistically significant differences in comprehensiveness and clarity were seen in favor of ChatGPT. Surgeons, 70.7% of the time, blindly choose ChatGPT as the source that overall provided the highest-quality microsurgical patient-facing information. Nonmedical individuals 55.9% of the time selected AI-generated microsurgical materials as well. Neither ChatGPT nor ASRM-generated materials were found to contain inaccuracies. Readability scores for both ChatGPT and ASRM materials were found to exceed recommended levels for patient proficiency across six readability formulas, with AI-based material scored as more complex.

CONCLUSION

AI-generated patient-facing materials were preferred by surgeons in terms of comprehensiveness and clarity when blindly compared with online material provided by ASRM. Studied AI-generated material was not found to contain inaccuracies. Additionally, surgeons and nonmedical individuals consistently indicated an overall preference for AI-generated material. A readability analysis suggested that both materials sourced from ChatGPT and ASRM surpassed recommended reading levels across six readability scores.

Collapse

Affiliation(s)

Charlotte E Berry Department of Surgery, Division of Plastic and Reconstructive Surgery, Hagey Laboratory for Pediatric Regenerative Medicine, Stanford University School of Medicine, Stanford, California
Alexander Z Fazilat Department of Surgery, Division of Plastic and Reconstructive Surgery, Hagey Laboratory for Pediatric Regenerative Medicine, Stanford University School of Medicine, Stanford, California
Christopher Lavin Department of Surgery, Division of Plastic and Reconstructive Surgery, Hagey Laboratory for Pediatric Regenerative Medicine, Stanford University School of Medicine, Stanford, California
Hendrik Lintel Department of Surgery, Division of Plastic and Reconstructive Surgery, Hagey Laboratory for Pediatric Regenerative Medicine, Stanford University School of Medicine, Stanford, California
Naomi Cole Department of Surgery, Division of Plastic and Reconstructive Surgery, Hagey Laboratory for Pediatric Regenerative Medicine, Stanford University School of Medicine, Stanford, California
Cybil S Stingl Department of Surgery, Division of Plastic and Reconstructive Surgery, Hagey Laboratory for Pediatric Regenerative Medicine, Stanford University School of Medicine, Stanford, California
Caleb Valencia Department of Surgery, Division of Plastic and Reconstructive Surgery, Hagey Laboratory for Pediatric Regenerative Medicine, Stanford University School of Medicine, Stanford, California
Annah G Morgan Department of Surgery, Division of Plastic and Reconstructive Surgery, Hagey Laboratory for Pediatric Regenerative Medicine, Stanford University School of Medicine, Stanford, California
Arash Momeni Department of Surgery, Division of Plastic and Reconstructive Surgery, Hagey Laboratory for Pediatric Regenerative Medicine, Stanford University School of Medicine, Stanford, California
Derrick C Wan Department of Surgery, Division of Plastic and Reconstructive Surgery, Hagey Laboratory for Pediatric Regenerative Medicine, Stanford University School of Medicine, Stanford, California

Collapse

Cocci A, Pezzoli M, Lo Re M, Russo GI, Asmundo MG, Fode M, Cacciamani G, Cimino S, Minervini A, Durukan E. Quality of information and appropriateness of ChatGPT outputs for urology patients. Prostate Cancer Prostatic Dis 2024;27:103-108. [PMID: 37516804 DOI: 10.1038/s41391-023-00705-y] [Citation(s) in RCA: 11] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2023] [Revised: 06/22/2023] [Accepted: 07/17/2023] [Indexed: 07/31/2023]

Abstract

BACKGROUND

The proportion of health-related searches on the internet is continuously growing. ChatGPT, a natural language processing (NLP) tool created by OpenAI, has been gaining increasing user attention and can potentially be used as a source for obtaining information related to health concerns. This study aims to analyze the quality and appropriateness of ChatGPT's responses to Urology case studies compared to those of a urologist.

METHODS

Data from 100 patient case studies, comprising patient demographics, medical history, and urologic complaints, were sequentially inputted into ChatGPT, one by one. A question was posed to determine the most likely diagnosis, suggested examinations, and treatment options. The responses generated by ChatGPT were then compared to those provided by a board-certified urologist who was blinded to ChatGPT's responses and graded on a 5-point Likert scale based on accuracy, comprehensiveness, and clarity as criterias for appropriateness. The quality of information was graded based on the section 2 of the DISCERN tool and readability assessments were performed using the Flesch Reading Ease (FRE) and Flesch-Kincaid Reading Grade Level (FKGL) formulas.

RESULTS

52% of all responses were deemed appropriate. ChatGPT provided more appropriate responses for non-oncology conditions (58.5%) compared to oncology (52.6%) and emergency urology cases (11.1%) (p = 0.03). The median score of the DISCERN tool was 15 (IQR = 5.3) corresponding to a quality score of poor. The ChatGPT responses demonstrated a college graduate reading level, as indicated by the median FRE score of 18 (IQR = 21) and the median FKGL score of 15.8 (IQR = 3).

CONCLUSIONS

ChatGPT serves as an interactive tool for providing medical information online, offering the possibility of enhancing health outcomes and patient satisfaction. Nevertheless, the insufficient appropriateness and poor quality of the responses on Urology cases emphasizes the importance of thorough evaluation and use of NLP-generated outputs when addressing health-related concerns.

Collapse

Hill JE, Harris C, Clegg A. Methods for using Bing's AI-powered search engine for data extraction for a systematic review. Res Synth Methods 2024;15:347-353. [PMID: 38066713 DOI: 10.1002/jrsm.1689] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2023] [Revised: 11/08/2023] [Accepted: 11/20/2023] [Indexed: 12/21/2023]

Shiraishi M, Lee H, Kanayama K, Moriwaki Y, Okazaki M. Appropriateness of Artificial Intelligence Chatbots in Diabetic Foot Ulcer Management. INT J LOW EXTR WOUND 2024:15347346241236811. [PMID: 38419470 DOI: 10.1177/15347346241236811] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/02/2024]

Abstract

Type 2 diabetes is a significant global health concern. It often causes diabetic foot ulcers (DFUs), which affect millions of people and increase amputation and mortality rates. Despite existing guidelines, the complexity of DFU treatment makes clinical decisions challenging. Large language models such as chat generative pretrained transformer (ChatGPT), which are adept at natural language processing, have emerged as valuable resources in the medical field. However, concerns about the accuracy and reliability of the information they provide remain. We aimed to assess the accuracy of various artificial intelligence (AI) chatbots, including ChatGPT, in providing information on DFUs based on established guidelines. Seven AI chatbots were asked clinical questions (CQs) based on the DFU guidelines. Their responses were analyzed for accuracy in terms of answers to CQs, grade of recommendation, level of evidence, and agreement with the reference, including verification of the authenticity of the references provided by the chatbots. The AI chatbots showed a mean accuracy of 91.2% in answers to CQs, with discrepancies noted in grade of recommendation and level of evidence. Claude-2 outperformed other chatbots in the number of verified references (99.6%), whereas ChatGPT had the lowest rate of reference authenticity (66.3%). This study highlights the potential of AI chatbots as tools for disseminating medical information and demonstrates their high degree of accuracy in answering CQs related to DFUs. However, the variability in the accuracy of these chatbots and problems like AI hallucinations necessitate cautious use and further optimization for medical applications. This study underscores the evolving role of AI in healthcare and the importance of refining these technologies for effective use in clinical decision-making and patient education.

Collapse

Abi-Rafeh J, Xu HH, Kazan R, Tevlin R, Furnas H. Large Language Models and Artificial Intelligence: A Primer for Plastic Surgeons on the Demonstrated and Potential Applications, Promises, and Limitations of ChatGPT. Aesthet Surg J 2024;44:329-343. [PMID: 37562022 DOI: 10.1093/asj/sjad260] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2023] [Revised: 08/02/2023] [Accepted: 08/04/2023] [Indexed: 08/12/2023] Open

Barlas T, Altinova AE, Akturk M, Toruner FB. Credibility of ChatGPT in the assessment of obesity in type 2 diabetes according to the guidelines. Int J Obes (Lond) 2024;48:271-275. [PMID: 37951982 DOI: 10.1038/s41366-023-01410-5] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/07/2023] [Revised: 10/22/2023] [Accepted: 10/30/2023] [Indexed: 11/14/2023]

Gengatharan D, Saggi SS, Bin Abd Razak HR. Pre-operative Planning of High Tibial Osteotomy With ChatGPT: Are We There Yet? Cureus 2024;16:e54858. [PMID: 38533173 PMCID: PMC10964394 DOI: 10.7759/cureus.54858] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/23/2024] [Indexed: 03/28/2024] Open

Aliyeva A, Sari E, Alaskarov E, Nasirov R. Enhancing Postoperative Cochlear Implant Care With ChatGPT-4: A Study on Artificial Intelligence (AI)-Assisted Patient Education and Support. Cureus 2024;16:e53897. [PMID: 38465158 PMCID: PMC10924891 DOI: 10.7759/cureus.53897] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/09/2024] [Indexed: 03/12/2024] Open

Abstract

BACKGROUND

Cochlear implantation is a critical surgical intervention for patients with severe hearing loss. Postoperative care is essential for successful rehabilitation, yet access to timely medical advice can be challenging, especially in remote or resource-limited settings. Integrating advanced artificial intelligence (AI) tools like Chat Generative Pre-trained Transformer (ChatGPT)-4 in post-surgical care could bridge the patient education and support gap.

AIM

This study aimed to assess the effectiveness of ChatGPT-4 as a supplementary information resource for postoperative cochlear implant patients. The focus was on evaluating the AI chatbot's ability to provide accurate, clear, and relevant information, particularly in scenarios where access to healthcare professionals is limited.

MATERIALS AND METHODS

Five common postoperative questions related to cochlear implant care were posed to ChatGPT-4. The AI chatbot's responses were analyzed for accuracy, response time, clarity, and relevance. The aim was to determine whether ChatGPT-4 could serve as a reliable source of information for patients in need, especially if the patients could not reach out to the hospital or the specialists at that moment.

RESULTS

ChatGPT-4 provided responses aligned with current medical guidelines, demonstrating accuracy and relevance. The AI chatbot responded to each query within seconds, indicating its potential as a timely resource. Additionally, the responses were clear and understandable, making complex medical information accessible to non-medical audiences. These findings suggest that ChatGPT-4 could effectively supplement traditional patient education, providing valuable support in postoperative care.

CONCLUSION

The study concluded that ChatGPT-4 has significant potential as a supportive tool for cochlear implant patients post surgery. While it cannot replace professional medical advice, ChatGPT-4 can provide immediate, accessible, and understandable information, which is particularly beneficial in special moments. This underscores the utility of AI in enhancing patient care and supporting cochlear implantation.

Collapse

Nguyen T. ChatGPT in Medical Education: A Precursor for Automation Bias? JMIR MEDICAL EDUCATION 2024;10:e50174. [PMID: 38231545 PMCID: PMC10831594 DOI: 10.2196/50174] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/21/2023] [Accepted: 12/11/2023] [Indexed: 01/18/2024]

Odabashian R, Bastin D, Jones G, Manzoor M, Tangestaniapour S, Assad M, Lakhani S, Odabashian M, McGee S. Assessment of ChatGPT-3.5's Knowledge in Oncology: Comparative Study with ASCO-SEP Benchmarks. JMIR AI 2024;3:e50442. [PMID: 38875575 PMCID: PMC11041475 DOI: 10.2196/50442] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/02/2023] [Revised: 10/05/2023] [Accepted: 11/19/2023] [Indexed: 06/16/2024]

Abstract

BACKGROUND

ChatGPT (Open AI) is a state-of-the-art large language model that uses artificial intelligence (AI) to address questions across diverse topics. The American Society of Clinical Oncology Self-Evaluation Program (ASCO-SEP) created a comprehensive educational program to help physicians keep up to date with the many rapid advances in the field. The question bank consists of multiple choice questions addressing the many facets of cancer care, including diagnosis, treatment, and supportive care. As ChatGPT applications rapidly expand, it becomes vital to ascertain if the knowledge of ChatGPT-3.5 matches the established standards that oncologists are recommended to follow.

OBJECTIVE

This study aims to evaluate whether ChatGPT-3.5's knowledge aligns with the established benchmarks that oncologists are expected to adhere to. This will furnish us with a deeper understanding of the potential applications of this tool as a support for clinical decision-making.

METHODS

We conducted a systematic assessment of the performance of ChatGPT-3.5 on the ASCO-SEP, the leading educational and assessment tool for medical oncologists in training and practice. Over 1000 multiple choice questions covering the spectrum of cancer care were extracted. Questions were categorized by cancer type or discipline, with subcategorization as treatment, diagnosis, or other. Answers were scored as correct if ChatGPT-3.5 selected the answer as defined by ASCO-SEP.

RESULTS

Overall, ChatGPT-3.5 achieved a score of 56.1% (583/1040) for the correct answers provided. The program demonstrated varying levels of accuracy across cancer types or disciplines. The highest accuracy was observed in questions related to developmental therapeutics (8/10; 80% correct), while the lowest accuracy was observed in questions related to gastrointestinal cancer (102/209; 48.8% correct). There was no significant difference in the program's performance across the predefined subcategories of diagnosis, treatment, and other (P=.16, which is greater than .05).

CONCLUSIONS

This study evaluated ChatGPT-3.5's oncology knowledge using the ASCO-SEP, aiming to address uncertainties regarding AI tools like ChatGPT in clinical decision-making. Our findings suggest that while ChatGPT-3.5 offers a hopeful outlook for AI in oncology, its present performance in ASCO-SEP tests necessitates further refinement to reach the requisite competency levels. Future assessments could explore ChatGPT's clinical decision support capabilities with real-world clinical scenarios, its ease of integration into medical workflows, and its potential to foster interdisciplinary collaboration and patient engagement in health care settings.

Collapse

Younis HA, Eisa TAE, Nasser M, Sahib TM, Noor AA, Alyasiri OM, Salisu S, Hayder IM, Younis HA. A Systematic Review and Meta-Analysis of Artificial Intelligence Tools in Medicine and Healthcare: Applications, Considerations, Limitations, Motivation and Challenges. Diagnostics (Basel) 2024;14:109. [PMID: 38201418 PMCID: PMC10802884 DOI: 10.3390/diagnostics14010109] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2023] [Revised: 12/02/2023] [Accepted: 12/04/2023] [Indexed: 01/12/2024] Open

Munir MM, Endo Y, Ejaz A, Dillhoff M, Cloyd JM, Pawlik TM. Online artificial intelligence platforms and their applicability to gastrointestinal surgical operations. J Gastrointest Surg 2024;28:64-69. [PMID: 38353076 DOI: 10.1016/j.gassur.2023.11.019] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/24/2023] [Revised: 10/28/2023] [Accepted: 11/19/2023] [Indexed: 02/16/2024]

Abstract

BACKGROUND

The internet is a common source of health information for patients. Interactive online artificial intelligence (AI) may be a more reliable source of health-related information than traditional search engines. This study aimed to assess the quality and perceived utility of chat-based AI responses related to 3 common gastrointestinal (GI) surgical procedures.

METHODS

A survey of 24 questions covering general perioperative information on cholecystectomy, pancreaticoduodenectomy (PD), and colectomy was created. Each question was posed to Chat Generative Pre-trained Transformer (ChatGPT) in June 2023, and the generated responses were recorded. The quality and perceived utility of responses were independently and subjectively graded by expert respondents specific to each surgical field. Grades were classified as "poor," "fair," "good," "very good," or "excellent."

RESULTS

Among the 45 respondents (general surgeon [n = 13], surgical oncologist [n = 18], colorectal surgeon [n = 13], and transplant surgeon [n = 1]), most practiced at an academic facility (95.6%). Respondents had been in practice for a mean of 12.3 years (general surgeon, 14.5 ± 7.2; surgical oncologist, 12.1 ± 8.2; colorectal surgeon, 10.2 ± 8.0) and performed a mean 53 index operations annually (cholecystectomy, 47 ± 28; PD, 28 ± 27; colectomy, 81 ± 44). Overall, the most commonly assigned quality grade was "fair" or "good" for most responses (n = 622/1080, 57.6%). Most of the 1080 total utility grades were "fair" (n = 279, 25.8%) or "good" (n = 344, 31.9%), whereas only 129 utility grades (11.9%) were "poor." Of note, ChatGPT responses related to cholecystectomy (45.3% ["very good"/"excellent"] vs 18.1% ["poor"/"fair"]) were deemed to be better quality than AI responses about PD (18.9% ["very good"/"excellent"] vs 46.9% ["poor"/"fair"]) or colectomy (31.4% ["very good"/"excellent"] vs 38.3% ["poor"/"fair"]). Overall, only 20.0% of the experts deemed ChatGPT to be an accurate source of information, whereas 15.6% of the experts found it unreliable. Moreover, 1 in 3 surgeons deemed ChatGPT responses as not likely to reduce patient-physician correspondence (31.1%) or not comparable to in-person surgeon responses (35.6%).

CONCLUSIONS

Although a potential resource for patient education, ChatGPT responses to common GI perioperative questions were deemed to be of only modest quality and utility to patients. In addition, the relative quality of AI responses varied markedly on the basis of procedure type.

Collapse

Morales-Ramirez P, Mishek H, Dasgupta A. The Genie Is Out of the Bottle: What ChatGPT Can and Cannot Do for Medical Professionals. Obstet Gynecol 2024;143:e1-e6. [PMID: 37944140 DOI: 10.1097/aog.0000000000005446] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2023] [Accepted: 10/12/2023] [Indexed: 11/12/2023]

Malik S, Zaheer S. ChatGPT as an aid for pathological diagnosis of cancer. Pathol Res Pract 2024;253:154989. [PMID: 38056135 DOI: 10.1016/j.prp.2023.154989] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/23/2023] [Revised: 11/26/2023] [Accepted: 11/27/2023] [Indexed: 12/08/2023]

Adhikari K, Naik N, Hameed BZ, Raghunath SK, Somani BK. Exploring the Ethical, Legal, and Social Implications of ChatGPT in Urology. Curr Urol Rep 2024;25:1-8. [PMID: 37735339 DOI: 10.1007/s11934-023-01185-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 09/05/2023] [Indexed: 09/23/2023]

Huang X, Estau D, Liu X, Yu Y, Qin J, Li Z. Evaluating the performance of ChatGPT in clinical pharmacy: A comparative study of ChatGPT and clinical pharmacists. Br J Clin Pharmacol 2024;90:232-238. [PMID: 37626010 DOI: 10.1111/bcp.15896] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2023] [Revised: 08/01/2023] [Accepted: 08/14/2023] [Indexed: 08/27/2023] Open

Liao W, Liu Z, Dai H, Xu S, Wu Z, Zhang Y, Huang X, Zhu D, Cai H, Li Q, Liu T, Li X. Differentiating ChatGPT-Generated and Human-Written Medical Texts: Quantitative Study. JMIR MEDICAL EDUCATION 2023;9:e48904. [PMID: 38153785 PMCID: PMC10784984 DOI: 10.2196/48904] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/19/2023] [Revised: 08/03/2023] [Accepted: 09/10/2023] [Indexed: 12/29/2023]

Abstract

BACKGROUND

Large language models, such as ChatGPT, are capable of generating grammatically perfect and human-like text content, and a large number of ChatGPT-generated texts have appeared on the internet. However, medical texts, such as clinical notes and diagnoses, require rigorous validation, and erroneous medical content generated by ChatGPT could potentially lead to disinformation that poses significant harm to health care and the general public.

OBJECTIVE

This study is among the first on responsible artificial intelligence-generated content in medicine. We focus on analyzing the differences between medical texts written by human experts and those generated by ChatGPT and designing machine learning workflows to effectively detect and differentiate medical texts generated by ChatGPT.

METHODS

We first constructed a suite of data sets containing medical texts written by human experts and generated by ChatGPT. We analyzed the linguistic features of these 2 types of content and uncovered differences in vocabulary, parts-of-speech, dependency, sentiment, perplexity, and other aspects. Finally, we designed and implemented machine learning methods to detect medical text generated by ChatGPT. The data and code used in this paper are published on GitHub.

RESULTS

Medical texts written by humans were more concrete, more diverse, and typically contained more useful information, while medical texts generated by ChatGPT paid more attention to fluency and logic and usually expressed general terminologies rather than effective information specific to the context of the problem. A bidirectional encoder representations from transformers-based model effectively detected medical texts generated by ChatGPT, and the F1 score exceeded 95%.

CONCLUSIONS

Although text generated by ChatGPT is grammatically perfect and human-like, the linguistic characteristics of generated medical texts were different from those written by human experts. Medical text generated by ChatGPT could be effectively detected by the proposed machine learning algorithms. This study provides a pathway toward trustworthy and accountable use of large language models in medicine.

Collapse

Ferreira RM. New evidence-based practice: Artificial intelligence as a barrier breaker. World J Methodol 2023;13:384-389. [PMID: 38229944 PMCID: PMC10789101 DOI: 10.5662/wjm.v13.i5.384] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/17/2023] [Revised: 10/24/2023] [Accepted: 11/08/2023] [Indexed: 12/20/2023] Open

Ebrahimian M, Behnam B, Ghayebi N, Sobhrakhshankhah E. ChatGPT in Iranian medical licensing examination: evaluating the diagnostic accuracy and decision-making capabilities of an AI-based model. BMJ Health Care Inform 2023;30:e100815. [PMID: 38081765 PMCID: PMC10729145 DOI: 10.1136/bmjhci-2023-100815] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2023] [Accepted: 11/28/2023] [Indexed: 12/18/2023] Open

Al-Dujaili Z, Omari S, Pillai J, Al Faraj A. Assessing the accuracy and consistency of ChatGPT in clinical pharmacy management: A preliminary analysis with clinical pharmacy experts worldwide. Res Social Adm Pharm 2023;19:1590-1594. [PMID: 37696742 DOI: 10.1016/j.sapharm.2023.08.012] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2023] [Revised: 08/30/2023] [Accepted: 08/30/2023] [Indexed: 09/13/2023]

Abstract

BACKGROUND

ChatGPT conversation system has ushered in a revolutionary new era of information retrieval and stands as one of the fastest-growing platforms. Clinical pharmacy, as a dynamic discipline, necessitates an advanced comprehension of drugs and diseases. The process of decision-making in clinical pharmacy demands accuracy and consistency in medical information, as it directly affects patient safety.

OBJECTIVE

The objective was to evaluate ChatGPT's accuracy and consistency in managing pharmacotherapy cases across multiple time points. Additionally, input was gathered from global clinical pharmacy experts, and the agreement between ChatGPT's responses and those of clinical pharmacy experts worldwide was assessed.

METHODS

A set of 20 cases of pharmacotherapy was entered into ChatGPT at three different time points. Reliability analysis was performed using inter-rater reliability to measure the accuracy of the output generated by ChatGPT at each time point. Test-retest reliability was performed to measure the consistency of the output generated by ChatGPT across the three time points. Pharmacy expert performance was evaluated, and the overall results were compared.

RESULTS

ChatGPT achieved a hit rate of 70.83% at week 1, 79.2% at week 3, and 75% at week 5. The percent agreement between weeks 1 and 3 was 79.2%, whereas it was 87.5% between weeks 3 and 5, and 83.3% between weeks 1 and 5. In contrast, accuracy rates among clinical pharmacy experts showed considerable variation according to their geographic location. The highest agreement between clinical pharmacist responses and ChatGPT responses was observed at the last time point examined.

CONCLUSIONS

Overall, the analysis suggested that ChatGPT is capable of generating clinically relevant pharmaceutical information, albeit with some variation in accuracy and consistency. It should be noted that clinical pharmacy experts worldwide may provide varying degrees of accuracy depending on their expertise. This study highlights the potential of AI chatbots in clinical pharmacy.

Collapse

Franco D'Souza R, Amanullah S, Mathew M, Surapaneni KM. Appraising the performance of ChatGPT in psychiatry using 100 clinical case vignettes. Asian J Psychiatr 2023;89:103770. [PMID: 37812998 DOI: 10.1016/j.ajp.2023.103770] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/20/2023] [Revised: 09/13/2023] [Accepted: 09/18/2023] [Indexed: 10/11/2023]

Abstract

BACKGROUND

ChatGPT has emerged as the most advanced and rapidly developing large language chatbot system. With its immense potential ranging from answering a simple query to cracking highly competitive medical exams, ChatGPT continues to impress the scientists and researchers worldwide giving room for more discussions regarding its utility in various fields. One such field of attention is Psychiatry. With suboptimal diagnosis and treatment, assuring mental health and well-being is a challenge in many countries, particularly developing nations. To this regard, we conducted an evaluation to assess the performance of ChatGPT 3.5 in Psychiatry using clinical cases to provide evidence-based information regarding the implication of ChatGPT 3.5 in enhancing mental health and well-being.

METHODS

ChatGPT 3.5 was used in this experimental study to initiate the conversations and collect responses to clinical vignettes in Psychiatry. Using 100 clinical case vignettes, the replies were assessed by expert faculties from the Department of Psychiatry. There were 100 different psychiatric illnesses represented in the cases. We recorded and assessed the initial ChatGPT 3.5 responses. The evaluation was conducted using the objective of questions that were put forth at the conclusion of the case, and the aim of the questions was divided into 10 categories. The grading was completed by taking the mean value of the scores provided by the evaluators. Graphs and tables were used to represent the grades.

RESULTS

The evaluation report suggests that ChatGPT 3.5 fared extremely well in Psychiatry by receiving "Grade A" ratings in 61 out of 100 cases, "Grade B" ratings in 31, and "Grade C" ratings in 8. Majority of the queries were concerned with the management strategies, which were followed by diagnosis, differential diagnosis, assessment, investigation, counselling, clinical reasoning, ethical reasoning, prognosis, and request acceptance. ChatGPT 3.5 performed extremely well, especially in generating management strategies followed by diagnoses for different psychiatric conditions. There were no responses which were graded "D" indicating that there were no errors in the diagnosis or response for clinical care. Only a few discrepancies and additional details were missed in a few responses that received a "Grade C" CONCLUSION: It is evident from our study that ChatGPT 3.5 has appreciable knowledge and interpretation skills in Psychiatry. Thus, ChatGPT 3.5 undoubtedly has the potential to transform the field of Medicine and we emphasize its utility in Psychiatry through the finding of our study. However, for any AI model to be successful, assuring the reliability, validation of information, proper guidelines and implementation framework are necessary.

Collapse

Aliyeva A. "Bot or Not": Turing Problem in Otolaryngology. Cureus 2023;15:e48170. [PMID: 38046723 PMCID: PMC10693309 DOI: 10.7759/cureus.48170] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/02/2023] [Indexed: 12/05/2023] Open

Santana LADM, Gonçalo RIC, Barbosa BF, Takeshita WM, Trento CL. Authors' reply: Combining ChatGPT and machine learning: A viable alternative in oral medicine. Oral Dis 2023. [PMID: 37848339 DOI: 10.1111/odi.14741] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2023] [Revised: 08/31/2023] [Accepted: 09/04/2023] [Indexed: 10/19/2023]

Reddy A, Patel S, Barik AK, Gowda P. Role of chat-generative pre-trained transformer (ChatGPT) in anaesthesia: Merits and pitfalls. Indian J Anaesth 2023;67:942-944. [PMID: 38044929 PMCID: PMC10691596 DOI: 10.4103/ija.ija_504_23] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2023] [Revised: 06/15/2023] [Accepted: 06/18/2023] [Indexed: 12/05/2023] Open

Cocci A, Pezzoli M, Minervini A. Light and Shadow of ChatGPT: A Real Tool for Advancing Scientific Research and Medical Practice? World J Mens Health 2023;41:751-752. [PMID: 37652659 PMCID: PMC10523110 DOI: 10.5534/wjmh.230102] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2023] [Revised: 04/26/2023] [Accepted: 05/03/2023] [Indexed: 09/02/2023] Open

Gobira M, Nakayama LF, Moreira R, Andrade E, Regatieri CVS, Belfort R. Performance of ChatGPT-4 in answering questions from the Brazilian National Examination for Medical Degree Revalidation. REVISTA DA ASSOCIACAO MEDICA BRASILEIRA (1992) 2023;69:e20230848. [PMID: 37792871 PMCID: PMC10547492 DOI: 10.1590/1806-9282.20230848] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/06/2023] [Accepted: 07/17/2023] [Indexed: 10/06/2023]

Lai UH, Wu KS, Hsu TY, Kan JKC. Evaluating the performance of ChatGPT-4 on the United Kingdom Medical Licensing Assessment. Front Med (Lausanne) 2023;10:1240915. [PMID: 37795422 PMCID: PMC10547055 DOI: 10.3389/fmed.2023.1240915] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2023] [Accepted: 08/30/2023] [Indexed: 10/06/2023] Open

Abstract

Introduction

Recent developments in artificial intelligence large language models (LLMs), such as ChatGPT, have allowed for the understanding and generation of human-like text. Studies have found LLMs abilities to perform well in various examinations including law, business and medicine. This study aims to evaluate the performance of ChatGPT in the United Kingdom Medical Licensing Assessment (UKMLA).

Methods

Two publicly available UKMLA papers consisting of 200 single-best-answer (SBA) questions were screened. Nine SBAs were omitted as they contained images that were not suitable for input. Each question was assigned a specialty based on the UKMLA content map published by the General Medical Council. A total of 191 SBAs were inputted in ChatGPT-4 through three attempts over the course of 3 weeks (once per week).

Results

ChatGPT scored 74.9% (143/191), 78.0% (149/191) and 75.6% (145/191) on three attempts, respectively. The average of all three attempts was 76.3% (437/573) with a 95% confidence interval of (74.46% and 78.08%). ChatGPT answered 129 SBAs correctly and 32 SBAs incorrectly on all three attempts. On three attempts, ChatGPT performed well in mental health (8/9 SBAs), cancer (11/14 SBAs) and cardiovascular (10/13 SBAs). On three attempts, ChatGPT did not perform well in clinical haematology (3/7 SBAs), endocrine and metabolic (2/5 SBAs) and gastrointestinal including liver (3/10 SBAs). Regarding to response consistency, ChatGPT provided correct answers consistently in 67.5% (129/191) of SBAs but provided incorrect answers consistently in 12.6% (24/191) and inconsistent response in 19.9% (38/191) of SBAs, respectively.

Discussion and conclusion

This study suggests ChatGPT performs well in the UKMLA. There may be a potential correlation between specialty performance. LLMs ability to correctly answer SBAs suggests that it could be utilised as a supplementary learning tool in medical education with appropriate medical educator supervision.

Collapse

Garg RK, Urs VL, Agarwal AA, Chaudhary SK, Paliwal V, Kar SK. Exploring the role of ChatGPT in patient care (diagnosis and treatment) and medical research: A systematic review. Health Promot Perspect 2023;13:183-191. [PMID: 37808939 PMCID: PMC10558973 DOI: 10.34172/hpp.2023.22] [Citation(s) in RCA: 19] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2023] [Accepted: 07/06/2023] [Indexed: 10/10/2023] Open

Watters C, Lemanski MK. Universal skepticism of ChatGPT: a review of early literature on chat generative pre-trained transformer. Front Big Data 2023;6:1224976. [PMID: 37680954 PMCID: PMC10482048 DOI: 10.3389/fdata.2023.1224976] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2023] [Accepted: 07/10/2023] [Indexed: 09/09/2023] Open

Alanzi TM. Impact of ChatGPT on Teleconsultants in Healthcare: Perceptions of Healthcare Experts in Saudi Arabia. J Multidiscip Healthc 2023;16:2309-2321. [PMID: 37601325 PMCID: PMC10438433 DOI: 10.2147/jmdh.s419847] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2023] [Accepted: 08/01/2023] [Indexed: 08/22/2023] Open