1
|
Moulaei K, Yadegari A, Baharestani M, Farzanbakhsh S, Sabet B, Reza Afrash M. Generative artificial intelligence in healthcare: A scoping review on benefits, challenges and applications. Int J Med Inform 2024; 188:105474. [PMID: 38733640 DOI: 10.1016/j.ijmedinf.2024.105474] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2024] [Revised: 05/03/2024] [Accepted: 05/04/2024] [Indexed: 05/13/2024]
Abstract
BACKGROUND Generative artificial intelligence (GAI) is revolutionizing healthcare with solutions for complex challenges, enhancing diagnosis, treatment, and care through new data and insights. However, its integration raises questions about applications, benefits, and challenges. Our study explores these aspects, offering an overview of GAI's applications and future prospects in healthcare. METHODS This scoping review searched Web of Science, PubMed, and Scopus . The selection of studies involved screening titles, reviewing abstracts, and examining full texts, adhering to the PRISMA-ScR guidelines throughout the process. RESULTS From 1406 articles across three databases, 109 met inclusion criteria after screening and deduplication. Nine GAI models were utilized in healthcare, with ChatGPT (n = 102, 74 %), Google Bard (Gemini) (n = 16, 11 %), and Microsoft Bing AI (n = 10, 7 %) being the most frequently employed. A total of 24 different applications of GAI in healthcare were identified, with the most common being "offering insights and information on health conditions through answering questions" (n = 41) and "diagnosis and prediction of diseases" (n = 17). In total, 606 benefits and challenges were identified, which were condensed to 48 benefits and 61 challenges after consolidation. The predominant benefits included "Providing rapid access to information and valuable insights" and "Improving prediction and diagnosis accuracy", while the primary challenges comprised "generating inaccurate or fictional content", "unknown source of information and fake references for texts", and "lower accuracy in answering questions". CONCLUSION This scoping review identified the applications, benefits, and challenges of GAI in healthcare. This synthesis offers a crucial overview of GAI's potential to revolutionize healthcare, emphasizing the imperative to address its limitations.
Collapse
Affiliation(s)
- Khadijeh Moulaei
- Department of Health Information Technology, School of Paramedical, Ilam University of Medical Sciences, Ilam, Iran
| | - Atiye Yadegari
- Department of Pediatric Dentistry, School of Dentistry, Hamadan University of Medical Sciences, Hamadan, Iran
| | - Mahdi Baharestani
- Network of Interdisciplinarity in Neonates and Infants (NINI), Universal Scientific Education and Research Network (USERN), Tehran, Iran
| | - Shayan Farzanbakhsh
- Network of Interdisciplinarity in Neonates and Infants (NINI), Universal Scientific Education and Research Network (USERN), Tehran, Iran
| | - Babak Sabet
- Department of Surgery, Faculty of Medicine, Shahid Beheshti University of Medical Sciences, Tehran, Iran
| | - Mohammad Reza Afrash
- Department of Artificial Intelligence, Smart University of Medical Sciences, Tehran, Iran.
| |
Collapse
|
2
|
Nedbal C, Naik N, Castellani D, Gauhar V, Geraghty R, Somani BK. ChatGPT in urology practice: revolutionizing efficiency and patient care with generative artificial intelligence. Curr Opin Urol 2024; 34:98-104. [PMID: 37962176 DOI: 10.1097/mou.0000000000001151] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2023]
Abstract
PURPOSE OF REVIEW ChatGPT has emerged as a potentially useful tool for healthcare. Its role in urology is in its infancy and has much potential for research, clinical practice and for patient assistance. With this narrative review, we want to draw a picture of what is known about ChatGPT's integration in urology, alongside future promises and challenges. RECENT FINDINGS The use of ChatGPT can ease the administrative work, helping urologists with note-taking and clinical documentation such as discharge summaries and clinical notes. It can improve patient engagement through increasing awareness and facilitating communication, as it has especially been investigated for uro-oncological diseases. Its ability to understand human emotions makes ChatGPT an empathic and thoughtful interactive tool or source for urological patients and their relatives. Currently, its role in clinical diagnosis and treatment decisions is uncertain, as concerns have been raised about misinterpretation, hallucination and out-of-date information. Moreover, a mandatory regulatory process for ChatGPT in urology is yet to be established. SUMMARY ChatGPT has the potential to contribute to precision medicine and tailored practice by its quick, structured responses. However, this will depend on how well information can be obtained by seeking appropriate responses and asking the pertinent questions. The key lies in being able to validate the responses, regulating the information shared and avoiding misuse of the same to protect the data and patient privacy. Its successful integration into mainstream urology needs educational bodies to provide guidelines or best practice recommendations for the same.
Collapse
Affiliation(s)
- Carlotta Nedbal
- Department of Urology, University Hospitals Southampton, NHS Trust, Southampton, UK
- Urology Unit, Azienda Ospedaliero-Universitaria delle Marche, Polytechnic University of Marche, Ancona, Italy
| | - Nitesh Naik
- Department of Mechanical and Industrial Engineering, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, Karnataka, India
| | - Daniele Castellani
- Urology Unit, Azienda Ospedaliero-Universitaria delle Marche, Polytechnic University of Marche, Ancona, Italy
| | - Vineet Gauhar
- Department of Urology, Ng Teng Fong General Hospital, NUHS, Singapore
| | - Robert Geraghty
- Department of Urology, Freeman Hospital, Newcastle-upon-Tyne, UK
| | - Bhaskar Kumar Somani
- Department of Urology, University Hospitals Southampton, NHS Trust, Southampton, UK
| |
Collapse
|
3
|
Lee KH, Lee RW. ChatGPT's Accuracy on Magnetic Resonance Imaging Basics: Characteristics and Limitations Depending on the Question Type. Diagnostics (Basel) 2024; 14:171. [PMID: 38248048 PMCID: PMC10814518 DOI: 10.3390/diagnostics14020171] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2023] [Revised: 01/04/2024] [Accepted: 01/11/2024] [Indexed: 01/23/2024] Open
Abstract
Our study aimed to assess the accuracy and limitations of ChatGPT in the domain of MRI, focused on evaluating ChatGPT's performance in answering simple knowledge questions and specialized multiple-choice questions related to MRI. A two-step approach was used to evaluate ChatGPT. In the first step, 50 simple MRI-related questions were asked, and ChatGPT's answers were categorized as correct, partially correct, or incorrect by independent researchers. In the second step, 75 multiple-choice questions covering various MRI topics were posed, and the answers were similarly categorized. The study utilized Cohen's kappa coefficient for assessing interobserver agreement. ChatGPT demonstrated high accuracy in answering straightforward MRI questions, with over 85% classified as correct. However, its performance varied significantly across multiple-choice questions, with accuracy rates ranging from 40% to 66.7%, depending on the topic. This indicated a notable gap in its ability to handle more complex, specialized questions requiring deeper understanding and context. In conclusion, this study critically evaluates the accuracy of ChatGPT in addressing questions related to Magnetic Resonance Imaging (MRI), highlighting its potential and limitations in the healthcare sector, particularly in radiology. Our findings demonstrate that ChatGPT, while proficient in responding to straightforward MRI-related questions, exhibits variability in its ability to accurately answer complex multiple-choice questions that require more profound, specialized knowledge of MRI. This discrepancy underscores the nuanced role AI can play in medical education and healthcare decision-making, necessitating a balanced approach to its application.
Collapse
Affiliation(s)
| | - Ro-Woon Lee
- Department of Radiology, Inha University College of Medicine, Incheon 22212, Republic of Korea;
| |
Collapse
|
4
|
Kollitsch L, Eredics K, Marszalek M, Rauchenwald M, Brookman-May SD, Burger M, Körner-Riffard K, May M. How does artificial intelligence master urological board examinations? A comparative analysis of different Large Language Models' accuracy and reliability in the 2022 In-Service Assessment of the European Board of Urology. World J Urol 2024; 42:20. [PMID: 38197996 DOI: 10.1007/s00345-023-04749-6] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2023] [Accepted: 11/02/2023] [Indexed: 01/11/2024] Open
Abstract
PURPOSE This study is a comparative analysis of three Large Language Models (LLMs) evaluating their rate of correct answers (RoCA) and the reliability of generated answers on a set of urological knowledge-based questions spanning different levels of complexity. METHODS ChatGPT-3.5, ChatGPT-4, and Bing AI underwent two testing rounds, with a 48-h gap in between, using the 100 multiple-choice questions from the 2022 European Board of Urology (EBU) In-Service Assessment (ISA). For conflicting responses, an additional consensus round was conducted to establish conclusive answers. RoCA was compared across various question complexities. Ten weeks after the consensus round, a subsequent testing round was conducted to assess potential knowledge gain and improvement in RoCA, respectively. RESULTS Over three testing rounds, ChatGPT-3.5 achieved RoCa scores of 58%, 62%, and 59%. In contrast, ChatGPT-4 achieved RoCA scores of 63%, 77%, and 77%, while Bing AI yielded scores of 81%, 73%, and 77%, respectively. Agreement rates between rounds 1 and 2 were 84% (κ = 0.67, p < 0.001) for ChatGPT-3.5, 74% (κ = 0.40, p < 0.001) for ChatGPT-4, and 76% (κ = 0.33, p < 0.001) for BING AI. In the consensus round, ChatGPT-4 and Bing AI significantly outperformed ChatGPT-3.5 (77% and 77% vs. 59%, both p = 0.010). All LLMs demonstrated decreasing RoCA scores with increasing question complexity (p < 0.001). In the fourth round, no significant improvement in RoCA was observed across all three LLMs. CONCLUSIONS The performance of the tested LLMs in addressing urological specialist inquiries warrants further refinement. Moreover, the deficiency in response reliability contributes to existing challenges related to their current utility for educational purposes.
Collapse
Affiliation(s)
- Lisa Kollitsch
- Department of Urology and Andrology, Klinik Donaustadt, Vienna, Austria
| | - Klaus Eredics
- Department of Urology and Andrology, Klinik Donaustadt, Vienna, Austria
- Department of Urology, Paracelsus Medical University, Salzburg, Austria
| | - Martin Marszalek
- Department of Urology and Andrology, Klinik Donaustadt, Vienna, Austria
| | - Michael Rauchenwald
- Department of Urology and Andrology, Klinik Donaustadt, Vienna, Austria
- European Board of Urology, Arnhem, The Netherlands
| | - Sabine D Brookman-May
- Department of Urology, University of Munich, LMU, Munich, Germany
- Johnson and Johnson Innovative Medicine, Research and Development, Spring House, PA, USA
| | - Maximilian Burger
- Department of Urology, Caritas St. Josef Medical Centre, University of Regensburg, Regensburg, Germany
| | - Katharina Körner-Riffard
- Department of Urology, Caritas St. Josef Medical Centre, University of Regensburg, Regensburg, Germany
| | - Matthias May
- Department of Urology, St. Elisabeth Hospital Straubing, Brothers of Mercy Hospital, Straubing, Germany.
| |
Collapse
|
5
|
Lee KH, Lee RW, Kwon YE. Validation of a Deep Learning Chest X-ray Interpretation Model: Integrating Large-Scale AI and Large Language Models for Comparative Analysis with ChatGPT. Diagnostics (Basel) 2023; 14:90. [PMID: 38201398 PMCID: PMC10795741 DOI: 10.3390/diagnostics14010090] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2023] [Revised: 12/28/2023] [Accepted: 12/29/2023] [Indexed: 01/12/2024] Open
Abstract
This study evaluates the diagnostic accuracy and clinical utility of two artificial intelligence (AI) techniques: Kakao Brain Artificial Neural Network for Chest X-ray Reading (KARA-CXR), an assistive technology developed using large-scale AI and large language models (LLMs), and ChatGPT, a well-known LLM. The study was conducted to validate the performance of the two technologies in chest X-ray reading and explore their potential applications in the medical imaging diagnosis domain. The study methodology consisted of randomly selecting 2000 chest X-ray images from a single institution's patient database, and two radiologists evaluated the readings provided by KARA-CXR and ChatGPT. The study used five qualitative factors to evaluate the readings generated by each model: accuracy, false findings, location inaccuracies, count inaccuracies, and hallucinations. Statistical analysis showed that KARA-CXR achieved significantly higher diagnostic accuracy compared to ChatGPT. In the 'Acceptable' accuracy category, KARA-CXR was rated at 70.50% and 68.00% by two observers, while ChatGPT achieved 40.50% and 47.00%. Interobserver agreement was moderate for both systems, with KARA at 0.74 and GPT4 at 0.73. For 'False Findings', KARA-CXR scored 68.00% and 68.50%, while ChatGPT scored 37.00% for both observers, with high interobserver agreements of 0.96 for KARA and 0.97 for GPT4. In 'Location Inaccuracy' and 'Hallucinations', KARA-CXR outperformed ChatGPT with significant margins. KARA-CXR demonstrated a non-hallucination rate of 75%, which is significantly higher than ChatGPT's 38%. The interobserver agreement was high for KARA (0.91) and moderate to high for GPT4 (0.85) in the hallucination category. In conclusion, this study demonstrates the potential of AI and large-scale language models in medical imaging and diagnostics. It also shows that in the chest X-ray domain, KARA-CXR has relatively higher accuracy than ChatGPT.
Collapse
Affiliation(s)
| | - Ro Woon Lee
- Department of Radiology, College of Medicine, Inha University, Incheon 22212, Republic of Korea
| | | |
Collapse
|
6
|
Spitale G, Schneider G, Germani F, Biller-Andorno N. Exploring the role of AI in classifying, analyzing, and generating case reports on assisted suicide cases: feasibility and ethical implications. Front Artif Intell 2023; 6:1328865. [PMID: 38164497 PMCID: PMC10757918 DOI: 10.3389/frai.2023.1328865] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2023] [Accepted: 11/24/2023] [Indexed: 01/03/2024] Open
Abstract
This paper presents a study on the use of AI models for the classification of case reports on assisted suicide procedures. The database of the five Dutch regional bioethics committees was scraped to collect the 72 case reports available in English. We trained several AI models for classification according to the categories defined by the Dutch Termination of Life on Request and Assisted Suicide (Review Procedures) Act. We also conducted a related project to fine-tune an OpenAI GPT-3.5-turbo large language model for generating new fictional but plausible cases. As AI is increasingly being used for judgement, it is possible to imagine an application in decision-making regarding assisted suicide. Here we explore two arising questions: feasibility and ethics, with the aim of contributing to a critical assessment of the potential role of AI in decision-making in highly sensitive areas.
Collapse
Affiliation(s)
- Giovanni Spitale
- Institute of Biomedical Ethics and History of Medicine, University of Zurich, Zürich, Switzerland
| | - Gerold Schneider
- Department of Computational Linguistics, University of Zurich, Zürich, Switzerland
| | - Federico Germani
- Institute of Biomedical Ethics and History of Medicine, University of Zurich, Zürich, Switzerland
| | - Nikola Biller-Andorno
- Institute of Biomedical Ethics and History of Medicine, University of Zurich, Zürich, Switzerland
| |
Collapse
|
7
|
Ohta K, Ohta S. The Performance of GPT-3.5, GPT-4, and Bard on the Japanese National Dentist Examination: A Comparison Study. Cureus 2023; 15:e50369. [PMID: 38213361 PMCID: PMC10782219 DOI: 10.7759/cureus.50369] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/12/2023] [Indexed: 01/13/2024] Open
Abstract
Purpose This study aims to evaluate the performance of three large language models (LLMs), the Generative Pre-trained Transformer (GPT)-3.5, GPT-4, and Google Bard, on the 2023 Japanese National Dentist Examination (JNDE) and assess their potential clinical applications in Japan. Methods A total of 185 questions from the 2023 JNDE were used. These questions were categorized by question type and category. McNemar's test compared the correct response rates between two LLMs, while Fisher's exact test evaluated the performance of LLMs in each question category. Results The overall correct response rates were 73.5% for GPT-4, 66.5% for Bard, and 51.9% for GPT-3.5. GPT-4 showed a significantly higher correct response rate than Bard and GPT-3.5. In the category of essential questions, Bard achieved a correct response rate of 80.5%, surpassing the passing criterion of 80%. In contrast, both GPT-4 and GPT-3.5 fell short of this benchmark, with GPT-4 attaining 77.6% and GPT-3.5 only 52.5%. The scores of GPT-4 and Bard were significantly higher than that of GPT-3.5 (p<0.01). For general questions, the correct response rates were 71.2% for GPT-4, 58.5% for Bard, and 52.5% for GPT-3.5. GPT-4 outperformed GPT-3.5 and Bard (p<0.01). The correct response rates for professional dental questions were 51.6% for GPT-4, 45.3% for Bard, and 35.9% for GPT-3.5. The differences among the models were not statistically significant. All LLMs demonstrated significantly lower accuracy for dentistry questions compared to other types of questions (p<0.01). Conclusions GPT-4 achieved the highest overall score in the JNDE, followed by Bard and GPT-3.5. However, only Bard surpassed the passing score for essential questions. To further understand the application of LLMs in clinical dentistry worldwide, more research on their performance in dental examinations across different languages is required.
Collapse
Affiliation(s)
| | - Satomi Ohta
- Dentistry, Dentist of Mama and Kodomo, Kobe, JPN
| |
Collapse
|
8
|
Koga S. Exploring the pitfalls of large language models: Inconsistency and inaccuracy in answering pathology board examination-style questions. Pathol Int 2023; 73:618-620. [PMID: 37818818 DOI: 10.1111/pin.13382] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2023] [Accepted: 09/19/2023] [Indexed: 10/13/2023]
Affiliation(s)
- Shunsuke Koga
- Department of Pathology and Laboratory Medicine, Hospital of the University of Pennsylvania, Philadelphia, Pennsylvania, USA
| |
Collapse
|
9
|
Meo SA, Al-Khlaiwi T, AbuKhalaf AA, Meo AS, Klonoff DC. The Scientific Knowledge of Bard and ChatGPT in Endocrinology, Diabetes, and Diabetes Technology: Multiple-Choice Questions Examination-Based Performance. J Diabetes Sci Technol 2023:19322968231203987. [PMID: 37798960 DOI: 10.1177/19322968231203987] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 10/07/2023]
Abstract
BACKGROUND The present study aimed to investigate the knowledge level of Bard and ChatGPT in the areas of endocrinology, diabetes, and diabetes technology through a multiple-choice question (MCQ) examination format. METHODS Initially, a 100-MCQ bank was established based on MCQs in endocrinology, diabetes, and diabetes technology. The MCQs were created from physiology, medical textbooks, and academic examination pools in the areas of endocrinology, diabetes, and diabetes technology and academic examination pools. The study team members analyzed the MCQ contents to ensure that they were related to the endocrinology, diabetes, and diabetes technology. The number of MCQs from endocrinology was 50, and that from diabetes and science technology was also 50. The knowledge level of Google's Bard and ChatGPT was assessed with an MCQ-based examination. RESULTS In the endocrinology examination section, ChatGPT obtained 29 marks (correct responses) of 50 (58%), and Bard obtained a similar score of 29 of 50 (58%). However, in the diabetes technology examination section, ChatGPT obtained 23 marks of 50 (46%), and Bard obtained 20 marks of 50 (40%). Overall, in the entire three-part examination, ChatGPT obtained 52 marks of 100 (52%), and Bard obtained 49 marks of 100 (49%). ChatGPT obtained slightly more marks than Bard. However, both ChatGPT and Bard did not achieve satisfactory scores in endocrinology or diabetes/technology of at least 60%. CONCLUSIONS The overall MCQ-based performance of ChatGPT was slightly better than that of Google's Bard. However, both ChatGPT and Bard did not achieve appropriate scores in endocrinology and diabetes/diabetes technology. The study indicates that Bard and ChatGPT have the potential to facilitate medical students and faculty in academic medical education settings, but both artificial intelligence tools need more updated information in the fields of endocrinology, diabetes, and diabetes technology.
Collapse
Affiliation(s)
- Sultan Ayoub Meo
- Department of Physiology, College of Medicine, King Saud University, Riyadh, Saudi Arabia
| | - Thamir Al-Khlaiwi
- Department of Physiology, College of Medicine, King Saud University, Riyadh, Saudi Arabia
| | | | - Anusha Sultan Meo
- The School of Medicine, Medical Sciences and Nutrition, University of Aberdeen, Aberdeen, UK
| | - David C Klonoff
- Diabetes Research Institute, Mills-Peninsula Medical Center, San Mateo, CA, USA
| |
Collapse
|