1
|
Cascella M, Semeraro F, Montomoli J, Bellini V, Piazza O, Bignami E. The Breakthrough of Large Language Models Release for Medical Applications: 1-Year Timeline and Perspectives. J Med Syst 2024; 48:22. [PMID: 38366043 PMCID: PMC10873461 DOI: 10.1007/s10916-024-02045-3] [Citation(s) in RCA: 55] [Impact Index Per Article: 55.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2023] [Accepted: 02/10/2024] [Indexed: 02/18/2024]
Abstract
Within the domain of Natural Language Processing (NLP), Large Language Models (LLMs) represent sophisticated models engineered to comprehend, generate, and manipulate text resembling human language on an extensive scale. They are transformer-based deep learning architectures, obtained through the scaling of model size, pretraining of corpora, and computational resources. The potential healthcare applications of these models primarily involve chatbots and interaction systems for clinical documentation management, and medical literature summarization (Biomedical NLP). The challenge in this field lies in the research for applications in diagnostic and clinical decision support, as well as patient triage. Therefore, LLMs can be used for multiple tasks within patient care, research, and education. Throughout 2023, there has been an escalation in the release of LLMs, some of which are applicable in the healthcare domain. This remarkable output is largely the effect of the customization of pre-trained models for applications like chatbots, virtual assistants, or any system requiring human-like conversational engagement. As healthcare professionals, we recognize the imperative to stay at the forefront of knowledge. However, keeping abreast of the rapid evolution of this technology is practically unattainable, and, above all, understanding its potential applications and limitations remains a subject of ongoing debate. Consequently, this article aims to provide a succinct overview of the recently released LLMs, emphasizing their potential use in the field of medicine. Perspectives for a more extensive range of safe and effective applications are also discussed. The upcoming evolutionary leap involves the transition from an AI-powered model primarily designed for answering medical questions to a more versatile and practical tool for healthcare providers such as generalist biomedical AI systems for multimodal-based calibrated decision-making processes. On the other hand, the development of more accurate virtual clinical partners could enhance patient engagement, offering personalized support, and improving chronic disease management.
Collapse
|
Review |
1 |
55 |
2
|
Hosseini M, Horbach SPJM. Fighting reviewer fatigue or amplifying bias? Considerations and recommendations for use of ChatGPT and other large language models in scholarly peer review. Res Integr Peer Rev 2023; 8:4. [PMID: 37198671 PMCID: PMC10191680 DOI: 10.1186/s41073-023-00133-5] [Citation(s) in RCA: 49] [Impact Index Per Article: 24.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2023] [Accepted: 04/19/2023] [Indexed: 05/19/2023] Open
Abstract
BACKGROUND The emergence of systems based on large language models (LLMs) such as OpenAI's ChatGPT has created a range of discussions in scholarly circles. Since LLMs generate grammatically correct and mostly relevant (yet sometimes outright wrong, irrelevant or biased) outputs in response to provided prompts, using them in various writing tasks including writing peer review reports could result in improved productivity. Given the significance of peer reviews in the existing scholarly publication landscape, exploring challenges and opportunities of using LLMs in peer review seems urgent. After the generation of the first scholarly outputs with LLMs, we anticipate that peer review reports too would be generated with the help of these systems. However, there are currently no guidelines on how these systems should be used in review tasks. METHODS To investigate the potential impact of using LLMs on the peer review process, we used five core themes within discussions about peer review suggested by Tennant and Ross-Hellauer. These include 1) reviewers' role, 2) editors' role, 3) functions and quality of peer reviews, 4) reproducibility, and 5) the social and epistemic functions of peer reviews. We provide a small-scale exploration of ChatGPT's performance regarding identified issues. RESULTS LLMs have the potential to substantially alter the role of both peer reviewers and editors. Through supporting both actors in efficiently writing constructive reports or decision letters, LLMs can facilitate higher quality review and address issues of review shortage. However, the fundamental opacity of LLMs' training data, inner workings, data handling, and development processes raise concerns about potential biases, confidentiality and the reproducibility of review reports. Additionally, as editorial work has a prominent function in defining and shaping epistemic communities, as well as negotiating normative frameworks within such communities, partly outsourcing this work to LLMs might have unforeseen consequences for social and epistemic relations within academia. Regarding performance, we identified major enhancements in a short period and expect LLMs to continue developing. CONCLUSIONS We believe that LLMs are likely to have a profound impact on academia and scholarly communication. While potentially beneficial to the scholarly communication system, many uncertainties remain and their use is not without risks. In particular, concerns about the amplification of existing biases and inequalities in access to appropriate infrastructure warrant further attention. For the moment, we recommend that if LLMs are used to write scholarly reviews and decision letters, reviewers and editors should disclose their use and accept full responsibility for data security and confidentiality, and their reports' accuracy, tone, reasoning and originality.
Collapse
|
research-article |
2 |
49 |
3
|
Dunn C, Hunter J, Steffes W, Whitney Z, Foss M, Mammino J, Leavitt A, Hawkins SD, Dane A, Yungmann M, Nathoo R. AI-derived Dermatology Case Reports are Indistinguishable from Those Written by Humans:A Single Blinded Observer Study. J Am Acad Dermatol 2023:S0190-9622(23)00587-X. [PMID: 37054810 DOI: 10.1016/j.jaad.2023.04.005] [Citation(s) in RCA: 25] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2023] [Revised: 03/29/2023] [Accepted: 04/03/2023] [Indexed: 04/15/2023]
|
|
2 |
25 |
4
|
Varghese J, Chapiro J. ChatGPT: The transformative influence of generative AI on science and healthcare. J Hepatol 2024; 80:977-980. [PMID: 37544516 DOI: 10.1016/j.jhep.2023.07.028] [Citation(s) in RCA: 17] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/01/2023] [Accepted: 07/03/2023] [Indexed: 08/08/2023]
Abstract
In an age where technology is evolving at a sometimes incomprehensibly rapid pace, the liver community must adjust and learn to embrace breakthroughs with an open mind in order to benefit from potentially transformative influences on our science and practice. The Journal of Hepatology has responded to novel developments in artificial intelligence (AI) by recruiting experts in the field to serve on the Editorial Board. Publications introducing novel AI technology are no longer uncommon in our journal and are among the most highly debated and possibly practice-changing papers across a broad range of scientific disciplines, united by their focus on liver disease. As AI is rapidly evolving, this expert paper will focus on educating our readership on large language models and their possible impact on our research practice and clinical outlook, outlining both challenges and opportunities in the field. "To improve is to change; to be perfect is to change often." - Winston S. Churchill.
Collapse
|
|
1 |
17 |
5
|
Nerella S, Bandyopadhyay S, Zhang J, Contreras M, Siegel S, Bumin A, Silva B, Sena J, Shickel B, Bihorac A, Khezeli K, Rashidi P. Transformers and large language models in healthcare: A review. Artif Intell Med 2024; 154:102900. [PMID: 38878555 PMCID: PMC11638972 DOI: 10.1016/j.artmed.2024.102900] [Citation(s) in RCA: 10] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2023] [Revised: 05/28/2024] [Accepted: 05/30/2024] [Indexed: 08/09/2024]
Abstract
With Artificial Intelligence (AI) increasingly permeating various aspects of society, including healthcare, the adoption of the Transformers neural network architecture is rapidly changing many applications. Transformer is a type of deep learning architecture initially developed to solve general-purpose Natural Language Processing (NLP) tasks and has subsequently been adapted in many fields, including healthcare. In this survey paper, we provide an overview of how this architecture has been adopted to analyze various forms of healthcare data, including clinical NLP, medical imaging, structured Electronic Health Records (EHR), social media, bio-physiological signals, biomolecular sequences. Furthermore, which have also include the articles that used the transformer architecture for generating surgical instructions and predicting adverse outcomes after surgeries under the umbrella of critical care. Under diverse settings, these models have been used for clinical diagnosis, report generation, data reconstruction, and drug/protein synthesis. Finally, we also discuss the benefits and limitations of using transformers in healthcare and examine issues such as computational cost, model interpretability, fairness, alignment with human values, ethical implications, and environmental impact.
Collapse
|
Review |
1 |
10 |
6
|
Tie X, Shin M, Pirasteh A, Ibrahim N, Huemann Z, Castellino SM, Kelly KM, Garrett J, Hu J, Cho SY, Bradshaw TJ. Personalized Impression Generation for PET Reports Using Large Language Models. JOURNAL OF IMAGING INFORMATICS IN MEDICINE 2024; 37:471-488. [PMID: 38308070 PMCID: PMC11031527 DOI: 10.1007/s10278-024-00985-3] [Citation(s) in RCA: 9] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/06/2023] [Revised: 01/17/2024] [Accepted: 01/18/2024] [Indexed: 02/04/2024]
Abstract
Large language models (LLMs) have shown promise in accelerating radiology reporting by summarizing clinical findings into impressions. However, automatic impression generation for whole-body PET reports presents unique challenges and has received little attention. Our study aimed to evaluate whether LLMs can create clinically useful impressions for PET reporting. To this end, we fine-tuned twelve open-source language models on a corpus of 37,370 retrospective PET reports collected from our institution. All models were trained using the teacher-forcing algorithm, with the report findings and patient information as input and the original clinical impressions as reference. An extra input token encoded the reading physician's identity, allowing models to learn physician-specific reporting styles. To compare the performances of different models, we computed various automatic evaluation metrics and benchmarked them against physician preferences, ultimately selecting PEGASUS as the top LLM. To evaluate its clinical utility, three nuclear medicine physicians assessed the PEGASUS-generated impressions and original clinical impressions across 6 quality dimensions (3-point scales) and an overall utility score (5-point scale). Each physician reviewed 12 of their own reports and 12 reports from other physicians. When physicians assessed LLM impressions generated in their own style, 89% were considered clinically acceptable, with a mean utility score of 4.08/5. On average, physicians rated these personalized impressions as comparable in overall utility to the impressions dictated by other physicians (4.03, P = 0.41). In summary, our study demonstrated that personalized impressions generated by PEGASUS were clinically useful in most cases, highlighting its potential to expedite PET reporting by automatically drafting impressions.
Collapse
|
research-article |
1 |
9 |
7
|
Inam M, Sheikh S, Minhas AMK, Vaughan EM, Krittanawong C, Samad Z, Lavie CJ, Khoja A, D'Cruze M, Slipczuk L, Alarakhiya F, Naseem A, Haider AH, Virani SS. A review of top cardiology and cardiovascular medicine journal guidelines regarding the use of generative artificial intelligence tools in scientific writing. Curr Probl Cardiol 2024; 49:102387. [PMID: 38185435 DOI: 10.1016/j.cpcardiol.2024.102387] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2023] [Accepted: 01/04/2024] [Indexed: 01/09/2024]
Abstract
BACKGROUND Generative Artificial Intelligence (AI) tools have experienced rapid development over the last decade and are gaining increasing popularity as assistive models in academic writing. However, the ability of AI to generate reliable and accurate research articles is a topic of debate. Major scientific journals have issued policies regarding the contribution of AI tools in scientific writing. METHODS We conducted a review of the author and peer reviewer guidelines of the top 25 Cardiology and Cardiovascular Medicine journals as per the 2023 SCImago rankings. Data were obtained though reviewing journal websites and directly emailing the editorial office. Descriptive data regarding journal characteristics were coded on SPSS. Subgroup analyses of the journal guidelines were conducted based on the publishing company policies. RESULTS Our analysis revealed that all scientific journals in our study permitted the documented use of AI in scientific writing with certain limitations as per ICMJE recommendations. We found that AI tools cannot be included in the authorship or be used for image generation, and that all authors are required to assume full responsibility of their submitted and published work. The use of generative AI tools in the peer review process is strictly prohibited. CONCLUSION Guidelines regarding the use of generative AI in scientific writing are standardized, detailed, and unanimously followed by all journals in our study according to the recommendations set forth by international forums. It is imperative to ensure that these policies are carefully followed and updated to maintain scientific integrity.
Collapse
|
Review |
1 |
4 |
8
|
Manolitsis I, Feretzakis G, Tzelves L, Kalles D, Katsimperis S, Angelopoulos P, Anastasiou A, Koutsouris D, Kosmidis T, Verykios VS, Skolarikos A, Varkarakis I. Training ChatGPT Models in Assisting Urologists in Daily Practice. Stud Health Technol Inform 2023; 305:576-579. [PMID: 37387096 DOI: 10.3233/shti230562] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023]
Abstract
Artificial Intelligence (AI) has shown the ability to enhance the accuracy and efficiency of physicians. ChatGPT is an AI chatbot that can interact with humans through text, over the internet. It is trained with machine learning algorithms, using large datasets. In this study, we compare the performance of using a ChatGPT API 3.5 Turbo model to a general model, in assisting urologists in obtaining accurate, valid medical information. The API was accessed through a Python script that was applied specifically for this study based on 2023 EAU guidelines in PDF format. This custom-trained model leads to providing doctors with more precise, prompt answers about specific urologic subjects, thus helping them, ultimately, providing better patient care.
Collapse
|
|
2 |
3 |
9
|
Shi Y, Ren P, Wang J, Han B, ValizadehAslani T, Agbavor F, Zhang Y, Hu M, Zhao L, Liang H. Leveraging GPT-4 for food effect summarization to enhance product-specific guidance development via iterative prompting. J Biomed Inform 2023; 148:104533. [PMID: 37918623 DOI: 10.1016/j.jbi.2023.104533] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2023] [Revised: 10/12/2023] [Accepted: 10/30/2023] [Indexed: 11/04/2023]
Abstract
Food effect summarization from New Drug Application (NDA) is an essential component of product-specific guidance (PSG) development and assessment, which provides the basis of recommendations for fasting and fed bioequivalence studies to guide the pharmaceutical industry for developing generic drug products. However, manual summarization of food effect from extensive drug application review documents is time-consuming. Therefore, there is a need to develop automated methods to generate food effect summary. Recent advances in natural language processing (NLP), particularly large language models (LLMs) such as ChatGPT and GPT-4, have demonstrated great potential in improving the effectiveness of automated text summarization, but its ability with regard to the accuracy in summarizing food effect for PSG assessment remains unclear. In this study, we introduce a simple yet effective approach,iterative prompting, which allows one to interact with ChatGPT or GPT-4 more effectively and efficiently through multi-turn interaction. Specifically, we propose a three-turn iterative prompting approach to food effect summarization in which the keyword-focused and length-controlled prompts are respectively provided in consecutive turns to refine the quality of the generated summary. We conduct a series of extensive evaluations, ranging from automated metrics to FDA professionals and even evaluation by GPT-4, on 100 NDA review documents selected over the past five years. We observe that the summary quality is progressively improved throughout the iterative prompting process. Moreover, we find that GPT-4 performs better than ChatGPT, as evaluated by FDA professionals (43% vs. 12%) and GPT-4 (64% vs. 35%). Importantly, all the FDA professionals unanimously rated that 85% of the summaries generated by GPT-4 are factually consistent with the golden reference summary, a finding further supported by GPT-4 rating of 72% consistency. Taken together, these results strongly suggest a great potential for GPT-4 to draft food effect summaries that could be reviewed by FDA professionals, thereby improving the efficiency of the PSG assessment cycle and promoting generic drug product development.
Collapse
|
|
2 |
3 |
10
|
Laios A, Theophilou G, De Jong D, Kalampokis E. The Future of AI in Ovarian Cancer Research: The Large Language Models Perspective. Cancer Control 2023; 30:10732748231197915. [PMID: 37624621 PMCID: PMC10467259 DOI: 10.1177/10732748231197915] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/26/2023] Open
Abstract
Conversational large language model (LLM)-based chatbots utilize neural networks to process natural language. By generating highly sophisticated outputs from contextual input text, they revolutionize the access to further learning, leading to the development of new skills and personalized interactions. Although they are not developed to provide healthcare, their potential to address biomedical issues is rather unexplored. Healthcare digitalization and documentation of electronic health records is now developing into a standard practice. Developing tools to facilitate clinical review of unstructured data such as LLMs can derive clinical meaningful insights for ovarian cancer, a heterogeneous but devastating disease. Compared to standard approaches, they can host capacity to condense results and optimize analysis time. To help accelerate research in biomedical language processing and improve the validity of scientific writing, task-specific and domain-specific language models may be required. In turn, we propose a bespoke, proprietary ovarian cancer-specific natural language using solely in-domain text, whereas transfer learning drifts away from the pretrained language models to fine-tune task-specific models for all possible downstream applications. This venture will be fueled by the abundance of unstructured text information in the electronic health records resulting in ovarian cancer research ultimately reaching its linguistic home.
Collapse
|
Editorial |
2 |
3 |
11
|
Scott AJS, McCuaig F, Lim V, Watkins W, Wang J, Strachan G. Revolutionizing Nurse Practitioner Training: Integrating Virtual Reality and Large Language Models for Enhanced Clinical Education. Stud Health Technol Inform 2024; 315:671-672. [PMID: 39049375 DOI: 10.3233/shti240272] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/27/2024]
Abstract
This project introduces an innovative virtual reality (VR) training program for student Nurse Practitioners, incorporating advanced 3D modeling, animation, and Large Language Models (LLMs). Designed to simulate realistic patient interactions, the program aims to improve communication, history taking, and clinical decision-making skills in a controlled, authentic setting. This abstract outlines the methods, results, and potential impact of this cutting-edge educational tool on nursing education.
Collapse
|
|
1 |
2 |
12
|
Wang X, Ye H, Zhang S, Yang M, Wang X. Evaluation of the Performance of Three Large Language Models in Clinical Decision Support: A Comparative Study Based on Actual Cases. J Med Syst 2025; 49:23. [PMID: 39948214 DOI: 10.1007/s10916-025-02152-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2024] [Accepted: 01/23/2025] [Indexed: 05/09/2025]
Abstract
BACKGROUND Generative large language models (LLMs) are increasingly integrated into the medical field. However, their actual efficacy in clinical decision-making remains partially unexplored. This study aimed to assess the performance of the three LLMs, ChatGPT-4, Gemini, and Med-Go, in the domain of professional medicine when confronted with actual clinical cases. METHODS This study involved 134 clinical cases spanning nine medical disciplines. Each LLM was required to provide suggestions for diagnosis, diagnostic criteria, differential diagnosis, examination and treatment for every case. Responses were scored by two experts using a predefined rubric. RESULTS In overall performance among the models, Med-Go achieved the highest median score (37.5, IQR 31.9-41.5), while Gemini recorded the lowest (33.0, IQR 25.5-36.6), showing significant statistical difference among the three LLMs (p < 0.001). Analysis revealed that responses related to differential diagnosis were the weakest, while those pertaining to treatment recommendations were the strongest. Med-Go displayed notable performance advantages in gastroenterology, nephrology, and neurology. CONCLUSIONS The findings show that all three LLMs achieved over 60% of the maximum possible score, indicating their potential applicability in clinical practice. However, inaccuracies that could lead to adverse decisions underscore the need for caution in their application. Med-Go's superior performance highlights the benefits of incorporating specialized medical knowledge into LLMs training. It is anticipated that further development and refinement of medical LLMs will enhance their precision and safety in clinical use.
Collapse
|
Comparative Study |
1 |
2 |
13
|
Hueso M, Álvarez R, Marí D, Ribas-Ripoll V, Lekadir K, Vellido A. Is generative artificial intelligence the next step toward a personalized hemodialysis? REVISTA DE INVESTIGACION CLINICA; ORGANO DEL HOSPITAL DE ENFERMEDADES DE LA NUTRICION 2023; 75:309-317. [PMID: 37734067 DOI: 10.24875/ric.23000162] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/26/2023] [Accepted: 07/27/2023] [Indexed: 09/23/2023]
Abstract
Artificial intelligence (AI) generative models driven by the integration of AI and natural language processing technologies, such as OpenAI's chatbot generative pre-trained transformer large language model (LLM), are receiving much public attention and have the potential to transform personalized medicine. Dialysis patients are highly dependent on technology and their treatment generates a challenging large volume of data that has to be analyzed for knowledge extraction. We argue that, by integrating the data acquired from hemodialysis treatments with the powerful conversational capabilities of LLMs, nephrologists could personalize treatments adapted to patients' lifestyles and preferences. We also argue that this new conversational AI integrated with a personalized patient-computer interface will enhance patients' engagement and self-care by providing them with a more personalized experience. However, generative AI models require continuous and accurate updates of data, and expert supervision and must address potential biases and limitations. Dialysis patients can also benefit from other new emerging technologies such as Digital Twins with which patients' care can also be addressed from a personalized medicine perspective. In this paper, we will revise LLMs potential strengths in terms of their contribution to personalized medicine, and, in particular, their potential impact, and limitations in nephrology. Nephrologists' collaboration with AI academia and companies, to develop algorithms and models that are more transparent, understandable, and trustworthy, will be crucial for the next generation of dialysis patients. The combination of technology, patient-specific data, and AI should contribute to create a more personalized and interactive dialysis process, improving patients' quality of life.
Collapse
|
|
2 |
1 |
14
|
Murthy AB, Palaniappan V, Radhakrishnan S, Rajaa S, Karthikeyan K. A Comparative Analysis of the Performance of Large Language Models and Human Respondents in Dermatology. Indian Dermatol Online J 2025; 16:241-247. [PMID: 40125046 PMCID: PMC11927985 DOI: 10.4103/idoj.idoj_221_24] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2024] [Revised: 08/14/2024] [Accepted: 08/23/2024] [Indexed: 03/25/2025] Open
Abstract
Background With the growing interest in generative artificial intelligence (AI), the scientific community is witnessing the vast utility of large language models (LLMs) with chat interfaces such as ChatGPT and Microsoft Bing Chat in the medical field and research. This study aimed to investigate the accuracy of ChatGPT and Microsoft Bing Chat to answer questions on Dermatology, Venereology, and Leprosy, the frequency of artificial hallucinations, and to compare their performance with human respondents. Aim and Objectives The primary objective of the study was to compare the knowledge and interpretation abilities of LLMs (ChatGPT v3.5 and Microsoft Bing Chat) with human respondents (12 final-year postgraduates) and the secondary objective was to assess the incidence of artificial hallucinations with 60 questions prepared by the authors, including multiple choice questions (MCQs), fill-in-the-blanks and scenario-based questions. Materials and Methods The authors accessed two commercially available large language models (LLMs) with chat interfaces namely ChatGPT version 3.5 (OpenAI; San Francisco, CA) and Microsoft Bing Chat from August 10th to August 23rd, 2023. Results In our testing set of 60 questions, Bing Chat outperformed ChatGPT and human respondents with a mean correct response score of 46.9 ± 0.7. The mean correct responses by ChatGPT and human respondents were 35.9 ± 0.5 and 25.8 ± 11.0, respectively. The overall accuracy of human respondents, ChatGPT and Bing Chat was observed to be 43%, 59.8%, and 78.2%, respectively. Of the MCQs, fill-in-the-blanks, and scenario-based questions, Bing Chat had the highest accuracy in all types of questions with statistical significance (P < 0.001 by ANOVA test). Topic-wise assessment of the performance of LLMs showed that Bing Chat performed better in all topics except vascular disorders, inflammatory disorders, and leprosy. Bing Chat performed better in answering easy and medium-difficulty questions with accuracies of 85.7% and 78%, respectively. In comparison, ChatGPT performed well on hard questions with an accuracy of 55% with statistical significance (P < 0.001 by ANOVA test). The mean number of questions answered by the human respondents among the 10 questions with multiple correct responses was 3 ± 1.4. The accuracy of LLMs in answering questions with multiple correct responses was assessed by employing two prompts. ChatGPT and Bing Chat could answer 3.1 ± 0.3 and 4 ± 0 questions respectively without prompting. On evaluating the ability of logical reasoning by the LLMs, it was found that ChatGPT gave logical reasoning in 47 ± 0.4 questions and Bing Chat in 53.9 ± 0.5 questions, irrespective of the correctness of the responses. ChatGPT exhibited artificial hallucination in 4 questions, even with 12 repeated inputs, which was not observed in Bing chat. Limitations Variability in respondent accuracy, a small question set, and exclusion of newer AI models and image-based assessments. Conclusion This study showed an overall better performance of LLMs compared to human respondents. However, the LLMs were less accurate than respondents in topics like inflammatory disorders and leprosy. Proper regulations concerning the use of LLMs are the need of the hour to avoid potential misuse.
Collapse
|
research-article |
1 |
|
15
|
Zelin C, Chung WK, Jeanne M, Zhang G, Weng C. Rare disease diagnosis using knowledge guided retrieval augmentation for ChatGPT. J Biomed Inform 2024; 157:104702. [PMID: 39084480 PMCID: PMC11402564 DOI: 10.1016/j.jbi.2024.104702] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2024] [Revised: 07/19/2024] [Accepted: 07/24/2024] [Indexed: 08/02/2024]
Abstract
Although rare diseases individually have a low prevalence, they collectively affect nearly 400 million individuals around the world. On average, it takes five years for an accurate rare disease diagnosis, but many patients remain undiagnosed or misdiagnosed. As machine learning technologies have been used to aid diagnostics in the past, this study aims to test ChatGPT's suitability for rare disease diagnostic support with the enhancement provided by Retrieval Augmented Generation (RAG). RareDxGPT, our enhanced ChatGPT model, supplies ChatGPT with information about 717 rare diseases from an external knowledge resource, the RareDis Corpus, through RAG. In RareDxGPT, when a query is entered, the three documents most relevant to the query in the RareDis Corpus are retrieved. Along with the query, they are returned to ChatGPT to provide a diagnosis. Additionally, phenotypes for thirty different diseases were extracted from free text from PubMed's Case Reports. They were each entered with three different prompt types: "prompt", "prompt + explanation" and "prompt + role play." The accuracy of ChatGPT and RareDxGPT with each prompt was then measured. With "Prompt", RareDxGPT had a 40 % accuracy, while ChatGPT 3.5 got 37 % of the cases correct. With "Prompt + Explanation", RareDxGPT had a 43 % accuracy, while ChatGPT 3.5 got 23 % of the cases correct. With "Prompt + Role Play", RareDxGPT had a 40 % accuracy, while ChatGPT 3.5 got 23 % of the cases correct. To conclude, ChatGPT, especially when supplying extra domain specific knowledge, demonstrates early potential for rare disease diagnosis with adjustments.
Collapse
|
research-article |
1 |
|
16
|
Tripathi S, Mutter L, Muppuri M, Dheer S, Garza-Frias E, Awan K, Jha A, Dezube M, Tabari A, Bizzo BC, Dreyer KJ, Bridge CP, Daye D. PRECISE framework: Enhanced radiology reporting with GPT for improved readability, reliability, and patient-centered care. Eur J Radiol 2025; 187:112124. [PMID: 40286532 DOI: 10.1016/j.ejrad.2025.112124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2024] [Revised: 04/03/2025] [Accepted: 04/16/2025] [Indexed: 04/29/2025]
Abstract
BACKGROUND The PRECISE framework, defined as Patient-Focused Radiology Reports with Enhanced Clarity and Informative Summaries for Effective Communication, leverages GPT-4 to create patient-friendly summaries of radiology reports at a sixth-grade reading level. PURPOSE The purpose of the study was to evaluate the effectiveness of the PRECISE framework in improving the readability, reliability, and understandability of radiology reports. We hypothesized that the PRECISE framework improves the readability and patient understanding of radiology reports compared to the original versions. MATERIALS AND METHODS The PRECISE framework was assessed using 500 chest X-ray reports. Readability was evaluated using the Flesch Reading Ease, Gunning Fog Index, and Automated Readability Index. Reliability was gauged by clinical volunteers, while understandability was assessed by non-medical volunteers. Statistical analyses including t-tests, regression analyses, and Mann-Whitney U tests were conducted to determine the significance of the differences in readability scores between the original and PRECISE-generated reports. RESULTS Readability scores significantly improved, with the mean Flesch Reading Ease score increasing from 38.28 to 80.82 (p-value < 0.001), the Gunning Fog Index decreasing from 13.04 to 6.99 (p-value < 0.001), and the ARI score improving from 13.33 to 5.86 (p-value < 0.001). Clinical volunteer assessments found 95 % of the summaries reliable, and non-medical volunteers rated 97 % of the PRECISE-generated summaries as fully understandable. CONCLUSION The application of the PRECISE approach demonstrates promise in enhancing patient understanding and communication without adding significant burden to radiologists. With improved reliability and patient-friendly summaries, this approach holds promise for fostering patient engagement and understanding in healthcare decision-making. The PRECISE framework represents a pivotal step towards more inclusive and patient-centric care delivery.
Collapse
|
|
1 |
|
17
|
Bitaraf E, Jafarpour M. AI-Powered Affiliation Insights: LLM-Based Bibliometric Study of European Medical Informatics Conferences. Stud Health Technol Inform 2025; 327:823-827. [PMID: 40380582 DOI: 10.3233/shti250474] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/19/2025]
Abstract
This study employs Large Language Models (LLMs) to analyze bibliometric data from European Medical Informatics conferences from 1996 to 2024. By enhancing traditional methods with LLM-based techniques, the researchers significantly improved affiliation extraction accuracy. The analysis reveals trends in publication volume, author impact, and institutional collaborations across Europe. Key findings include the identification of leading contributors, visualization of collaboration networks, and mapping of geographical and institutional centers of excellence. The study highlights the potential of LLMs in bibliometric analysis, offering deeper insights into research trends and collaborations while addressing challenges in data standardization and computational resources.
Collapse
|
|
1 |
|
18
|
Delourme S, Redjdal A, Bouaud J, Seroussi B. Measured Performance and Healthcare Professional Perception of Large Language Models Used as Clinical Decision Support Systems: A Scoping Review. Stud Health Technol Inform 2024; 316:841-845. [PMID: 39176924 DOI: 10.3233/shti240543] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/24/2024]
Abstract
The healthcare sector confronts challenges from overloaded tumor board meetings, reduced discussion durations, and care quality concerns, necessitating innovative solutions. Integrating Clinical Decision Support Systems (CDSSs) has a potential in supporting clinicians to reduce the cancer burden, but CDSSs remain poorly used in clinical practice. The emergence of OpenAI's ChatGPT in 2022 has prompted the evaluation of Large Language Models (LLMs) as potential CDSSs for diagnosis and therapeutic management. We conducted a scoping review to evaluate the utility of LLMs like ChatGPT as CDSSs in several medical specialties, particularly in oncology, and compared users' perception of LLMs with the actually measured performance of these systems.
Collapse
|
Scoping Review |
1 |
|
19
|
Wang J. Deep Learning in Hematology: From Molecules to Patients. Clin Hematol Int 2024; 6:19-42. [PMID: 39417017 PMCID: PMC11477942 DOI: 10.46989/001c.124131] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2024] [Accepted: 06/29/2024] [Indexed: 10/19/2024] Open
Abstract
Deep learning (DL), a subfield of machine learning, has made remarkable strides across various aspects of medicine. This review examines DL's applications in hematology, spanning from molecular insights to patient care. The review begins by providing a straightforward introduction to the basics of DL tailored for those without prior knowledge, touching on essential concepts, principal architectures, and prevalent training methods. It then discusses the applications of DL in hematology, concentrating on elucidating the models' architecture, their applications, performance metrics, and inherent limitations. For example, at the molecular level, DL has improved the analysis of multi-omics data and protein structure prediction. For cells and tissues, DL enables the automation of cytomorphology analysis, interpretation of flow cytometry data, and diagnosis from whole slide images. At the patient level, DL's utility extends to analyzing curated clinical data, electronic health records, and clinical notes through large language models. While DL has shown promising results in various hematology applications, challenges remain in model generalizability and explainability. Moreover, the integration of novel DL architectures into hematology has been relatively slow in comparison to that in other medical fields.
Collapse
|
Review |
1 |
|
20
|
Ghorbian M, Ghobaei-Arani M, Ghorbian S. Transforming breast cancer diagnosis and treatment with large language Models: A comprehensive survey. Methods 2025; 239:85-110. [PMID: 40199412 DOI: 10.1016/j.ymeth.2025.04.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2025] [Revised: 03/24/2025] [Accepted: 04/01/2025] [Indexed: 04/10/2025] Open
Abstract
Breast cancer (BrCa), being one of the most prevalent forms of cancer in women, poses many challenges in the field of treatment and diagnosis due to its complex biological mechanisms. Early and accurate diagnosis plays a fundamental role in improving survival rates, but the limitations of existing imaging methods and clinical data interpretation often prevent optimal results. Large Language Models (LLMs), which are developed based on advanced architectures such as transformers, have brought about a significant revolution in data processing and medical decision-making. By analyzing a large volume of medical and clinical data, these models enable early diagnosis by identifying patterns in images and medical records and provide personalized treatment strategies by integrating genetic markers and clinical guidelines. Despite the transformative potential of these models, their use in BrCa management faces challenges such as data sensitivity, algorithm transparency, ethical considerations, and model compatibility with the details of medical applications that need to be addressed to achieve reliable results. This review systematically reviews the impact of LLMs on BrCa treatment and diagnosis. This study's objectives include analyzing the role of LLM technology in diagnosing and treating this disease. The findings indicate that the application of LLMs has resulted in significant improvements in various aspects of BrCa management, such as a 35% increase in the Efficiency of Diagnosis and BrCa Treatment (EDBC), a 30% enhancement in the System's Clinical Trust and Reliability (SCTR), and a 20% improvement in the quality of patient education and information (IPEI). Ultimately, this study demonstrates the importance of LLMs in advancing precision medicine for BrCa and paves the way for effective patient-centered care solutions.
Collapse
|
Review |
1 |
|
21
|
Kondo T, Okamoto M, Kondo Y. Pilot Study on Using Large Language Models for Educational Resource Development in Japanese Radiological Technologist Exams. MEDICAL SCIENCE EDUCATOR 2025; 35:919-927. [PMID: 40353040 PMCID: PMC12059199 DOI: 10.1007/s40670-024-02251-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Accepted: 11/29/2024] [Indexed: 05/14/2025]
Abstract
In this study, we explored the potential application of large language models (LLMs) to the development of educational resources for medical licensure exams in non-English-speaking contexts, focusing on the Japanese Radiological Technologist National Exam. We categorized multiple-choice questions into image-based, calculation, and textual types. We generated explanatory texts using Copilot, an LLM integrated with Microsoft Bing, and assessed their quality on a 0-4-point scale. LLMs achieved high performance for textual questions, which demonstrated their strong capability to process specialized content. However, we identified challenges in generating accurate formulas and performing calculations for calculation questions, as well as in interpreting complex medical images in image-based questions. To address these issues, we suggest using LLMs with programming functionalities for calculations and using keyword-based prompts for medical image interpretation. The findings highlight the active role of educators in managing LLM-supported learning environments, particularly by validating outputs and providing supplementary guidance to ensure accuracy. Furthermore, the rapid evolution of LLM technology necessitates continuous adaptation of utilization strategies to align with their advancing capabilities. In this study, we underscored the potential of LLMs to enhance educational practices in non-English-speaking regions, while addressing critical challenges to improve their reliability and utility.
Collapse
|
research-article |
1 |
|
22
|
Graf L, Ritzi A, Schoeler LM. Delirium Identification from Nursing Reports Using Large Language Models. Stud Health Technol Inform 2025; 327:886-887. [PMID: 40380600 DOI: 10.3233/shti250492] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/19/2025]
Abstract
This study investigates large language models for delirium detection from nursing reports, comparing keyword matching, prompting, and finetuning. Using a manually labelled dataset from the University Hospital Freiburg, Germany, we tested Llama3 and Phi3 models. Both prompting and finetuning were effective, with finetuning Phi3 (3.8B) achieving the highest accuracy (90.24%) and AUROC (96.07%), significantly outperforming other methods.
Collapse
|
|
1 |
|
23
|
Shahid F, Hsu MH, Chang YC, Jian WS. Using Generative AI to Extract Structured Information from Free Text Pathology Reports. J Med Syst 2025; 49:36. [PMID: 40080229 PMCID: PMC11906504 DOI: 10.1007/s10916-025-02167-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2024] [Accepted: 03/03/2025] [Indexed: 03/15/2025]
Abstract
Manually converting unstructured text pathology reports into structured pathology reports is very time-consuming and prone to errors. This study demonstrates the transformative potential of generative AI in automating the analysis of free-text pathology reports. Employing the ChatGPT Large Language Model within a Streamlit web application, we automated the extraction and structuring of information from 33 unstructured breast cancer pathology reports from Taipei Medical University Hospital. Achieving a 99.61% accuracy rate, the AI system notably reduced the processing time compared to traditional methods. This not only underscores the efficacy of AI in converting unstructured medical text into structured data but also highlights its potential to enhance the efficiency and reliability of medical text analysis. However, this study is limited to breast cancer pathology reports and was conducted using data obtained from hospitals associated with a single institution. In the future, we plan to expand the scope of this research to include pathology reports for other cancer types incrementally and conduct external validation to further substantiate the robustness and generalizability of the proposed system. Through this technological integration, we aimed to substantiate the capabilities of generative AI in improving both the speed and reliability of data processing. The outcomes of this study affirm that generative AI can significantly transform the handling of pathology reports, promising substantial advancements in biomedical research by facilitating the structured analysis of complex medical data.
Collapse
|
research-article |
1 |
|
24
|
Górska A, Tacconelli E. Towards Autonomous Living Meta-Analyses: A Framework for Automation of Systematic Review and Meta-Analyses. Stud Health Technol Inform 2024; 316:378-382. [PMID: 39176757 DOI: 10.3233/shti240427] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/24/2024]
Abstract
Systematic review and meta-analysis constitute a staple of evidence-based medicine, an obligatory step in developing the guideline and recommendation document. It is a formalized process aiming at extracting and summarizing knowledge from the published work, grading, and considering the quality of the included studies. It is very laborious and time-consuming. Therefore, the meta-analyses are rarely updated and seldom living, decreasing their utility with time. Here, we present a framework for integrating the large language models and natural language processing techniques applied to the previously published systematic review and meta-analysis of the diagnostic test accuracy of the point of care tests. We show that the framework can be used to automate the screening step of the existing meta-analyses with minimal costs to quality and, to a large extent, the extraction step while maintaining the strict nature of the systematic review process.
Collapse
|
|
1 |
|
25
|
Shoham OB, Rappoport N. MedConceptsQA: Open source medical concepts QA benchmark. Comput Biol Med 2024; 182:109089. [PMID: 39276611 DOI: 10.1016/j.compbiomed.2024.109089] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2024] [Revised: 08/28/2024] [Accepted: 08/29/2024] [Indexed: 09/17/2024]
Abstract
BACKGROUND Clinical data often includes both standardized medical codes and natural language texts. This highlights the need for Clinical Large Language Models to understand these codes and their differences. We introduce a benchmark for evaluating the understanding of medical codes by various Large Language Models. METHODS We present MedConceptsQA, a dedicated open source benchmark for medical concepts question answering. The benchmark comprises of questions of various medical concepts across different vocabularies: diagnoses, procedures, and drugs. The questions are categorized into three levels of difficulty: easy, medium, and hard. We conduct evaluations of the benchmark using various Large Language Models. RESULTS Our findings show that most of the pre-trained clinical Large Language Models achieved accuracy levels close to random guessing on this benchmark, despite being pre-trained on medical data. However, GPT-4 achieves an absolute average improvement of 9-11% (9% for few-shot learning and 11% for zero-shot learning) compared to Llama3-OpenBioLLM-70B, the clinical Large Language Model that achieved the best results. CONCLUSION Our benchmark serves as a valuable resource for evaluating the abilities of Large Language Models to interpret medical codes and distinguish between medical concepts. We demonstrate that most of the current state-of-the-art clinical Large Language Models achieve random guess performance, whereas GPT-3.5, GPT-4, and Llama3-70B outperform these clinical models, despite their primary focus during pre-training not being on the medical domain. Our benchmark is available at https://huggingface.co/datasets/ofir408/MedConceptsQA.
Collapse
|
|
1 |
|