1
|
Wang Y, Zuo J, Duan C, Peng H, Huang J, Zhao L, Zhang L, Dong Z. Large language models assisted multi-effect variants mining on cerebral cavernous malformation familial whole genome sequencing. Comput Struct Biotechnol J 2024; 23:843-858. [PMID: 38352937 PMCID: PMC10861960 DOI: 10.1016/j.csbj.2024.01.014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2023] [Revised: 01/04/2024] [Accepted: 01/19/2024] [Indexed: 02/16/2024] Open
Abstract
Cerebral cavernous malformation (CCM) is a polygenic disease with intricate genetic interactions contributing to quantitative pathogenesis across multiple factors. The principal pathogenic genes of CCM, specifically KRIT1, CCM2, and PDCD10, have been reported, accompanied by a growing wealth of genetic data related to mutations. Furthermore, numerous other molecules associated with CCM have been unearthed. However, tackling such massive volumes of unstructured data remains challenging until the advent of advanced large language models. In this study, we developed an automated analytical pipeline specialized in single nucleotide variants (SNVs) related biomedical text analysis called BRLM. To facilitate this, BioBERT was employed to vectorize the rich information of SNVs, while a deep residue network was used to discriminate the classes of the SNVs. BRLM was initially constructed on mutations from 12 different types of TCGA cancers, achieving an accuracy exceeding 99%. It was further examined for CCM mutations in familial sequencing data analysis, highlighting an upstream master regulator gene fibroblast growth factor 1 (FGF1). With multi-omics characterization and validation in biological function, FGF1 demonstrated to play a significant role in the development of CCMs, which proved the effectiveness of our model. The BRLM web server is available at http://1.117.230.196.
Collapse
Affiliation(s)
- Yiqi Wang
- College of Biomedicine and Health, College of Life Science and Technology, Huazhong Agricultural University, No.1, Shizishan Street, Wuhan 430070, Hubei, China
- Center for Neurological Disease Research, Taihe Hospital, Hubei University of Medicine, No.32, Renmin South Road, Shiyan 442000, Hubei, China
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, No. 32, Renmin South Road, Shiyan 442000, Hubei, China
| | - Jinmei Zuo
- Physical Examination Center, Taihe Hospital, Hubei University of Medicine, No. 32, Renmin South Road, Shiyan 442000, Hubei, China
| | - Chao Duan
- College of Biomedicine and Health, College of Life Science and Technology, Huazhong Agricultural University, No.1, Shizishan Street, Wuhan 430070, Hubei, China
- Center for Neurological Disease Research, Taihe Hospital, Hubei University of Medicine, No.32, Renmin South Road, Shiyan 442000, Hubei, China
| | - Hao Peng
- Center for Neurological Disease Research, Taihe Hospital, Hubei University of Medicine, No.32, Renmin South Road, Shiyan 442000, Hubei, China
- Department of Neurosurgery, Taihe Hospital, Hubei University of Medicine, No.32, Renmin South Road, Shiyan 442000, Hubei, China
| | - Jia Huang
- The Second Clinical Medical College, Lanzhou University, No. 222, South Tianshui Road, Lanzhou 730030, Gansu, China
| | - Liang Zhao
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, No. 32, Renmin South Road, Shiyan 442000, Hubei, China
| | - Li Zhang
- Center for Neurological Disease Research, Taihe Hospital, Hubei University of Medicine, No.32, Renmin South Road, Shiyan 442000, Hubei, China
- Department of Neurosurgery, Taihe Hospital, Hubei University of Medicine, No.32, Renmin South Road, Shiyan 442000, Hubei, China
| | - Zhiqiang Dong
- College of Biomedicine and Health, College of Life Science and Technology, Huazhong Agricultural University, No.1, Shizishan Street, Wuhan 430070, Hubei, China
- Center for Neurological Disease Research, Taihe Hospital, Hubei University of Medicine, No.32, Renmin South Road, Shiyan 442000, Hubei, China
| |
Collapse
|
2
|
Tailor PD, Dalvin LA, Chen JJ, Iezzi R, Olsen TW, Scruggs BA, Barkmeier AJ, Bakri SJ, Ryan EH, Tang PH, Parke DW, Belin PJ, Sridhar J, Xu D, Kuriyan AE, Yonekawa Y, Starr MR. A Comparative Study of Responses to Retina Questions from Either Experts, Expert-Edited Large Language Models, or Expert-Edited Large Language Models Alone. Ophthalmol Sci 2024; 4:100485. [PMID: 38660460 PMCID: PMC11041826 DOI: 10.1016/j.xops.2024.100485] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/14/2023] [Revised: 01/03/2024] [Accepted: 02/01/2024] [Indexed: 04/26/2024]
Abstract
Objective To assess the quality, empathy, and safety of expert edited large language model (LLM), human expert created, and LLM responses to common retina patient questions. Design Randomized, masked multicenter study. Participants Twenty-one common retina patient questions were randomly assigned among 13 retina specialists. Methods Each expert created a response (Expert) and then edited a LLM (ChatGPT-4)-generated response to that question (Expert + artificial intelligence [AI]), timing themselves for both tasks. Five LLMs (ChatGPT-3.5, ChatGPT-4, Claude 2, Bing, and Bard) also generated responses to each question. The original question along with anonymized and randomized Expert + AI, Expert, and LLM responses were evaluated by the other experts who did not write an expert response to the question. Evaluators judged quality and empathy (very poor, poor, acceptable, good, or very good) along with safety metrics (incorrect information, likelihood to cause harm, extent of harm, and missing content). Main Outcome Mean quality and empathy score, proportion of responses with incorrect information, likelihood to cause harm, extent of harm, and missing content for each response type. Results There were 4008 total grades collected (2608 for quality and empathy; 1400 for safety metrics), with significant differences in both quality and empathy (P < 0.001, P < 0.001) between LLM, Expert and Expert + AI groups. For quality, Expert + AI (3.86 ± 0.85) performed the best overall while GPT-3.5 (3.75 ± 0.79) was the top performing LLM. For empathy, GPT-3.5 (3.75 ± 0.69) had the highest mean score followed by Expert + AI (3.73 ± 0.63). By mean score, Expert placed 4 out of 7 for quality and 6 out of 7 for empathy. For both quality (P < 0.001) and empathy (P < 0.001), expert-edited LLM responses performed better than expert-created responses. There were time savings for an expert-edited LLM response versus expert-created response (P = 0.02). ChatGPT-4 performed similar to Expert for inappropriate content (P = 0.35), missing content (P = 0.001), extent of possible harm (P = 0.356), and likelihood of possible harm (P = 0.129). Conclusions In this randomized, masked, multicenter study, LLM responses were comparable with experts in terms of quality, empathy, and safety metrics, warranting further exploration of their potential benefits in clinical settings. Financial Disclosures Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of the article.
Collapse
Affiliation(s)
| | | | - John J. Chen
- Department of Ophthalmology, Mayo Clinic, Rochester, Minnesota
| | - Raymond Iezzi
- Department of Ophthalmology, Mayo Clinic, Rochester, Minnesota
| | | | | | | | - Sophie J. Bakri
- Department of Ophthalmology, Mayo Clinic, Rochester, Minnesota
| | - Edwin H. Ryan
- Retina Consultants of Minnesota, Edina, Minnesota
- Department of Ophthalmology & Visual Neurosciences, University of Minnesota Medical School, Minneapolis, Minnesota
| | - Peter H. Tang
- Retina Consultants of Minnesota, Edina, Minnesota
- Department of Ophthalmology & Visual Neurosciences, University of Minnesota Medical School, Minneapolis, Minnesota
| | - D. Wilkin. Parke
- Retina Consultants of Minnesota, Edina, Minnesota
- Department of Ophthalmology & Visual Neurosciences, University of Minnesota Medical School, Minneapolis, Minnesota
| | | | - Jayanth Sridhar
- Olive View Medical Center, University of California Los Angeles, Los Angeles, California
| | - David Xu
- Wills Eye Hospital, Mid Atlantic Retina, Thomas Jefferson University, Philadelphia, Pennsylvania
| | - Ajay E. Kuriyan
- Wills Eye Hospital, Mid Atlantic Retina, Thomas Jefferson University, Philadelphia, Pennsylvania
| | - Yoshihiro Yonekawa
- Wills Eye Hospital, Mid Atlantic Retina, Thomas Jefferson University, Philadelphia, Pennsylvania
| | | |
Collapse
|
3
|
Tobler S. Smart grading: A generative AI-based tool for knowledge-grounded answer evaluation in educational assessments. MethodsX 2024; 12:102531. [PMID: 38204981 PMCID: PMC10776976 DOI: 10.1016/j.mex.2023.102531] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2023] [Accepted: 12/19/2023] [Indexed: 01/12/2024] Open
Abstract
Evaluating text-based answers obtained in educational settings or behavioral studies is time-consuming and resource-intensive. Applying novel artificial intelligence tools such as ChatGPT might support the process. Still, currently available implementations do not allow for automated and case-specific evaluations of large numbers of student answers. To counter this limitation, we developed a flexible software and user-friendly web application that enables researchers and educators to use cutting-edge artificial intelligence technologies by providing an interface that combines large language models with options to specify questions of interest, sample solutions, and evaluation instructions for automated answer scoring. We validated the method in an empirical study and found the software with expert ratings to have high reliability. Hence, the present software constitutes a valuable tool to facilitate and enhance text-based answer evaluation.•Generative AI-enhanced software for customizable, case-specific, and automized grading of large amounts of text-based answers.•Open-source software and web application for direct implementation and adaptation.
Collapse
|
4
|
Abstract
The launch of Open AI's chatbot, ChatGPT, has generated a lot of attention and discussion among professionals in several fields. Many concerns and challenges have been brought up by researchers from various fields, particularly in relation to the harm that using these tools for medical diagnosis and treatment recommendations can cause. In addition, it has been debated if ChatGPT is dependable, efficient, and helpful for clinicians and medical professionals. Therefore, in this study, we assess ChatGPT's effectiveness in providing mental health support, particularly for issues related to anxiety and depression, based on the chatbot's responses and cross-questioning. The findings indicate that there are significant inconsistencies and that ChatGPT's reliability is low in this specific domain. As a result, care must be used when using ChatGPT as a complementary mental health resource.
Collapse
Affiliation(s)
- Faiza Farhat
- Section of Parasitology, Department of Zoology, Aligarh Muslim University, Aligarh, UP, 202002, India.
| |
Collapse
|
5
|
Wang C, Ong J, Wang C, Ong H, Cheng R, Ong D. Potential for GPT Technology to Optimize Future Clinical Decision-Making Using Retrieval-Augmented Generation. Ann Biomed Eng 2024; 52:1115-1118. [PMID: 37530906 DOI: 10.1007/s10439-023-03327-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2023] [Accepted: 07/17/2023] [Indexed: 08/03/2023]
Abstract
Advancements in artificial intelligence (AI) provide many helpful tools for healthcare, one of which includes AI chatbots that use natural language processing to create humanlike, conversational dialog. These chatbots have general cognitive skills and are able to engage with clinicians and patients to discuss patients' health conditions and what they may be at risk for. While chatbot engines have access to a wide range of medical texts and research papers, they currently provide high-level, generic responses and are limited in their ability to provide diagnostic guidance and clinical advice to patients on an individual level. The essay discusses the use of retrieval-augmented generation (RAG), which can be used to improve the specificity of user-entered prompts and thereby enhance the detail in AI chatbot responses. By embedding more recent clinical data and trusted medical sources, such as clinical guidelines, into the chatbot models, AI chatbots can provide more patient-specific guidance, faster diagnoses and treatment recommendations, and greater improvement of patient outcomes.
Collapse
Affiliation(s)
- Calvin Wang
- College of Medicine - Robert Wood Johnson Medical School, Rutgers University, New Brunswick, NJ, 08901, USA.
| | - Joshua Ong
- Michigan Medicine, University of Michigan, Ann Arbor, MI, USA
| | - Chara Wang
- Biotechnology High School, Freehold, NJ, USA
| | - Hannah Ong
- College of Medicine, The Ohio State University, Columbus, OH, USA
| | - Rebekah Cheng
- Department of Physical Therapy, Virginia Commonwealth University, Richmond, VA, USA
| | - Dennis Ong
- Amazon Web Services, Amazon, Seattle, WA, USA
| |
Collapse
|
6
|
Jiang S, Evans-Yamamoto D, Bersenev D, Palaniappan SK, Yachie-Kinoshita A. ProtoCode: Leveraging large language models (LLMs) for automated generation of machine-readable PCR protocols from scientific publications. SLAS Technol 2024; 29:100134. [PMID: 38670311 DOI: 10.1016/j.slast.2024.100134] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2023] [Revised: 04/03/2024] [Accepted: 04/22/2024] [Indexed: 04/28/2024]
Abstract
Protocol standardization and sharing are crucial for reproducibility in life sciences. In spite of numerous efforts for standardized protocol description, adherence to these standards in literature remains largely inconsistent. Curation of protocols are especially challenging due to the labor intensive process, requiring expert domain knowledge of each experimental procedure. Recent advancements in Large Language Models (LLMs) offer a promising solution to interpret and curate knowledge from complex scientific literature. In this work, we develop ProtoCode, a tool leveraging fine-tune LLMs to curate protocols into intermediate representation formats which can be interpretable by both human and machine interfaces. Our proof-of-concept, focused on polymerase chain reaction (PCR) protocols, retrieves information from PCR protocols at an accuracy ranging 69-100 % depending on the information content. In all tested protocols, we demonstrate that ProtoCode successfully converts literature-based protocols into correct operational files for multiple thermal cycler systems. In conclusion, ProtoCode can alleviate labor intensive curation and standardization of life science protocols to enhance research reproducibility by providing a reliable, automated means to process and standardize protocols. ProtoCode is freely available as a web server at https://curation.taxila.io/ProtoCode/.
Collapse
Affiliation(s)
- Shuo Jiang
- SBX BioSciences, Inc. 1600 - 925 West Georgia Street, Vancouver, BC, V6C 3L2, Canada
| | - Daniel Evans-Yamamoto
- The Systems Biology Institute, Saisei Ikedayama Bldg., 5-10-25, Higashi Gotanda, Shinagawa-ku, Tokyo, 141-0022, Japan
| | - Dennis Bersenev
- SBX BioSciences, Inc. 1600 - 925 West Georgia Street, Vancouver, BC, V6C 3L2, Canada
| | - Sucheendra K Palaniappan
- The Systems Biology Institute, Saisei Ikedayama Bldg., 5-10-25, Higashi Gotanda, Shinagawa-ku, Tokyo, 141-0022, Japan.
| | - Ayako Yachie-Kinoshita
- SBX BioSciences, Inc. 1600 - 925 West Georgia Street, Vancouver, BC, V6C 3L2, Canada; The Systems Biology Institute, Saisei Ikedayama Bldg., 5-10-25, Higashi Gotanda, Shinagawa-ku, Tokyo, 141-0022, Japan.
| |
Collapse
|
7
|
Tsai CY, Hsieh SJ, Huang HH, Deng JH, Huang YY, Cheng PY. Performance of ChatGPT on the Taiwan urology board examination: insights into current strengths and shortcomings. World J Urol 2024; 42:250. [PMID: 38652322 DOI: 10.1007/s00345-024-04957-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2023] [Accepted: 03/25/2024] [Indexed: 04/25/2024] Open
Abstract
PURPOSE To compare ChatGPT-4 and ChatGPT-3.5's performance on Taiwan urology board examination (TUBE), focusing on answer accuracy, explanation consistency, and uncertainty management tactics to minimize score penalties from incorrect responses across 12 urology domains. METHODS 450 multiple-choice questions from TUBE(2020-2022) were presented to two models. Three urologists assessed correctness and consistency of each response. Accuracy quantifies correct answers; consistency assesses logic and coherence in explanations out of total responses, alongside a penalty reduction experiment with prompt variations. Univariate logistic regression was applied for subgroup comparison. RESULTS ChatGPT-4 showed strengths in urology, achieved an overall accuracy of 57.8%, with annual accuracies of 64.7% (2020), 58.0% (2021), and 50.7% (2022), significantly surpassing ChatGPT-3.5 (33.8%, OR = 2.68, 95% CI [2.05-3.52]). It could have passed the TUBE written exams if solely based on accuracy but failed in the final score due to penalties. ChatGPT-4 displayed a declining accuracy trend over time. Variability in accuracy across 12 urological domains was noted, with more frequently updated knowledge domains showing lower accuracy (53.2% vs. 62.2%, OR = 0.69, p = 0.05). A high consistency rate of 91.6% in explanations across all domains indicates reliable delivery of coherent and logical information. The simple prompt outperformed strategy-based prompts in accuracy (60% vs. 40%, p = 0.016), highlighting ChatGPT's limitations in its inability to accurately self-assess uncertainty and a tendency towards overconfidence, which may hinder medical decision-making. CONCLUSIONS ChatGPT-4's high accuracy and consistent explanations in urology board examination demonstrate its potential in medical information processing. However, its limitations in self-assessment and overconfidence necessitate caution in its application, especially for inexperienced users. These insights call for ongoing advancements of urology-specific AI tools.
Collapse
Affiliation(s)
- Chung-You Tsai
- Divisions of Urology, Department of Surgery, Far Eastern Memorial Hospital, No.21, Sec. 2, Nanya S. Rd., Banciao Dist., New Taipei City, 220, Taiwan
- Department of Electrical Engineering, Yuan Ze University, Taoyuan, Taiwan
| | - Shang-Ju Hsieh
- Divisions of Urology, Department of Surgery, Far Eastern Memorial Hospital, No.21, Sec. 2, Nanya S. Rd., Banciao Dist., New Taipei City, 220, Taiwan
| | - Hung-Hsiang Huang
- Divisions of Urology, Department of Surgery, Far Eastern Memorial Hospital, No.21, Sec. 2, Nanya S. Rd., Banciao Dist., New Taipei City, 220, Taiwan
| | - Juinn-Horng Deng
- Department of Electrical Engineering, Yuan Ze University, Taoyuan, Taiwan
| | - Yi-You Huang
- Department of Biomedical Engineering, College of Medicine and College of Engineering, National Taiwan University, Taipei, Taiwan
| | - Pai-Yu Cheng
- Divisions of Urology, Department of Surgery, Far Eastern Memorial Hospital, No.21, Sec. 2, Nanya S. Rd., Banciao Dist., New Taipei City, 220, Taiwan.
- Department of Biomedical Engineering, College of Medicine and College of Engineering, National Taiwan University, Taipei, Taiwan.
| |
Collapse
|
8
|
Ye G. De novo drug design as GPT language modeling: large chemistry models with supervised and reinforcement learning. J Comput Aided Mol Des 2024; 38:20. [PMID: 38647700 PMCID: PMC11035455 DOI: 10.1007/s10822-024-00559-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2023] [Accepted: 03/22/2024] [Indexed: 04/25/2024]
Abstract
In recent years, generative machine learning algorithms have been successful in designing innovative drug-like molecules. SMILES is a sequence-like language used in most effective drug design models. Due to data's sequential structure, models such as recurrent neural networks and transformers can design pharmacological compounds with optimized efficacy. Large language models have advanced recently, but their implications on drug design have not yet been explored. Although one study successfully pre-trained a large chemistry model (LCM), its application to specific tasks in drug discovery is unknown. In this study, the drug design task is modeled as a causal language modeling problem. Thus, the procedure of reward modeling, supervised fine-tuning, and proximal policy optimization was used to transfer the LCM to drug design, similar to Open AI's ChatGPT and InstructGPT procedures. By combining the SMILES sequence with chemical descriptors, the novel efficacy evaluation model exceeded its performance compared to previous studies. After proximal policy optimization, the drug design model generated molecules with 99.2% having efficacy pIC50 > 7 towards the amyloid precursor protein, with 100% of the generated molecules being valid and novel. This demonstrated the applicability of LCMs in drug discovery, with benefits including less data consumption while fine-tuning. The applicability of LCMs to drug discovery opens the door for larger studies involving reinforcement-learning with human feedback, where chemists provide feedback to LCMs and generate higher-quality molecules. LCMs' ability to design similar molecules from datasets paves the way for more accessible, non-patented alternatives to drug molecules.
Collapse
Affiliation(s)
- Gavin Ye
- Columbia Grammar & Preparatory School, New York, NY, USA.
| |
Collapse
|
9
|
Kawahara T, Sumi Y. GPT-4/4V's performance on the Japanese National Medical Licensing Examination. Med Teach 2024:1-8. [PMID: 38648547 DOI: 10.1080/0142159x.2024.2342545] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/27/2023] [Accepted: 04/09/2024] [Indexed: 04/25/2024]
Abstract
BACKGROUND Recent advances in Artificial Intelligence (AI) are changing the medical world, and AI will likely replace many of the actions performed by medical professionals. The overall clinical ability of the AI has been evaluated by its ability to answer a text-based national medical examination. This study uniquely assesses the performance of Open AI's ChatGPT against all Japanese National Medical Licensing Examination (NMLE), including images, illustrations, and pictures. METHODS We obtained the questions of the past six years of the NMLE (112th to 117th) from the Japanese Ministry of Health, Labour and Welfare website. We converted them to JavaScript Object Notation (JSON) format. We created an application programming interface (API) to output correct answers using GPT-4 for questions without images and GPT4-V(ision) or GPT4 console for questions with images. RESULTS The percentage of image questions was 723/2400 (30.1%) over the past six years. In all years, GPT-4/4V exceeded the minimum score the examinee should score. In total, over the six years, the percentage of correct answers for basic medical knowledge questions was 665/905 (73.5%); for clinical knowledge questions, 1143/1531 (74.7%); and for image questions 497/723 (68.7%), respectively. CONCLUSIONS Regarding medical knowledge, GPT-4/4V met the minimum criteria regardless of whether the questions included images, illustrations, and pictures. Our study sheds light on the potential utility of AI in medical education.
Collapse
Affiliation(s)
- Tomoki Kawahara
- Department of Clinical Information Applied Sciences, Tokyo Medical and Dental University, Tokyo, Japan
| | - Yuki Sumi
- Department of Clinical Information Applied Sciences, Tokyo Medical and Dental University, Tokyo, Japan
| |
Collapse
|
10
|
Wang A, Liu C, Yang J, Weng C. Fine-tuning Large Language Models for Rare Disease Concept Normalization. bioRxiv 2024:2023.12.28.573586. [PMID: 38234802 PMCID: PMC10793431 DOI: 10.1101/2023.12.28.573586] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/19/2024]
Abstract
Objective We aim to develop a novel method for rare disease concept normalization by fine-tuning Llama 2, an open-source large language model (LLM), using a domain-specific corpus sourced from the Human Phenotype Ontology (HPO). Methods We developed an in-house template-based script to generate two corpora for fine-tuning. The first (NAME) contains standardized HPO names, sourced from the HPO vocabularies, along with their corresponding identifiers. The second (NAME+SYN) includes HPO names and half of the concept's synonyms as well as identifiers. Subsequently, we fine-tuned Llama2 (Llama2-7B) for each sentence set and conducted an evaluation using a range of sentence prompts and various phenotype terms. Results When the phenotype terms for normalization were included in the fine-tuning corpora, both models demonstrated nearly perfect performance, averaging over 99% accuracy. In comparison, ChatGPT-3.5 has only ~20% accuracy in identifying HPO IDs for phenotype terms. When single-character typos were introduced in the phenotype terms, the accuracy of NAME and NAME+SYN is 10.2% and 36.1%, respectively, but increases to 61.8% (NAME+SYN) with additional typo-specific fine-tuning. For terms sourced from HPO vocabularies as unseen synonyms, the NAME model achieved 11.2% accuracy, while the NAME+SYN model achieved 92.7% accuracy. Conclusion Our fine-tuned models demonstrate ability to normalize phenotype terms unseen in the fine-tuning corpus, including misspellings, synonyms, terms from other ontologies, and laymen's terms. Our approach provides a solution for the use of LLM to identify named medical entities from the clinical narratives, while successfully normalizing them to standard concepts in a controlled vocabulary.
Collapse
Affiliation(s)
- Andy Wang
- Peddie School, Hightstown, NJ, USA
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| | - Cong Liu
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| | - Jingye Yang
- Department of Mathematics, University of Pennsylvania, Philadelphia, PA, USA
| | - Chunhua Weng
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| |
Collapse
|
11
|
Yu Z, Peng C, Yang X, Dang C, Adekkanattu P, Gopal Patra B, Peng Y, Pathak J, Wilson DL, Chang CY, Lo-Ciganic WH, George TJ, Hogan WR, Guo Y, Bian J, Wu Y. Identifying social determinants of health from clinical narratives: A study of performance, documentation ratio, and potential bias. J Biomed Inform 2024; 153:104642. [PMID: 38621641 DOI: 10.1016/j.jbi.2024.104642] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2023] [Revised: 04/09/2024] [Accepted: 04/12/2024] [Indexed: 04/17/2024]
Abstract
OBJECTIVE To develop a natural language processing (NLP) package to extract social determinants of health (SDoH) from clinical narratives, examine the bias among race and gender groups, test the generalizability of extracting SDoH for different disease groups, and examine population-level extraction ratio. METHODS We developed SDoH corpora using clinical notes identified at the University of Florida (UF) Health. We systematically compared 7 transformer-based large language models (LLMs) and developed an open-source package - SODA (i.e., SOcial DeterminAnts) to facilitate SDoH extraction from clinical narratives. We examined the performance and potential bias of SODA for different race and gender groups, tested the generalizability of SODA using two disease domains including cancer and opioid use, and explored strategies for improvement. We applied SODA to extract 19 categories of SDoH from the breast (n = 7,971), lung (n = 11,804), and colorectal cancer (n = 6,240) cohorts to assess patient-level extraction ratio and examine the differences among race and gender groups. RESULTS We developed an SDoH corpus using 629 clinical notes of cancer patients with annotations of 13,193 SDoH concepts/attributes from 19 categories of SDoH, and another cross-disease validation corpus using 200 notes from opioid use patients with 4,342 SDoH concepts/attributes. We compared 7 transformer models and the GatorTron model achieved the best mean average strict/lenient F1 scores of 0.9122 and 0.9367 for SDoH concept extraction and 0.9584 and 0.9593 for linking attributes to SDoH concepts. There is a small performance gap (∼4%) between Males and Females, but a large performance gap (>16 %) among race groups. The performance dropped when we applied the cancer SDoH model to the opioid cohort; fine-tuning using a smaller opioid SDoH corpus improved the performance. The extraction ratio varied in the three cancer cohorts, in which 10 SDoH could be extracted from over 70 % of cancer patients, but 9 SDoH could be extracted from less than 70 % of cancer patients. Individuals from the White and Black groups have a higher extraction ratio than other minority race groups. CONCLUSIONS Our SODA package achieved good performance in extracting 19 categories of SDoH from clinical narratives. The SODA package with pre-trained transformer models is available at https://github.com/uf-hobi-informatics-lab/SODA_Docker.
Collapse
Affiliation(s)
- Zehao Yu
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA
| | - Cheng Peng
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA; Cancer Informatics Shared Resource, University of Florida Health Cancer Center, Gainesville, FL, USA
| | - Xi Yang
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA; Cancer Informatics Shared Resource, University of Florida Health Cancer Center, Gainesville, FL, USA
| | - Chong Dang
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA
| | - Prakash Adekkanattu
- Information Technologies and Services, Weill Cornell Medicine, New York, NY, USA
| | - Braja Gopal Patra
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY, USA
| | - Yifan Peng
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY, USA
| | - Jyotishman Pathak
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY, USA
| | - Debbie L Wilson
- Department of Pharmaceutical Outcomes & Policy, College of Pharmacy, University of Florida, Gainesville, FL 32611, USA
| | - Ching-Yuan Chang
- Department of Pharmaceutical Outcomes & Policy, College of Pharmacy, University of Florida, Gainesville, FL 32611, USA
| | - Wei-Hsuan Lo-Ciganic
- Department of Pharmaceutical Outcomes & Policy, College of Pharmacy, University of Florida, Gainesville, FL 32611, USA
| | - Thomas J George
- Division of Hematology & Oncology, Department of Medicine, College of Medicine, University of Florida, Gainesville, FL, USA
| | - William R Hogan
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA
| | - Yi Guo
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA; Cancer Informatics Shared Resource, University of Florida Health Cancer Center, Gainesville, FL, USA
| | - Jiang Bian
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA; Cancer Informatics Shared Resource, University of Florida Health Cancer Center, Gainesville, FL, USA
| | - Yonghui Wu
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA; Cancer Informatics Shared Resource, University of Florida Health Cancer Center, Gainesville, FL, USA.
| |
Collapse
|
12
|
Kamihara T, Tabuchi M, Omura T, Suzuki Y, Aritake T, Hirashiki A, Kokubo M, Shimizu A. Evolution of a Large Language Model for Preoperative Assessment Based on the Japanese Circulation Society 2022 Guideline on Perioperative Cardiovascular Assessment and Management for Non-Cardiac Surgery. Circ Rep 2024; 6:142-148. [PMID: 38606418 PMCID: PMC11004031 DOI: 10.1253/circrep.cr-24-0019] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2024] [Accepted: 03/04/2024] [Indexed: 04/13/2024] Open
Abstract
Background: The Japanese Circulation Society 2022 Guideline on Perioperative Cardiovascular Assessment and Management for Non-Cardiac Surgery standardizes preoperative cardiovascular assessments. The present study investigated the efficacy of a large language model (LLM) in providing accurate responses meeting the JCS 2022 Guideline. Methods and Results: Data on consultation requests, physicians' cardiovascular records, and patients' response content were analyzed. Virtual scenarios were created using real-world clinical data, and a LLM was then consulted for such scenarios. Conclusions: Google BARD could accurately provide responses in accordance with the JCS 2022 Guideline in low-risk cases. Google Gemini has significantly improved its accuracy in intermediate- and high-risk cases.
Collapse
Affiliation(s)
- Takahiro Kamihara
- Department of Cardiology, National Center for Geriatrics and Gerontology Obu Japan
| | - Masanori Tabuchi
- Department of Nursing, National Center for Geriatrics and Gerontology Obu Japan
| | - Takuya Omura
- Department of Metabolism, National Center for Geriatrics and Gerontology Obu Japan
| | - Yumi Suzuki
- Department of Surgery, National Center for Geriatrics and Gerontology Obu Japan
| | - Tsukasa Aritake
- Department of Surgery, National Center for Geriatrics and Gerontology Obu Japan
| | - Akihiro Hirashiki
- Department of Cardiology, National Center for Geriatrics and Gerontology Obu Japan
| | - Manabu Kokubo
- Department of Cardiology, National Center for Geriatrics and Gerontology Obu Japan
| | - Atsuya Shimizu
- Department of Cardiology, National Center for Geriatrics and Gerontology Obu Japan
| |
Collapse
|
13
|
Itelman E, Golovchiner G, Barsheshet A, Kornowski R, Erez A. Balancing innovation and professionalism: The emerging role of AI-powered chatbots in medical consultation. Heart Rhythm 2024:S1547-5271(24)02327-0. [PMID: 38588991 DOI: 10.1016/j.hrthm.2024.04.010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/23/2023] [Revised: 04/02/2024] [Accepted: 04/03/2024] [Indexed: 04/10/2024]
Affiliation(s)
- Edward Itelman
- Cardiology Division, Rabin Medical Center, Petah Tikva, Israel.
| | | | - Alon Barsheshet
- Cardiology Division, Rabin Medical Center, Petah Tikva, Israel
| | - Ran Kornowski
- Cardiology Division, Rabin Medical Center, Petah Tikva, Israel
| | - Aharon Erez
- Cardiology Division, Rabin Medical Center, Petah Tikva, Israel
| |
Collapse
|
14
|
Gui H, Rezaei SJ, Schlessinger D, Weed J, Lester J, Wongvibulsin S, Mitchell D, Ko J, Rotemberg V, Lee I, Daneshjou R. Dermatologists' Perspectives and Usage of Large Language Models in Practice: An Exploratory Survey. J Invest Dermatol 2024:S0022-202X(24)00270-7. [PMID: 38582369 DOI: 10.1016/j.jid.2024.03.028] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2024] [Accepted: 03/19/2024] [Indexed: 04/08/2024]
Affiliation(s)
- Haiwen Gui
- Department of Dermatology, Stanford University, Redwood City, California, USA.
| | - Shawheen J Rezaei
- Department of Dermatology, Stanford University, Redwood City, California, USA
| | - Daniel Schlessinger
- Division of Dermatology, Department of Medicine, Washington University School of Medicine, St. Louis, Missouri, USA
| | - Jason Weed
- The Ronald O. Perelman Department of Dermatology, NYU Grossman School of Medicine, New York, New York, USA
| | - Jenna Lester
- Department of Dermatology, University of California San Francisco, San Francisco, California, USA
| | - Shannon Wongvibulsin
- Division of Dermatology, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, California, USA
| | - Dom Mitchell
- Department of Dermatology, Stanford University, Redwood City, California, USA
| | - Justin Ko
- Department of Dermatology, Stanford University, Redwood City, California, USA
| | - Veronica Rotemberg
- Dermatology Service, Memorial Sloan Kettering Cancer Center, New York, New York, USA
| | - Ivy Lee
- Pasadena Premier Dermatology, Pasadena, California, USA
| | - Roxana Daneshjou
- Department of Dermatology, Stanford University, Redwood City, California, USA
| |
Collapse
|
15
|
Zhang S, Liau ZQG, Tan KLM, Chua WL. Evaluating the accuracy and relevance of ChatGPT responses to frequently asked questions regarding total knee replacement. Knee Surg Relat Res 2024; 36:15. [PMID: 38566254 PMCID: PMC10986046 DOI: 10.1186/s43019-024-00218-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/02/2024] [Accepted: 03/12/2024] [Indexed: 04/04/2024] Open
Abstract
BACKGROUND Chat Generative Pretrained Transformer (ChatGPT), a generative artificial intelligence chatbot, may have broad applications in healthcare delivery and patient education due to its ability to provide human-like responses to a wide range of patient queries. However, there is limited evidence regarding its ability to provide reliable and useful information on orthopaedic procedures. This study seeks to evaluate the accuracy and relevance of responses provided by ChatGPT to frequently asked questions (FAQs) regarding total knee replacement (TKR). METHODS A list of 50 clinically-relevant FAQs regarding TKR was collated. Each question was individually entered as a prompt to ChatGPT (version 3.5), and the first response generated was recorded. Responses were then reviewed by two independent orthopaedic surgeons and graded on a Likert scale for their factual accuracy and relevance. These responses were then classified into accurate versus inaccurate and relevant versus irrelevant responses using preset thresholds on the Likert scale. RESULTS Most responses were accurate, while all responses were relevant. Of the 50 FAQs, 44/50 (88%) of ChatGPT responses were classified as accurate, achieving a mean Likert grade of 4.6/5 for factual accuracy. On the other hand, 50/50 (100%) of responses were classified as relevant, achieving a mean Likert grade of 4.9/5 for relevance. CONCLUSION ChatGPT performed well in providing accurate and relevant responses to FAQs regarding TKR, demonstrating great potential as a tool for patient education. However, it is not infallible and can occasionally provide inaccurate medical information. Patients and clinicians intending to utilize this technology should be mindful of its limitations and ensure adequate supervision and verification of information provided.
Collapse
Affiliation(s)
- Siyuan Zhang
- Department of Orthopaedic Surgery, National University Health System, Level 11, NUHS Tower Block, 1E Kent Ridge Road, Singapore, 119228, Singapore.
| | - Zi Qiang Glen Liau
- Department of Orthopaedic Surgery, National University Health System, Level 11, NUHS Tower Block, 1E Kent Ridge Road, Singapore, 119228, Singapore
| | - Kian Loong Melvin Tan
- Department of Orthopaedic Surgery, National University Health System, Level 11, NUHS Tower Block, 1E Kent Ridge Road, Singapore, 119228, Singapore
| | - Wei Liang Chua
- Department of Orthopaedic Surgery, National University Health System, Level 11, NUHS Tower Block, 1E Kent Ridge Road, Singapore, 119228, Singapore
| |
Collapse
|
16
|
Zhenzhu L, Jingfeng Z, Wei Z, Jianjun Z, Yinshui X. GPT-agents based on medical guidelines can improve the responsiveness and explainability of outcomes for traumatic brain injury rehabilitation. Sci Rep 2024; 14:7626. [PMID: 38561445 PMCID: PMC10985066 DOI: 10.1038/s41598-024-58514-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2024] [Accepted: 03/30/2024] [Indexed: 04/04/2024] Open
Abstract
This study explored the application of generative pre-trained transformer (GPT) agents based on medical guidelines using large language model (LLM) technology for traumatic brain injury (TBI) rehabilitation-related questions. To assess the effectiveness of multiple agents (GPT-agents) created using GPT-4, a comparison was conducted using direct GPT-4 as the control group (GPT-4). The GPT-agents comprised multiple agents with distinct functions, including "Medical Guideline Classification", "Question Retrieval", "Matching Evaluation", "Intelligent Question Answering (QA)", and "Results Evaluation and Source Citation". Brain rehabilitation questions were selected from the doctor-patient Q&A database for assessment. The primary endpoint was a better answer. The secondary endpoints were accuracy, completeness, explainability, and empathy. Thirty questions were answered; overall GPT-agents took substantially longer and more words to respond than GPT-4 (time: 54.05 vs. 9.66 s, words: 371 vs. 57). However, GPT-agents provided superior answers in more cases compared to GPT-4 (66.7 vs. 33.3%). GPT-Agents surpassed GPT-4 in accuracy evaluation (3.8 ± 1.02 vs. 3.2 ± 0.96, p = 0.0234). No difference in incomplete answers was found (2 ± 0.87 vs. 1.7 ± 0.79, p = 0.213). However, in terms of explainability (2.79 ± 0.45 vs. 07 ± 0.52, p < 0.001) and empathy (2.63 ± 0.57 vs. 1.08 ± 0.51, p < 0.001) evaluation, the GPT-agents performed notably better. Based on medical guidelines, GPT-agents enhanced the accuracy and empathy of responses to TBI rehabilitation questions. This study provides guideline references and demonstrates improved clinical explainability. However, further validation through multicenter trials in a clinical setting is necessary. This study offers practical insights and establishes groundwork for the potential theoretical integration of LLM-agents medicine.
Collapse
Affiliation(s)
- Li Zhenzhu
- Radiology Department, Ningbo NO.2 Hospital, Ningbo, 315211, China
- Department of Neurosurgery, Ningbo NO.2 Hospital, Ningbo, 315211, China
- Faculty of Electrical Engineering and Computer Science, Ningbo University, Ningbo, 315211, China
| | - Zhang Jingfeng
- Radiology Department, Ningbo NO.2 Hospital, Ningbo, 315211, China
| | - Zhou Wei
- Department of Neurosurgery, Ningbo NO.2 Hospital, Ningbo, 315211, China
| | - Zheng Jianjun
- Radiology Department, Ningbo NO.2 Hospital, Ningbo, 315211, China.
| | - Xia Yinshui
- Faculty of Electrical Engineering and Computer Science, Ningbo University, Ningbo, 315211, China.
| |
Collapse
|
17
|
Kim H, Kim P, Joo I, Kim JH, Park CM, Yoon SH. ChatGPT Vision for Radiological Interpretation: An Investigation Using Medical School Radiology Examinations. Korean J Radiol 2024; 25:403-406. [PMID: 38528699 PMCID: PMC10973733 DOI: 10.3348/kjr.2024.0017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2024] [Revised: 01/11/2024] [Accepted: 01/14/2024] [Indexed: 03/27/2024] Open
Affiliation(s)
- Hyungjin Kim
- Department of Radiology, Seoul National University Hospital, Seoul National University College of Medicine, Seoul, Republic of Korea
| | - Paul Kim
- Graduate School of Education, Stanford University, Stanford, CA, USA
| | - Ijin Joo
- Department of Radiology, Seoul National University Hospital, Seoul National University College of Medicine, Seoul, Republic of Korea
| | - Jung Hoon Kim
- Department of Radiology, Seoul National University Hospital, Seoul National University College of Medicine, Seoul, Republic of Korea
| | - Chang Min Park
- Department of Radiology, Seoul National University Hospital, Seoul National University College of Medicine, Seoul, Republic of Korea
| | - Soon Ho Yoon
- Department of Radiology, Seoul National University Hospital, Seoul National University College of Medicine, Seoul, Republic of Korea.
| |
Collapse
|
18
|
Zhang K, Wang S, Jia N, Zhao L, Han C, Li L. Integrating visual large language model and reasoning chain for driver behavior analysis and risk assessment. Accid Anal Prev 2024; 198:107497. [PMID: 38330547 DOI: 10.1016/j.aap.2024.107497] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/28/2023] [Revised: 01/12/2024] [Accepted: 02/03/2024] [Indexed: 02/10/2024]
Abstract
Driver behavior is a critical factor in driving safety, making the development of sophisticated distraction classification methods essential. Our study presents a Distracted Driving Classification (DDC) approach utilizing a visual Large Language Model (LLM), named the Distracted Driving Language Model (DDLM). The DDLM introduces whole-body human pose estimation to isolate and analyze key postural features-head, right hand, and left hand-for precise behavior classification and better interpretability. Recognizing the inherent limitations of LLMs, particularly their lack of logical reasoning abilities, we have integrated a reasoning chain framework within the DDLM, allowing it to generate clear, reasoned explanations for its assessments. Tailored specifically with relevant data, the DDLM demonstrates enhanced performance, providing detailed, context-aware evaluations of driver behaviors and corresponding risk levels. Notably outperforming standard models in both zero-shot and few-shot learning scenarios, as evidenced by tests on the 100-Driver dataset, the DDLM stands out as an advanced tool that promises significant contributions to driving safety by accurately detecting and analyzing driving distractions.
Collapse
Affiliation(s)
- Kunpeng Zhang
- College of Electrical Engineering, Henan University of Technology, Zhengzhou 450001, China; Department of Automation, Tsinghua University, Beijing 100084, China
| | - Shipu Wang
- College of Electrical Engineering, Henan University of Technology, Zhengzhou 450001, China
| | - Ning Jia
- College of Management and Economics, Tianjin University, Tianjin 300072, China
| | - Liang Zhao
- College of Electrical Engineering, Henan University of Technology, Zhengzhou 450001, China.
| | - Chunyang Han
- Department of Automation, Tsinghua University, Beijing 100084, China; Faculty of Transportation Engineering, Kunming University of Science and Technology, Kunming 650500, China.
| | - Li Li
- Department of Automation, Tsinghua University, Beijing 100084, China
| |
Collapse
|
19
|
Kinoshita M, Komasaka M, Tanaka K. ChatGPT's performance on JSA-certified anesthesiologist exam. J Anesth 2024; 38:282-283. [PMID: 37902835 DOI: 10.1007/s00540-023-03275-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2023] [Accepted: 10/12/2023] [Indexed: 11/01/2023]
Affiliation(s)
- Michiko Kinoshita
- Department of Anesthesiology, Tokushima University Hospital, 2-50-1 Kuramoto-cho, Tokushima-shi, Tokushima, 770-8503, Japan.
| | - Mizuki Komasaka
- Department of Anesthesiology, Tokushima University Hospital, 2-50-1 Kuramoto-cho, Tokushima-shi, Tokushima, 770-8503, Japan
| | - Katsuya Tanaka
- Department of Anesthesiology, Tokushima University Hospital, 2-50-1 Kuramoto-cho, Tokushima-shi, Tokushima, 770-8503, Japan
| |
Collapse
|
20
|
Gu Z, He X, Yu P, Jia W, Yang X, Peng G, Hu P, Chen S, Chen H, Lin Y. Automatic quantitative stroke severity assessment based on Chinese clinical named entity recognition with domain-adaptive pre-trained large language model. Artif Intell Med 2024; 150:102822. [PMID: 38553162 DOI: 10.1016/j.artmed.2024.102822] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2023] [Revised: 01/28/2024] [Accepted: 02/21/2024] [Indexed: 04/02/2024]
Abstract
BACKGROUND Stroke is a prevalent disease with a significant global impact. Effective assessment of stroke severity is vital for an accurate diagnosis, appropriate treatment, and optimal clinical outcomes. The National Institutes of Health Stroke Scale (NIHSS) is a widely used scale for quantitatively assessing stroke severity. However, the current manual scoring of NIHSS is labor-intensive, time-consuming, and sometimes unreliable. Applying artificial intelligence (AI) techniques to automate the quantitative assessment of stroke on vast amounts of electronic health records (EHRs) has attracted much interest. OBJECTIVE This study aims to develop an automatic, quantitative stroke severity assessment framework through automating the entire NIHSS scoring process on Chinese clinical EHRs. METHODS Our approach consists of two major parts: Chinese clinical named entity recognition (CNER) with a domain-adaptive pre-trained large language model (LLM) and automated NIHSS scoring. To build a high-performing CNER model, we first construct a stroke-specific, densely annotated dataset "Chinese Stroke Clinical Records" (CSCR) from EHRs provided by our partner hospital, based on a stroke ontology that defines semantically related entities for stroke assessment. We then pre-train a Chinese clinical LLM coined "CliRoberta" through domain-adaptive transfer learning and construct a deep learning-based CNER model that can accurately extract entities directly from Chinese EHRs. Finally, an automated, end-to-end NIHSS scoring pipeline is proposed by mapping the extracted entities to relevant NIHSS items and values, to quantitatively assess the stroke severity. RESULTS Results obtained on a benchmark dataset CCKS2019 and our newly created CSCR dataset demonstrate the superior performance of our domain-adaptive pre-trained LLM and the CNER model, compared with the existing benchmark LLMs and CNER models. The high F1 score of 0.990 ensures the reliability of our model in accurately extracting the entities for the subsequent automatic NIHSS scoring. Subsequently, our automated, end-to-end NIHSS scoring approach achieved excellent inter-rater agreement (0.823) and intraclass consistency (0.986) with the ground truth and significantly reduced the processing time from minutes to a few seconds. CONCLUSION Our proposed automatic and quantitative framework for assessing stroke severity demonstrates exceptional performance and reliability through directly scoring the NIHSS from diagnostic notes in Chinese clinical EHRs. Moreover, this study also contributes a new clinical dataset, a pre-trained clinical LLM, and an effective deep learning-based CNER model. The deployment of these advanced algorithms can improve the accuracy and efficiency of clinical assessment, and help improve the quality, affordability and productivity of healthcare services.
Collapse
Affiliation(s)
- Zhanzhong Gu
- School of Electrical and Data Engineering, University of Technology Sydney, NSW, 2007, Australia.
| | - Xiangjian He
- School of Electrical and Data Engineering, University of Technology Sydney, NSW, 2007, Australia; School of Computer Science, University of Nottingham Ningbo China, Ningbo, China
| | - Ping Yu
- School of Computing and Information Technology, University of Wollongong, NSW, 2522, Australia
| | - Wenjing Jia
- School of Electrical and Data Engineering, University of Technology Sydney, NSW, 2007, Australia
| | - Xiguang Yang
- School of Electrical and Data Engineering, University of Technology Sydney, NSW, 2007, Australia
| | - Gang Peng
- Intergenepharm Pty Ltd, Sydney, NSW, 2000, Australia
| | - Penghui Hu
- Department of Oncology, The First Affiliated Hospital of Jinan University, Guangzhou, China
| | - Shiyan Chen
- Department of Neurology, The First Affiliated Hospital of Fujian Medical University, Fuzhou, China
| | - Hongjie Chen
- Department of Traditional Chinese Medicine, The Third Affiliated Hospital of Sun Yat-sen University, Guangzhou, China
| | - Yiguang Lin
- Department of Traditional Chinese Medicine, The Third Affiliated Hospital of Sun Yat-sen University, Guangzhou, China; Department of Immuno-Oncology, The First Affiliated Hospital of Guangdong Pharmaceutical University, China; School of Life Sciences, University of Technology Sydney, NSW, 2007, Australia
| |
Collapse
|
21
|
Bernstorff M, Vistisen ST, Enevoldsen KC. Natural language processing for electronic health records in anaesthesiology: an introduction to clinicians with recommendations and pitfalls. J Clin Monit Comput 2024; 38:241-245. [PMID: 38310589 PMCID: PMC10995065 DOI: 10.1007/s10877-024-01128-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2023] [Accepted: 01/18/2024] [Indexed: 02/06/2024]
Affiliation(s)
- Martin Bernstorff
- Department of Affective Disorders, Aarhus University Hospital - Psychiatry, Aarhus, Denmark
- Department of Clinical Medicine, Aarhus University, Aarhus, Denmark
- Center for Humanities Computing, Aarhus University, Jens Chr. Skous Vej 4, Aarhus N, 8200, Denmark
| | - Simon Tilma Vistisen
- Department of Clinical Medicine, Aarhus University, Aarhus, Denmark.
- Department of Anaesthesiology and Intensive Care, Aarhus University Hospital, Palle Juul-Jensens Boulevard 99, Aarhus N, 8200, Denmark.
| | - Kenneth C Enevoldsen
- Department of Affective Disorders, Aarhus University Hospital - Psychiatry, Aarhus, Denmark
- Department of Clinical Medicine, Aarhus University, Aarhus, Denmark
- Center for Humanities Computing, Aarhus University, Jens Chr. Skous Vej 4, Aarhus N, 8200, Denmark
- Quantitative Genomics Group, Aarhus University, Aarhus N, Denmark
| |
Collapse
|
22
|
Peng C, Yang X, Smith KE, Yu Z, Chen A, Bian J, Wu Y. Model tuning or prompt Tuning? a study of large language models for clinical concept and relation extraction. J Biomed Inform 2024; 153:104630. [PMID: 38548007 DOI: 10.1016/j.jbi.2024.104630] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2023] [Revised: 02/24/2024] [Accepted: 03/25/2024] [Indexed: 04/04/2024]
Abstract
OBJECTIVE To develop soft prompt-based learning architecture for large language models (LLMs), examine prompt-tuning using frozen/unfrozen LLMs, and assess their abilities in transfer learning and few-shot learning. METHODS We developed a soft prompt-based learning architecture and compared 4 strategies including (1) fine-tuning without prompts; (2) hard-prompting with unfrozen LLMs; (3) soft-prompting with unfrozen LLMs; and (4) soft-prompting with frozen LLMs. We evaluated GatorTron, a clinical LLM with up to 8.9 billion parameters, and compared GatorTron with 4 existing transformer models for clinical concept and relation extraction on 2 benchmark datasets for adverse drug events and social determinants of health (SDoH). We evaluated the few-shot learning ability and generalizability for cross-institution applications. RESULTS AND CONCLUSION When LLMs are unfrozen, GatorTron-3.9B with soft prompting achieves the best strict F1-scores of 0.9118 and 0.8604 for concept extraction, outperforming the traditional fine-tuning and hard prompt-based models by 0.6 ∼ 3.1 % and 1.2 ∼ 2.9 %, respectively; GatorTron-345 M with soft prompting achieves the best F1-scores of 0.8332 and 0.7488 for end-to-end relation extraction, outperforming other two models by 0.2 ∼ 2 % and 0.6 ∼ 11.7 %, respectively. When LLMs are frozen, small LLMs have a big gap to be competitive with unfrozen models; scaling LLMs up to billions of parameters makes frozen LLMs competitive with unfrozen models. Soft prompting with a frozen GatorTron-8.9B model achieved the best performance for cross-institution evaluation. We demonstrate that (1) machines can learn soft prompts better than hard prompts composed by human, (2) frozen LLMs have good few-shot learning ability and generalizability for cross-institution applications, (3) frozen LLMs reduce computing cost to 2.5 ∼ 6 % of previous methods using unfrozen LLMs, and (4) frozen LLMs require large models (e.g., over several billions of parameters) for good performance.
Collapse
Affiliation(s)
- Cheng Peng
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA
| | - Xi Yang
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA; Cancer Informatics Shared Resource, University of Florida Health Cancer Center, Gainesville, FL, USA
| | | | - Zehao Yu
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA
| | - Aokun Chen
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA; Cancer Informatics Shared Resource, University of Florida Health Cancer Center, Gainesville, FL, USA
| | - Jiang Bian
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA; Cancer Informatics Shared Resource, University of Florida Health Cancer Center, Gainesville, FL, USA
| | - Yonghui Wu
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA; Cancer Informatics Shared Resource, University of Florida Health Cancer Center, Gainesville, FL, USA.
| |
Collapse
|
23
|
Koga S, Du W. ChatGPT's limited accuracy in generating anatomical images for medical education. Skeletal Radiol 2024:10.1007/s00256-024-04655-x. [PMID: 38506966 DOI: 10.1007/s00256-024-04655-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/06/2024] [Revised: 03/06/2024] [Accepted: 03/11/2024] [Indexed: 03/22/2024]
Affiliation(s)
- Shunsuke Koga
- Department of Pathology and Laboratory Medicine, Hospital of the University of Pennsylvania, 3400 Spruce Street, Philadelphia, PA, 19104, USA.
| | - Wei Du
- Department of Pathology and Laboratory Medicine, Hospital of the University of Pennsylvania, 3400 Spruce Street, Philadelphia, PA, 19104, USA
| |
Collapse
|
24
|
Suzuki R, Arita T. An evolutionary model of personality traits related to cooperative behavior using a large language model. Sci Rep 2024; 14:5989. [PMID: 38503778 PMCID: PMC10951268 DOI: 10.1038/s41598-024-55903-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2023] [Accepted: 02/28/2024] [Indexed: 03/21/2024] Open
Abstract
This study aims to demonstrate that Large Language Models (LLMs) can empower research on the evolution of human behavior, based on evolutionary game theory, by using an evolutionary model positing that instructing LLMs with high-level psychological and cognitive character descriptions enables the simulation of human behavior choices in game-theoretical scenarios. As a first step towards this objective, this paper proposes an evolutionary model of personality traits related to cooperative behavior using a large language model. In the model, linguistic descriptions of personality traits related to cooperative behavior are used as genes. The deterministic strategies extracted from LLM that make behavioral decisions based on these personality traits are used as behavioral traits. The population is evolved according to selection based on average payoff and mutation of genes by asking LLM to slightly modify the parent gene toward cooperative or selfish. Through experiments and analyses, we clarify that such a model can indeed exhibit evolution of cooperative behavior based on the diverse and higher-order representation of personality traits. We also observed repeated intrusion of cooperative and selfish personality traits through changes in the expression of personality traits. The words that emerged in the evolved genes reflected the behavioral tendencies of their associated personalities in terms of semantics, thereby influencing individual behavior and, consequently, the evolutionary dynamics.
Collapse
Affiliation(s)
- Reiji Suzuki
- Graduate School of Informatics, Nagoya University, Furo-cho, Chikusa-ku, Nagoya, 464-8601, Japan.
| | - Takaya Arita
- Graduate School of Informatics, Nagoya University, Furo-cho, Chikusa-ku, Nagoya, 464-8601, Japan
| |
Collapse
|
25
|
Mat Q, Briganti G, Maniaci A, Lelubre C. Will ChatGPT soon replace otolaryngologists? Eur Arch Otorhinolaryngol 2024:10.1007/s00405-024-08543-x. [PMID: 38438614 DOI: 10.1007/s00405-024-08543-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2024] [Accepted: 02/08/2024] [Indexed: 03/06/2024]
Affiliation(s)
- Quentin Mat
- Department of Otorhinolaryngology, C.H.U. Charleroi, Chaussée de Bruxelles 140, 6042, Charleroi, Belgium.
- Faculty of Medicine and Pharmacy, University of Mons (UMons), Mons, Belgium.
| | - Giovanni Briganti
- Faculty of Medicine and Pharmacy, University of Mons (UMons), Mons, Belgium
- Department of Clinical Science, Faculty of Medicine, University of Liège, Quartier Hôpital, Avenue Hippocrate 13, 4000, Liege, Belgium
- Faculty of Medicine, Université Libre de Bruxelles, Route de Lennik 808, 1070, Brussels, Belgium
| | - Antonino Maniaci
- Faculty of Medicine and Surgery, University of Enna "Kore", Enna, Italy
| | - Christophe Lelubre
- Faculty of Medicine and Pharmacy, University of Mons (UMons), Mons, Belgium
- Department of Internal Medicine, C.H.U. Charleroi, Charleroi, Belgium
| |
Collapse
|
26
|
Hu D, Liu B, Zhu X, Lu X, Wu N. Zero-shot information extraction from radiological reports using ChatGPT. Int J Med Inform 2024; 183:105321. [PMID: 38157785 DOI: 10.1016/j.ijmedinf.2023.105321] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2023] [Revised: 12/04/2023] [Accepted: 12/16/2023] [Indexed: 01/03/2024]
Abstract
INTRODUCTION Electronic health records contain an enormous amount of valuable information recorded in free text. Information extraction is the strategy to transform free text into structured data, but some of its components require annotated data to tune, which has become a bottleneck. Large language models achieve good performances on various downstream NLP tasks without parameter tuning, becoming a possible way to extract information in a zero-shot manner. METHODS In this study, we aim to explore whether the most popular large language model, ChatGPT, can extract information from the radiological reports. We first design the prompt template for the interested information in the CT reports. Then, we generate the prompts by combining the prompt template with the CT reports as the inputs of ChatGPT to obtain the responses. A post-processing module is developed to transform the responses into structured extraction results. Besides, we add prior medical knowledge to the prompt template to reduce wrong extraction results. We also explore the consistency of the extraction results. RESULTS We conducted the experiments with 847 real CT reports. The experimental results indicate that ChatGPT can achieve competitive performances for some extraction tasks like tumor location, tumor long and short diameters compared with the baseline information extraction system. By adding some prior medical knowledge to the prompt template, extraction tasks about tumor spiculations and lobulations obtain significant improvements but tasks about tumor density and lymph node status do not achieve better performances. CONCLUSION ChatGPT can achieve competitive information extraction for radiological reports in a zero-shot manner. Adding prior medical knowledge as instructions can further improve performances for some extraction tasks but may lead to worse performances for some complex extraction tasks.
Collapse
Affiliation(s)
- Danqing Hu
- Zhejiang Lab, Hangzhou, 311121, Zhejiang, China.
| | - Bing Liu
- Department of Thoracic Surgery II, Peking University Cancer Hospital and Institute, Beijing, 100142, China
| | - Xiaofeng Zhu
- Zhejiang Lab, Hangzhou, 311121, Zhejiang, China.
| | - Xudong Lu
- College of Biomedical Engineering and Instrumental Science, Zhejiang University, Hangzhou, 310027, Zhejiang, China
| | - Nan Wu
- Department of Thoracic Surgery II, Peking University Cancer Hospital and Institute, Beijing, 100142, China.
| |
Collapse
|
27
|
Kim K, Cho K, Jang R, Kyung S, Lee S, Ham S, Choi E, Hong GS, Kim N. Updated Primer on Generative Artificial Intelligence and Large Language Models in Medical Imaging for Medical Professionals. Korean J Radiol 2024; 25:224-242. [PMID: 38413108 PMCID: PMC10912493 DOI: 10.3348/kjr.2023.0818] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2023] [Revised: 11/27/2023] [Accepted: 12/28/2023] [Indexed: 02/29/2024] Open
Abstract
The emergence of Chat Generative Pre-trained Transformer (ChatGPT), a chatbot developed by OpenAI, has garnered interest in the application of generative artificial intelligence (AI) models in the medical field. This review summarizes different generative AI models and their potential applications in the field of medicine and explores the evolving landscape of Generative Adversarial Networks and diffusion models since the introduction of generative AI models. These models have made valuable contributions to the field of radiology. Furthermore, this review also explores the significance of synthetic data in addressing privacy concerns and augmenting data diversity and quality within the medical domain, in addition to emphasizing the role of inversion in the investigation of generative models and outlining an approach to replicate this process. We provide an overview of Large Language Models, such as GPTs and bidirectional encoder representations (BERTs), that focus on prominent representatives and discuss recent initiatives involving language-vision models in radiology, including innovative large language and vision assistant for biomedicine (LLaVa-Med), to illustrate their practical application. This comprehensive review offers insights into the wide-ranging applications of generative AI models in clinical research and emphasizes their transformative potential.
Collapse
Affiliation(s)
- Kiduk Kim
- Department of Convergence Medicine, University of Ulsan College of Medicine, Asan Medical Center, Seoul, Republic of Korea
| | - Kyungjin Cho
- Department of Biomedical Engineering, Asan Medical Institute of Convergence Science and Technology, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Republic of Korea
| | | | - Sunggu Kyung
- Department of Biomedical Engineering, Asan Medical Institute of Convergence Science and Technology, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Republic of Korea
| | - Soyoung Lee
- Department of Biomedical Engineering, Asan Medical Institute of Convergence Science and Technology, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Republic of Korea
| | - Sungwon Ham
- Healthcare Readiness Institute for Unified Korea, Korea University Ansan Hospital, Korea University College of Medicine, Ansan, Republic of Korea
| | - Edward Choi
- Korea Advanced Institute of Science and Technology, Daejeon, Republic of Korea
| | - Gil-Sun Hong
- Department of Radiology and Research Institute of Radiology, University of Ulsan College of Medicine, Asan Medical Center, Seoul, Republic of Korea.
| | - Namkug Kim
- Department of Convergence Medicine, University of Ulsan College of Medicine, Asan Medical Center, Seoul, Republic of Korea
- Department of Radiology and Research Institute of Radiology, University of Ulsan College of Medicine, Asan Medical Center, Seoul, Republic of Korea.
| |
Collapse
|
28
|
Liu P, Ren Y, Tao J, Ren Z. GIT-Mol: A multi-modal large language model for molecular science with graph, image, and text. Comput Biol Med 2024; 171:108073. [PMID: 38359660 DOI: 10.1016/j.compbiomed.2024.108073] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2023] [Revised: 12/25/2023] [Accepted: 01/27/2024] [Indexed: 02/17/2024]
Abstract
Large language models have made significant strides in natural language processing, enabling innovative applications in molecular science by processing textual representations of molecules. However, most existing language models cannot capture the rich information with complex molecular structures or images. In this paper, we introduce GIT-Mol, a multi-modal large language model that integrates the Graph, Image, and Text information. To facilitate the integration of multi-modal molecular data, we propose GIT-Former, a novel architecture that is capable of aligning all modalities into a unified latent space. We achieve a 5%-10% accuracy increase in properties prediction and a 20.2% boost in molecule generation validity compared to the baselines. With the any-to-language molecular translation strategy, our model has the potential to perform more downstream tasks, such as compound name recognition and chemical reaction prediction.
Collapse
Affiliation(s)
- Pengfei Liu
- Peng Cheng Laboratory, Shenzhen, 518055, Guangdong Province, China; School of Computer Science and Engineering, Sun Yat-Sen University, Guangzhou, 510006, Guangdong Province, China
| | - Yiming Ren
- Peng Cheng Laboratory, Shenzhen, 518055, Guangdong Province, China
| | - Jun Tao
- School of Computer Science and Engineering, Sun Yat-Sen University, Guangzhou, 510006, Guangdong Province, China
| | - Zhixiang Ren
- Peng Cheng Laboratory, Shenzhen, 518055, Guangdong Province, China.
| |
Collapse
|
29
|
Dang R, Hanba C. A large language model's assessment of methodology reporting in head and neck surgery. Am J Otolaryngol 2024; 45:104145. [PMID: 38103488 DOI: 10.1016/j.amjoto.2023.104145] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2023] [Accepted: 12/03/2023] [Indexed: 12/19/2023]
Abstract
OBJECTIVE The aim of this study was to assess the ability of a Large Language Model - ChatGPT 3.5 to appraise the quality of scientific methodology reporting in head and neck specific scientific literature. METHODS Authors asked ChatGPT 3.5 to create a grading system for scientific reporting of research methods. The language model produced a system with a max of 60 points. Individual scores were provided for Study Design and Description, Data Collection and Measurement, Statistical Analysis, Ethical Considerations, and Overall Clarity and Transparency. Twenty articles were selected at random from The American Head and Neck Society's (AHNS) fellowship curriculum 2.0 for interrogation and each 'Methods' section was input into ChatGPT 3.5 for scoring. Analysis of variance (ANOVA) was performed between different scoring categories and a post-hoc tukey HSD test was performed. RESULTS Twenty articles were assessed, eight were categorized as very good and nine as good based on cumulative score. Lowest mean score was noted with category of statistical analysis (Mean = 0.49, SD = 0.02). On ANOVA a significant difference between means of the different scoring categories was noted, F(4, 95) = 13.4, p ≤ 0.05. On post-hoc Tukey HSD test, mean scores for categories of data collection (Mean = 0.58, SD = 0.06) and statistical analysis (Mean = 0.49, SD = 0.02) were significantly lower when compared to other categories. CONCLUSION This article showcases the feasibility of employing a large language model such as ChatGPT 3.5 to assess the methods sections in head and neck academic writing. LEVEL OF EVIDENCE: 4
Collapse
Affiliation(s)
- Rushil Dang
- Maxillofacial Oncology and Reconstructive Surgery, Department of Oral and Maxillofacial surgery, Boston Medical Center, Boston, MA, USA
| | - Curtis Hanba
- Department of Head and Neck Surgery, The University of Texas MD Anderson Cancer Center, Houston, TX, USA.
| |
Collapse
|
30
|
Wei Q, Yao Z, Cui Y, Wei B, Jin Z, Xu X. Evaluation of ChatGPT-generated medical responses: A systematic review and meta-analysis. J Biomed Inform 2024; 151:104620. [PMID: 38462064 DOI: 10.1016/j.jbi.2024.104620] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2023] [Revised: 02/27/2024] [Accepted: 02/29/2024] [Indexed: 03/12/2024]
Abstract
OBJECTIVE Large language models (LLMs) such as ChatGPT are increasingly explored in medical domains. However, the absence of standard guidelines for performance evaluation has led to methodological inconsistencies. This study aims to summarize the available evidence on evaluating ChatGPT's performance in answering medical questions and provide direction for future research. METHODS An extensive literature search was conducted on June 15, 2023, across ten medical databases. The keyword used was "ChatGPT," without restrictions on publication type, language, or date. Studies evaluating ChatGPT's performance in answering medical questions were included. Exclusions comprised review articles, comments, patents, non-medical evaluations of ChatGPT, and preprint studies. Data was extracted on general study characteristics, question sources, conversation processes, assessment metrics, and performance of ChatGPT. An evaluation framework for LLM in medical inquiries was proposed by integrating insights from selected literature. This study is registered with PROSPERO, CRD42023456327. RESULTS A total of 3520 articles were identified, of which 60 were reviewed and summarized in this paper and 17 were included in the meta-analysis. ChatGPT displayed an overall integrated accuracy of 56 % (95 % CI: 51 %-60 %, I2 = 87 %) in addressing medical queries. However, the studies varied in question resource, question-asking process, and evaluation metrics. As per our proposed evaluation framework, many studies failed to report methodological details, such as the date of inquiry, version of ChatGPT, and inter-rater consistency. CONCLUSION This review reveals ChatGPT's potential in addressing medical inquiries, but the heterogeneity of the study design and insufficient reporting might affect the results' reliability. Our proposed evaluation framework provides insights for the future study design and transparent reporting of LLM in responding to medical questions.
Collapse
Affiliation(s)
- Qiuhong Wei
- Big Data Center for Children's Medical Care, Children's Hospital of Chongqing Medical University, Chongqing, China; Children Nutrition Research Center, Children's Hospital of Chongqing Medical University, Chongqing, China; National Clinical Research Center for Child Health and Disorders, Ministry of Education Key Laboratory of Child Development and Disorders, China International Science and Technology Cooperation Base of Child Development and Critical Disorders, Chongqing Key Laboratory of Child Neurodevelopment and Cognitive Disorders, Chongqing, China
| | - Zhengxiong Yao
- Department of Neurology, Children's Hospital of Chongqing Medical University, Chongqing, China
| | - Ying Cui
- Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, CA, USA
| | - Bo Wei
- Department of Global Statistics and Data Science, BeiGene USA Inc., San Mateo, CA, USA
| | - Zhezhen Jin
- Department of Biostatistics, Mailman School of Public Health, Columbia University, New York, NY, USA
| | - Ximing Xu
- Big Data Center for Children's Medical Care, Children's Hospital of Chongqing Medical University, Chongqing, China
| |
Collapse
|
31
|
Li S, Guo Z, Zang X. Advancing the Production of Clinical Medical Devices Through ChatGPT. Ann Biomed Eng 2024; 52:441-445. [PMID: 37369944 DOI: 10.1007/s10439-023-03300-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2023] [Accepted: 06/22/2023] [Indexed: 06/29/2023]
Abstract
As a recently popular large language model, Chatbot Generative Pre-trained Transformer (ChatGPT) is highly valued in the field of clinical medicine. Due to the limited understanding of the potential impact of ChatGPT on the manufacturing side of clinical medical devices, we aim to fill this gap through this article. We elucidate the classification of medical devices and explore the positive contributions of ChatGPT in various aspects of medical device design, optimization, and improvement. However, limitations such as the potential for misinterpretation of user intent, lack of personal experience, and the need for human supervision should be taken into consideration. Striking a balance between ChatGPT and human expertise can ensure the safety, quality, and compliance of medical devices. This work contributes to the advancement of ChatGPT in the medical device manufacturing industry and highlights the synergistic relationship between artificial intelligence and human involvement in healthcare.
Collapse
Affiliation(s)
- Siqi Li
- Advanced Research Center, GD Midea Equipment Co., Ltd, Foshan, 528000, China
| | - Zheng Guo
- Orthopedics Department of The Sixth Affiliated Hospital, School of Medicine, South China University of Technology, Foshan, 528042, China.
| | - Xuehui Zang
- Orthopedics Department of The Sixth Affiliated Hospital, School of Medicine, South China University of Technology, Foshan, 528042, China.
| |
Collapse
|
32
|
Pal S, Bhattacharya M, Lee SS, Chakraborty C. A Domain-Specific Next-Generation Large Language Model (LLM) or ChatGPT is Required for Biomedical Engineering and Research. Ann Biomed Eng 2024; 52:451-454. [PMID: 37428337 DOI: 10.1007/s10439-023-03306-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2023] [Accepted: 07/03/2023] [Indexed: 07/11/2023]
Abstract
Large language models or ChatGPT have recently gained extensive media coverage. At the same time, the use of ChatGPT has increased deistically. Biomedical researchers, engineers, and clinicians have shown significant interest and started using it due to its diverse applications, especially in the biomedical field. However, it has been found that ChatGPT sometimes provided incorrect or partly correct information. It is unable to give the most recent information. Therefore, we urgently advocate a domain-specific next-generation, ChatBot for biomedical engineering and research, providing error-free, more accurate, and updated information. The domain-specific ChatBot can perform diversified functions in biomedical engineering, such as performing innovation in biomedical engineering, designing a medical device, etc. The domain-specific artificial intelligence enabled device will revolutionize biomedical engineering and research if a biomedical domain-specific ChatBot is produced.
Collapse
Affiliation(s)
- Soumen Pal
- School of Mechanical Engineering, Vellore Institute of Technology, Vellore, Tamil Nadu, 632014, India
| | - Manojit Bhattacharya
- Department of Zoology, Fakir Mohan University, Vyasa Vihar, Balasore, Odisha, 756020, India
| | - Sang-Soo Lee
- Institute for Skeletal Aging & Orthopaedic Surgery, Hallym University-Chuncheon Sacred Heart Hospital, Chuncheon, Gangwon-Do, 24252, Republic of Korea
| | - Chiranjib Chakraborty
- Department of Biotechnology, School of Life Science and Biotechnology, Adamas University, Kolkata, West Bengal, 700126, India.
| |
Collapse
|
33
|
Shen SA, Perez-Heydrich CA, Xie DX, Nellis JC. ChatGPT vs. web search for patient questions: what does ChatGPT do better? Eur Arch Otorhinolaryngol 2024:10.1007/s00405-024-08524-0. [PMID: 38416195 DOI: 10.1007/s00405-024-08524-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2023] [Accepted: 01/31/2024] [Indexed: 02/29/2024]
Abstract
PURPOSE Chat generative pretrained transformer (ChatGPT) has the potential to significantly impact how patients acquire medical information online. Here, we characterize the readability and appropriateness of ChatGPT responses to a range of patient questions compared to results from traditional web searches. METHODS Patient questions related to the published Clinical Practice Guidelines by the American Academy of Otolaryngology-Head and Neck Surgery were sourced from existing online posts. Questions were categorized using a modified Rothwell classification system into (1) fact, (2) policy, and (3) diagnosis and recommendations. These were queried using ChatGPT and traditional web search. All results were evaluated on readability (Flesch Reading Ease and Flesch-Kinkaid Grade Level) and understandability (Patient Education Materials Assessment Tool). Accuracy was assessed by two blinded clinical evaluators using a three-point ordinal scale. RESULTS 54 questions were organized into fact (37.0%), policy (37.0%), and diagnosis (25.8%). The average readability for ChatGPT responses was lower than traditional web search (FRE: 42.3 ± 13.1 vs. 55.6 ± 10.5, p < 0.001), while the PEMAT understandability was equivalent (93.8% vs. 93.5%, p = 0.17). ChatGPT scored higher than web search for questions the 'Diagnosis' category (p < 0.01); there was no difference in questions categorized as 'Fact' (p = 0.15) or 'Policy' (p = 0.22). Additional prompting improved ChatGPT response readability (FRE 55.6 ± 13.6, p < 0.01). CONCLUSIONS ChatGPT outperforms web search in answering patient questions related to symptom-based diagnoses and is equivalent in providing medical facts and established policy. Appropriate prompting can further improve readability while maintaining accuracy. Further patient education is needed to relay the benefits and limitations of this technology as a source of medial information.
Collapse
Affiliation(s)
- Sarek A Shen
- Department of Otolaryngology-Head and Neck Surgery, Johns Hopkins School of Medicine, 601 North Caroline Street, Baltimore, MD, 21287, USA.
| | | | - Deborah X Xie
- Department of Otolaryngology-Head and Neck Surgery, Johns Hopkins School of Medicine, 601 North Caroline Street, Baltimore, MD, 21287, USA
| | - Jason C Nellis
- Department of Otolaryngology-Head and Neck Surgery, Johns Hopkins School of Medicine, 601 North Caroline Street, Baltimore, MD, 21287, USA
| |
Collapse
|
34
|
Reese JT, Danis D, Caufield JH, Groza T, Casiraghi E, Valentini G, Mungall CJ, Robinson PN. On the limitations of large language models in clinical diagnosis. medRxiv 2024:2023.07.13.23292613. [PMID: 37503093 PMCID: PMC10370243 DOI: 10.1101/2023.07.13.23292613] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/29/2023]
Abstract
Objective Large Language Models such as GPT-4 previously have been applied to differential diagnostic challenges based on published case reports. Published case reports have a sophisticated narrative style that is not readily available from typical electronic health records (EHR). Furthermore, even if such a narrative were available in EHRs, privacy requirements would preclude sending it outside the hospital firewall. We therefore tested a method for parsing clinical texts to extract ontology terms and programmatically generating prompts that by design are free of protected health information. Materials and Methods We investigated different methods to prepare prompts from 75 recently published case reports. We transformed the original narratives by extracting structured terms representing phenotypic abnormalities, comorbidities, treatments, and laboratory tests and creating prompts programmatically. Results Performance of all of these approaches was modest, with the correct diagnosis ranked first in only 5.3-17.6% of cases. The performance of the prompts created from structured data was substantially worse than that of the original narrative texts, even if additional information was added following manual review of term extraction. Moreover, different versions of GPT-4 demonstrated substantially different performance on this task. Discussion The sensitivity of the performance to the form of the prompt and the instability of results over two GPT-4 versions represent important current limitations to the use of GPT-4 to support diagnosis in real-life clinical settings. Conclusion Research is needed to identify the best methods for creating prompts from typically available clinical data to support differential diagnostics.
Collapse
Affiliation(s)
- Justin T Reese
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Daniel Danis
- The Jackson Laboratory for Genomic Medicine, Farmington CT, 06032, USA
| | - J Harry Caufield
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Tudor Groza
- Rare Care Centre, Perth Children’s Hospital, Perth, WA 6009, Australia
- Telethon Kids Institute, Perth, WA 6009, Australia
| | - Elena Casiraghi
- AnacletoLab, Dipartimento di Informatica, Università degli Studi di Milano, Milano, Italy
| | - Giorgio Valentini
- AnacletoLab, Dipartimento di Informatica, Università degli Studi di Milano, Milano, Italy
- ELLIS-European Laboratory for Learning and Intelligent Systems
| | - Christopher J Mungall
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Peter N Robinson
- The Jackson Laboratory for Genomic Medicine, Farmington CT, 06032, USA
- Institute for Systems Genomics, University of Connecticut, Farmington, CT 06032, USA
| |
Collapse
|
35
|
Sood A, Mansoor N, Memmi C, Lynch M, Lynch J. Generative pretrained transformer-4, an artificial intelligence text predictive model, has a high capability for passing novel written radiology exam questions. Int J Comput Assist Radiol Surg 2024:10.1007/s11548-024-03071-9. [PMID: 38381363 DOI: 10.1007/s11548-024-03071-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2023] [Accepted: 02/01/2024] [Indexed: 02/22/2024]
Abstract
PURPOSE AI-image interpretation, through convolutional neural networks, shows increasing capability within radiology. These models have achieved impressive performance in specific tasks within controlled settings, but possess inherent limitations, such as the inability to consider clinical context. We assess the ability of large language models (LLMs) within the context of radiology specialty exams to determine whether they can evaluate relevant clinical information. METHODS A database of questions was created with official sample, author written, and textbook questions based on the Royal College of Radiology (United Kingdom) FRCR 2A and American Board of Radiology (ABR) Certifying examinations. The questions were input into the Generative Pretrained Transformer (GPT) versions 3 and 4, with prompting to answer the questions. RESULTS One thousand seventy-two questions were evaluated by GPT-3 and GPT-4. 495 (46.2%) were for the FRCR 2A and 577 (53.8%) were for the ABR exam. There were 890 single best answers (SBA), and 182 true/false questions. GPT-4 was correct in 629/890 (70.7%) SBA and 151/182 (83.0%) true/false questions. There was no degradation on author written questions. GPT-4 performed significantly better than GPT-3 which selected the correct answer in 282/890 (31.7%) SBA and 111/182 (61.0%) true/false questions. Performance of GPT-4 was similar across both examinations for all categories of question. CONCLUSION The newest generation of LLMs, GPT-4, demonstrates high capability in answering radiology exam questions. It shows marked improvement from GPT-3, suggesting further improvements in accuracy are possible. Further research is needed to explore the clinical applicability of these AI models in real-world settings.
Collapse
Affiliation(s)
- Avnish Sood
- King's College London, Strand, London, WC2R 2LS, UK
| | - Nina Mansoor
- Department of Neuroradiology, Kings College Hospital, Denmark Hill, London, SE59RS, UK
| | - Caroline Memmi
- Imperial College London, Exhibition Road, London, SW7 2AZ, UK
| | - Magnus Lynch
- King's College London Centre for Stem Cells and Regenerative Medicine, Guy's Hospital, Great Maze Pond, London, UK
- St John's Institute of Dermatology, King's College London, London, UK
| | - Jeremy Lynch
- Department of Neuroradiology, Kings College Hospital, Denmark Hill, London, SE59RS, UK.
| |
Collapse
|
36
|
Hu Y, Hu Z, Liu W, Gao A, Wen S, Liu S, Lin Z. Exploring the potential of ChatGPT as an adjunct for generating diagnosis based on chief complaint and cone beam CT radiologic findings. BMC Med Inform Decis Mak 2024; 24:55. [PMID: 38374067 PMCID: PMC10875853 DOI: 10.1186/s12911-024-02445-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2023] [Accepted: 01/28/2024] [Indexed: 02/21/2024] Open
Abstract
AIM This study aimed to assess the performance of OpenAI's ChatGPT in generating diagnosis based on chief complaint and cone beam computed tomography (CBCT) radiologic findings. MATERIALS AND METHODS 102 CBCT reports (48 with dental diseases (DD) and 54 with neoplastic/cystic diseases (N/CD)) were collected. ChatGPT was provided with chief complaint and CBCT radiologic findings. Diagnostic outputs from ChatGPT were scored based on five-point Likert scale. For diagnosis accuracy, the scoring was based on the accuracy of chief complaint related diagnosis and chief complaint unrelated diagnoses (1-5 points); for diagnosis completeness, the scoring was based on how many accurate diagnoses included in ChatGPT's output for one case (1-5 points); for text quality, the scoring was based on how many text errors included in ChatGPT's output for one case (1-5 points). For 54 N/CD cases, the consistence of the diagnosis generated by ChatGPT with pathological diagnosis was also calculated. The constitution of text errors in ChatGPT's outputs was evaluated. RESULTS After subjective ratings by expert reviewers on a five-point Likert scale, the final score of diagnosis accuracy, diagnosis completeness and text quality of ChatGPT was 3.7, 4.5 and 4.6 for the 102 cases. For diagnostic accuracy, it performed significantly better on N/CD (3.8/5) compared to DD (3.6/5). For 54 N/CD cases, 21(38.9%) cases have first diagnosis completely consistent with pathological diagnosis. No text errors were observed in 88.7% of all the 390 text items. CONCLUSION ChatGPT showed potential in generating radiographic diagnosis based on chief complaint and radiologic findings. However, the performance of ChatGPT varied with task complexity, necessitating professional oversight due to a certain error rate.
Collapse
Affiliation(s)
- Yanni Hu
- Department of Dentomaxillofacial Radiology, Nanjing Stomatological Hospital, Affiliated Hospital of Medical School, Institute of Stomatology, Nanjing University, Nanjing, Jiangsu, People's Republic of China
| | - Ziyang Hu
- Department of Dentomaxillofacial Radiology, Nanjing Stomatological Hospital, Affiliated Hospital of Medical School, Institute of Stomatology, Nanjing University, Nanjing, Jiangsu, People's Republic of China
- Department of Stomatology, Shenzhen Longhua District Central Hospital, Shenzhen, People's Republic of China
| | - Wenjing Liu
- Department of Dentomaxillofacial Radiology, Nanjing Stomatological Hospital, Affiliated Hospital of Medical School, Institute of Stomatology, Nanjing University, Nanjing, Jiangsu, People's Republic of China
| | - Antian Gao
- Department of Dentomaxillofacial Radiology, Nanjing Stomatological Hospital, Affiliated Hospital of Medical School, Institute of Stomatology, Nanjing University, Nanjing, Jiangsu, People's Republic of China
| | - Shanhui Wen
- Department of Dentomaxillofacial Radiology, Nanjing Stomatological Hospital, Affiliated Hospital of Medical School, Institute of Stomatology, Nanjing University, Nanjing, Jiangsu, People's Republic of China
| | - Shu Liu
- Department of Dentomaxillofacial Radiology, Nanjing Stomatological Hospital, Affiliated Hospital of Medical School, Institute of Stomatology, Nanjing University, Nanjing, Jiangsu, People's Republic of China
| | - Zitong Lin
- Department of Dentomaxillofacial Radiology, Nanjing Stomatological Hospital, Affiliated Hospital of Medical School, Institute of Stomatology, Nanjing University, Nanjing, Jiangsu, People's Republic of China.
| |
Collapse
|
37
|
Guthrie E, Levy D, Del Carmen G. The Operating and Anesthetic Reference Assistant (OARA): A fine-tuned large language model for resident teaching. Am J Surg 2024:S0002-9610(24)00106-5. [PMID: 38365551 DOI: 10.1016/j.amjsurg.2024.02.016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2023] [Revised: 02/03/2024] [Accepted: 02/08/2024] [Indexed: 02/18/2024]
Abstract
OBJECTIVE This study aimed to fine-tune a large language model (LLM) for domain-specific text generation in surgical and anesthesia residency education. SUMMARY BACKGROUND DATA With growing interest in artificial intelligence (AI) for medical training, the potential of LLMs to transform residency education is explored. METHODS The 7-billion parameter base model "Vicuna v1.5" was trained on 266,342 lines of text from 821 peer-reviewed documents. We evaluated the model with 150 surgical or anesthesia queries and assessed accuracy, token count, and inference speed across various reasoning tasks. Tests of significance were conducted using ANOVA and chi-square analysis. RESULTS Our model achieved 65.3% accuracy, excelling in surgical case-based tasks. We found no significant difference in accuracy between knowledge domains (P=0.081), though longer response generation demonstrated poorer accuracy, with significant accuracy variation based on output length (P = 0.002). CONCLUSIONS LLMs show potential in enhancing residency education. Our model's efficiency and task-specific accuracy highlights such promise, though limits in parameter count diminishes accuracy of longer response generation. Our findings showcase how AI may be integrated effectively within future residency training.
Collapse
Affiliation(s)
- Estefania Guthrie
- McGovern Medical School at the University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Dominique Levy
- McGovern Medical School at the University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Gabriel Del Carmen
- McGovern Medical School at the University of Texas Health Science Center at Houston, Houston, TX, USA.
| |
Collapse
|
38
|
Cai ZR, Chen ML, Kim J, Novoa RA, Barnes LA, Beam A, Linos E. Assessment of Correctness, Content Omission, and Risk of Harm in Large Language Model Responses to Dermatology Continuing Medical Education Questions. J Invest Dermatol 2024:S0022-202X(24)00088-5. [PMID: 38310972 DOI: 10.1016/j.jid.2024.01.015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2024] [Accepted: 01/16/2024] [Indexed: 02/06/2024]
Affiliation(s)
- Zhuo Ran Cai
- Department of Dermatology, Stanford University School of Medicine, Stanford, California, USA; Department of Dermatology, Medical School, Université de Montréal, Montreal, Canada
| | - Michael L Chen
- Department of Dermatology, Stanford University School of Medicine, Stanford, California, USA; Center for Digital Health, Stanford University School of Medicine, Stanford, California, USA
| | - Jiyeong Kim
- Department of Dermatology, Stanford University School of Medicine, Stanford, California, USA; Center for Digital Health, Stanford University School of Medicine, Stanford, California, USA
| | - Roberto A Novoa
- Department of Dermatology, Stanford University School of Medicine, Stanford, California, USA; Department of Pathology, Stanford University School of Medicine, Stanford, California, USA
| | - Leandra A Barnes
- Department of Dermatology, Stanford University School of Medicine, Stanford, California, USA
| | - Andrew Beam
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA
| | - Eleni Linos
- Department of Dermatology, Stanford University School of Medicine, Stanford, California, USA; Center for Digital Health, Stanford University School of Medicine, Stanford, California, USA.
| |
Collapse
|
39
|
Nakaura T, Yoshida N, Kobayashi N, Shiraishi K, Nagayama Y, Uetani H, Kidoh M, Hokamura M, Funama Y, Hirai T. Preliminary assessment of automated radiology report generation with generative pre-trained transformers: comparing results to radiologist-generated reports. Jpn J Radiol 2024; 42:190-200. [PMID: 37713022 PMCID: PMC10811038 DOI: 10.1007/s11604-023-01487-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2023] [Accepted: 08/29/2023] [Indexed: 09/16/2023]
Abstract
PURPOSE In this preliminary study, we aimed to evaluate the potential of the generative pre-trained transformer (GPT) series for generating radiology reports from concise imaging findings and compare its performance with radiologist-generated reports. METHODS This retrospective study involved 28 patients who underwent computed tomography (CT) scans and had a diagnosed disease with typical imaging findings. Radiology reports were generated using GPT-2, GPT-3.5, and GPT-4 based on the patient's age, gender, disease site, and imaging findings. We calculated the top-1, top-5 accuracy, and mean average precision (MAP) of differential diagnoses for GPT-2, GPT-3.5, GPT-4, and radiologists. Two board-certified radiologists evaluated the grammar and readability, image findings, impression, differential diagnosis, and overall quality of all reports using a 4-point scale. RESULTS Top-1 and Top-5 accuracies for the different diagnoses were highest for radiologists, followed by GPT-4, GPT-3.5, and GPT-2, in that order (Top-1: 1.00, 0.54, 0.54, and 0.21, respectively; Top-5: 1.00, 0.96, 0.89, and 0.54, respectively). There were no significant differences in qualitative scores about grammar and readability, image findings, and overall quality between radiologists and GPT-3.5 or GPT-4 (p > 0.05). However, qualitative scores of the GPT series in impression and differential diagnosis scores were significantly lower than those of radiologists (p < 0.05). CONCLUSIONS Our preliminary study suggests that GPT-3.5 and GPT-4 have the possibility to generate radiology reports with high readability and reasonable image findings from very short keywords; however, concerns persist regarding the accuracy of impressions and differential diagnoses, thereby requiring verification by radiologists.
Collapse
Affiliation(s)
- Takeshi Nakaura
- Department of Diagnostic Radiology, Graduate School of Medical Sciences, Kumamoto University, 1-1-1 Honjo, Chuo-ku, Kumamoto-shi, Kumamoto, 860-8556, Japan.
| | - Naofumi Yoshida
- Department of Diagnostic Radiology, Graduate School of Medical Sciences, Kumamoto University, 1-1-1 Honjo, Chuo-ku, Kumamoto-shi, Kumamoto, 860-8556, Japan
| | - Naoki Kobayashi
- Department of Diagnostic Radiology, Graduate School of Medical Sciences, Kumamoto University, 1-1-1 Honjo, Chuo-ku, Kumamoto-shi, Kumamoto, 860-8556, Japan
| | - Kaori Shiraishi
- Department of Diagnostic Radiology, Graduate School of Medical Sciences, Kumamoto University, 1-1-1 Honjo, Chuo-ku, Kumamoto-shi, Kumamoto, 860-8556, Japan
| | - Yasunori Nagayama
- Department of Diagnostic Radiology, Graduate School of Medical Sciences, Kumamoto University, 1-1-1 Honjo, Chuo-ku, Kumamoto-shi, Kumamoto, 860-8556, Japan
| | - Hiroyuki Uetani
- Department of Diagnostic Radiology, Graduate School of Medical Sciences, Kumamoto University, 1-1-1 Honjo, Chuo-ku, Kumamoto-shi, Kumamoto, 860-8556, Japan
| | - Masafumi Kidoh
- Department of Diagnostic Radiology, Graduate School of Medical Sciences, Kumamoto University, 1-1-1 Honjo, Chuo-ku, Kumamoto-shi, Kumamoto, 860-8556, Japan
| | - Masamichi Hokamura
- Department of Diagnostic Radiology, Graduate School of Medical Sciences, Kumamoto University, 1-1-1 Honjo, Chuo-ku, Kumamoto-shi, Kumamoto, 860-8556, Japan
| | - Yoshinori Funama
- Department of Medical Physics, Faculty of Life Sciences, Kumamoto University, Honjo 1-1-1, Kumamoto, 860-8556, Japan
| | - Toshinori Hirai
- Department of Diagnostic Radiology, Graduate School of Medical Sciences, Kumamoto University, 1-1-1 Honjo, Chuo-ku, Kumamoto-shi, Kumamoto, 860-8556, Japan
| |
Collapse
|
40
|
Zhou Y, Moon C, Szatkowski J, Moore D, Stevens J. Evaluating ChatGPT responses in the context of a 53-year-old male with a femoral neck fracture: a qualitative analysis. Eur J Orthop Surg Traumatol 2024; 34:927-955. [PMID: 37776392 PMCID: PMC10858115 DOI: 10.1007/s00590-023-03742-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/14/2023] [Accepted: 09/18/2023] [Indexed: 10/02/2023]
Abstract
PURPOSE The integration of artificial intelligence (AI) tools, such as ChatGPT, in clinical medicine and medical education has gained significant attention due to their potential to support decision-making and improve patient care. However, there is a need to evaluate the benefits and limitations of these tools in specific clinical scenarios. METHODS This study used a case study approach within the field of orthopaedic surgery. A clinical case report featuring a 53-year-old male with a femoral neck fracture was used as the basis for evaluation. ChatGPT, a large language model, was asked to respond to clinical questions related to the case. The responses generated by ChatGPT were evaluated qualitatively, considering their relevance, justification, and alignment with the responses of real clinicians. Alternative dialogue protocols were also employed to assess the impact of additional prompts and contextual information on ChatGPT responses. RESULTS ChatGPT generally provided clinically appropriate responses to the questions posed in the clinical case report. However, the level of justification and explanation varied across the generated responses. Occasionally, clinically inappropriate responses and inconsistencies were observed in the generated responses across different dialogue protocols and on separate days. CONCLUSIONS The findings of this study highlight both the potential and limitations of using ChatGPT in clinical practice. While ChatGPT demonstrated the ability to provide relevant clinical information, the lack of consistent justification and occasional clinically inappropriate responses raise concerns about its reliability. These results underscore the importance of careful consideration and validation when using AI tools in healthcare. Further research and clinician training are necessary to effectively integrate AI tools like ChatGPT, ensuring their safe and reliable use in clinical decision-making.
Collapse
Affiliation(s)
- Yushy Zhou
- Department of Surgery, The University of Melbourne, St. Vincent's Hospital Melbourne, 29 Regent Street, Clinical Sciences Block Level 2, Melbourne, VIC, 3010, Australia.
- Department of Orthopaedic Surgery, St. Vincent's Hospital, Melbourne, Australia.
| | - Charles Moon
- Department of Orthopaedic Surgery, Cedars-Sinai Medical Centre, Los Angeles, CA, USA
| | - Jan Szatkowski
- Department of Orthopaedic Surgery, Indiana University Health Methodist Hospital, Indianapolis, IN, USA
| | - Derek Moore
- Santa Barbara Orthopedic Associates, Santa Barbara, CA, USA
| | - Jarrad Stevens
- Department of Orthopaedic Surgery, St. Vincent's Hospital, Melbourne, Australia
| |
Collapse
|
41
|
Kim S, Lee CK, Kim SS. Large Language Models: A Guide for Radiologists. Korean J Radiol 2024; 25:126-133. [PMID: 38288895 PMCID: PMC10831297 DOI: 10.3348/kjr.2023.0997] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2023] [Revised: 11/27/2023] [Accepted: 12/18/2023] [Indexed: 02/01/2024] Open
Abstract
Large language models (LLMs) have revolutionized the global landscape of technology beyond natural language processing. Owing to their extensive pre-training on vast datasets, contemporary LLMs can handle tasks ranging from general functionalities to domain-specific areas, such as radiology, without additional fine-tuning. General-purpose chatbots based on LLMs can optimize the efficiency of radiologists in terms of their professional work and research endeavors. Importantly, these LLMs are on a trajectory of rapid evolution, wherein challenges such as "hallucination," high training cost, and efficiency issues are addressed, along with the inclusion of multimodal inputs. In this review, we aim to offer conceptual knowledge and actionable guidance to radiologists interested in utilizing LLMs through a succinct overview of the topic and a summary of radiology-specific aspects, from the beginning to potential future directions.
Collapse
Affiliation(s)
- Sunkyu Kim
- Department of Computer Science and Engineering, Korea University, Seoul, Republic of Korea
- AIGEN Sciences, Seoul, Republic of Korea
| | - Choong-Kun Lee
- Division of Medical Oncology, Department of Internal Medicine, Yonsei University College of Medicine, Seoul, Republic of Korea
| | - Seung-Seob Kim
- Department of Radiology and Research Institute of Radiological Science, Severance Hospital, Yonsei University College of Medicine, Seoul, Republic of Korea.
| |
Collapse
|
42
|
King MR, Abdulrahman AM, Petrovic MI, Poley PL, Hall SP, Kulapatana S, Lamantia ZE. Incorporation of ChatGPT and Other Large Language Models into a Graduate Level Computational Bioengineering Course. Cell Mol Bioeng 2024; 17:1-6. [PMID: 38435794 PMCID: PMC10902225 DOI: 10.1007/s12195-024-00793-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/05/2024] Open
Abstract
The remarkable capabilities of generative artificial intelligence and large language models (LLMs) such as ChatGPT have delighted users around the world. Educators have regarded these tools as either a cause for great concern, an opportunity to educate students on cutting-edge technology, or often some combination of the two. Throughout the Fall 2023 semester, we explored the use of ChatGPT (and Bard, among other LLMs) in a graduate level numerical and statistical methods course for PhD-level bioengineers. In this article we share examples of this ChatGPT content, our observations on what worked best in our course, and speculate on how bioengineering students may be best served by this technology in the future.
Collapse
Affiliation(s)
- Michael R. King
- Department of Biomedical Engineering, Vanderbilt University, Nashville, TN 37235 USA
| | - Adam M. Abdulrahman
- Department of Biomedical Engineering, Vanderbilt University, Nashville, TN 37235 USA
- Medical Scientist Training Program, Vanderbilt University School of Medicine, Nashville, TN USA
| | - Mark I. Petrovic
- Department of Biomedical Engineering, Vanderbilt University, Nashville, TN 37235 USA
- Medical Scientist Training Program, Vanderbilt University School of Medicine, Nashville, TN USA
| | - Patricia L. Poley
- Department of Biomedical Engineering, Vanderbilt University, Nashville, TN 37235 USA
| | - Sarah P. Hall
- Department of Biomedical Engineering, Vanderbilt University, Nashville, TN 37235 USA
| | - Surat Kulapatana
- Department of Biomedical Engineering, Vanderbilt University, Nashville, TN 37235 USA
- Department of Physiology, Faculty of Medicine Siriraj Hospital, Mahidol University, Bangkok, 10700 Thailand
| | - Zachary E. Lamantia
- Department of Biomedical Engineering, Vanderbilt University, Nashville, TN 37235 USA
| |
Collapse
|
43
|
Sahin MC, Sozer A, Kuzucu P, Turkmen T, Sahin MB, Sozer E, Tufek OY, Nernekli K, Emmez H, Celtikci E. Beyond human in neurosurgical exams: ChatGPT's success in the Turkish neurosurgical society proficiency board exams. Comput Biol Med 2024; 169:107807. [PMID: 38091727 DOI: 10.1016/j.compbiomed.2023.107807] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2023] [Revised: 11/29/2023] [Accepted: 12/01/2023] [Indexed: 02/08/2024]
Abstract
Chat Generative Pre-Trained Transformer (ChatGPT) is a sophisticated natural language model that employs advanced deep learning techniques and is trained on extensive datasets to produce responses akin to human conversation for user inputs. In this study, ChatGPT's success in the Turkish Neurosurgical Society Proficiency Board Exams (TNSPBE) is compared to the actual candidates who took the exam, along with identifying the types of questions it answered incorrectly, assessing the quality of its responses, and evaluating its performance based on the difficulty level of the questions. Scores of all 260 candidates were recalculated according to the exams they took and included questions in those exams for ranking purposes of this study. The average score of the candidates for a total of 523 questions is 62.02 ± 0.61 compared to ChatGPT, which was 78.77. We have concluded that in addition to ChatGPT's higher response rate, there was also a correlation with the increase in clarity regardless of the difficulty level of the questions with Clarity 1.5, 2.0, 2.5, and 3.0. In the participants, however, there is no such increase in parallel with the increase in clarity.
Collapse
Affiliation(s)
- Mustafa Caglar Sahin
- Gazi University Faculty of Medicine, Department of Neurosurgery, Ankara, Turkey.
| | - Alperen Sozer
- Gazi University Faculty of Medicine, Department of Neurosurgery, Ankara, Turkey.
| | - Pelin Kuzucu
- Gazi University Faculty of Medicine, Department of Neurosurgery, Ankara, Turkey.
| | - Tolga Turkmen
- Ministry of Health Dortyol State Hospital, Department of Neurosurgery, Hatay, Turkey.
| | - Merve Buke Sahin
- Ministry of Health Etimesgut District Health Directorate, Department of Public Health, Ankara, Turkey.
| | - Ekin Sozer
- Gazi University, Directorate of Health Culture and Sports, Ankara, Turkey.
| | - Ozan Yavuz Tufek
- Gazi University Faculty of Medicine, Department of Neurosurgery, Ankara, Turkey.
| | - Kerem Nernekli
- Stanford University Medical School, Department of Radiology, Stanford, CA, USA.
| | - Hakan Emmez
- Gazi University Faculty of Medicine, Department of Neurosurgery, Ankara, Turkey.
| | - Emrah Celtikci
- Gazi University Faculty of Medicine, Department of Neurosurgery, Ankara, Turkey; Gazi University Artificial Intelligence Center, Ankara, Turkey.
| |
Collapse
|
44
|
Liao Z, Wang J, Shi Z, Lu L, Tabata H. Revolutionary Potential of ChatGPT in Constructing Intelligent Clinical Decision Support Systems. Ann Biomed Eng 2024; 52:125-129. [PMID: 37332008 DOI: 10.1007/s10439-023-03288-w] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2023] [Accepted: 06/13/2023] [Indexed: 06/20/2023]
Abstract
Recently, Chatbot Generative Pre-trained Transformer (ChatGPT) is recognized as a promising clinical decision support system (CDSS) in the medical field owing to its advanced text analysis capabilities and interactive design. However, ChatGPT primarily focuses on learning text semantics rather than learning complex data structures and conducting real-time data analysis, which typically necessitate the development of intelligent CDSS employing specialized machine learning algorithms. Although ChatGPT cannot directly execute specific algorithms, it aids in algorithm design for intelligent CDSS at the textual level. In this study, besides discussing the types of CDSS and their relationship with ChatGPT, we mainly investigate the benefits and drawbacks of employing ChatGPT as an auxiliary design tool for intelligent CDSS. Our findings indicate that by collaborating with human expertise, ChatGPT has the potential to revolutionize the development of robust and effective intelligent CDSS.
Collapse
Affiliation(s)
- Zhiqiang Liao
- Department of Electrical Engineering and Information Systems, Graduate School of Engineering, The University of Tokyo, Tokyo, 113-8656, Japan.
| | - Jian Wang
- Department of Orthopaedics, Qilu Hospital of Shandong University, Jinan, 250012, People's Republic of China
| | - Zhuozheng Shi
- Department of Bioengineering, Graduate School of Engineering, The University of Tokyo, Tokyo, 113-8656, Japan
| | - Lintao Lu
- Department of Orthopaedics, Qilu Hospital of Shandong University, Jinan, 250012, People's Republic of China.
- Department of Orthopaedics, Qilu Hospital of Shandong University Dezhou Hospital, Dezhou, 253000, People's Republic of China.
| | - Hitoshi Tabata
- Department of Electrical Engineering and Information Systems, Graduate School of Engineering, The University of Tokyo, Tokyo, 113-8656, Japan
- Department of Bioengineering, Graduate School of Engineering, The University of Tokyo, Tokyo, 113-8656, Japan
| |
Collapse
|
45
|
Rahad K, Martin K, Amugo I, Ferguson S, Curtis A, Davis A, Gangula P, Wang Q. ChatGPT to Enhance Learning in Dental Education at a Historically Black Medical College. Dent Res Oral Health 2024; 7:8-14. [PMID: 38404561 PMCID: PMC10887427 DOI: 10.26502/droh.0069] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/27/2024]
Abstract
The recent rise of powerful large language model (LLM)-based AI tools, exemplified by ChatGPT and Bard, poses a great challenge to contemporary dental education. It simultaneously offers a unique resource that potentially complements today's teaching and learning, where existing widely available learning resources have often fallen short. Although the LLM tools will shape both the clinical and educational aspects of dentistry profoundly, the didactic curricula, which primarily rely on lecture-based courses where instructors impart knowledge through presentations and discussions, need to be upgraded urgently. In this paper, we used dental course materials, syllabi, and textbooks adopted currently in the School of Dentistry (SOD) at Meharry Medical College to assess the potential utility and effectiveness of ChatGPT in dental education. We collected the responses of the chatbot to questions as well as students' interactions with it for assessment. Our results showed that ChatGPT can assist in dental essay writing and generate relevant content for dental students, in addition to other benefits. The limitations of ChatGPT were also discussed in the paper.
Collapse
Affiliation(s)
- Khandoker Rahad
- Department of Computer Science & Data Science, School of Applied Computational Sciences, Meharry Medical College, Nashville, TN, USA
| | - Kianna Martin
- Department of ODS & Research, School of Dentistry, Meharry Medical College, Nashville, TN, USA
| | - Ihunna Amugo
- Department of ODS & Research, School of Dentistry, Meharry Medical College, Nashville, TN, USA
| | - Shania Ferguson
- Department of ODS & Research, School of Dentistry, Meharry Medical College, Nashville, TN, USA
| | - Angela Curtis
- Department of ODS & Research, School of Dentistry, Meharry Medical College, Nashville, TN, USA
| | - Anniya Davis
- Department of ODS & Research, School of Dentistry, Meharry Medical College, Nashville, TN, USA
| | - Pandu Gangula
- Department of ODS & Research, School of Dentistry, Meharry Medical College, Nashville, TN, USA
| | - Qingguo Wang
- Department of Computer Science & Data Science, School of Applied Computational Sciences, Meharry Medical College, Nashville, TN, USA
| |
Collapse
|
46
|
Gravina AG, Pellegrino R, Cipullo M, Palladino G, Imperio G, Ventura A, Auletta S, Ciamarra P, Federico A. May ChatGPT be a tool producing medical information for common inflammatory bowel disease patients' questions? An evidence-controlled analysis. World J Gastroenterol 2024; 30:17-33. [PMID: 38293321 PMCID: PMC10823903 DOI: 10.3748/wjg.v30.i1.17] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/04/2023] [Revised: 12/07/2023] [Accepted: 12/28/2023] [Indexed: 01/06/2024] Open
Abstract
Artificial intelligence is increasingly entering everyday healthcare. Large language model (LLM) systems such as Chat Generative Pre-trained Transformer (ChatGPT) have become potentially accessible to everyone, including patients with inflammatory bowel diseases (IBD). However, significant ethical issues and pitfalls exist in innovative LLM tools. The hype generated by such systems may lead to unweighted patient trust in these systems. Therefore, it is necessary to understand whether LLMs (trendy ones, such as ChatGPT) can produce plausible medical information (MI) for patients. This review examined ChatGPT's potential to provide MI regarding questions commonly addressed by patients with IBD to their gastroenterologists. From the review of the outputs provided by ChatGPT, this tool showed some attractive potential while having significant limitations in updating and detailing information and providing inaccurate information in some cases. Further studies and refinement of the ChatGPT, possibly aligning the outputs with the leading medical evidence provided by reliable databases, are needed.
Collapse
Affiliation(s)
- Antonietta Gerarda Gravina
- Division of Hepatogastroenterology, Department of Precision Medicine, University of Campania Luigi Vanvitelli, Naples 80138, Italy
| | - Raffaele Pellegrino
- Division of Hepatogastroenterology, Department of Precision Medicine, University of Campania Luigi Vanvitelli, Naples 80138, Italy
| | - Marina Cipullo
- Division of Hepatogastroenterology, Department of Precision Medicine, University of Campania Luigi Vanvitelli, Naples 80138, Italy
| | - Giovanna Palladino
- Division of Hepatogastroenterology, Department of Precision Medicine, University of Campania Luigi Vanvitelli, Naples 80138, Italy
| | - Giuseppe Imperio
- Division of Hepatogastroenterology, Department of Precision Medicine, University of Campania Luigi Vanvitelli, Naples 80138, Italy
| | - Andrea Ventura
- Division of Hepatogastroenterology, Department of Precision Medicine, University of Campania Luigi Vanvitelli, Naples 80138, Italy
| | - Salvatore Auletta
- Division of Hepatogastroenterology, Department of Precision Medicine, University of Campania Luigi Vanvitelli, Naples 80138, Italy
| | - Paola Ciamarra
- Division of Hepatogastroenterology, Department of Precision Medicine, University of Campania Luigi Vanvitelli, Naples 80138, Italy
| | - Alessandro Federico
- Division of Hepatogastroenterology, Department of Precision Medicine, University of Campania Luigi Vanvitelli, Naples 80138, Italy
| |
Collapse
|
47
|
Woo B, Huynh T, Tang A, Bui N, Nguyen G, Tam W. Transforming nursing with large language models: from concept to practice. Eur J Cardiovasc Nurs 2024:zvad120. [PMID: 38178303 DOI: 10.1093/eurjcn/zvad120] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/14/2023] [Accepted: 11/19/2023] [Indexed: 01/06/2024]
Abstract
Large language models (LLMs) such as ChatGPT have emerged as potential game-changers in nursing, aiding in patient education, diagnostic assistance, treatment recommendations, and administrative task efficiency. While these advancements signal promising strides in healthcare, integrated LLMs are not without challenges, particularly artificial intelligence hallucination and data privacy concerns. Methodologies such as prompt engineering, temperature adjustments, model fine-tuning, and local deployment are proposed to refine the accuracy of LLMs and ensure data security. While LLMs offer transformative potential, it is imperative to acknowledge that they cannot substitute the intricate expertise of human professionals in the clinical field, advocating for a synergistic approach in patient care.
Collapse
Affiliation(s)
- Brigitte Woo
- Alice Lee Centre for Nursing Studies, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
| | - Tom Huynh
- School of Science, Engineering and Technology, RMIT University, 702 Nguyen Van Linh Blvd., District 7, Ho Chin Minh 756000, Ho Chin Minh City, Vietnam
| | - Arthur Tang
- School of Science, Engineering and Technology, RMIT University, 702 Nguyen Van Linh Blvd., District 7, Ho Chin Minh 756000, Ho Chin Minh City, Vietnam
| | - Nhat Bui
- School of Science, Engineering and Technology, RMIT University, 702 Nguyen Van Linh Blvd., District 7, Ho Chin Minh 756000, Ho Chin Minh City, Vietnam
| | - Giang Nguyen
- School of Science, Engineering and Technology, RMIT University, 702 Nguyen Van Linh Blvd., District 7, Ho Chin Minh 756000, Ho Chin Minh City, Vietnam
| | - Wilson Tam
- Alice Lee Centre for Nursing Studies, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
| |
Collapse
|
48
|
Scquizzato T, Semeraro F, Swindell P, Simpson R, Angelini M, Gazzato A, Sajjad U, Bignami EG, Landoni G, Keeble TR, Mion M. Testing ChatGPT ability to answer laypeople questions about cardiac arrest and cardiopulmonary resuscitation. Resuscitation 2024; 194:110077. [PMID: 38081504 DOI: 10.1016/j.resuscitation.2023.110077] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2023] [Accepted: 11/15/2023] [Indexed: 12/22/2023]
Abstract
INTRODUCTION Cardiac arrest leaves witnesses, survivors, and their relatives with a multitude of questions. When a young or a public figure is affected, interest around cardiac arrest and cardiopulmonary resuscitation (CPR) increases. ChatGPT allows everyone to obtain human-like responses on any topic. Due to the risks of accessing incorrect information, we assessed ChatGPT accuracy in answering laypeople questions about cardiac arrest and CPR. METHODS We co-produced a list of 40 questions with members of Sudden Cardiac Arrest UK covering all aspects of cardiac arrest and CPR. Answers provided by ChatGPT to each question were evaluated by professionals for their accuracy, by professionals and laypeople for their relevance, clarity, comprehensiveness, and overall value on a scale from 1 (poor) to 5 (excellent), and for readability. RESULTS ChatGPT answers received an overall positive evaluation (4.3 ± 0.7) by 14 professionals and 16 laypeople. Also, clarity (4.4 ± 0.6), relevance (4.3 ± 0.6), accuracy (4.0 ± 0.6), and comprehensiveness (4.2 ± 0.7) of answers was rated high. Professionals, however, rated overall value (4.0 ± 0.5 vs 4.6 ± 0.7; p = 0.02) and comprehensiveness (3.9 ± 0.6 vs 4.5 ± 0.7; p = 0.02) lower compared to laypeople. CPR-related answers consistently received a lower score across all parameters by professionals and laypeople. Readability was 'difficult' (median Flesch reading ease score of 34 [IQR 26-42]). CONCLUSIONS ChatGPT provided largely accurate, relevant, and comprehensive answers to questions about cardiac arrest commonly asked by survivors, their relatives, and lay rescuers, except CPR-related answers that received the lowest scores. Large language model will play a significant role in the future and healthcare-related content generated should be monitored.
Collapse
Affiliation(s)
- Tommaso Scquizzato
- Department of Anesthesia and Intensive Care, IRCCS San Raffaele Scientific Institute, Milan, Italy
| | - Federico Semeraro
- Department of Anaesthesia, Intensive Care and Emergency Medical Services, Ospedale Maggiore, Bologna, Italy
| | | | - Rupert Simpson
- Essex Cardiothoracic Centre, Mid and South Essex NHS Foundation Trust, Basildon, United Kingdom; Medical Technology Research Centre, Anglia Ruskin School of Medicine, Chelmsford, United Kingdom
| | - Matteo Angelini
- Department of Anesthesia and Intensive Care, IRCCS San Raffaele Scientific Institute, Milan, Italy
| | - Arianna Gazzato
- Department of Anesthesia and Intensive Care, IRCCS San Raffaele Scientific Institute, Milan, Italy
| | - Uzma Sajjad
- Essex Cardiothoracic Centre, Mid and South Essex NHS Foundation Trust, Basildon, United Kingdom; Medical Technology Research Centre, Anglia Ruskin School of Medicine, Chelmsford, United Kingdom
| | - Elena G Bignami
- Anesthesiology, Critical Care and Pain Medicine Division, Department of Medicine and Surgery, University of Parma, Parma, Italy
| | - Giovanni Landoni
- Department of Anesthesia and Intensive Care, IRCCS San Raffaele Scientific Institute, Milan, Italy; School of Medicine, Vita-Salute San Raffaele University, Milan, Italy
| | - Thomas R Keeble
- Essex Cardiothoracic Centre, Mid and South Essex NHS Foundation Trust, Basildon, United Kingdom; Medical Technology Research Centre, Anglia Ruskin School of Medicine, Chelmsford, United Kingdom
| | - Marco Mion
- Essex Cardiothoracic Centre, Mid and South Essex NHS Foundation Trust, Basildon, United Kingdom; Medical Technology Research Centre, Anglia Ruskin School of Medicine, Chelmsford, United Kingdom.
| |
Collapse
|
49
|
Wei WI, Leung CLK, Tang A, McNeil EB, Wong SYS, Kwok KO. Extracting symptoms from free-text responses using ChatGPT among COVID-19 cases in Hong Kong. Clin Microbiol Infect 2024; 30:142.e1-142.e3. [PMID: 37949111 DOI: 10.1016/j.cmi.2023.11.002] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2023] [Revised: 11/01/2023] [Accepted: 11/03/2023] [Indexed: 11/12/2023]
Abstract
OBJECTIVES To investigate the feasibility and performance of Chat Generative Pretrained Transformer (ChatGPT) in converting symptom narratives into structured symptom labels. METHODS We extracted symptoms from 300 deidentified symptom narratives of COVID-19 patients by a computer-based matching algorithm (the standard), and prompt engineering in ChatGPT. Common symptoms were those with a prevalence >10% according to the standard, and similarly less common symptoms were those with a prevalence of 2-10%. The precision of ChatGPT was compared with the standard using sensitivity and specificity with 95% exact binomial CIs (95% binCIs). In ChatGPT, we prompted without examples (zero-shot prompting) and with examples (few-shot prompting). RESULTS In zero-shot prompting, GPT-4 achieved high specificity (0.947 [95% binCI: 0.894-0.978]-1.000 [95% binCI: 0.965-0.988, 1.000]) for all symptoms, high sensitivity for common symptoms (0.853 [95% binCI: 0.689-0.950]-1.000 [95% binCI: 0.951-1.000]), and moderate sensitivity for less common symptoms (0.200 [95% binCI: 0.043-0.481]-1.000 [95% binCI: 0.590-0.815, 1.000]). Few-shot prompting increased the sensitivity and specificity. GPT-4 outperformed GPT-3.5 in response accuracy and consistent labelling. DISCUSSION This work substantiates ChatGPT's role as a research tool in medical fields. Its performance in converting symptom narratives to structured symptom labels was encouraging, saving time and effort in compiling the task-specific training data. It potentially accelerates free-text data compilation and synthesis in future disease outbreaks and improves the accuracy of symptom checkers. Focused prompt training addressing ambiguous descriptions impacts medical research positively.
Collapse
Affiliation(s)
- Wan In Wei
- JC School of Public Health and Primary Care, The Chinese University of Hong Kong, Hong Kong Special Administrative Region, China
| | - Cyrus Lap Kwan Leung
- JC School of Public Health and Primary Care, The Chinese University of Hong Kong, Hong Kong Special Administrative Region, China
| | - Arthur Tang
- Department of Information Technology, School of Science, Engineering and Technology, RMIT University, Vietnam
| | - Edward Braddon McNeil
- JC School of Public Health and Primary Care, The Chinese University of Hong Kong, Hong Kong Special Administrative Region, China
| | - Samuel Yeung Shan Wong
- JC School of Public Health and Primary Care, The Chinese University of Hong Kong, Hong Kong Special Administrative Region, China
| | - Kin On Kwok
- JC School of Public Health and Primary Care, The Chinese University of Hong Kong, Hong Kong Special Administrative Region, China; Stanley Ho Centre for Emerging Infectious Diseases, The Chinese University of Hong Kong, Hong Kong Special Administrative Region, China; Hong Kong Institute of Asia-Pacific Studies, The Chinese University of Hong Kong, Hong Kong Special Administrative Region, China; Department of Infectious Disease Epidemiology, School of Public Health, Imperial College London, London, United Kingdom.
| |
Collapse
|
50
|
Knoedler S, Sofo G, Kern B, Frank K, Cotofana S, von Isenburg S, Könneker S, Mazzarone F, Dorafshar AH, Knoedler L, Alfertshofer M. Modern Machiavelli? The illusion of ChatGPT-generated patient reviews in plastic and aesthetic surgery based on 9000 review classifications. J Plast Reconstr Aesthet Surg 2024; 88:99-108. [PMID: 37972444 DOI: 10.1016/j.bjps.2023.10.119] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2023] [Revised: 10/18/2023] [Accepted: 10/23/2023] [Indexed: 11/19/2023]
Abstract
BACKGROUND Online patient reviews are crucial in guiding individuals who seek plastic surgery, but artificial chatbots pose a threat of disseminating fake reviews. This study aimed to compare real patient feedback with ChatGPT-generated reviews for the top five US plastic surgery procedures. METHODS Thirty real patient reviews on rhinoplasty, blepharoplasty, facelift, liposuction, and breast augmentation were collected from RealSelf and used as templates for ChatGPT to generate matching patient reviews. Prolific users (n = 30) assessed 150 pairs of reviews to identify human-written and artificial intelligence (AI)-generated reviews. Patient reviews were further assessed using AI content detector software (Copyleaks AI). RESULTS Among the 9000 classification tasks, 64.3% and 35.7% of reviews were classified as authentic and fake, respectively. On an average, the author (human versus machine) was correctly identified in 59.6% of cases, and this poor classification performance was consistent across all procedures. Patients with prior aesthetic treatment showed poorer classification performance than those without (p < 0.05). The mean character count in human-written reviews was significantly higher (p < 0.001) that that in AI-generated reviews, with a significant correlation between character count and participants' accuracy rate (p < 0.001). Emotional timbre of reviews differed significantly with "happiness" being more prevalent in human-written reviews (p < 0.001), and "disappointment" being more prevalent in AI reviews (p = 0.005). Copyleaks AI correctly classified 96.7% and 69.3% of human-written and ChatGPT-generated reviews, respectively. CONCLUSION ChatGPT convincingly replicates authentic patient reviews, even deceiving commercial AI detection software. Analyzing emotional tone and review length can help differentiate real from fake reviews, underscoring the need to educate both patients and physicians to prevent misinformation and mistrust.
Collapse
Affiliation(s)
- Samuel Knoedler
- Department of Plastic Surgery and Hand Surgery, Klinikum Rechts der Isar, Technical University of Munich, Munich, Germany; Division of Plastic Surgery, Department of Surgery, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA; Instituto Ivo Pitanguy, Hospital Santa Casa de Misericórdia, Pontifícia Universidade Católica do Rio de Janeiro, Rio de Janeiro, Brazil.
| | - Giuseppe Sofo
- Instituto Ivo Pitanguy, Hospital Santa Casa de Misericórdia, Pontifícia Universidade Católica do Rio de Janeiro, Rio de Janeiro, Brazil
| | - Barbara Kern
- Department of Plastic Surgery, Charité - Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin, Humboldt-Universität zu Berlin and Berlin Institute of Health, Berlin, Germany
| | | | - Sebastian Cotofana
- Centre for Cutaneous Research, Blizard Institute, Queen Mary University of London, London, UK; Department of Dermatology, Erasmus Hospital, Rotterdam, the Netherlands
| | - Sarah von Isenburg
- Private Practice, Plastische Chirurgie München Dres. Neuhann-Lorenz & v. Isenburg, Munich, Germany
| | - Sören Könneker
- Department of Plastic Surgery and Hand Surgery, University Hospital Zürich, Zurich, Switzerland
| | - Francesco Mazzarone
- Instituto Ivo Pitanguy, Hospital Santa Casa de Misericórdia, Pontifícia Universidade Católica do Rio de Janeiro, Rio de Janeiro, Brazil
| | - Amir H Dorafshar
- Division of Plastic and Reconstructive Surgery, Rush University Medical Center, Chicago, IL, USA
| | - Leonard Knoedler
- Division of Plastic and Reconstructive Surgery, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
| | - Michael Alfertshofer
- Division of Hand, Plastic and Aesthetic Surgery, Ludwig-Maximilians-University Munich, Munich, Germany
| |
Collapse
|