1
|
Giuffrè M, Kresevic S, You K, Dupont J, Huebner J, Grimshaw AA, Shung DL. Systematic review: The use of large language models as medical chatbots in digestive diseases. Aliment Pharmacol Ther 2024; 60:144-166. [PMID: 38798194 DOI: 10.1111/apt.18058] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/28/2023] [Revised: 01/23/2024] [Accepted: 05/11/2024] [Indexed: 05/29/2024]
Abstract
BACKGROUND Interest in large language models (LLMs), such as OpenAI's ChatGPT, across multiple specialties has grown as a source of patient-facing medical advice and provider-facing clinical decision support. The accuracy of LLM responses for gastroenterology and hepatology-related questions is unknown. AIMS To evaluate the accuracy and potential safety implications for LLMs for the diagnosis, management and treatment of questions related to gastroenterology and hepatology. METHODS We conducted a systematic literature search including Cochrane Library, Google Scholar, Ovid Embase, Ovid MEDLINE, PubMed, Scopus and the Web of Science Core Collection to identify relevant articles published from inception until January 28, 2024, using a combination of keywords and controlled vocabulary for LLMs and gastroenterology or hepatology. Accuracy was defined as the percentage of entirely correct answers. RESULTS Among the 1671 reports screened, we identified 33 full-text articles on using LLMs in gastroenterology and hepatology and included 18 in the final analysis. The accuracy of question-responding varied across different model versions. For example, accuracy ranged from 6.4% to 45.5% with ChatGPT-3.5 and was between 40% and 91.4% with ChatGPT-4. In addition, the absence of standardised methodology and reporting metrics for studies involving LLMs places all the studies at a high risk of bias and does not allow for the generalisation of single-study results. CONCLUSIONS Current general-purpose LLMs have unacceptably low accuracy on clinical gastroenterology and hepatology tasks, which may lead to adverse patient safety events through incorrect information or triage recommendations, which might overburden healthcare systems or delay necessary care.
Collapse
Affiliation(s)
- Mauro Giuffrè
- Department of Internal Medicine (Digestive Diseases), Yale School of Medicine, New Haven, Connecticut, USA
- Department of Medical, Surgical and Health Sciences, University of Trieste, Trieste, Italy
| | - Simone Kresevic
- Department of Engineering and Architecture, University of Trieste, Trieste, Italy
| | - Kisung You
- Department of Mathematics at Baruch College, City University of new York, New York, New York, USA
| | - Johannes Dupont
- Department of Internal Medicine (Digestive Diseases), Yale School of Medicine, New Haven, Connecticut, USA
| | - Jack Huebner
- Department of Internal Medicine, Yale School of Medicine, New Haven, Connecticut, USA
| | - Alyssa Ann Grimshaw
- Research & Education Librarian (Clinical) at Harvey Cushing/John Hay Whitney Medical Library, Yale University, New Haven, Connecticut, USA
| | - Dennis Legen Shung
- Department of Internal Medicine (Digestive Diseases), Yale School of Medicine, New Haven, Connecticut, USA
| |
Collapse
|
2
|
Tu SP, Garcia B, Zhu X, Sewell D, Mishra V, Matin K, Dow A. Patient care in complex Sociotechnological ecosystems and learning health systems. Learn Health Syst 2024; 8:e10427. [PMID: 38883874 PMCID: PMC11176594 DOI: 10.1002/lrh2.10427] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2023] [Revised: 04/12/2024] [Accepted: 04/16/2024] [Indexed: 06/18/2024] Open
Abstract
The learning health system (LHS) model was proposed to provide real-time, bi-directional flow of learning using data captured in health information technology systems to deliver rapid learning in healthcare delivery. As highlighted by the landmark National Academy of Medicine report "Crossing the Quality Chasm," the U.S. healthcare delivery industry represents complex adaptive systems, and there is an urgent need to develop innovative methods to identify efficient team structures by harnessing real-world care delivery data found in the electronic health record (EHR). We offer a discussion surrounding the complexities of team communication and how solutions may be guided by theories such as the Multiteam System (MTS) framework and the Multitheoretical Multilevel Framework of Communication Networks. To advance healthcare delivery science and promote LHSs, our team has been building a new line of research using EHR data to study MTS in the complex real world of cancer care delivery. We are developing new network metrics to study MTSs and will be analyzing the impact of EHR communication network structures on patient outcomes. As this research leads to patient care delivery interventions/tools, healthcare leaders and healthcare professionals can effectively use health IT data to implement the most evidence-based collaboration approaches in order to achieve the optimal LHS and patient outcomes.
Collapse
Affiliation(s)
- Shin-Ping Tu
- Department of Internal Medicine University of California, Davis Sacramento California USA
| | - Brittany Garcia
- Department of Internal Medicine University of California, Davis Sacramento California USA
| | - Xi Zhu
- Department of Health Policy and Management University of California, Los Angeles Los Angeles California USA
| | - Daniel Sewell
- Department of Biostatistics University of Iowa Iowa City Iowa USA
| | - Vimal Mishra
- Department of Internal Medicine University of California, Davis Sacramento California USA
| | - Khalid Matin
- Department of Internal Medicine Virginia Commonwealth University Richmond Virginia USA
| | - Alan Dow
- Department of Internal Medicine Virginia Commonwealth University Richmond Virginia USA
| |
Collapse
|
3
|
Bailey H. Advances in emergency management of the critically Ill and injured. Curr Opin Crit Care 2024; 30:193-194. [PMID: 38690951 DOI: 10.1097/mcc.0000000000001153] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/03/2024]
Affiliation(s)
- Heatherlee Bailey
- Department of Emergency Medicine, VA Medical Center, 508 Fulton Street, Durham
| |
Collapse
|
4
|
Bockarie MJ, Ansumana R, Machingaidze SG, de Souza DK, Fatoma P, Zumla A, Lee SS. Transformative potential of artificial intelligence on health care and research in Africa. Int J Infect Dis 2024; 143:107011. [PMID: 38490638 DOI: 10.1016/j.ijid.2024.107011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/17/2024] Open
Affiliation(s)
- Moses J Bockarie
- College of Medical Sciences, Njala University, Bo, Sierra Leone; International Society for Infectious Diseases, Brookline, MA, USA.
| | - Rashid Ansumana
- College of Medical Sciences, Njala University, Bo, Sierra Leone; School of Public Health and Tropical Medicine, Tulane University, New Orleans, LA, USA
| | | | - Dziedzom K de Souza
- Department of Parasitology and Department of Clinical Pathology, Noguchi Memorial Institute for Medical Research, College of Health Sciences, University of Ghana, Accra, Ghana
| | - Patrick Fatoma
- College of Medical Sciences, Njala University, Bo, Sierra Leone
| | - Alimuddin Zumla
- Department of Infection, Division of Infection and Immunity, University College London; NIHR Biomedical Research Centre, UCL Hospitals NHS Foundation Trust, London, UK
| | - Shui-Shan Lee
- International Society for Infectious Diseases, Brookline, MA, USA; S.H. Ho Research Centre for Infectious Diseases, The Chinese University of Hong Kong, Shatin, Hong Kong
| |
Collapse
|
5
|
Giuffrè M, Kresevic S, Pugliese N, You K, Shung DL. Optimizing large language models in digestive disease: strategies and challenges to improve clinical outcomes. Liver Int 2024. [PMID: 38819632 DOI: 10.1111/liv.15974] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/28/2024] [Revised: 04/26/2024] [Accepted: 05/02/2024] [Indexed: 06/01/2024]
Abstract
Large Language Models (LLMs) are transformer-based neural networks with billions of parameters trained on very large text corpora from diverse sources. LLMs have the potential to improve healthcare due to their capability to parse complex concepts and generate context-based responses. The interest in LLMs has not spared digestive disease academics, who have mainly investigated foundational LLM accuracy, which ranges from 25% to 90% and is influenced by the lack of standardized rules to report methodologies and results for LLM-oriented research. In addition, a critical issue is the absence of a universally accepted definition of accuracy, varying from binary to scalar interpretations, often tied to grader expertise without reference to clinical guidelines. We address strategies and challenges to increase accuracy. In particular, LLMs can be infused with domain knowledge using Retrieval Augmented Generation (RAG) or Supervised Fine-Tuning (SFT) with reinforcement learning from human feedback (RLHF). RAG faces challenges with in-context window limits and accurate information retrieval from the provided context. SFT, a deeper adaptation method, is computationally demanding and requires specialized knowledge. LLMs may increase patient quality of care across the field of digestive diseases, where physicians are often engaged in screening, treatment and surveillance for a broad range of pathologies for which in-context learning or SFT with RLHF could improve clinical decision-making and patient outcomes. However, despite their potential, the safe deployment of LLMs in healthcare still needs to overcome hurdles in accuracy, suggesting a need for strategies that integrate human feedback with advanced model training.
Collapse
Affiliation(s)
- Mauro Giuffrè
- Department of Internal Medicine (Digestive Diseases), Yale School of Medicine, New Haven, Connecticut, USA
- Department of Medical, Surgical, and Health Sciences, University of Trieste, Trieste, Italy
| | - Simone Kresevic
- Department of Engineering and Architecture, University of Trieste, Trieste, Italy
| | - Nicola Pugliese
- Division of Internal Medicine and Hepatology, Department of Gastroenterology, IRCCS Humanitas Research Hospital, Rozzano, Italy
- Department of Biomedical Sciences, Humanitas University, Pieve Emanuele, Italy
| | - Kisung You
- Department of Mathematics, Baruch College, City University of New York, New York, New York, USA
| | - Dennis L Shung
- Department of Internal Medicine (Digestive Diseases), Yale School of Medicine, New Haven, Connecticut, USA
| |
Collapse
|
6
|
Murugan M, Yuan B, Venner E, Ballantyne CM, Robinson KM, Coons JC, Wang L, Empey PE, Gibbs RA. Empowering personalized pharmacogenomics with generative AI solutions. J Am Med Inform Assoc 2024; 31:1356-1366. [PMID: 38447590 PMCID: PMC11105140 DOI: 10.1093/jamia/ocae039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2023] [Revised: 02/06/2024] [Accepted: 02/19/2024] [Indexed: 03/08/2024] Open
Abstract
OBJECTIVE This study evaluates an AI assistant developed using OpenAI's GPT-4 for interpreting pharmacogenomic (PGx) testing results, aiming to improve decision-making and knowledge sharing in clinical genetics and to enhance patient care with equitable access. MATERIALS AND METHODS The AI assistant employs retrieval-augmented generation (RAG), which combines retrieval and generative techniques, by harnessing a knowledge base (KB) that comprises data from the Clinical Pharmacogenetics Implementation Consortium (CPIC). It uses context-aware GPT-4 to generate tailored responses to user queries from this KB, further refined through prompt engineering and guardrails. RESULTS Evaluated against a specialized PGx question catalog, the AI assistant showed high efficacy in addressing user queries. Compared with OpenAI's ChatGPT 3.5, it demonstrated better performance, especially in provider-specific queries requiring specialized data and citations. Key areas for improvement include enhancing accuracy, relevancy, and representative language in responses. DISCUSSION The integration of context-aware GPT-4 with RAG significantly enhanced the AI assistant's utility. RAG's ability to incorporate domain-specific CPIC data, including recent literature, proved beneficial. Challenges persist, such as the need for specialized genetic/PGx models to improve accuracy and relevancy and addressing ethical, regulatory, and safety concerns. CONCLUSION This study underscores generative AI's potential for transforming healthcare provider support and patient accessibility to complex pharmacogenomic information. While careful implementation of large language models like GPT-4 is necessary, it is clear that they can substantially improve understanding of pharmacogenomic data. With further development, these tools could augment healthcare expertise, provider productivity, and the delivery of equitable, patient-centered healthcare services.
Collapse
Affiliation(s)
- Mullai Murugan
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, United States
| | - Bo Yuan
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, United States
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, United States
| | - Eric Venner
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, United States
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, United States
| | - Christie M Ballantyne
- Sections of Cardiology and Cardiovascular Research, Department of Medicine, Baylor College of Medicine, Houston, TX, United States
| | | | - James C Coons
- School of Pharmacy, University of Pittsburgh, Pittsburgh, PA, United States
- Department of Pharmacy, UPMC Presbyterian-Shadyside Hospital, Pittsburgh, PA, United States
| | - Liwen Wang
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, United States
| | - Philip E Empey
- School of Pharmacy, University of Pittsburgh, Pittsburgh, PA, United States
- Institute for Precision Medicine, UPMC/University of Pittsburgh, Pittsburgh, PA, United States
| | - Richard A Gibbs
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, United States
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, United States
| |
Collapse
|
7
|
Jindal JA, Lungren MP, Shah NH. Ensuring useful adoption of generative artificial intelligence in healthcare. J Am Med Inform Assoc 2024; 31:1441-1444. [PMID: 38452298 PMCID: PMC11105148 DOI: 10.1093/jamia/ocae043] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2023] [Revised: 02/01/2024] [Accepted: 02/22/2024] [Indexed: 03/09/2024] Open
Abstract
OBJECTIVES This article aims to examine how generative artificial intelligence (AI) can be adopted with the most value in health systems, in response to the Executive Order on AI. MATERIALS AND METHODS We reviewed how technology has historically been deployed in healthcare, and evaluated recent examples of deployments of both traditional AI and generative AI (GenAI) with a lens on value. RESULTS Traditional AI and GenAI are different technologies in terms of their capability and modes of current deployment, which have implications on value in health systems. DISCUSSION Traditional AI when applied with a framework top-down can realize value in healthcare. GenAI in the short term when applied top-down has unclear value, but encouraging more bottom-up adoption has the potential to provide more benefit to health systems and patients. CONCLUSION GenAI in healthcare can provide the most value for patients when health systems adapt culturally to grow with this new technology and its adoption patterns.
Collapse
Affiliation(s)
- Jenelle A Jindal
- Center for Biomedical Informatics Research, Stanford University, Stanford, CA 94305, United States
| | - Matthew P Lungren
- Health and Life Sciences, Microsoft Corporation, Redmond, WA 98052, United States
- Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, CA 94305, United States
- Department of Biomedical Imaging, University of California San Francisco, San Francisco, CA 94143, United States
| | - Nigam H Shah
- Department of Medicine, Stanford School of Medicine, Stanford, CA 94304, United States
- Clinical Excellence Research Center, Stanford School of Medicine, Stanford, CA 94304, United States
- Technology and Digital Solutions, Stanford Health Care, Palo Alto, CA 94304, United States
| |
Collapse
|
8
|
Harada Y, Sakamoto T, Sugimoto S, Shimizu T. Longitudinal Changes in Diagnostic Accuracy of a Differential Diagnosis List Developed by an AI-Based Symptom Checker: Retrospective Observational Study. JMIR Form Res 2024; 8:e53985. [PMID: 38758588 PMCID: PMC11143391 DOI: 10.2196/53985] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2023] [Revised: 03/23/2024] [Accepted: 04/24/2024] [Indexed: 05/18/2024] Open
Abstract
BACKGROUND Artificial intelligence (AI) symptom checker models should be trained using real-world patient data to improve their diagnostic accuracy. Given that AI-based symptom checkers are currently used in clinical practice, their performance should improve over time. However, longitudinal evaluations of the diagnostic accuracy of these symptom checkers are limited. OBJECTIVE This study aimed to assess the longitudinal changes in the accuracy of differential diagnosis lists created by an AI-based symptom checker used in the real world. METHODS This was a single-center, retrospective, observational study. Patients who visited an outpatient clinic without an appointment between May 1, 2019, and April 30, 2022, and who were admitted to a community hospital in Japan within 30 days of their index visit were considered eligible. We only included patients who underwent an AI-based symptom checkup at the index visit, and the diagnosis was finally confirmed during follow-up. Final diagnoses were categorized as common or uncommon, and all cases were categorized as typical or atypical. The primary outcome measure was the accuracy of the differential diagnosis list created by the AI-based symptom checker, defined as the final diagnosis in a list of 10 differential diagnoses created by the symptom checker. To assess the change in the symptom checker's diagnostic accuracy over 3 years, we used a chi-square test to compare the primary outcome over 3 periods: from May 1, 2019, to April 30, 2020 (first year); from May 1, 2020, to April 30, 2021 (second year); and from May 1, 2021, to April 30, 2022 (third year). RESULTS A total of 381 patients were included. Common diseases comprised 257 (67.5%) cases, and typical presentations were observed in 298 (78.2%) cases. Overall, the accuracy of the differential diagnosis list created by the AI-based symptom checker was 172 (45.1%), which did not differ across the 3 years (first year: 97/219, 44.3%; second year: 32/72, 44.4%; and third year: 43/90, 47.7%; P=.85). The accuracy of the differential diagnosis list created by the symptom checker was low in those with uncommon diseases (30/124, 24.2%) and atypical presentations (12/83, 14.5%). In the multivariate logistic regression model, common disease (P<.001; odds ratio 4.13, 95% CI 2.50-6.98) and typical presentation (P<.001; odds ratio 6.92, 95% CI 3.62-14.2) were significantly associated with the accuracy of the differential diagnosis list created by the symptom checker. CONCLUSIONS A 3-year longitudinal survey of the diagnostic accuracy of differential diagnosis lists developed by an AI-based symptom checker, which has been implemented in real-world clinical practice settings, showed no improvement over time. Uncommon diseases and atypical presentations were independently associated with a lower diagnostic accuracy. In the future, symptom checkers should be trained to recognize uncommon conditions.
Collapse
Affiliation(s)
- Yukinori Harada
- Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Shimotsuga, Japan
- Department of General Medicine, Nagano Chuo Hospital, Nagano, Japan
| | - Tetsu Sakamoto
- Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Shimotsuga, Japan
| | - Shu Sugimoto
- Department of Medicine (Neurology and Rheumatology), Shinshu University School of Medicine, Matsumoto, Japan
| | - Taro Shimizu
- Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Shimotsuga, Japan
| |
Collapse
|
9
|
Bridges JM. Computerized diagnostic decision support systems - a comparative performance study of Isabel Pro vs. ChatGPT4. Diagnosis (Berl) 2024; 0:dx-2024-0033. [PMID: 38709491 DOI: 10.1515/dx-2024-0033] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2024] [Accepted: 04/22/2024] [Indexed: 05/07/2024]
Abstract
OBJECTIVES Validate the diagnostic accuracy of the Artificial Intelligence Large Language Model ChatGPT4 by comparing diagnosis lists produced by ChatGPT4 to Isabel Pro. METHODS This study used 201 cases, comparing ChatGPT4 to Isabel Pro. Systems inputs were identical. Mean Reciprocal Rank (MRR) compares the correct diagnosis's rank between systems. Isabel Pro ranks by the frequency with which the symptoms appear in the reference dataset. The mechanism ChatGPT4 uses to rank the diagnoses is unknown. A Wilcoxon Signed Rank Sum test failed to reject the null hypothesis. RESULTS Both systems produced comprehensive differential diagnosis lists. Isabel Pro's list appears immediately upon submission, while ChatGPT4 takes several minutes. Isabel Pro produced 175 (87.1 %) correct diagnoses and ChatGPT4 165 (82.1 %). The MRR for ChatGPT4 was 0.428 (rank 2.31), and Isabel Pro was 0.389 (rank 2.57), an average rank of three for each. ChatGPT4 outperformed on Recall at Rank 1, 5, and 10, with Isabel Pro outperforming at 20, 30, and 40. The Wilcoxon Signed Rank Sum Test confirmed that the sample size was inadequate to conclude that the systems are equivalent. ChatGPT4 fabricated citations and DOIs, producing 145 correct references (87.9 %) but only 52 correct DOIs (31.5 %). CONCLUSIONS This study validates the promise of Clinical Diagnostic Decision Support Systems, including the Large Language Model form of artificial intelligence (AI). Until the issue of hallucination of references and, perhaps diagnoses, is resolved in favor of absolute accuracy, clinicians will make cautious use of Large Language Model systems in diagnosis, if at all.
Collapse
Affiliation(s)
- Joe M Bridges
- D. Bradley McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, USA
| |
Collapse
|
10
|
Zumla A, Hui DS. Keeping global focus on the many challenges of respiratory tract infections. Curr Opin Pulm Med 2024; 30:201-203. [PMID: 38517136 DOI: 10.1097/mcp.0000000000001066] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/23/2024]
Affiliation(s)
- Alimuddin Zumla
- Centre for Clinical Microbiology, Division of Infection and Immunity, University College London
- NIHR Biomedical Research Centre, UCL Hospitals NHS Foundation Trust, London, UK
| | - David S Hui
- Department of Medicine & Therapeutics and SH Ho Research Centre for Infectious Diseases, The Chinese University of Hong Kong, Hong Kong Special Administrative Region, China
| |
Collapse
|
11
|
Sloss EA, Abdul S, Aboagyewah MA, Beebe A, Kendle K, Marshall K, Rosenbloom ST, Rossetti S, Grigg A, Smith KD, Mishuris RG. Toward Alleviating Clinician Documentation Burden: A Scoping Review of Burden Reduction Efforts. Appl Clin Inform 2024; 15:446-455. [PMID: 38839063 PMCID: PMC11152769 DOI: 10.1055/s-0044-1787007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2023] [Accepted: 04/17/2024] [Indexed: 06/07/2024] Open
Abstract
BACKGROUND Studies have shown that documentation burden experienced by clinicians may lead to less direct patient care, increased errors, and job dissatisfaction. Implementing effective strategies within health care systems to mitigate documentation burden can result in improved clinician satisfaction and more time spent with patients. However, there is a gap in the literature regarding evidence-based interventions to reduce documentation burden. OBJECTIVES The objective of this review was to identify and comprehensively summarize the state of the science related to documentation burden reduction efforts. METHODS Following Joanna Briggs Institute Manual for Evidence Synthesis and Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews (PRISMA-ScR) guidelines, we conducted a comprehensive search of multiple databases, including PubMed, Medline, Embase, CINAHL Complete, Scopus, and Web of Science. Additionally, we searched gray literature and used Google Scholar to ensure a thorough review. Two reviewers independently screened titles and abstracts, followed by full-text review, with a third reviewer resolving any discrepancies. Data extraction was performed and a table of evidence was created. RESULTS A total of 34 articles were included in the review, published between 2016 and 2022, with a majority focusing on the United States. The efforts described can be categorized into medical scribes, workflow improvements, educational interventions, user-driven approaches, technology-based solutions, combination approaches, and other strategies. The outcomes of these efforts often resulted in improvements in documentation time, workflow efficiency, provider satisfaction, and patient interactions. CONCLUSION This scoping review provides a comprehensive summary of health system documentation burden reduction efforts. The positive outcomes reported in the literature emphasize the potential effectiveness of these efforts. However, more research is needed to identify universally applicable best practices, and considerations should be given to the transfer of burden among members of the health care team, quality of education, clinician involvement, and evaluation methods.
Collapse
Affiliation(s)
- Elizabeth A. Sloss
- Division of Health Systems and Community Based Care, College of Nursing, University of Utah, Utah, United States
| | - Shawna Abdul
- John D. Dingell VA Medical Center, Detroit, Michigan, United States
| | - Mayfair A. Aboagyewah
- Case Management, Mount Sinai Health System, MSH Main Campus, New York, New York, United States
| | - Alicia Beebe
- Saint Luke's Health System (MO), Kansas City, Missouri, United States
| | - Kathleen Kendle
- Section of Health Informatics, El Paso VA Health Care System, El Paso, Texas, United States
| | - Kyle Marshall
- Department of Emergency Medicine, Geisinger, Danville, Pennsylvania, United States
| | - S. Trent Rosenbloom
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, United States
| | - Sarah Rossetti
- Biomedical Informatics and Nursing, Columbia University Irving Medical Center, New York, New York, United States
| | - Aaron Grigg
- Department of Informatics, Grande Ronde Hospital, La Grande, Oregon, United States
| | - Kevin D. Smith
- Department of Pediatrics, University of Chicago Medicine, Chicago, Illinois, United States
| | - Rebecca G. Mishuris
- Digital, Mass General Brigham, Somerville, Massachusetts, United States
- Department of Medicine, Harvard Medical School, Boston, Massachusetts, United States
| |
Collapse
|
12
|
Templin T, Perez MW, Sylvia S, Leek J, Sinnott-Armstrong N. Addressing 6 challenges in generative AI for digital health: A scoping review. PLOS DIGITAL HEALTH 2024; 3:e0000503. [PMID: 38781686 PMCID: PMC11115971 DOI: 10.1371/journal.pdig.0000503] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/25/2024]
Abstract
Generative artificial intelligence (AI) can exhibit biases, compromise data privacy, misinterpret prompts that are adversarial attacks, and produce hallucinations. Despite the potential of generative AI for many applications in digital health, practitioners must understand these tools and their limitations. This scoping review pays particular attention to the challenges with generative AI technologies in medical settings and surveys potential solutions. Using PubMed, we identified a total of 120 articles published by March 2024, which reference and evaluate generative AI in medicine, from which we synthesized themes and suggestions for future work. After first discussing general background on generative AI, we focus on collecting and presenting 6 challenges key for digital health practitioners and specific measures that can be taken to mitigate these challenges. Overall, bias, privacy, hallucination, and regulatory compliance were frequently considered, while other concerns around generative AI, such as overreliance on text models, adversarial misprompting, and jailbreaking, are not commonly evaluated in the current literature.
Collapse
Affiliation(s)
- Tara Templin
- Department of Health Policy and Management, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, United States of America
- Carolina Population Center, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, United States of America
| | - Monika W. Perez
- Department of Genome Sciences, University of Washington, Seattle, Washington, United States of America
| | - Sean Sylvia
- Carolina Population Center, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, United States of America
- Department of Health Policy and Management, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, United States of America
- Sheps Center for Health Services Research, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, United States of America
| | - Jeff Leek
- Biostatistics Program, Fred Hutchinson Cancer Center, Seattle, Washington, United States of America
- Department of Biostatistics, University of Washington, Seattle, Washington, United States of America
| | - Nasa Sinnott-Armstrong
- Department of Genome Sciences, University of Washington, Seattle, Washington, United States of America
- Herbold Computational Biology Program, Fred Hutchinson Cancer Center, Seattle, Washington, United States of America
| |
Collapse
|
13
|
Bibbins-Domingo K, Flanagin A, Christiansen S, Park H, Curfman G. 2023 Year in Review and What's Ahead at JAMA. JAMA 2024; 331:1181-1184. [PMID: 38457136 DOI: 10.1001/jama.2024.3643] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 03/09/2024]
Affiliation(s)
| | | | | | - Hannah Park
- Managing Director of Strategy and Planning, JAMA and the JAMA Network
| | | |
Collapse
|
14
|
Williams CY, Bains J, Tang T, Patel K, Lucas AN, Chen F, Miao BY, Butte AJ, Kornblith AE. Evaluating Large Language Models for Drafting Emergency Department Discharge Summaries. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.04.03.24305088. [PMID: 38633805 PMCID: PMC11023681 DOI: 10.1101/2024.04.03.24305088] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/19/2024]
Abstract
Importance Large language models (LLMs) possess a range of capabilities which may be applied to the clinical domain, including text summarization. As ambient artificial intelligence scribes and other LLM-based tools begin to be deployed within healthcare settings, rigorous evaluations of the accuracy of these technologies are urgently needed. Objective To investigate the performance of GPT-4 and GPT-3.5-turbo in generating Emergency Department (ED) discharge summaries and evaluate the prevalence and type of errors across each section of the discharge summary. Design Cross-sectional study. Setting University of California, San Francisco ED. Participants We identified all adult ED visits from 2012 to 2023 with an ED clinician note. We randomly selected a sample of 100 ED visits for GPT-summarization. Exposure We investigate the potential of two state-of-the-art LLMs, GPT-4 and GPT-3.5-turbo, to summarize the full ED clinician note into a discharge summary. Main Outcomes and Measures GPT-3.5-turbo and GPT-4-generated discharge summaries were evaluated by two independent Emergency Medicine physician reviewers across three evaluation criteria: 1) Inaccuracy of GPT-summarized information; 2) Hallucination of information; 3) Omission of relevant clinical information. On identifying each error, reviewers were additionally asked to provide a brief explanation for their reasoning, which was manually classified into subgroups of errors. Results From 202,059 eligible ED visits, we randomly sampled 100 for GPT-generated summarization and then expert-driven evaluation. In total, 33% of summaries generated by GPT-4 and 10% of those generated by GPT-3.5-turbo were entirely error-free across all evaluated domains. Summaries generated by GPT-4 were mostly accurate, with inaccuracies found in only 10% of cases, however, 42% of the summaries exhibited hallucinations and 47% omitted clinically relevant information. Inaccuracies and hallucinations were most commonly found in the Plan sections of GPT-generated summaries, while clinical omissions were concentrated in text describing patients' Physical Examination findings or History of Presenting Complaint. Conclusions and Relevance In this cross-sectional study of 100 ED encounters, we found that LLMs could generate accurate discharge summaries, but were liable to hallucination and omission of clinically relevant information. A comprehensive understanding of the location and type of errors found in GPT-generated clinical text is important to facilitate clinician review of such content and prevent patient harm.
Collapse
Affiliation(s)
| | - Jaskaran Bains
- Department of Emergency Medicine; University of California, San Francisco
| | - Tianyu Tang
- Department of Emergency Medicine; University of California, San Francisco
| | - Kishan Patel
- Department of Emergency Medicine; University of California, San Francisco
| | - Alexa N. Lucas
- Department of Emergency Medicine; University of California, San Francisco
| | - Fiona Chen
- Department of Emergency Medicine; University of California, San Francisco
| | - Brenda Y. Miao
- Bakar Computational Health Sciences Institute; University of California, San Francisco
| | - Atul J. Butte
- Bakar Computational Health Sciences Institute; University of California, San Francisco
| | - Aaron E. Kornblith
- Bakar Computational Health Sciences Institute; University of California, San Francisco
- Department of Emergency Medicine; University of California, San Francisco
| |
Collapse
|
15
|
Gupta RK, Pawa A. Beam me up, Scotty! Apple Vision Pro highlights how we could teleport ultrasound-guided regional anesthesia education into the future. Reg Anesth Pain Med 2024:rapm-2024-105424. [PMID: 38580337 DOI: 10.1136/rapm-2024-105424] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2024] [Accepted: 03/22/2024] [Indexed: 04/07/2024]
Affiliation(s)
- Rajnish K Gupta
- Anesthesiology, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Amit Pawa
- Department of Anaesthesia, Guy's & St Thomas' Hospital, London, UK
- Department of Theatres, Anaesthesia and Perioperative Medicine, Cleveland Clinic London, London, UK
| |
Collapse
|
16
|
Moingeon P, Garbay C, Dahan M, Fermont I, Benmakhlouf A, Gouyette A, Poitou P, Saint-Pierre A. [The revolution of AI in drug development]. Med Sci (Paris) 2024; 40:369-376. [PMID: 38651962 DOI: 10.1051/medsci/2024028] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/25/2024] Open
Abstract
Artificial intelligence and machine learning enable the construction of predictive models, which are currently used to assist in decision-making throughout the process of drug discovery and development. These computational models can be used to represent the heterogeneity of a disease, identify therapeutic targets, design and optimize drug candidates, and evaluate the efficacy of these drugs on virtual patients or digital twins. By combining detailed patient characteristics with the prediction of potential drug-candidate properties, artificial intelligence promotes the emergence of a "computational" precision medicine, allowing for more personalized treatments, better tailored to patient specificities with the aid of such predictive models. Based on such new capabilities, a mixed reality approach to the development of new drugs is being adopted by the pharmaceutical industry, which integrates the outputs of predictive virtual models with real-world empirical studies.
Collapse
|
17
|
Leopold SS. Editor's Spotlight/Take 5: How Does ChatGPT Use Source Information Compared With Google? A Text Network Analysis of Online Health Information. Clin Orthop Relat Res 2024; 482:574-577. [PMID: 38446430 PMCID: PMC10936992 DOI: 10.1097/corr.0000000000003006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/19/2024] [Accepted: 01/23/2024] [Indexed: 03/07/2024]
Affiliation(s)
- Seth S Leopold
- Editor-in-Chief, Clinical Orthopaedics and Related Research® , Park Ridge, IL, USA
| |
Collapse
|
18
|
Zhenzhu L, Jingfeng Z, Wei Z, Jianjun Z, Yinshui X. GPT-agents based on medical guidelines can improve the responsiveness and explainability of outcomes for traumatic brain injury rehabilitation. Sci Rep 2024; 14:7626. [PMID: 38561445 PMCID: PMC10985066 DOI: 10.1038/s41598-024-58514-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2024] [Accepted: 03/30/2024] [Indexed: 04/04/2024] Open
Abstract
This study explored the application of generative pre-trained transformer (GPT) agents based on medical guidelines using large language model (LLM) technology for traumatic brain injury (TBI) rehabilitation-related questions. To assess the effectiveness of multiple agents (GPT-agents) created using GPT-4, a comparison was conducted using direct GPT-4 as the control group (GPT-4). The GPT-agents comprised multiple agents with distinct functions, including "Medical Guideline Classification", "Question Retrieval", "Matching Evaluation", "Intelligent Question Answering (QA)", and "Results Evaluation and Source Citation". Brain rehabilitation questions were selected from the doctor-patient Q&A database for assessment. The primary endpoint was a better answer. The secondary endpoints were accuracy, completeness, explainability, and empathy. Thirty questions were answered; overall GPT-agents took substantially longer and more words to respond than GPT-4 (time: 54.05 vs. 9.66 s, words: 371 vs. 57). However, GPT-agents provided superior answers in more cases compared to GPT-4 (66.7 vs. 33.3%). GPT-Agents surpassed GPT-4 in accuracy evaluation (3.8 ± 1.02 vs. 3.2 ± 0.96, p = 0.0234). No difference in incomplete answers was found (2 ± 0.87 vs. 1.7 ± 0.79, p = 0.213). However, in terms of explainability (2.79 ± 0.45 vs. 07 ± 0.52, p < 0.001) and empathy (2.63 ± 0.57 vs. 1.08 ± 0.51, p < 0.001) evaluation, the GPT-agents performed notably better. Based on medical guidelines, GPT-agents enhanced the accuracy and empathy of responses to TBI rehabilitation questions. This study provides guideline references and demonstrates improved clinical explainability. However, further validation through multicenter trials in a clinical setting is necessary. This study offers practical insights and establishes groundwork for the potential theoretical integration of LLM-agents medicine.
Collapse
Affiliation(s)
- Li Zhenzhu
- Radiology Department, Ningbo NO.2 Hospital, Ningbo, 315211, China
- Department of Neurosurgery, Ningbo NO.2 Hospital, Ningbo, 315211, China
- Faculty of Electrical Engineering and Computer Science, Ningbo University, Ningbo, 315211, China
| | - Zhang Jingfeng
- Radiology Department, Ningbo NO.2 Hospital, Ningbo, 315211, China
| | - Zhou Wei
- Department of Neurosurgery, Ningbo NO.2 Hospital, Ningbo, 315211, China
| | - Zheng Jianjun
- Radiology Department, Ningbo NO.2 Hospital, Ningbo, 315211, China.
| | - Xia Yinshui
- Faculty of Electrical Engineering and Computer Science, Ningbo University, Ningbo, 315211, China.
| |
Collapse
|
19
|
Baumgart A, Beck G, Ghezel-Ahmadi D. [Artificial intelligence in intensive care medicine]. Med Klin Intensivmed Notfmed 2024; 119:189-198. [PMID: 38546864 DOI: 10.1007/s00063-024-01117-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2024] [Revised: 01/29/2024] [Accepted: 02/05/2024] [Indexed: 04/05/2024]
Abstract
The integration of artificial intelligence (AI) into intensive care medicine has made considerable progress in recent studies, particularly in the areas of predictive analytics, early detection of complications, and the development of decision support systems. The main challenges remain availability and quality of data, reduction of bias and the need for explainable results from algorithms and models. Methods to explain these systems are essential to increase trust, understanding, and ethical considerations among healthcare professionals and patients. Proper training of healthcare professionals in AI principles, terminology, ethical considerations, and practical application is crucial for the successful use of AI. Careful assessment of the impact of AI on patient autonomy and data protection is essential for its responsible use in intensive care medicine. A balance between ethical and practical considerations must be maintained to ensure patient-centered care while complying with data protection regulations. Synergistic collaboration between clinicians, AI engineers, and regulators is critical to realizing the full potential of AI in intensive care medicine and maximizing its positive impact on patient care. Future research and development efforts should focus on improving AI models for real-time predictions, increasing the accuracy and utility of AI-based closed-loop systems, and overcoming ethical, technical, and regulatory challenges, especially in generative AI systems.
Collapse
Affiliation(s)
- André Baumgart
- Zentrum für Präventivmedizin und Digitale Gesundheit, Medizinische Fakultät Mannheim der Universität Heidelberg, Theodor-Kutzer-Ufer 1-3, 68167, Mannheim, Deutschland.
| | - Grietje Beck
- Abteilung für Anästhesiologie, Intensivmedizin und Schmerzmedizin, Universitätsmedizin Mannheim gGmbH, Medizinische Fakultät Mannheim der Universität Heidelberg, Mannheim, Deutschland
| | - David Ghezel-Ahmadi
- Abteilung für Anästhesiologie, Intensivmedizin und Schmerzmedizin, Universitätsmedizin Mannheim gGmbH, Medizinische Fakultät Mannheim der Universität Heidelberg, Mannheim, Deutschland
| |
Collapse
|
20
|
Caufield JH, Hegde H, Emonet V, Harris NL, Joachimiak MP, Matentzoglu N, Kim H, Moxon S, Reese JT, Haendel MA, Robinson PN, Mungall CJ. Structured Prompt Interrogation and Recursive Extraction of Semantics (SPIRES): a method for populating knowledge bases using zero-shot learning. Bioinformatics 2024; 40:btae104. [PMID: 38383067 PMCID: PMC10924283 DOI: 10.1093/bioinformatics/btae104] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2023] [Revised: 12/16/2023] [Accepted: 02/20/2024] [Indexed: 02/23/2024] Open
Abstract
MOTIVATION Creating knowledge bases and ontologies is a time consuming task that relies on manual curation. AI/NLP approaches can assist expert curators in populating these knowledge bases, but current approaches rely on extensive training data, and are not able to populate arbitrarily complex nested knowledge schemas. RESULTS Here we present Structured Prompt Interrogation and Recursive Extraction of Semantics (SPIRES), a Knowledge Extraction approach that relies on the ability of Large Language Models (LLMs) to perform zero-shot learning and general-purpose query answering from flexible prompts and return information conforming to a specified schema. Given a detailed, user-defined knowledge schema and an input text, SPIRES recursively performs prompt interrogation against an LLM to obtain a set of responses matching the provided schema. SPIRES uses existing ontologies and vocabularies to provide identifiers for matched elements. We present examples of applying SPIRES in different domains, including extraction of food recipes, multi-species cellular signaling pathways, disease treatments, multi-step drug mechanisms, and chemical to disease relationships. Current SPIRES accuracy is comparable to the mid-range of existing Relation Extraction methods, but greatly surpasses an LLM's native capability of grounding entities with unique identifiers. SPIRES has the advantage of easy customization, flexibility, and, crucially, the ability to perform new tasks in the absence of any new training data. This method supports a general strategy of leveraging the language interpreting capabilities of LLMs to assemble knowledge bases, assisting manual knowledge curation and acquisition while supporting validation with publicly-available databases and ontologies external to the LLM. AVAILABILITY AND IMPLEMENTATION SPIRES is available as part of the open source OntoGPT package: https://github.com/monarch-initiative/ontogpt.
Collapse
Affiliation(s)
- J Harry Caufield
- Biosystems Data Science, Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, United States
| | - Harshad Hegde
- Biosystems Data Science, Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, United States
| | - Vincent Emonet
- Institute of Data Science, Faculty of Science and Engineering, Maastricht University, 6200 MD Maastricht, The Netherlands
| | - Nomi L Harris
- Biosystems Data Science, Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, United States
| | - Marcin P Joachimiak
- Biosystems Data Science, Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, United States
| | | | - HyeongSik Kim
- Robert Bosch LLC, Sunnyvale, CA 94085, United States
| | - Sierra Moxon
- Biosystems Data Science, Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, United States
| | - Justin T Reese
- Biosystems Data Science, Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, United States
| | - Melissa A Haendel
- Department of Biomedical Informatics, University of Colorado, Anschutz Medical Campus, Aurora, CO 80217, United States
| | | | - Christopher J Mungall
- Biosystems Data Science, Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, United States
| |
Collapse
|
21
|
Ayers JW, Desai N, Smith DM. Regulate Artificial Intelligence in Health Care by Prioritizing Patient Outcomes. JAMA 2024; 331:639-640. [PMID: 38285467 DOI: 10.1001/jama.2024.0549] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 01/30/2024]
Abstract
This Viewpoint argues for a shift in focus by the White House executive order on artificial intelligence from regulatory targets to patient outcomes.
Collapse
Affiliation(s)
- John W Ayers
- Qualcomm Institute, University of California San Diego, La Jolla
- Altman Clinical Translational Research Institute, University of California San Diego, La Jolla
| | - Nimit Desai
- Qualcomm Institute, University of California San Diego, La Jolla
- School of Medicine, University of California San Diego, La Jolla
| | - Davey M Smith
- Altman Clinical Translational Research Institute, University of California San Diego, La Jolla
- Division of Infectious Diseases and Global Public Health, Department of Medicine, University of California San Diego, La Jolla
| |
Collapse
|