1
|
Bedmutha MS, Bascom E, Sladek KR, Tobar K, Casanova-Perez R, Andreiu A, Bhat A, Mangal S, Wood BR, Sabin J, Pratt W, Weibel N, Hartzler AL. Artificial intelligence-generated feedback on social signals in patient-provider communication: technical performance, feedback usability, and impact. JAMIA Open 2024; 7:ooae106. [PMID: 39430803 PMCID: PMC11488971 DOI: 10.1093/jamiaopen/ooae106] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2024] [Revised: 08/29/2024] [Accepted: 10/14/2024] [Indexed: 10/22/2024] Open
Abstract
Objectives Implicit bias perpetuates health care inequities and manifests in patient-provider interactions, particularly nonverbal social cues like dominance. We investigated the use of artificial intelligence (AI) for automated communication assessment and feedback during primary care visits to raise clinician awareness of bias in patient interactions. Materials and Methods (1) Assessed the technical performance of our AI models by building a machine-learning pipeline that automatically detects social signals in patient-provider interactions from 145 primary care visits. (2) Engaged 24 clinicians to design usable AI-generated communication feedback for their workflow. (3) Evaluated the impact of our AI-based approach in a prospective cohort of 108 primary care visits. Results Findings demonstrate the feasibility of AI models to identify social signals, such as dominance, warmth, engagement, and interactivity, in nonverbal patient-provider communication. Although engaged clinicians preferred feedback delivered in personalized dashboards, they found nonverbal cues difficult to interpret, motivating social signals as an alternative feedback mechanism. Impact evaluation demonstrated fairness in all AI models with better generalizability of provider dominance, provider engagement, and patient warmth. Stronger clinician implicit race bias was associated with less provider dominance and warmth. Although clinicians expressed overall interest in our AI approach, they recommended improvements to enhance acceptability, feasibility, and implementation in telehealth and medical education contexts. Discussion and Conclusion Findings demonstrate promise for AI-driven communication assessment and feedback systems focused on social signals. Future work should improve the performance of this approach, personalize models, and contextualize feedback, and investigate system implementation in educational workflows. This work exemplifies a systematic, multistage approach for evaluating AI tools designed to raise clinician awareness of implicit bias and promote patient-centered, equitable health care interactions.
Collapse
Affiliation(s)
- Manas Satish Bedmutha
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA 92093, United States
| | - Emily Bascom
- Department of Human Centered Design and Engineering, School of Engineering, University of Washington, Seattle, WA 98195, United States
| | - Kimberly R Sladek
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA 92093, United States
| | - Kelly Tobar
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA 92093, United States
| | - Reggie Casanova-Perez
- Department of Biomedical Informatics and Medical Education, School of Medicine, University of Washington, Seattle, WA 98195, United States
| | - Alexandra Andreiu
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA 92093, United States
| | - Amrit Bhat
- Department of Biomedical Informatics and Medical Education, School of Medicine, University of Washington, Seattle, WA 98195, United States
| | - Sabrina Mangal
- Department of Biobehavioral Nursing and Health Informatics, University of Washington School of Nursing, Seattle, WA 98195, United States
| | - Brian R Wood
- Department of Medicine, School of Medicine, University of Washington, Seattle, WA 98195, United States
| | - Janice Sabin
- Department of Biomedical Informatics and Medical Education, School of Medicine, University of Washington, Seattle, WA 98195, United States
| | - Wanda Pratt
- Department of Biomedical Informatics and Medical Education, School of Medicine, University of Washington, Seattle, WA 98195, United States
- Information School, University of Washington, Seattle, WA 98195, United States
| | - Nadir Weibel
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA 92093, United States
| | - Andrea L Hartzler
- Department of Biomedical Informatics and Medical Education, School of Medicine, University of Washington, Seattle, WA 98195, United States
| |
Collapse
|
2
|
Garcia-Agundez A, Schmajuk G, Yazdany J. Promises and pitfalls of artificial intelligence models in forecasting rheumatoid arthritis treatment response and outcomes. Semin Arthritis Rheum 2024:152584. [PMID: 39550309 DOI: 10.1016/j.semarthrit.2024.152584] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2024] [Accepted: 10/28/2024] [Indexed: 11/18/2024]
Affiliation(s)
- Augusto Garcia-Agundez
- Division of Rheumatology, University of California San Francisco, 2540 23rd St, San Francisco, CA, 94110
| | - Gabriela Schmajuk
- Division of Rheumatology, University of California San Francisco, 2540 23rd St, San Francisco, CA, 94110
| | - Jinoos Yazdany
- Division of Rheumatology, University of California San Francisco, 2540 23rd St, San Francisco, CA, 94110.
| |
Collapse
|
3
|
Fu YV, Ramachandran GK, Halwani A, McInnes BT, Xia F, Lybarger K, Yetisgen M, Uzuner Ö. CACER: Clinical concept Annotations for Cancer Events and Relations. J Am Med Inform Assoc 2024; 31:2583-2594. [PMID: 39225779 PMCID: PMC11491616 DOI: 10.1093/jamia/ocae231] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2024] [Revised: 08/08/2024] [Accepted: 08/12/2024] [Indexed: 09/04/2024] Open
Abstract
OBJECTIVE Clinical notes contain unstructured representations of patient histories, including the relationships between medical problems and prescription drugs. To investigate the relationship between cancer drugs and their associated symptom burden, we extract structured, semantic representations of medical problem and drug information from the clinical narratives of oncology notes. MATERIALS AND METHODS We present Clinical concept Annotations for Cancer Events and Relations (CACER), a novel corpus with fine-grained annotations for over 48 000 medical problems and drug events and 10 000 drug-problem and problem-problem relations. Leveraging CACER, we develop and evaluate transformer-based information extraction models such as Bidirectional Encoder Representations from Transformers (BERT), Fine-tuned Language Net Text-To-Text Transfer Transformer (Flan-T5), Large Language Model Meta AI (Llama3), and Generative Pre-trained Transformers-4 (GPT-4) using fine-tuning and in-context learning (ICL). RESULTS In event extraction, the fine-tuned BERT and Llama3 models achieved the highest performance at 88.2-88.0 F1, which is comparable to the inter-annotator agreement (IAA) of 88.4 F1. In relation extraction, the fine-tuned BERT, Flan-T5, and Llama3 achieved the highest performance at 61.8-65.3 F1. GPT-4 with ICL achieved the worst performance across both tasks. DISCUSSION The fine-tuned models significantly outperformed GPT-4 in ICL, highlighting the importance of annotated training data and model optimization. Furthermore, the BERT models performed similarly to Llama3. For our task, large language models offer no performance advantage over the smaller BERT models. CONCLUSIONS We introduce CACER, a novel corpus with fine-grained annotations for medical problems, drugs, and their relationships in clinical narratives of oncology notes. State-of-the-art transformer models achieved performance comparable to IAA for several extraction tasks.
Collapse
Affiliation(s)
- Yujuan Velvin Fu
- Department of Biomedical Informatics & Medical Education, University of Washington, Seattle, WA 98195, United States
| | | | - Ahmad Halwani
- Huntsman Cancer Institute, University of Utah, Salt Lake City, UT 84112, United States
| | - Bridget T McInnes
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA 23284, United States
| | - Fei Xia
- Department of Linguistics, University of Washington, Seattle, WA 98195, United States
| | - Kevin Lybarger
- Department of Information Sciences and Technology, George Mason University, Fairfax, VA 22030, United States
| | - Meliha Yetisgen
- Department of Biomedical Informatics & Medical Education, University of Washington, Seattle, WA 98195, United States
| | - Özlem Uzuner
- Department of Information Sciences and Technology, George Mason University, Fairfax, VA 22030, United States
| |
Collapse
|
4
|
Bedi S, Liu Y, Orr-Ewing L, Dash D, Koyejo S, Callahan A, Fries JA, Wornow M, Swaminathan A, Lehmann LS, Hong HJ, Kashyap M, Chaurasia AR, Shah NR, Singh K, Tazbaz T, Milstein A, Pfeffer MA, Shah NH. Testing and Evaluation of Health Care Applications of Large Language Models: A Systematic Review. JAMA 2024:2825147. [PMID: 39405325 PMCID: PMC11480901 DOI: 10.1001/jama.2024.21700] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/18/2024] [Accepted: 09/30/2024] [Indexed: 10/19/2024]
Abstract
Importance Large language models (LLMs) can assist in various health care activities, but current evaluation approaches may not adequately identify the most useful application areas. Objective To summarize existing evaluations of LLMs in health care in terms of 5 components: (1) evaluation data type, (2) health care task, (3) natural language processing (NLP) and natural language understanding (NLU) tasks, (4) dimension of evaluation, and (5) medical specialty. Data Sources A systematic search of PubMed and Web of Science was performed for studies published between January 1, 2022, and February 19, 2024. Study Selection Studies evaluating 1 or more LLMs in health care. Data Extraction and Synthesis Three independent reviewers categorized studies via keyword searches based on the data used, the health care tasks, the NLP and NLU tasks, the dimensions of evaluation, and the medical specialty. Results Of 519 studies reviewed, published between January 1, 2022, and February 19, 2024, only 5% used real patient care data for LLM evaluation. The most common health care tasks were assessing medical knowledge such as answering medical licensing examination questions (44.5%) and making diagnoses (19.5%). Administrative tasks such as assigning billing codes (0.2%) and writing prescriptions (0.2%) were less studied. For NLP and NLU tasks, most studies focused on question answering (84.2%), while tasks such as summarization (8.9%) and conversational dialogue (3.3%) were infrequent. Almost all studies (95.4%) used accuracy as the primary dimension of evaluation; fairness, bias, and toxicity (15.8%), deployment considerations (4.6%), and calibration and uncertainty (1.2%) were infrequently measured. Finally, in terms of medical specialty area, most studies were in generic health care applications (25.6%), internal medicine (16.4%), surgery (11.4%), and ophthalmology (6.9%), with nuclear medicine (0.6%), physical medicine (0.4%), and medical genetics (0.2%) being the least represented. Conclusions and Relevance Existing evaluations of LLMs mostly focus on accuracy of question answering for medical examinations, without consideration of real patient care data. Dimensions such as fairness, bias, and toxicity and deployment considerations received limited attention. Future evaluations should adopt standardized applications and metrics, use clinical data, and broaden focus to include a wider range of tasks and specialties.
Collapse
Affiliation(s)
- Suhana Bedi
- Department of Biomedical Data Science, Stanford School of Medicine, Stanford, California
| | - Yutong Liu
- Clinical Excellence Research Center, Stanford University, Stanford, California
| | - Lucy Orr-Ewing
- Clinical Excellence Research Center, Stanford University, Stanford, California
| | - Dev Dash
- Clinical Excellence Research Center, Stanford University, Stanford, California
- Center for Biomedical Informatics Research, Stanford University, Stanford, California
| | - Sanmi Koyejo
- Department of Computer Science, Stanford University, Stanford, California
| | - Alison Callahan
- Center for Biomedical Informatics Research, Stanford University, Stanford, California
| | - Jason A. Fries
- Center for Biomedical Informatics Research, Stanford University, Stanford, California
| | - Michael Wornow
- Center for Biomedical Informatics Research, Stanford University, Stanford, California
| | - Akshay Swaminathan
- Center for Biomedical Informatics Research, Stanford University, Stanford, California
| | | | - Hyo Jung Hong
- Department of Anesthesiology, Stanford University, Stanford, California
| | - Mehr Kashyap
- Stanford University School of Medicine, Stanford, California
| | - Akash R. Chaurasia
- Center for Biomedical Informatics Research, Stanford University, Stanford, California
| | - Nirav R. Shah
- Clinical Excellence Research Center, Stanford University, Stanford, California
| | - Karandeep Singh
- Digital Health Innovation, University of California San Diego Health, San Diego
| | - Troy Tazbaz
- Digital Health Center of Excellence, US Food and Drug Administration, Washington, DC
| | - Arnold Milstein
- Clinical Excellence Research Center, Stanford University, Stanford, California
| | - Michael A. Pfeffer
- Department of Medicine, Stanford University School of Medicine, Stanford, California
| | - Nigam H. Shah
- Clinical Excellence Research Center, Stanford University, Stanford, California
- Center for Biomedical Informatics Research, Stanford University, Stanford, California
| |
Collapse
|
5
|
Hsueh JY, Nethala D, Singh S, Hyman JA, Gelikman DG, Linehan WM, Ball MW. Exploring the Feasibility of GPT-4 as a Data Extraction Tool for Renal Surgery Operative Notes. UROLOGY PRACTICE 2024; 11:782-789. [PMID: 38913566 PMCID: PMC11335444 DOI: 10.1097/upj.0000000000000599] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2024] [Accepted: 04/12/2024] [Indexed: 06/26/2024]
Abstract
INTRODUCTION GPT-4 is a large language model with potential for multiple applications in urology. Our study sought to evaluate GPT-4's performance in data extraction from renal surgery operative notes. METHODS GPT-4 was queried to extract information on laterality, surgery, approach, estimated blood loss, and ischemia time from deidentified operative notes. Match rates were determined by the number of "matched" data points between GPT-4 and human-curated extraction. Accuracy rates were calculated after manually reviewing "not matched" data points. Cohen's kappa and the intraclass coefficient were used to evaluate interrater agreement/reliability. RESULTS Our cohort consisted of 1498 renal surgeries from 2003 to 2023. Match rates were high for laterality (94.4%), surgery (92.5%), and approach (89.4%), but lower for estimated blood loss (77.1%) and ischemia time (25.6%). GPT-4 was more accurate for estimated blood loss (90.3% vs 85.5% human curated) and similarly accurate for laterality (95.2% vs 95.3% human curated). Human-curated accuracy rates were higher for surgery (99.3% vs 93% GPT-4), approach (97.9% vs 90.8% GPT-4), and ischemia time (95.6% vs 30.7% GPT-4). Cohen's kappa was 0.96 for laterality, 0.83 for approach, and 0.71 for surgery. The intraclass coefficient was 0.62 for estimated blood loss and 0.09 for ischemia time. CONCLUSIONS Match and accuracy rates were higher for categorical variables. GPT-4 data extraction was particularly error prone for variables with heterogenous documentation styles. The role of a standard operative template to aid data extraction will be explored in the future. GPT-4 can be utilized as a helpful and efficient data extraction tool with manual feedback.
Collapse
Affiliation(s)
- Jessica Y. Hsueh
- Urologic Oncology Branch, National Cancer Institute, National Institutes of Health, Bethesda, MD
| | - Daniel Nethala
- Urologic Oncology Branch, National Cancer Institute, National Institutes of Health, Bethesda, MD
| | - Shiva Singh
- Radiology and Imaging Services, Clinical Center, National Institutes of Health, Bethesda, MD
| | - Jason A. Hyman
- Urologic Oncology Branch, National Cancer Institute, National Institutes of Health, Bethesda, MD
| | - David G. Gelikman
- Molecular Imaging Branch, National Cancer Institute, National Institutes of Health, Bethesda, MD
| | - W. Marston Linehan
- Urologic Oncology Branch, National Cancer Institute, National Institutes of Health, Bethesda, MD
| | - Mark W. Ball
- Urologic Oncology Branch, National Cancer Institute, National Institutes of Health, Bethesda, MD
| |
Collapse
|
6
|
Van Veen D, Van Uden C, Blankemeier L, Delbrouck JB, Aali A, Bluethgen C, Pareek A, Polacin M, Reis EP, Seehofnerová A, Rohatgi N, Hosamani P, Collins W, Ahuja N, Langlotz CP, Hom J, Gatidis S, Pauly J, Chaudhari AS. Adapted large language models can outperform medical experts in clinical text summarization. Nat Med 2024; 30:1134-1142. [PMID: 38413730 PMCID: PMC11479659 DOI: 10.1038/s41591-024-02855-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2023] [Accepted: 02/02/2024] [Indexed: 02/29/2024]
Abstract
Analyzing vast textual data and summarizing key information from electronic health records imposes a substantial burden on how clinicians allocate their time. Although large language models (LLMs) have shown promise in natural language processing (NLP) tasks, their effectiveness on a diverse range of clinical summarization tasks remains unproven. Here we applied adaptation methods to eight LLMs, spanning four distinct clinical summarization tasks: radiology reports, patient questions, progress notes and doctor-patient dialogue. Quantitative assessments with syntactic, semantic and conceptual NLP metrics reveal trade-offs between models and adaptation methods. A clinical reader study with 10 physicians evaluated summary completeness, correctness and conciseness; in most cases, summaries from our best-adapted LLMs were deemed either equivalent (45%) or superior (36%) compared with summaries from medical experts. The ensuing safety analysis highlights challenges faced by both LLMs and medical experts, as we connect errors to potential medical harm and categorize types of fabricated information. Our research provides evidence of LLMs outperforming medical experts in clinical text summarization across multiple tasks. This suggests that integrating LLMs into clinical workflows could alleviate documentation burden, allowing clinicians to focus more on patient care.
Collapse
Affiliation(s)
- Dave Van Veen
- Department of Electrical Engineering, Stanford University, Stanford, CA, USA.
- Stanford Center for Artificial Intelligence in Medicine and Imaging, Palo Alto, CA, USA.
| | - Cara Van Uden
- Stanford Center for Artificial Intelligence in Medicine and Imaging, Palo Alto, CA, USA
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Louis Blankemeier
- Department of Electrical Engineering, Stanford University, Stanford, CA, USA
- Stanford Center for Artificial Intelligence in Medicine and Imaging, Palo Alto, CA, USA
| | - Jean-Benoit Delbrouck
- Stanford Center for Artificial Intelligence in Medicine and Imaging, Palo Alto, CA, USA
| | - Asad Aali
- Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX, USA
| | - Christian Bluethgen
- Stanford Center for Artificial Intelligence in Medicine and Imaging, Palo Alto, CA, USA
- Diagnostic and Interventional Radiology, University Hospital Zurich, University of Zurich, Zurich, Switzerland
| | - Anuj Pareek
- Stanford Center for Artificial Intelligence in Medicine and Imaging, Palo Alto, CA, USA
- Copenhagen University Hospital, Copenhagen, Denmark
| | - Malgorzata Polacin
- Diagnostic and Interventional Radiology, University Hospital Zurich, University of Zurich, Zurich, Switzerland
| | - Eduardo Pontes Reis
- Stanford Center for Artificial Intelligence in Medicine and Imaging, Palo Alto, CA, USA
- Albert Einstein Israelite Hospital, São Paulo, Brazil
| | - Anna Seehofnerová
- Department of Medicine, Stanford University, Stanford, CA, USA
- Department of Radiology, Stanford University, Stanford, CA, USA
| | - Nidhi Rohatgi
- Department of Medicine, Stanford University, Stanford, CA, USA
- Department of Neurosurgery, Stanford University, Stanford, CA, USA
| | - Poonam Hosamani
- Department of Medicine, Stanford University, Stanford, CA, USA
| | - William Collins
- Department of Medicine, Stanford University, Stanford, CA, USA
| | - Neera Ahuja
- Department of Medicine, Stanford University, Stanford, CA, USA
| | - Curtis P Langlotz
- Stanford Center for Artificial Intelligence in Medicine and Imaging, Palo Alto, CA, USA
- Department of Medicine, Stanford University, Stanford, CA, USA
- Department of Radiology, Stanford University, Stanford, CA, USA
- Department of Biomedical Data Science, Stanford University, Stanford, CA, USA
| | - Jason Hom
- Department of Medicine, Stanford University, Stanford, CA, USA
| | - Sergios Gatidis
- Stanford Center for Artificial Intelligence in Medicine and Imaging, Palo Alto, CA, USA
- Department of Radiology, Stanford University, Stanford, CA, USA
| | - John Pauly
- Department of Electrical Engineering, Stanford University, Stanford, CA, USA
| | - Akshay S Chaudhari
- Stanford Center for Artificial Intelligence in Medicine and Imaging, Palo Alto, CA, USA
- Department of Radiology, Stanford University, Stanford, CA, USA
- Department of Biomedical Data Science, Stanford University, Stanford, CA, USA
- Stanford Cardiovascular Institute, Stanford, CA, USA
| |
Collapse
|