1
|
Mathis WS, Zhao S, Pratt N, Weleff J, De Paoli S. Inductive thematic analysis of healthcare qualitative interviews using open-source large language models: How does it compare to traditional methods? COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2024; 255:108356. [PMID: 39067136 DOI: 10.1016/j.cmpb.2024.108356] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/28/2024] [Revised: 07/13/2024] [Accepted: 07/23/2024] [Indexed: 07/30/2024]
Abstract
BACKGROUND Large language models (LLMs) are generative artificial intelligence that have ignited much interest and discussion about their utility in clinical and research settings. Despite this interest there is sparse analysis of their use in qualitative thematic analysis comparing their current ability to that of human coding and analysis. In addition, there has been no published analysis of their use in real-world, protected health information. OBJECTIVE Here we fill that gap in the literature by comparing an LLM to standard human thematic analysis in real-world, semi-structured interviews of both patients and clinicians within a psychiatric setting. METHODS Using a 70 billion parameter open-source LLM running on local hardware and advanced prompt engineering techniques, we produced themes that summarized a full corpus of interviews in minutes. Subsequently we used three different evaluation methods for quantifying similarity between themes produced by the LLM and those produced by humans. RESULTS These revealed similarities ranging from moderate to substantial (Jaccard similarity coefficients 0.44-0.69), which are promising preliminary results. CONCLUSION Our study demonstrates that open-source LLMs can effectively generate robust themes from qualitative data, achieving substantial similarity to human-generated themes. The validation of LLMs in thematic analysis, coupled with evaluation methodologies, highlights their potential to enhance and democratize qualitative research across diverse fields.
Collapse
Affiliation(s)
- Walter S Mathis
- Department of Psychiatry, Yale University School of Medicine, New Haven, CT, USA.
| | - Sophia Zhao
- Department of Psychiatry, Yale University School of Medicine, New Haven, CT, USA
| | - Nicholas Pratt
- Department of Psychiatry, Yale University School of Medicine, New Haven, CT, USA
| | - Jeremy Weleff
- Department of Psychiatry, Yale University School of Medicine, New Haven, CT, USA
| | - Stefano De Paoli
- Division of Sociology, School of Business, Law and Social Sciences, Abertay University, Dundee, Scotland, United Kingdom
| |
Collapse
|
2
|
Wan P, Huang Z, Tang W, Nie Y, Pei D, Deng S, Chen J, Zhou Y, Duan H, Chen Q, Long E. Outpatient reception via collaboration between nurses and a large language model: a randomized controlled trial. Nat Med 2024:10.1038/s41591-024-03148-7. [PMID: 39009780 DOI: 10.1038/s41591-024-03148-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2023] [Accepted: 06/20/2024] [Indexed: 07/17/2024]
Abstract
Reception is an essential process for patients seeking medical care and a critical component influencing the healthcare experience. However, current communication systems rely mainly on human efforts, which are both labor and knowledge intensive. A promising alternative is to leverage the capabilities of large language models (LLMs) to assist the communication in medical center reception sites. Here we curated a unique dataset comprising 35,418 cases of real-world conversation audio corpus between outpatients and receptionist nurses from 10 reception sites across two medical centers, to develop a site-specific prompt engineering chatbot (SSPEC). The SSPEC efficiently resolved patient queries, with a higher proportion of queries addressed in fewer rounds of queries and responses (Q&Rs; 68.0% ≤2 rounds) compared with nurse-led sessions (50.5% ≤2 rounds) (P = 0.009) across administrative, triaging and primary care concerns. We then established a nurse-SSPEC collaboration model, overseeing the uncertainties encountered during the real-world deployment. In a single-center randomized controlled trial involving 2,164 participants, the primary endpoint indicated that the nurse-SSPEC collaboration model received higher satisfaction feedback from patients (3.91 ± 0.90 versus 3.39 ± 1.15 in the nurse group, P < 0.001). Key secondary outcomes indicated reduced rate of repeated Q&R (3.2% versus 14.4% in the nurse group, P < 0.001) and reduced negative emotions during visits (2.4% versus 7.8% in the nurse group, P < 0.001) and enhanced response quality in terms of integrity (4.37 ± 0.95 versus 3.42 ± 1.22 in the nurse group, P < 0.001), empathy (4.14 ± 0.98 versus 3.27 ± 1.22 in the nurse group, P < 0.001) and readability (3.86 ± 0.95 versus 3.71 ± 1.07 in the nurse group, P = 0.006). Overall, our study supports the feasibility of integrating LLMs into the daily hospital workflow and introduces a paradigm for improving communication that benefits both patients and nurses. Chinese Clinical Trial Registry identifier: ChiCTR2300077245 .
Collapse
Affiliation(s)
- Peixing Wan
- State Key Laboratory of Respiratory Health and Multimorbidity, Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
- Laboratory of Immune Cell Biology, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA
| | - Zigeng Huang
- State Key Laboratory of Respiratory Health and Multimorbidity, Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| | - Wenjun Tang
- State Key Laboratory of Respiratory Health and Multimorbidity, Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| | - Yulan Nie
- Southern University of Science and Technology Yantian Hospital, Shenzhen, China
| | - Dajun Pei
- Renmin Hospital of Wuhan University, Wuhan, China
| | - Shaofen Deng
- Southern University of Science and Technology Yantian Hospital, Shenzhen, China
| | - Jing Chen
- Renmin Hospital of Wuhan University, Wuhan, China
| | - Yizhi Zhou
- Southern University of Science and Technology Yantian Hospital, Shenzhen, China
| | - Hongru Duan
- Southern University of Science and Technology Yantian Hospital, Shenzhen, China
| | - Qingyu Chen
- Section of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT, USA.
| | - Erping Long
- State Key Laboratory of Respiratory Health and Multimorbidity, Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China.
| |
Collapse
|
3
|
Micali G, Corallo F, Pagano M, Giambò FM, Duca A, D’Aleo P, Anselmo A, Bramanti A, Garofano M, Mazzon E, Bramanti P, Cappadona I. Artificial Intelligence and Heart-Brain Connections: A Narrative Review on Algorithms Utilization in Clinical Practice. Healthcare (Basel) 2024; 12:1380. [PMID: 39057522 PMCID: PMC11276532 DOI: 10.3390/healthcare12141380] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2024] [Revised: 07/04/2024] [Accepted: 07/08/2024] [Indexed: 07/28/2024] Open
Abstract
Cardiovascular and neurological diseases are a major cause of mortality and morbidity worldwide. Such diseases require careful monitoring to effectively manage their progression. Artificial intelligence (AI) offers valuable tools for this purpose through its ability to analyse data and identify predictive patterns. This review evaluated the application of AI in cardiac and neurological diseases for their clinical impact on the general population. We reviewed studies on the application of AI in the neurological and cardiological fields. Our search was performed on the PubMed, Web of Science, Embase and Cochrane library databases. Of the initial 5862 studies, 23 studies met the inclusion criteria. The studies showed that the most commonly used algorithms in these clinical fields are Random Forest and Artificial Neural Network, followed by logistic regression and Support-Vector Machines. In addition, an ECG-AI algorithm based on convolutional neural networks has been developed and has been widely used in several studies for the detection of atrial fibrillation with good accuracy. AI has great potential to support physicians in interpretation, diagnosis, risk assessment and disease management.
Collapse
Affiliation(s)
- Giuseppe Micali
- IRCCS Centro Neurolesi Bonino-Pulejo, Via Palermo, S.S. 113, C.da Casazza, 98124 Messina, Italy; (G.M.)
| | - Francesco Corallo
- IRCCS Centro Neurolesi Bonino-Pulejo, Via Palermo, S.S. 113, C.da Casazza, 98124 Messina, Italy; (G.M.)
| | - Maria Pagano
- IRCCS Centro Neurolesi Bonino-Pulejo, Via Palermo, S.S. 113, C.da Casazza, 98124 Messina, Italy; (G.M.)
| | - Fabio Mauro Giambò
- IRCCS Centro Neurolesi Bonino-Pulejo, Via Palermo, S.S. 113, C.da Casazza, 98124 Messina, Italy; (G.M.)
| | - Antonio Duca
- IRCCS Centro Neurolesi Bonino-Pulejo, Via Palermo, S.S. 113, C.da Casazza, 98124 Messina, Italy; (G.M.)
| | - Piercataldo D’Aleo
- IRCCS Centro Neurolesi Bonino-Pulejo, Via Palermo, S.S. 113, C.da Casazza, 98124 Messina, Italy; (G.M.)
| | - Anna Anselmo
- IRCCS Centro Neurolesi Bonino-Pulejo, Via Palermo, S.S. 113, C.da Casazza, 98124 Messina, Italy; (G.M.)
| | - Alessia Bramanti
- Department of Medicine, Surgery and Dentistry, University of Salerno, 84081 Baronissi, Italy
| | - Marina Garofano
- Department of Medicine, Surgery and Dentistry, University of Salerno, 84081 Baronissi, Italy
| | - Emanuela Mazzon
- IRCCS Centro Neurolesi Bonino-Pulejo, Via Palermo, S.S. 113, C.da Casazza, 98124 Messina, Italy; (G.M.)
| | - Placido Bramanti
- IRCCS Centro Neurolesi Bonino-Pulejo, Via Palermo, S.S. 113, C.da Casazza, 98124 Messina, Italy; (G.M.)
- Faculty of Psychology, Università degli Studi eCampus, Via Isimbardi 10, 22060 Novedrate, Italy
| | - Irene Cappadona
- IRCCS Centro Neurolesi Bonino-Pulejo, Via Palermo, S.S. 113, C.da Casazza, 98124 Messina, Italy; (G.M.)
| |
Collapse
|
4
|
Meral G, Ateş S, Günay S, Öztürk A, Kuşdoğan M. Comparative analysis of ChatGPT, Gemini and emergency medicine specialist in ESI triage assessment. Am J Emerg Med 2024; 81:146-150. [PMID: 38728938 DOI: 10.1016/j.ajem.2024.05.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2024] [Revised: 04/21/2024] [Accepted: 05/02/2024] [Indexed: 05/12/2024] Open
Abstract
INTRODUCTION The term Artificial Intelligence (AI) was first coined in the 1960s and has made significant progress up to the present day. During this period, numerous AI applications have been developed. GPT-4 and Gemini are two of the best-known of these AI models. As a triage system The Emergency Severity Index (ESI) is currently one of the most commonly used for effective patient triage in the emergency department. The aim of this study is to evaluate the performance of GPT-4, Gemini, and emergency medicine specialists in ESI triage against each other; furthermore, it aims to contribute to the literature on the usability of these AI programs in emergency department triage. METHODS Our study was conducted between February 1, 2024, and February 29, 2024, among emergency medicine specialists in Turkey, as well as with GPT-4 and Gemini. Ten emergency medicine specialists were included in our study but as a limitation the emergency medicine specialists participating in the study do not frequently use the ESI triage model in daily practice. In the first phase of our study, 100 case examples related to adult or trauma patients were extracted from the sample and training cases found in the ESI Implementation Handbook. In the second phase of our study, the provided responses were categorized into three groups: correct triage, over-triage, and under-triage. In the third phase of our study, the questions were categorized according to the correct triage responses. RESULTS In the results of our study, a statistically significant difference was found between the three groups in terms of correct triage, over-triage, and under-triage (p < 0.001). GPT-4 was found to have the highest correct triage rate with an average of 70.60 (±3.74), while Gemini had the highest over-triage rate with an average of 35.2 (±2.93) (p < 0.001). The highest under-triage rate was observed in emergency medicine specialists (32.90 (±11.83)). In the ESI 1-2 class, Gemini had a correct triage rate of 87.77%, GPT-4 had 85.11%, and emergency medicine specialists had 49.33%. CONCLUSION In conclusion, our study shows that both GPT-4 and Gemini can accurately triage critical and urgent patients in ESI 1&2 groups at a high rate. Furthermore, GPT-4 has been more successful in ESI triage for all patients. These results suggest that GPT-4 and Gemini could assist in accurate ESI triage of patients in emergency departments.
Collapse
Affiliation(s)
- Gürbüz Meral
- Department of Emergency Medicine, Specialist in Emergency Medicine, Hitit University Çorum Erol Olçok Education and Research Hospital, Çorum, Turkey.
| | - Serdal Ateş
- Department of Emergency Medicine, Specialist in Emergency Medicine, Hitit University Çorum Erol Olçok Education and Research Hospital, Çorum, Turkey
| | - Serkan Günay
- Department of Emergency Medicine, Specialist in Emergency Medicine, Hitit University Çorum Erol Olçok Education and Research Hospital, Çorum, Turkey
| | - Ahmet Öztürk
- Department of Emergency Medicine, Specialist in Emergency Medicine, Hitit University Çorum Erol Olçok Education and Research Hospital, Çorum, Turkey
| | - Mikail Kuşdoğan
- Department of Emergency Medicine, Specialist in Emergency Medicine, Hitit University Çorum Erol Olçok Education and Research Hospital, Çorum, Turkey
| |
Collapse
|
5
|
Li J, Dada A, Puladi B, Kleesiek J, Egger J. ChatGPT in healthcare: A taxonomy and systematic review. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2024; 245:108013. [PMID: 38262126 DOI: 10.1016/j.cmpb.2024.108013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/07/2023] [Revised: 12/29/2023] [Accepted: 01/08/2024] [Indexed: 01/25/2024]
Abstract
The recent release of ChatGPT, a chat bot research project/product of natural language processing (NLP) by OpenAI, stirs up a sensation among both the general public and medical professionals, amassing a phenomenally large user base in a short time. This is a typical example of the 'productization' of cutting-edge technologies, which allows the general public without a technical background to gain firsthand experience in artificial intelligence (AI), similar to the AI hype created by AlphaGo (DeepMind Technologies, UK) and self-driving cars (Google, Tesla, etc.). However, it is crucial, especially for healthcare researchers, to remain prudent amidst the hype. This work provides a systematic review of existing publications on the use of ChatGPT in healthcare, elucidating the 'status quo' of ChatGPT in medical applications, for general readers, healthcare professionals as well as NLP scientists. The large biomedical literature database PubMed is used to retrieve published works on this topic using the keyword 'ChatGPT'. An inclusion criterion and a taxonomy are further proposed to filter the search results and categorize the selected publications, respectively. It is found through the review that the current release of ChatGPT has achieved only moderate or 'passing' performance in a variety of tests, and is unreliable for actual clinical deployment, since it is not intended for clinical applications by design. We conclude that specialized NLP models trained on (bio)medical datasets still represent the right direction to pursue for critical clinical applications.
Collapse
Affiliation(s)
- Jianning Li
- Institute for Artificial Intelligence in Medicine, University Hospital Essen (AöR), Girardetstraße 2, 45131 Essen, Germany
| | - Amin Dada
- Institute for Artificial Intelligence in Medicine, University Hospital Essen (AöR), Girardetstraße 2, 45131 Essen, Germany
| | - Behrus Puladi
- Institute of Medical Informatics, University Hospital RWTH Aachen, Pauwelsstraße 30, 52074 Aachen, Germany; Department of Oral and Maxillofacial Surgery, University Hospital RWTH Aachen, Pauwelsstraße 30, 52074 Aachen, Germany
| | - Jens Kleesiek
- Institute for Artificial Intelligence in Medicine, University Hospital Essen (AöR), Girardetstraße 2, 45131 Essen, Germany; TU Dortmund University, Department of Physics, Otto-Hahn-Straße 4, 44227 Dortmund, Germany
| | - Jan Egger
- Institute for Artificial Intelligence in Medicine, University Hospital Essen (AöR), Girardetstraße 2, 45131 Essen, Germany; Center for Virtual and Extended Reality in Medicine (ZvRM), University Hospital Essen, University Medicine Essen, Hufelandstraße 55, 45147 Essen, Germany.
| |
Collapse
|
6
|
Goh E, Bunning B, Khoong E, Gallo R, Milstein A, Centola D, Chen JH. ChatGPT Influence on Medical Decision-Making, Bias, and Equity: A Randomized Study of Clinicians Evaluating Clinical Vignettes. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2023:2023.11.24.23298844. [PMID: 38076944 PMCID: PMC10705632 DOI: 10.1101/2023.11.24.23298844] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/20/2023]
Abstract
In a randomized, pre-post intervention study, we evaluated the influence of a large language model (LLM) generative AI system on accuracy of physician decision-making and bias in healthcare. 50 US-licensed physicians reviewed a video clinical vignette, featuring actors representing different demographics (a White male or a Black female) with chest pain. Participants were asked to answer clinical questions around triage, risk, and treatment based on these vignettes, then asked to reconsider after receiving advice generated by ChatGPT+ (GPT4). The primary outcome was the accuracy of clinical decisions based on pre-established evidence-based guidelines. Results showed that physicians are willing to change their initial clinical impressions given AI assistance, and that this led to a significant improvement in clinical decision-making accuracy in a chest pain evaluation scenario without introducing or exacerbating existing race or gender biases. A survey of physician participants indicates that the majority expect LLM tools to play a significant role in clinical decision making.
Collapse
Affiliation(s)
- Ethan Goh
- Stanford Biomedical Informatics Research, Stanford University, Stanford, CA
- Stanford Clinical Excellence Research Center, Stanford University, Stanford, CA
| | - Bryan Bunning
- Stanford Biomedical Informatics Research, Stanford University, Stanford, CA
| | - Elaine Khoong
- UCSF Center for Vulnerable Populations at San Francisco General Hospital, SF, CA
| | - Robert Gallo
- Stanford Biomedical Informatics Research, Stanford University, Stanford, CA
- Center for Innovation to Implementation, VA Palo Alto Health Care System, PA, CA
| | - Arnold Milstein
- Stanford Clinical Excellence Research Center, Stanford University, Stanford, CA
| | - Damon Centola
- Communication, Sociology and Engineering, University of Pennsylvania, PA
| | - Jonathan H Chen
- Stanford Biomedical Informatics Research, Stanford University, Stanford, CA
- Division of Hospital Medicine, Stanford University, Stanford, CA
- Stanford Clinical Excellence Research Center, Stanford University, Stanford, CA
| |
Collapse
|