1
|
Bedi S, Liu Y, Orr-Ewing L, Dash D, Koyejo S, Callahan A, Fries JA, Wornow M, Swaminathan A, Lehmann LS, Hong HJ, Kashyap M, Chaurasia AR, Shah NR, Singh K, Tazbaz T, Milstein A, Pfeffer MA, Shah NH. Testing and Evaluation of Health Care Applications of Large Language Models: A Systematic Review. JAMA 2025; 333:319-328. [PMID: 39405325 PMCID: PMC11480901 DOI: 10.1001/jama.2024.21700] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/18/2024] [Accepted: 09/30/2024] [Indexed: 10/19/2024]
Abstract
Importance Large language models (LLMs) can assist in various health care activities, but current evaluation approaches may not adequately identify the most useful application areas. Objective To summarize existing evaluations of LLMs in health care in terms of 5 components: (1) evaluation data type, (2) health care task, (3) natural language processing (NLP) and natural language understanding (NLU) tasks, (4) dimension of evaluation, and (5) medical specialty. Data Sources A systematic search of PubMed and Web of Science was performed for studies published between January 1, 2022, and February 19, 2024. Study Selection Studies evaluating 1 or more LLMs in health care. Data Extraction and Synthesis Three independent reviewers categorized studies via keyword searches based on the data used, the health care tasks, the NLP and NLU tasks, the dimensions of evaluation, and the medical specialty. Results Of 519 studies reviewed, published between January 1, 2022, and February 19, 2024, only 5% used real patient care data for LLM evaluation. The most common health care tasks were assessing medical knowledge such as answering medical licensing examination questions (44.5%) and making diagnoses (19.5%). Administrative tasks such as assigning billing codes (0.2%) and writing prescriptions (0.2%) were less studied. For NLP and NLU tasks, most studies focused on question answering (84.2%), while tasks such as summarization (8.9%) and conversational dialogue (3.3%) were infrequent. Almost all studies (95.4%) used accuracy as the primary dimension of evaluation; fairness, bias, and toxicity (15.8%), deployment considerations (4.6%), and calibration and uncertainty (1.2%) were infrequently measured. Finally, in terms of medical specialty area, most studies were in generic health care applications (25.6%), internal medicine (16.4%), surgery (11.4%), and ophthalmology (6.9%), with nuclear medicine (0.6%), physical medicine (0.4%), and medical genetics (0.2%) being the least represented. Conclusions and Relevance Existing evaluations of LLMs mostly focus on accuracy of question answering for medical examinations, without consideration of real patient care data. Dimensions such as fairness, bias, and toxicity and deployment considerations received limited attention. Future evaluations should adopt standardized applications and metrics, use clinical data, and broaden focus to include a wider range of tasks and specialties.
Collapse
Affiliation(s)
- Suhana Bedi
- Department of Biomedical Data Science, Stanford School of Medicine, Stanford, California
| | - Yutong Liu
- Clinical Excellence Research Center, Stanford University, Stanford, California
| | - Lucy Orr-Ewing
- Clinical Excellence Research Center, Stanford University, Stanford, California
| | - Dev Dash
- Clinical Excellence Research Center, Stanford University, Stanford, California
- Center for Biomedical Informatics Research, Stanford University, Stanford, California
| | - Sanmi Koyejo
- Department of Computer Science, Stanford University, Stanford, California
| | - Alison Callahan
- Center for Biomedical Informatics Research, Stanford University, Stanford, California
| | - Jason A Fries
- Center for Biomedical Informatics Research, Stanford University, Stanford, California
| | - Michael Wornow
- Center for Biomedical Informatics Research, Stanford University, Stanford, California
| | - Akshay Swaminathan
- Center for Biomedical Informatics Research, Stanford University, Stanford, California
| | | | - Hyo Jung Hong
- Department of Anesthesiology, Stanford University, Stanford, California
| | - Mehr Kashyap
- Stanford University School of Medicine, Stanford, California
| | - Akash R Chaurasia
- Center for Biomedical Informatics Research, Stanford University, Stanford, California
| | - Nirav R Shah
- Clinical Excellence Research Center, Stanford University, Stanford, California
| | - Karandeep Singh
- Digital Health Innovation, University of California San Diego Health, San Diego
| | - Troy Tazbaz
- Digital Health Center of Excellence, US Food and Drug Administration, Washington, DC
| | - Arnold Milstein
- Clinical Excellence Research Center, Stanford University, Stanford, California
| | - Michael A Pfeffer
- Department of Medicine, Stanford University School of Medicine, Stanford, California
| | - Nigam H Shah
- Clinical Excellence Research Center, Stanford University, Stanford, California
- Center for Biomedical Informatics Research, Stanford University, Stanford, California
| |
Collapse
|
2
|
Pavone M, Palmieri L, Bizzarri N, Rosati A, Campolo F, Innocenzi C, Taliento C, Restaino S, Catena U, Vizzielli G, Akladios C, Ianieri MM, Marescaux J, Campo R, Fanfani F, Scambia G. Artificial Intelligence, the ChatGPT Large Language Model: Assessing the Accuracy of Responses to the Gynaecological Endoscopic Surgical Education and Assessment (GESEA) Level 1-2 knowledge tests. Facts Views Vis Obgyn 2024; 16:449-456. [PMID: 39718328 DOI: 10.52054/fvvo.16.4.052] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2024] Open
Abstract
Background In 2022, OpenAI launched ChatGPT 3.5, which is now widely used in medical education, training, and research. Despite its valuable use for the generation of information, concerns persist about its authenticity and accuracy. Its undisclosed information source and outdated dataset pose risks of misinformation. Although it is widely used, AI-generated text inaccuracies raise doubts about its reliability. The ethical use of such technologies is crucial to uphold scientific accuracy in research. Objective This study aimed to assess the accuracy of ChatGPT in doing GESEA tests 1 and 2. Materials and Methods The 100 multiple-choice theoretical questions from GESEA certifications 1 and 2 were presented to ChatGPT, requesting the selection of the correct answer along with an explanation. Expert gynaecologists evaluated and graded the explanations for accuracy. Main outcome measures ChatGPT showed a 59% accuracy in responses, with 64% providing comprehensive explanations. It performed better in GESEA Level 1 (64% accuracy) than in GESEA Level 2 (54% accuracy) questions. Conclusions ChatGPT is a versatile tool in medicine and research, offering knowledge, information, and promoting evidence-based practice. Despite its widespread use, its accuracy has not been validated yet. This study found a 59% correct response rate, highlighting the need for accuracy validation and ethical use considerations. Future research should investigate ChatGPT's truthfulness in subspecialty fields such as gynaecologic oncology and compare different versions of chatbot for continuous improvement. What is new? Artificial intelligence (AI) has a great potential in scientific research. However, the validity of outputs remains unverified. This study aims to evaluate the accuracy of responses generated by ChatGPT to enhance the critical use of this tool.
Collapse
|
3
|
Kıyak YS, Coşkun AK, Kaymak Ş, Coşkun Ö, Budakoğlu Iİ. Can ChatGPT generate surgical multiple-choice questions comparable to those written by a surgeon? Proc AMIA Symp 2024; 38:48-52. [PMID: 39712415 PMCID: PMC11657069 DOI: 10.1080/08998280.2024.2418752] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2024] [Revised: 10/09/2024] [Accepted: 10/13/2024] [Indexed: 12/24/2024] Open
Abstract
Background This study aimed to determine whether surgical multiple-choice questions generated by ChatGPT are comparable to those written by human experts (surgeons). Methods The study was conducted at a medical school and involved 112 fourth-year medical students. Based on five learning objectives in general surgery (colorectal, gastric, trauma, breast, thyroid), ChatGPT and surgeons generated five multiple-choice questions. No change was made to the ChatGPT-generated questions. The statistical properties of these questions, including correlations between two group of questions and correlations with total scores (item discrimination) in a general surgery clerkship exam, were reported. Results There was a significant positive correlation between the ChatGPT-generated and human-written questions for one learning objective (colorectal). More importantly, only one ChatGPT-generated question (colorectal) achieved an acceptable discrimination level, while other four failed to achieve it. In contrast, human-written questions showed acceptable discrimination levels. Conclusion While ChatGPT has the potential to generate multiple-choice questions comparable to human-written ones in specific contexts, the variability across surgical topics points to the need for human oversight and review before their use in exams. It is important to integrate artificial intelligence tools like ChatGPT with human expertise to enhance efficiency and quality.
Collapse
Affiliation(s)
- Yavuz Selim Kıyak
- Department of Medical Education and Informatics, Gazi University Faculty of Medicine, Ankara, Turkey
| | - Ali Kağan Coşkun
- Department of General Surgery, UHS Gulhane School of Medicine, Ankara, Turkey
| | - Şahin Kaymak
- Department of General Surgery, UHS Gulhane School of Medicine, Ankara, Turkey
| | - Özlem Coşkun
- Department of Medical Education and Informatics, Gazi University Faculty of Medicine, Ankara, Turkey
| | - Işıl İrem Budakoğlu
- Department of Medical Education and Informatics, Gazi University Faculty of Medicine, Ankara, Turkey
| |
Collapse
|
4
|
Lim B, Seth I, Xie Y, Kenney PS, Cuomo R, Rozen WM. Exploring the Unknown: Evaluating ChatGPT's Performance in Uncovering Novel Aspects of Plastic Surgery and Identifying Areas for Future Innovation. Aesthetic Plast Surg 2024; 48:2580-2589. [PMID: 38528129 PMCID: PMC11239602 DOI: 10.1007/s00266-024-03952-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2023] [Accepted: 02/21/2024] [Indexed: 03/27/2024]
Abstract
BACKGROUND Artificial intelligence (AI) has emerged as a powerful tool in various medical fields, including plastic surgery. This study aims to evaluate the performance of ChatGPT, an AI language model, in elucidating historical aspects of plastic surgery and identifying potential avenues for innovation. METHODS A comprehensive analysis of ChatGPT's responses to a diverse range of plastic surgery-related inquiries was performed. The quality of the AI-generated responses was assessed based on their relevance, accuracy, and novelty. Additionally, the study examined the AI's ability to recognize gaps in existing knowledge and propose innovative solutions. ChatGPT's responses were analysed by specialist plastic surgeons with extensive research experience, and quantitatively analysed with a Likert scale. RESULTS ChatGPT demonstrated a high degree of proficiency in addressing a wide array of plastic surgery-related topics. The AI-generated responses were found to be relevant and accurate in most cases. However, it demonstrated convergent thinking and failed to generate genuinely novel ideas to revolutionize plastic surgery. Instead, it suggested currently popular trends that demonstrate great potential for further advancements. Some of the references presented were also erroneous as they cannot be validated against the existing literature. CONCLUSION Although ChatGPT requires major improvements, this study highlights its potential as an effective tool for uncovering novel aspects of plastic surgery and identifying areas for future innovation. By leveraging the capabilities of AI language models, plastic surgeons may drive advancements in the field. Further studies are needed to cautiously explore the integration of AI-driven insights into clinical practice and to evaluate their impact on patient outcomes. LEVEL OF EVIDENCE V This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors www.springer.com/00266.
Collapse
Affiliation(s)
- Bryan Lim
- Department of Plastic Surgery, Peninsula Health, Melbourne, VIC, 3199, Australia.
- Central Clinical School, Monash University, The Alfred Centre, 99 Commercial Rd, Melbourne, VIC, 3004, Australia.
| | - Ishith Seth
- Department of Plastic Surgery, Peninsula Health, Melbourne, VIC, 3199, Australia
- Central Clinical School, Monash University, The Alfred Centre, 99 Commercial Rd, Melbourne, VIC, 3004, Australia
| | - Yi Xie
- Department of Plastic Surgery, Peninsula Health, Melbourne, VIC, 3199, Australia
| | - Peter Sinkjaer Kenney
- Department of Plastic Surgery, Odense University Hospital, J. B. Winsløwsvej 4, 5000, Odense, Denmark
- Department of Plastic and Breast Surgery, Aarhus University Hospital, Palle Juul-Jensens Boulevard 99, 8200, Aarhus, Denmark
| | - Roberto Cuomo
- Plastic Surgery Unit, Department of Medicine, Surgery and Neuroscience, University of Siena, 53100, Siena, Italy
| | - Warren M Rozen
- Department of Plastic Surgery, Peninsula Health, Melbourne, VIC, 3199, Australia
- Central Clinical School, Monash University, The Alfred Centre, 99 Commercial Rd, Melbourne, VIC, 3004, Australia
| |
Collapse
|