Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

Download

Total Articles

9
(from Reference Citation Analysis)

Article PDFs (2)

Cited by > 0 (3)

Searched Name

Vimig Socrates

Ranked By

Results Analysis

Year Published Analysis
Article Type Analysis
Publication Title Analysis
Category Analysis

Results Analysis

Indexed Articles

Year Published

Show more Refine

Article Type

Show more Refine

Article Statistics

Refine

MESH Headings

Show more Refine

First Author

Show more Refine

First Author Affiliations

Show more Refine

Authors

Show more Refine

Publication Titles

Show more Refine

Grant Agencies

Show more Refine

Countries/Regions

Show more Refine

Affiliations

Show more Refine

Corresponding Author Affiliations

Show more Refine

Category

Show more Refine

Number

Citation Analysis

Iscoe M, Socrates V, Gilson A, Chi L, Li H, Huang T, Kearns T, Perkins R, Khandjian L, Taylor RA. Identifying signs and symptoms of urinary tract infection from emergency department clinical notes using large language models. Acad Emerg Med 2024. [PMID: 38567658 DOI: 10.1111/acem.14883] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2023] [Revised: 01/24/2024] [Accepted: 01/24/2024] [Indexed: 04/04/2024]

Abstract

BACKGROUND

Natural language processing (NLP) tools including recently developed large language models (LLMs) have myriad potential applications in medical care and research, including the efficient labeling and classification of unstructured text such as electronic health record (EHR) notes. This opens the door to large-scale projects that rely on variables that are not typically recorded in a structured form, such as patient signs and symptoms.

OBJECTIVES

This study is designed to acquaint the emergency medicine research community with the foundational elements of NLP, highlighting essential terminology, annotation methodologies, and the intricacies involved in training and evaluating NLP models. Symptom characterization is critical to urinary tract infection (UTI) diagnosis, but identification of symptoms from the EHR has historically been challenging, limiting large-scale research, public health surveillance, and EHR-based clinical decision support. We therefore developed and compared two NLP models to identify UTI symptoms from unstructured emergency department (ED) notes.

METHODS

The study population consisted of patients aged ≥ 18 who presented to an ED in a northeastern U.S. health system between June 2013 and August 2021 and had a urinalysis performed. We annotated a random subset of 1250 ED clinician notes from these visits for a list of 17 UTI symptoms. We then developed two task-specific LLMs to perform the task of named entity recognition: a convolutional neural network-based model (SpaCy) and a transformer-based model designed to process longer documents (Clinical Longformer). Models were trained on 1000 notes and tested on a holdout set of 250 notes. We compared model performance (precision, recall, F1 measure) at identifying the presence or absence of UTI symptoms at the note level.

RESULTS

A total of 8135 entities were identified in 1250 notes; 83.6% of notes included at least one entity. Overall F1 measure for note-level symptom identification weighted by entity frequency was 0.84 for the SpaCy model and 0.88 for the Longformer model. F1 measure for identifying presence or absence of any UTI symptom in a clinical note was 0.96 (232/250 correctly classified) for the SpaCy model and 0.98 (240/250 correctly classified) for the Longformer model.

CONCLUSIONS

The study demonstrated the utility of LLMs and transformer-based models in particular for extracting UTI symptoms from unstructured ED clinical notes; models were highly accurate for detecting the presence or absence of any UTI symptom on the note level, with variable performance for individual symptoms.

Collapse

Safranek CW, Huang T, Wright DS, Wright CX, Socrates V, Sangal RB, Iscoe M, Chartash D, Taylor RA. Automated HEART score determination via ChatGPT: Honing a framework for iterative prompt development. J Am Coll Emerg Physicians Open 2024;5:e13133. [PMID: 38481520 PMCID: PMC10936537 DOI: 10.1002/emp2.13133] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2023] [Revised: 01/25/2024] [Accepted: 02/10/2024] [Indexed: 03/17/2024] Open

Gilson A, Safranek CW, Huang T, Socrates V, Chi L, Taylor RA, Chartash D. Correction: How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Med Educ 2024;10:e57594. [PMID: 38412478 PMCID: PMC10933712 DOI: 10.2196/57594] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/20/2024] [Accepted: 02/20/2024] [Indexed: 02/29/2024]

Gilson A, Safranek CW, Huang T, Socrates V, Chi L, Taylor RA, Chartash D. Authors' Reply to: Variability in Large Language Models' Responses to Medical Licensing and Certification Examinations. JMIR Med Educ 2023;9:e50336. [PMID: 37440299 PMCID: PMC10375396 DOI: 10.2196/50336] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/27/2023] [Accepted: 07/05/2023] [Indexed: 07/14/2023]

Socrates V, Gilson A, Lopez K, Chi L, Taylor RA, Chartash D. Predicting relations between SOAP note sections: The value of incorporating a clinical information model. J Biomed Inform 2023;141:104360. [PMID: 37061014 PMCID: PMC10197152 DOI: 10.1016/j.jbi.2023.104360] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2023] [Revised: 03/27/2023] [Accepted: 04/05/2023] [Indexed: 04/17/2023]

Abstract

Physician progress notes are frequently organized into Subjective, Objective, Assessment, and Plan (SOAP) sections. The Assessment section synthesizes information recorded in the Subjective and Objective sections, and the Plan section documents tests and treatments to narrow the differential diagnosis and manage symptoms. Classifying the relationship between the Assessment and Plan sections has been suggested to provide valuable insight into clinical reasoning. In this work, we use a novel human-in-the-loop pipeline to classify the relationships between the Assessment and Plan sections of SOAP notes as a part of the n2c2 2022 Track 3 Challenge. In particular, we use a clinical information model constructed from both the entailment logic expected from the aforementioned Challenge and the problem-oriented medical record. This information model is used to label named entities as primary and secondary problems/symptoms, events and complications in all four SOAP sections. We iteratively train separate Named Entity Recognition models and use them to annotate entities in all notes/sections. We fine-tune a downstream RoBERTa-large model to classify the Assessment-Plan relationship. We evaluate multiple language model architectures, preprocessing parameters, and methods of knowledge integration, achieving a maximum macro-F1 score of 82.31%. Our initial model achieves top-2 performance during the challenge (macro-F1: 81.52%, competitors' macro-F1 range: 74.54%-82.12%). We improved our model by incorporating post-challenge annotations (S&O sections), outperforming the top model from the Challenge. We also used Shapley additive explanations to investigate the extent of language model clinical logic, under the lens of our clinical information model. We find that the model often uses shallow heuristics and nonspecific attention when making predictions, suggesting language model knowledge integration requires further research.

Collapse

Gilson A, Safranek CW, Huang T, Socrates V, Chi L, Taylor RA, Chartash D. How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Med Educ 2023;9:e45312. [PMID: 36753318 PMCID: PMC9947764 DOI: 10.2196/45312] [Citation(s) in RCA: 352] [Impact Index Per Article: 352.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/23/2022] [Revised: 01/27/2023] [Accepted: 01/29/2023] [Indexed: 05/05/2023]

Abstract

BACKGROUND

Chat Generative Pre-trained Transformer (ChatGPT) is a 175-billion-parameter natural language processing model that can generate conversation-style responses to user input.

OBJECTIVE

This study aimed to evaluate the performance of ChatGPT on questions within the scope of the United States Medical Licensing Examination (USMLE) Step 1 and Step 2 exams, as well as to analyze responses for user interpretability.

METHODS

We used 2 sets of multiple-choice questions to evaluate ChatGPT's performance, each with questions pertaining to Step 1 and Step 2. The first set was derived from AMBOSS, a commonly used question bank for medical students, which also provides statistics on question difficulty and the performance on an exam relative to the user base. The second set was the National Board of Medical Examiners (NBME) free 120 questions. ChatGPT's performance was compared to 2 other large language models, GPT-3 and InstructGPT. The text output of each ChatGPT response was evaluated across 3 qualitative metrics: logical justification of the answer selected, presence of information internal to the question, and presence of information external to the question.

RESULTS

Of the 4 data sets, AMBOSS-Step1, AMBOSS-Step2, NBME-Free-Step1, and NBME-Free-Step2, ChatGPT achieved accuracies of 44% (44/100), 42% (42/100), 64.4% (56/87), and 57.8% (59/102), respectively. ChatGPT outperformed InstructGPT by 8.15% on average across all data sets, and GPT-3 performed similarly to random chance. The model demonstrated a significant decrease in performance as question difficulty increased (P=.01) within the AMBOSS-Step1 data set. We found that logical justification for ChatGPT's answer selection was present in 100% of outputs of the NBME data sets. Internal information to the question was present in 96.8% (183/189) of all questions. The presence of information external to the question was 44.5% and 27% lower for incorrect answers relative to correct answers on the NBME-Free-Step1 (P<.001) and NBME-Free-Step2 (P=.001) data sets, respectively.

CONCLUSIONS

ChatGPT marks a significant improvement in natural language processing models on the tasks of medical question answering. By performing at a greater than 60% threshold on the NBME-Free-Step-1 data set, we show that the model achieves the equivalent of a passing score for a third-year medical student. Additionally, we highlight ChatGPT's capacity to provide logic and informational context across the majority of answers. These facts taken together make a compelling case for the potential applications of ChatGPT as an interactive medical education tool to support learning.

Collapse

Melnick ER, Ong SY, Fong A, Socrates V, Ratwani RM, Nath B, Simonov M, Salgia A, Williams B, Marchalik D, Goldstein R, Sinsky CA. Characterizing physician EHR use with vendor derived data: a feasibility study and cross-sectional analysis. J Am Med Inform Assoc 2021;28:1383-1392. [PMID: 33822970 PMCID: PMC8279798 DOI: 10.1093/jamia/ocab011] [Citation(s) in RCA: 32] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2020] [Revised: 01/10/2021] [Accepted: 01/15/2021] [Indexed: 11/12/2022] Open

Abstract

OBJECTIVE

To derive 7 proposed core electronic health record (EHR) use metrics across 2 healthcare systems with different EHR vendor product installations and examine factors associated with EHR time.

MATERIALS AND METHODS

A cross-sectional analysis of ambulatory physicians EHR use across the Yale-New Haven and MedStar Health systems was performed for August 2019 using 7 proposed core EHR use metrics normalized to 8 hours of patient scheduled time.

RESULTS

Five out of 7 proposed metrics could be measured in a population of nonteaching, exclusively ambulatory physicians. Among 573 physicians (Yale-New Haven N = 290, MedStar N = 283) in the analysis, median EHR-Time8 was 5.23 hours. Gender, additional clinical hours scheduled, and certain medical specialties were associated with EHR-Time8 after adjusting for age and health system on multivariable analysis. For every 8 hours of scheduled patient time, the model predicted these differences in EHR time (P < .001, unless otherwise indicated): female physicians +0.58 hours; each additional clinical hour scheduled per month -0.01 hours; practicing cardiology -1.30 hours; medical subspecialties -0.89 hours (except gastroenterology, P = .002); neurology/psychiatry -2.60 hours; obstetrics/gynecology -1.88 hours; pediatrics -1.05 hours (P = .001); sports/physical medicine and rehabilitation -3.25 hours; and surgical specialties -3.65 hours.

CONCLUSIONS

For every 8 hours of scheduled patient time, ambulatory physicians spend more than 5 hours on the EHR. Physician gender, specialty, and number of clinical hours practicing are associated with differences in EHR time. While audit logs remain a powerful tool for understanding physician EHR use, additional transparency, granularity, and standardization of vendor-derived EHR use data definitions are still necessary to standardize EHR use measurement.

Collapse

Socrates V, Gershon A, Sahoo SS. Computation of Brain Functional Connectivity Network Measures in Epilepsy: A Web-Based Platform for EEG Signal Data Processing and Analysis. Stud Health Technol Inform 2019;264:1590-1591. [PMID: 31438246 DOI: 10.3233/shti190549] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]

Valdez J, Kim M, Rueschman M, Socrates V, Redline S, Sahoo SS. ProvCaRe Semantic Provenance Knowledgebase: Evaluating Scientific Reproducibility of Research Studies. AMIA Annu Symp Proc 2018;2017:1705-1714. [PMID: 29854241 PMCID: PMC5977728] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]

Abstract

Scientific reproducibility is critical for biomedical research as it enables us to advance science by building on previous results, helps ensure the success of increasingly expensive drug trials, and allows funding agencies to make informed decisions. However, there is a growing "crisis" of reproducibility as evidenced by a recent Nature journal survey of more than 1500 researchers that found that 70% of researchers were not able to replicate results from other research groups and more than 50% of researchers were not able reproduce their own research results. In 2016, the National Institutes of Health (NIH) announced the "Rigor and Reproducibility" guidelines to support reproducibility in biomedical research. A key component of the NIH Rigor and Reproducibility guidelines is the recording and analysis of "provenance" information, which describes the origin or history of data and plays a central role in ensuring scientific reproducibility. As part of the NIH Big Data to Knowledge (BD2K)-funded data provenance project, we have developed a new informatics framework called Provenance for Clinical and Healthcare Research (ProvCaRe) to extract, model, and analyze provenance information from published literature describing research studies. Using sleep medicine research studies that have made their data available through the National Sleep Research Resource (NSRR), we have developed an automated pipeline to identify and extract provenance metadata from published literature that is made available for analysis in the ProvCaRe knowledgebase. NSRR is the largest repository of sleep data from over 40,000 studies involving 36,000 participants and we used 75 published articles describing 6 research studies to populate the ProvCaRe knowledgebase. We evaluated the ProvCaRe knowledgebase with 28,474 "provenance triples" using hypothesis-driven queries to identify and rank research studies based on the provenance information extracted from published articles.

Collapse