1
|
van de Burgt BWM, Wasylewicz ATM, Dullemond B, Jessurun NT, Grouls RJE, Bouwman RA, Korsten EHM, Egberts TCG. Development of a text mining algorithm for identifying adverse drug reactions in electronic health records. JAMIA Open 2024; 7:ooae070. [PMID: 39156048 PMCID: PMC11328534 DOI: 10.1093/jamiaopen/ooae070] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2024] [Revised: 07/03/2024] [Accepted: 08/13/2024] [Indexed: 08/20/2024] Open
Abstract
Objective Adverse drug reactions (ADRs) are a significant healthcare concern. They are often documented as free text in electronic health records (EHRs), making them challenging to use in clinical decision support systems (CDSS). The study aimed to develop a text mining algorithm to identify ADRs in free text of Dutch EHRs. Materials and Methods In Phase I, our previously developed CDSS algorithm was recoded and improved upon with the same relatively large dataset of 35 000 notes (Step A), using R to identify possible ADRs with Medical Dictionary for Regulatory Activities (MedDRA) terms and the related Systematized Nomenclature of Medicine Clinical Terms (SNOMED-CT) (Step B). In Phase II, 6 existing text-mining R-scripts were used to detect and present unique ADRs, and positive predictive value (PPV) and sensitivity were observed. Results In Phase IA, the recoded algorithm performed better than the previously developed CDSS algorithm, resulting in a PPV of 13% and a sensitivity of 93%. For The sensitivity for serious ADRs was 95%. The algorithm identified 58 additional possible ADRs. In Phase IB, the algorithm achieved a PPV of 10%, a sensitivity of 86%, and an F-measure of 0.18. In Phase II, four R-scripts enhanced the sensitivity and PPV of the algorithm, resulting in a PPV of 70%, a sensitivity of 73%, an F-measure of 0.71, and a 63% sensitivity for serious ADRs. Discussion and Conclusion The recoded Dutch algorithm effectively identifies ADRs from free-text Dutch EHRs using R-scripts and MedDRA/SNOMED-CT. The study details its limitations, highlighting the algorithm's potential and significant improvements.
Collapse
Affiliation(s)
- Britt W M van de Burgt
- Division of Clinical Pharmacy, Catharina Hospital Eindhoven, 5623 EJ Eindhoven, The Netherlands
- Division Healthcare Intelligence, Catharina Hospital Eindhoven, 5623 EJ Eindhoven, The Netherlands
- Department of Electrical Engineering, Signal Processing Group, Technical University Eindhoven, 5612 AP Eindhoven, The Netherlands
| | - Arthur T M Wasylewicz
- Division Healthcare Intelligence, Catharina Hospital Eindhoven, 5623 EJ Eindhoven, The Netherlands
| | - Bjorn Dullemond
- Department of Mathematics and Computer Science, Technical University Eindhoven, 5612 AP Eindhoven, The Netherlands
| | - Naomi T Jessurun
- Netherlands Pharmacovigilance Centre LAREB, 5237 MH 's-Hertogenbosch, The Netherlands
| | - Rene J E Grouls
- Division of Clinical Pharmacy, Catharina Hospital Eindhoven, 5623 EJ Eindhoven, The Netherlands
| | - R Arthur Bouwman
- Department of Electrical Engineering, Signal Processing Group, Technical University Eindhoven, 5612 AP Eindhoven, The Netherlands
- Department of Anesthesiology, Catharina Hospital Eindhoven, 5623 EJ Eindhoven, The Netherlands
| | - Erik H M Korsten
- Division Healthcare Intelligence, Catharina Hospital Eindhoven, 5623 EJ Eindhoven, The Netherlands
- Department of Electrical Engineering, Signal Processing Group, Technical University Eindhoven, 5612 AP Eindhoven, The Netherlands
| | - Toine C G Egberts
- Department of Clinical Pharmacy, University Medical Centre Utrecht, 3584 CX Utrecht, The Netherlands
- Department of Pharmacoepidemiology and Clinical Pharmacology, Utrecht Institute for Pharmaceutical Sciences, Faculty of Science, Utrecht University, 3584 CX Utrecht, The Netherlands
| |
Collapse
|
2
|
Seinen TM, Kors JA, van Mulligen EM, Fridgeirsson EA, Verhamme KM, Rijnbeek PR. Using clinical text to refine unspecific condition codes in Dutch general practitioner EHR data. Int J Med Inform 2024; 189:105506. [PMID: 38820647 DOI: 10.1016/j.ijmedinf.2024.105506] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2023] [Revised: 05/22/2024] [Accepted: 05/27/2024] [Indexed: 06/02/2024]
Abstract
OBJECTIVE Observational studies using electronic health record (EHR) databases often face challenges due to unspecific clinical codes that can obscure detailed medical information, hindering precise data analysis. In this study, we aimed to assess the feasibility of refining these unspecific condition codes into more specific codes in a Dutch general practitioner (GP) EHR database by leveraging the available clinical free text. METHODS We utilized three approaches for text classification-search queries, semi-supervised learning, and supervised learning-to improve the specificity of ten unspecific International Classification of Primary Care (ICPC-1) codes. Two text representations and three machine learning algorithms were evaluated for the (semi-)supervised models. Additionally, we measured the improvement achieved by the refinement process on all code occurrences in the database. RESULTS The classification models performed well for most codes. In general, no single classification approach consistently outperformed the others. However, there were variations in the relative performance of the classification approaches within each code and in the use of different text representations and machine learning algorithms. Class imbalance and limited training data affected the performance of the (semi-)supervised models, yet the simple search queries remained particularly effective. Ultimately, the developed models improved the specificity of over half of all the unspecific code occurrences in the database. CONCLUSIONS Our findings show the feasibility of using information from clinical text to improve the specificity of unspecific condition codes in observational healthcare databases, even with a limited range of machine-learning techniques and modest annotated training sets. Future work could investigate transfer learning, integration of structured data, alternative semi-supervised methods, and validation of models across healthcare settings. The improved level of detail enriches the interpretation of medical information and can benefit observational research and patient care.
Collapse
Affiliation(s)
- Tom M Seinen
- Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, the Netherlands.
| | - Jan A Kors
- Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, the Netherlands
| | - Erik M van Mulligen
- Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, the Netherlands
| | - Egill A Fridgeirsson
- Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, the Netherlands
| | - Katia Mc Verhamme
- Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, the Netherlands
| | - Peter R Rijnbeek
- Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, the Netherlands
| |
Collapse
|
3
|
Wieland-Jorna Y, van Kooten D, Verheij RA, de Man Y, Francke AL, Oosterveld-Vlug MG. Natural language processing systems for extracting information from electronic health records about activities of daily living. A systematic review. JAMIA Open 2024; 7:ooae044. [PMID: 38798774 PMCID: PMC11126158 DOI: 10.1093/jamiaopen/ooae044] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2024] [Revised: 03/21/2024] [Accepted: 05/07/2024] [Indexed: 05/29/2024] Open
Abstract
Objective Natural language processing (NLP) can enhance research on activities of daily living (ADL) by extracting structured information from unstructured electronic health records (EHRs) notes. This review aims to give insight into the state-of-the-art, usability, and performance of NLP systems to extract information on ADL from EHRs. Materials and Methods A systematic review was conducted based on searches in Pubmed, Embase, Cinahl, Web of Science, and Scopus. Studies published between 2017 and 2022 were selected based on predefined eligibility criteria. Results The review identified 22 studies. Most studies (65%) used NLP for classifying unstructured EHR data on 1 or 2 ADL. Deep learning, combined with a ruled-based method or machine learning, was the approach most commonly used. NLP systems varied widely in terms of the pre-processing and algorithms. Common performance evaluation methods were cross-validation and train/test datasets, with F1, precision, and sensitivity as the most frequently reported evaluation metrics. Most studies reported relativity high overall scores on the evaluation metrics. Discussion NLP systems are valuable for the extraction of unstructured EHR data on ADL. However, comparing the performance of NLP systems is difficult due to the diversity of the studies and challenges related to the dataset, including restricted access to EHR data, inadequate documentation, lack of granularity, and small datasets. Conclusion This systematic review indicates that NLP is promising for deriving information on ADL from unstructured EHR notes. However, what the best-performing NLP system is, depends on characteristics of the dataset, research question, and type of ADL.
Collapse
Affiliation(s)
- Yvonne Wieland-Jorna
- Netherlands Institute for Health Services Research (Nivel), Utrecht, Postbus 1568, 3500 BN, The Netherlands
- Tranzo, School of Social Sciences and Behavioural Research, Tilburg University, Tilburg, Postbus 90153, 5000 LE, The Netherlands
| | - Daan van Kooten
- Netherlands Institute for Health Services Research (Nivel), Utrecht, Postbus 1568, 3500 BN, The Netherlands
| | - Robert A Verheij
- Netherlands Institute for Health Services Research (Nivel), Utrecht, Postbus 1568, 3500 BN, The Netherlands
- Tranzo, School of Social Sciences and Behavioural Research, Tilburg University, Tilburg, Postbus 90153, 5000 LE, The Netherlands
| | - Yvonne de Man
- Netherlands Institute for Health Services Research (Nivel), Utrecht, Postbus 1568, 3500 BN, The Netherlands
| | - Anneke L Francke
- Netherlands Institute for Health Services Research (Nivel), Utrecht, Postbus 1568, 3500 BN, The Netherlands
- Department of Public and Occupational Health, Location Vrije Universiteit Amsterdam, Amsterdam UMC, Amsterdam, Postbus 7057, 1007 MB, The Netherlands
| | - Mariska G Oosterveld-Vlug
- Netherlands Institute for Health Services Research (Nivel), Utrecht, Postbus 1568, 3500 BN, The Netherlands
| |
Collapse
|
4
|
Khan L, Shahreen M, Qazi A, Jamil Ahmed Shah S, Hussain S, Chang HT. Migraine headache (MH) classification using machine learning methods with data augmentation. Sci Rep 2024; 14:5180. [PMID: 38431729 PMCID: PMC10908834 DOI: 10.1038/s41598-024-55874-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2023] [Accepted: 02/28/2024] [Indexed: 03/05/2024] Open
Abstract
Migraine headache, a prevalent and intricate neurovascular disease, presents significant challenges in its clinical identification. Existing techniques that use subjective pain intensity measures are insufficiently accurate to make a reliable diagnosis. Even though headaches are a common condition with poor diagnostic specificity, they have a significant negative influence on the brain, body, and general human function. In this era of deeply intertwined health and technology, machine learning (ML) has emerged as a crucial force in transforming every aspect of healthcare, utilizing advanced facilities ML has shown groundbreaking achievements related to developing classification and automatic predictors. With this, deep learning models, in particular, have proven effective in solving complex problems spanning computer vision and data analytics. Consequently, the integration of ML in healthcare has become vital, especially in developing countries where limited medical resources and lack of awareness prevail, the urgent need to forecast and categorize migraines using artificial intelligence (AI) becomes even more crucial. By training these models on a publicly available dataset, with and without data augmentation. This study focuses on leveraging state-of-the-art ML algorithms, including support vector machine (SVM), K-nearest neighbors (KNN), random forest (RF), decision tree (DST), and deep neural networks (DNN), to predict and classify various types of migraines. The proposed models with data augmentations were trained to classify seven various types of migraine. The proposed models with data augmentations were trained to classify seven various types of migraine. The revealed results show that DNN, SVM, KNN, DST, and RF achieved an accuracy of 99.66%, 94.60%, 97.10%, 88.20%, and 98.50% respectively with data augmentation highlighting the transformative potential of AI in enhancing migraine diagnosis.
Collapse
Affiliation(s)
- Lal Khan
- Department of Computer Science, Ibadat International University Islamabad Pakpattan Campus, Pakpattan, Pakistan
| | - Moudasra Shahreen
- Department of Computer Science, Mir Chakar Khan Rind University, Sibi, Pakistan
| | - Atika Qazi
- Centre for Lifelong Learning, Universiti Brunei Darussalam, Bandar Seri Begawan, Brunei Darussalam
| | | | - Sabir Hussain
- Department of Agriculture, Mir Chakar Khan Rind University, Sibi, Pakistan
| | - Hsien-Tsung Chang
- Bachelor Program in Artificial Intelligence, Chang Gung University, Taoyuan, Taiwan.
- Department of Computer Science and Information Engineering, Chang Gung University, Taoyuan, Taiwan.
- Department of Physical Medicine and Rehabilitation, Chang Gung Memorial Hospital, Taoyuan, Taiwan.
| |
Collapse
|
5
|
Tong L, Shi W, Isgut M, Zhong Y, Lais P, Gloster L, Sun J, Swain A, Giuste F, Wang MD. Integrating Multi-Omics Data With EHR for Precision Medicine Using Advanced Artificial Intelligence. IEEE Rev Biomed Eng 2024; 17:80-97. [PMID: 37824325 DOI: 10.1109/rbme.2023.3324264] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/14/2023]
Abstract
With the recent advancement of novel biomedical technologies such as high-throughput sequencing and wearable devices, multi-modal biomedical data ranging from multi-omics molecular data to real-time continuous bio-signals are generated at an unprecedented speed and scale every day. For the first time, these multi-modal biomedical data are able to make precision medicine close to a reality. However, due to data volume and the complexity, making good use of these multi-modal biomedical data requires major effort. Researchers and clinicians are actively developing artificial intelligence (AI) approaches for data-driven knowledge discovery and causal inference using a variety of biomedical data modalities. These AI-based approaches have demonstrated promising results in various biomedical and healthcare applications. In this review paper, we summarize the state-of-the-art AI models for integrating multi-omics data and electronic health records (EHRs) for precision medicine. We discuss the challenges and opportunities in integrating multi-omics data with EHRs and future directions. We hope this review can inspire future research and developing in integrating multi-omics data with EHRs for precision medicine.
Collapse
|
6
|
Lanera C, Lorenzoni G, Barbieri E, Piras G, Magge A, Weissenbacher D, Donà D, Cantarutti L, Gonzalez-Hernandez G, Giaquinto C, Gregori D. Monitoring the Epidemiology of Otitis Using Free-Text Pediatric Medical Notes: A Deep Learning Approach. J Pers Med 2023; 14:28. [PMID: 38248729 PMCID: PMC10817419 DOI: 10.3390/jpm14010028] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2023] [Revised: 12/20/2023] [Accepted: 12/21/2023] [Indexed: 01/23/2024] Open
Abstract
Free-text information represents a valuable resource for epidemiological surveillance. Its unstructured nature, however, presents significant challenges in the extraction of meaningful information. This study presents a deep learning model for classifying otitis using pediatric medical records. We analyzed the Pedianet database, which includes data from January 2004 to August 2017. The model categorizes narratives from clinical record diagnoses into six types: no otitis, non-media otitis, non-acute otitis media (OM), acute OM (AOM), AOM with perforation, and recurrent AOM. Utilizing deep learning architectures, including an ensemble model, this study addressed the challenges associated with the manual classification of extensive narrative data. The performance of the model was evaluated according to a gold standard classification made by three expert clinicians. The ensemble model achieved values of 97.03, 93.97, 96.59, and 95.48 for balanced precision, balanced recall, accuracy, and balanced F1 measure, respectively. These results underscore the efficacy of using automated systems for medical diagnoses, especially in pediatric care. Our findings demonstrate the potential of deep learning in interpreting complex medical records, enhancing epidemiological surveillance and research. This approach offers significant improvements in handling large-scale medical data, ensuring accuracy and minimizing human error. The methodology is adaptable to other medical contexts, promising a new horizon in healthcare analytics.
Collapse
Affiliation(s)
- Corrado Lanera
- Unit of Biostatistics, Epidemiology and Public Health, Department of Cardiac, Thoracic, Vascular Sciences and Public Health, University of Padova, 35131 Padova, Italy; (C.L.); (G.L.)
| | - Giulia Lorenzoni
- Unit of Biostatistics, Epidemiology and Public Health, Department of Cardiac, Thoracic, Vascular Sciences and Public Health, University of Padova, 35131 Padova, Italy; (C.L.); (G.L.)
| | - Elisa Barbieri
- Division of Pediatric Infectious Diseases, Department for Woman and Child Health, University of Padova, 35128 Padova, Italy; (E.B.); (D.D.); (C.G.)
| | - Gianluca Piras
- Unit of Biostatistics, Epidemiology and Public Health, Department of Cardiac, Thoracic, Vascular Sciences and Public Health, University of Padova, 35131 Padova, Italy; (C.L.); (G.L.)
| | - Arjun Magge
- Health Language Processing Center, Institute for Biomedical Informatics at the Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA; (A.M.); (D.W.); (G.G.-H.)
| | - Davy Weissenbacher
- Health Language Processing Center, Institute for Biomedical Informatics at the Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA; (A.M.); (D.W.); (G.G.-H.)
| | - Daniele Donà
- Division of Pediatric Infectious Diseases, Department for Woman and Child Health, University of Padova, 35128 Padova, Italy; (E.B.); (D.D.); (C.G.)
| | | | - Graciela Gonzalez-Hernandez
- Health Language Processing Center, Institute for Biomedical Informatics at the Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA; (A.M.); (D.W.); (G.G.-H.)
| | - Carlo Giaquinto
- Division of Pediatric Infectious Diseases, Department for Woman and Child Health, University of Padova, 35128 Padova, Italy; (E.B.); (D.D.); (C.G.)
- Società Servizi Telematici—Pedianet, 35100 Padova, Italy;
| | - Dario Gregori
- Unit of Biostatistics, Epidemiology and Public Health, Department of Cardiac, Thoracic, Vascular Sciences and Public Health, University of Padova, 35131 Padova, Italy; (C.L.); (G.L.)
| |
Collapse
|
7
|
Bazoge A, Morin E, Daille B, Gourraud PA. Applying Natural Language Processing to Textual Data From Clinical Data Warehouses: Systematic Review. JMIR Med Inform 2023; 11:e42477. [PMID: 38100200 PMCID: PMC10757232 DOI: 10.2196/42477] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2022] [Revised: 01/16/2023] [Accepted: 09/07/2023] [Indexed: 12/17/2023] Open
Abstract
BACKGROUND In recent years, health data collected during the clinical care process have been often repurposed for secondary use through clinical data warehouses (CDWs), which interconnect disparate data from different sources. A large amount of information of high clinical value is stored in unstructured text format. Natural language processing (NLP), which implements algorithms that can operate on massive unstructured textual data, has the potential to structure the data and make clinical information more accessible. OBJECTIVE The aim of this review was to provide an overview of studies applying NLP to textual data from CDWs. It focuses on identifying the (1) NLP tasks applied to data from CDWs and (2) NLP methods used to tackle these tasks. METHODS This review was performed according to the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines. We searched for relevant articles in 3 bibliographic databases: PubMed, Google Scholar, and ACL Anthology. We reviewed the titles and abstracts and included articles according to the following inclusion criteria: (1) focus on NLP applied to textual data from CDWs, (2) articles published between 1995 and 2021, and (3) written in English. RESULTS We identified 1353 articles, of which 194 (14.34%) met the inclusion criteria. Among all identified NLP tasks in the included papers, information extraction from clinical text (112/194, 57.7%) and the identification of patients (51/194, 26.3%) were the most frequent tasks. To address the various tasks, symbolic methods were the most common NLP methods (124/232, 53.4%), showing that some tasks can be partially achieved with classical NLP techniques, such as regular expressions or pattern matching that exploit specialized lexica, such as drug lists and terminologies. Machine learning (70/232, 30.2%) and deep learning (38/232, 16.4%) have been increasingly used in recent years, including the most recent approaches based on transformers. NLP methods were mostly applied to English language data (153/194, 78.9%). CONCLUSIONS CDWs are central to the secondary use of clinical texts for research purposes. Although the use of NLP on data from CDWs is growing, there remain challenges in this field, especially with regard to languages other than English. Clinical NLP is an effective strategy for accessing, extracting, and transforming data from CDWs. Information retrieved with NLP can assist in clinical research and have an impact on clinical practice.
Collapse
Affiliation(s)
- Adrien Bazoge
- Nantes Université, École Centrale Nantes, CNRS, LS2N, UMR 6004, F-44000 Nantes, France
- Nantes Université, CHU de Nantes, Pôle Hospitalo-Universitaire 11: Santé Publique, Clinique des données, INSERM, CIC 1413, F-44000 Nantes, France
| | - Emmanuel Morin
- Nantes Université, École Centrale Nantes, CNRS, LS2N, UMR 6004, F-44000 Nantes, France
| | - Béatrice Daille
- Nantes Université, École Centrale Nantes, CNRS, LS2N, UMR 6004, F-44000 Nantes, France
| | - Pierre-Antoine Gourraud
- Nantes Université, CHU de Nantes, Pôle Hospitalo-Universitaire 11: Santé Publique, Clinique des données, INSERM, CIC 1413, F-44000 Nantes, France
- Nantes Université, INSERM, CHU de Nantes, École Centrale Nantes, Centre de Recherche Translationnelle en Transplantation et Immunologie, CR2TI, F-44000 Nantes, France
| |
Collapse
|
8
|
Msosa YJ, Grauslys A, Zhou Y, Wang T, Buchan I, Langan P, Foster S, Walker M, Pearson M, Folarin A, Roberts A, Maskell S, Dobson R, Kullu C, Kehoe D. Trustworthy Data and AI Environments for Clinical Prediction: Application to Crisis-Risk in People With Depression. IEEE J Biomed Health Inform 2023; 27:5588-5598. [PMID: 37669205 DOI: 10.1109/jbhi.2023.3312011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/07/2023]
Abstract
Depression is a common mental health condition that often occurs in association with other chronic illnesses, and varies considerably in severity. Electronic Health Records (EHRs) contain rich information about a patient's medical history and can be used to train, test and maintain predictive models to support and improve patient care. This work evaluated the feasibility of implementing an environment for predicting mental health crisis among people living with depression based on both structured and unstructured EHRs. A large EHR from a mental health provider, Mersey Care, was pseudonymised and ingested into the Natural Language Processing (NLP) platform CogStack, allowing text content in binary clinical notes to be extracted. All unstructured clinical notes and summaries were semantically annotated by MedCAT and BioYODIE NLP services. Cases of crisis in patients with depression were then identified. Random forest models, gradient boosting trees, and Long Short-Term Memory (LSTM) networks, with varying feature arrangement, were trained to predict the occurrence of crisis. The results showed that all the prediction models can use a combination of structured and unstructured EHR information to predict crisis in patients with depression with good and useful accuracy. The LSTM network that was trained on a modified dataset with only 1000 most-important features from the random forest model with temporality showed the best performance with a mean AUC of 0.901 and a standard deviation of 0.006 using a training dataset and a mean AUC of 0.810 and 0.01 using a hold-out test dataset. Comparing the results from the technical evaluation with the views of psychiatrists shows that there are now opportunities to refine and integrate such prediction models into pragmatic point-of-care clinical decision support tools for supporting mental healthcare delivery.
Collapse
|
9
|
Eysenbach G, Kleib M, Norris C, O'Rourke HM, Montgomery C, Douma M. The Use and Structure of Emergency Nurses' Triage Narrative Data: Scoping Review. JMIR Nurs 2023; 6:e41331. [PMID: 36637881 PMCID: PMC9883744 DOI: 10.2196/41331] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2022] [Revised: 11/24/2022] [Accepted: 11/28/2022] [Indexed: 11/30/2022] Open
Abstract
BACKGROUND Emergency departments use triage to ensure that patients with the highest level of acuity receive care quickly and safely. Triage is typically a nursing process that is documented as structured and unstructured (free text) data. Free-text triage narratives have been studied for specific conditions but never reviewed in a comprehensive manner. OBJECTIVE The objective of this paper was to identify and map the academic literature that examines triage narratives. The paper described the types of research conducted, identified gaps in the research, and determined where additional review may be warranted. METHODS We conducted a scoping review of unstructured triage narratives. We mapped the literature, described the use of triage narrative data, examined the information available on the form and structure of narratives, highlighted similarities among publications, and identified opportunities for future research. RESULTS We screened 18,074 studies published between 1990 and 2022 in CINAHL, MEDLINE, Embase, Cochrane, and ProQuest Central. We identified 0.53% (96/18,074) of studies that directly examined the use of triage nurses' narratives. More than 12 million visits were made to 2438 emergency departments included in the review. In total, 82% (79/96) of these studies were conducted in the United States (43/96, 45%), Australia (31/96, 32%), or Canada (5/96, 5%). Triage narratives were used for research and case identification, as input variables for predictive modeling, and for quality improvement. Overall, 31% (30/96) of the studies offered a description of the triage narrative, including a list of the keywords used (27/96, 28%) or more fulsome descriptions (such as word counts, character counts, abbreviation, etc; 7/96, 7%). We found limited use of reporting guidelines (8/96, 8%). CONCLUSIONS The breadth of the identified studies suggests that there is widespread routine collection and research use of triage narrative data. Despite the use of triage narratives as a source of data in studies, the narratives and nurses who generate them are poorly described in the literature, and data reporting is inconsistent. Additional research is needed to describe the structure of triage narratives, determine the best use of triage narratives, and improve the consistent use of triage-specific data reporting guidelines. INTERNATIONAL REGISTERED REPORT IDENTIFIER (IRRID) RR2-10.1136/bmjopen-2021-055132.
Collapse
Affiliation(s)
| | - Manal Kleib
- Faculty of Nursing, University of Alberta, Edmonton, AB, Canada
| | - Colleen Norris
- Faculty of Nursing, University of Alberta, Edmonton, AB, Canada
| | | | | | - Matthew Douma
- School of Nursing, Midwifery and Health Systems, University College Dublin, Dublin, Ireland
| |
Collapse
|
10
|
Picard CT, Kleib M, O'Rourke HM, Norris CM, Douma MJ. Emergency nurses' triage narrative data, their uses and structure: a scoping review protocol. BMJ Open 2022; 12:e055132. [PMID: 35418428 PMCID: PMC9014040 DOI: 10.1136/bmjopen-2021-055132] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
INTRODUCTION The first clinical interaction most patients have in the emergency department occurs during triage. An unstructured narrative is generated during triage and is the first source of in-hospital documentation. These narratives capture the patient's reported reason for the visit and the initial assessment and offer significantly more nuanced descriptions of the patient's complaints than fixed field data. Previous research demonstrated these data are useful for predicting important clinical outcomes. Previous reviews examined these narratives in combination or isolation with other free-text sources, but used restricted searches and are becoming outdated. Furthermore, there are no reviews focused solely on nurses' (the primary collectors of these data) narratives. METHODS AND ANALYSIS Using the Arksey and O'Malley scoping review framework and PRISMA-ScR reporting guidelines, we will perform structured searches of CINAHL, Ovid MEDLINE, ProQuest Central, Ovid Embase and Cochrane Library (via Wiley). Additionally, we will forward citation searches of all included studies. No geographical or study design exclusion criteria will be used. Studies examining disaster triage, published before 1990, and non-English language literature will be excluded. Data will be managed using online management tools; extracted data will be independently confirmed by a separate reviewer using prepiloted extraction forms. Cohen's kappa will be used to examine inter-rater agreement on pilot and final screening. Quantitative data will be expressed using measures of range and central tendency, counts, proportions and percentages, as appropriate. Qualitative data will be narrative summaries of the authors' primary findings. PATIENT AND PUBLIC INVOLVEMENT No patients involved. ETHICS AND DISSEMINATION No ethics approval is required. Findings will be submitted to peer-reviewed conferences and journals. Results will be disseminated using individual and institutional social media platforms.
Collapse
Affiliation(s)
- Christopher Thomas Picard
- Faculty of Nursing, University of Alberta, Edmonton, Alberta, Canada
- Royal Alexandra Hospital, Emergency, Alberta Health Services, Edmonton, Alberta, Canada
| | - Manal Kleib
- Faculty of Nursing, University of Alberta, Edmonton, Alberta, Canada
| | - Hannah M O'Rourke
- Faculty of Nursing, University of Alberta, Edmonton, Alberta, Canada
| | - Colleen M Norris
- Faculty of Nursing, University of Alberta, Edmonton, Alberta, Canada
- School of Public Health, University of Alberta, Edmonton, Alberta, Canada
| | - Matthew J Douma
- Department of Critical Care Medicine, Faculty of Medicine and Dentistry, University of Alberta, Edmonton, Alberta, Canada
- School of Nursing, Midwifery and Health Systems, University College Dublin, Dublin, Ireland
| |
Collapse
|
11
|
Shuai Z, Xiaolin D, Jing Y, Yanni H, Meng C, Yuxin W, Wei Z. Comparison of different feature extraction methods for applicable automated ICD coding. BMC Med Inform Decis Mak 2022; 22:11. [PMID: 35022039 PMCID: PMC8756659 DOI: 10.1186/s12911-022-01753-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2021] [Accepted: 01/04/2022] [Indexed: 01/10/2023] Open
Abstract
Abstract
Background
Automated ICD coding on medical texts via machine learning has been a hot topic. Related studies from medical field heavily relies on conventional bag-of-words (BoW) as the feature extraction method, and do not commonly use more complicated methods, such as word2vec (W2V) and large pretrained models like BERT. This study aimed at uncovering the most effective feature extraction methods for coding models by comparing BoW, W2V and BERT variants.
Methods
We experimented with a Chinese dataset from Fuwai Hospital, which contains 6947 records and 1532 unique ICD codes, and a public Spanish dataset, which contains 1000 records and 2557 unique ICD codes. We designed coding tasks with different code frequency thresholds (denoted as $$f_s$$
f
s
), with a lower threshold indicating a more complex task. Using traditional classifiers, we compared BoW, W2V and BERT variants on accomplishing these coding tasks.
Results
When $$f_s$$
f
s
was equal to or greater than 140 for Fuwai dataset, and 60 for the Spanish dataset, the BERT variants with the whole network fine-tuned was the best method, leading to a Micro-F1 of 93.9% for Fuwai data when $$f_s=200$$
f
s
=
200
, and a Micro-F1 of 85.41% for the Spanish dataset when $$f_s=180$$
f
s
=
180
. When $$f_s$$
f
s
fell below 140 for Fuwai dataset, and 60 for the Spanish dataset, BoW turned out to be the best, leading to a Micro-F1 of 83% for Fuwai dataset when $$f_s=20$$
f
s
=
20
, and a Micro-F1 of 39.1% for the Spanish dataset when $$f_s=20$$
f
s
=
20
. Our experiments also showed that both the BERT variants and BoW possessed good interpretability, which is important for medical applications of coding models.
Conclusions
This study shed light on building promising machine learning models for automated ICD coding by revealing the most effective feature extraction methods. Concretely, our results indicated that fine-tuning the whole network of the BERT variants was the optimal method for tasks covering only frequent codes, especially codes that represented unspecified diseases, while BoW was the best for tasks involving both frequent and infrequent codes. The frequency threshold where the best-performing method varied differed between different datasets due to factors like language and codeset.
Collapse
|
12
|
The prediction of hospital length of stay using unstructured data. BMC Med Inform Decis Mak 2021; 21:351. [PMID: 34922532 PMCID: PMC8684269 DOI: 10.1186/s12911-021-01722-4] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2021] [Accepted: 12/13/2021] [Indexed: 11/10/2022] Open
Abstract
Objective This study aimed to assess the performance improvement for machine learning-based hospital length of stay (LOS) predictions when clinical signs written in text are accounted for and compared to the traditional approach of solely considering structured information such as age, gender and major ICD diagnosis.
Methods This study was an observational retrospective cohort study and analyzed patient stays admitted between 1 January to 24 September 2019. For each stay, a patient was admitted through the Emergency Department (ED) and stayed for more than two days in the subsequent service. LOS was predicted using two random forest models. The first included unstructured text extracted from electronic health records (EHRs). A word-embedding algorithm based on UMLS terminology with exact matching restricted to patient-centric affirmation sentences was used to assess the EHR data. The second model was primarily based on structured data in the form of diagnoses coded from the International Classification of Disease 10th Edition (ICD-10) and triage codes (CCMU/GEMSA classifications). Variables common to both models were: age, gender, zip/postal code, LOS in the ED, recent visit flag, assigned patient ward after the ED stay and short-term ED activity. Models were trained on 80% of data and performance was evaluated by accuracy on the remaining 20% test data.
Results The model using unstructured data had a 75.0% accuracy compared to 74.1% for the model containing structured data. The two models produced a similar prediction in 86.6% of cases. In a secondary analysis restricted to intensive care patients, the accuracy of both models was also similar (76.3% vs 75.0%).
Conclusions LOS prediction using unstructured data had similar accuracy to using structured data and can be considered of use to accurately model LOS. Supplementary Information The online version contains supplementary material available at 10.1186/s12911-021-01722-4.
Collapse
|
13
|
Automatic Prediction of Recurrence of Major Cardiovascular Events: A Text Mining Study Using Chest X-Ray Reports. JOURNAL OF HEALTHCARE ENGINEERING 2021; 2021:6663884. [PMID: 34306597 PMCID: PMC8285182 DOI: 10.1155/2021/6663884] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/14/2020] [Revised: 05/29/2021] [Accepted: 06/29/2021] [Indexed: 11/17/2022]
Abstract
Methods We used EHR data of patients included in the Second Manifestations of ARTerial disease (SMART) study. We propose a deep learning-based multimodal architecture for our text mining pipeline that integrates neural text representation with preprocessed clinical predictors for the prediction of recurrence of major cardiovascular events in cardiovascular patients. Text preprocessing, including cleaning and stemming, was first applied to filter out the unwanted texts from X-ray radiology reports. Thereafter, text representation methods were used to numerically represent unstructured radiology reports with vectors. Subsequently, these text representation methods were added to prediction models to assess their clinical relevance. In this step, we applied logistic regression, support vector machine (SVM), multilayer perceptron neural network, convolutional neural network, long short-term memory (LSTM), and bidirectional LSTM deep neural network (BiLSTM). Results We performed various experiments to evaluate the added value of the text in the prediction of major cardiovascular events. The two main scenarios were the integration of radiology reports (1) with classical clinical predictors and (2) with only age and sex in the case of unavailable clinical predictors. In total, data of 5603 patients were used with 5-fold cross-validation to train the models. In the first scenario, the multimodal BiLSTM (MI-BiLSTM) model achieved an area under the curve (AUC) of 84.7%, misclassification rate of 14.3%, and F1 score of 83.8%. In this scenario, the SVM model, trained on clinical variables and bag-of-words representation, achieved the lowest misclassification rate of 12.2%. In the case of unavailable clinical predictors, the MI-BiLSTM model trained on radiology reports and demographic (age and sex) variables reached an AUC, F1 score, and misclassification rate of 74.5%, 70.8%, and 20.4%, respectively. Conclusions Using the case study of routine care chest X-ray radiology reports, we demonstrated the clinical relevance of integrating text features and classical predictors in our text mining pipeline for cardiovascular risk prediction. The MI-BiLSTM model with word embedding representation appeared to have a desirable performance when trained on text data integrated with the clinical variables from the SMART study. Our results mined from chest X-ray reports showed that models using text data in addition to laboratory values outperform those using only known clinical predictors.
Collapse
|
14
|
Leveraging electronic health record data to inform hospital resource management : A systematic data mining approach. Health Care Manag Sci 2021; 24:716-741. [PMID: 34031792 DOI: 10.1007/s10729-021-09554-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2020] [Accepted: 02/02/2021] [Indexed: 10/21/2022]
Abstract
Early identification of resource needs is instrumental in promoting efficient hospital resource management. Hospital information systems, and electronic health records (EHR) in particular, collect valuable demographic and clinical patient data from the moment patients are admitted, which can help predict expected resource needs in early stages of patient episodes. To this end, this article proposes a data mining methodology to systematically obtain predictions for relevant managerial variables by leveraging structured EHR data. Specifically, these managerial variables are: i) Diagnosis categories, ii) procedure codes, iii) diagnosis-related groups (DRGs), iv) outlier episodes and v) length of stay (LOS). The proposed methodology approaches the problem in four stages: Feature set construction, feature selection, prediction model development, and model performance evaluation. We tested this approach with an EHR dataset of 5,089 inpatient episodes and compared different classification and regression models (for categorical and continuous variables, respectively), performed temporal analysis of model performance, analyzed the impact of training set homogeneity on performance and assessed the contribution of different EHR data elements for model predictive power. Overall, our results indicate that inpatient EHR data can effectively be leveraged to inform resource management on multiple perspectives. Logistic regression (combined with minimal redundancy maximum relevance feature selection) and bagged decision trees yielded best results for predicting categorical and numerical managerial variables, respectively. Furthermore, our temporal analysis indicated that, while DRG classes are more difficult to predict, several diagnosis categories, procedure codes and LOS amongst shorter-stay patients can be predicted with higher confidence in early stages of patient stay. Lastly, value of information analysis indicated that diagnoses, medication and structured assessment forms were the most valuable EHR data elements in predicting managerial variables of interest through a data mining approach.
Collapse
|
15
|
Moldwin A, Demner-Fushman D, Goodwin TR. Empirical Findings on the Role of Structured Data, Unstructured Data, and their Combination for Automatic Clinical Phenotyping. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2021; 2021:445-454. [PMID: 34457160 PMCID: PMC8378600] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
The objective of this study is to explore the role of structured and unstructured data for clinical phenotyping by determining which types of clinical phenotypes are best identified using unstructured data (e.g., clinical notes), structured data (e.g., laboratory values, vital signs), or their combination across 172 clinical phenotypes. Specifically, we used laboratory and chart measurements as well as clinical notes from the MIMIC-III critical care database and trained an LSTM using features extracted from each type of data to determine which categories of phenotypes were best identified by structured data, unstructured data, or both. We observed that textual features on their own outperformed structured features for 145 (84%) of phenotypes, and that Doc2Vec was the most effective representation of unstructured data for all phenotypes. When evaluating the impact of adding textual features to systems previously relying only on structured features, we found a statistically significant (p < 0.05) increase in phenotyping performance for 51 phenotypes (primarily involving the circulatory system, injury, and poisoning), one phenotype for which textual features degraded performance (diabetes without complications), and no statistically significant change in performance with the remaining 120 phenotypes. We provide analysis on which phenotypes are best identified by each type of data and guidance on which data sources to consider for future research on phenotype identification.
Collapse
Affiliation(s)
- Asher Moldwin
- U.S. National Library of Medicine, Bethesda, MD, USA
| | | | | |
Collapse
|
16
|
Wang H. Research on the integration of library e-book borrowing history data based on big data technology. WEB INTELLIGENCE 2020. [DOI: 10.3233/web-200433] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Affiliation(s)
- Hao Wang
- Heilongjiang Bayi Agricultural University, Daqing 163319, China. E-mail:
| |
Collapse
|
17
|
Feller DJ, Bear Don't Walk Iv OJ, Zucker J, Yin MT, Gordon P, Elhadad N. Detecting Social and Behavioral Determinants of Health with Structured and Free-Text Clinical Data. Appl Clin Inform 2020; 11:172-181. [PMID: 32131117 DOI: 10.1055/s-0040-1702214] [Citation(s) in RCA: 42] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022] Open
Abstract
BACKGROUND Social and behavioral determinants of health (SBDH) are environmental and behavioral factors that often impede disease management and result in sexually transmitted infections. Despite their importance, SBDH are inconsistently documented in electronic health records (EHRs) and typically collected only in an unstructured format. Evidence suggests that structured data elements present in EHRs can contribute further to identify SBDH in the patient record. OBJECTIVE Explore the automated inference of both the presence of SBDH documentation and individual SBDH risk factors in patient records. Compare the relative ability of clinical notes and structured EHR data, such as laboratory measurements and diagnoses, to support inference. METHODS We attempt to infer the presence of SBDH documentation in patient records, as well as patient status of 11 SBDH, including alcohol abuse, homelessness, and sexual orientation. We compare classification performance when considering clinical notes only, structured data only, and notes and structured data together. We perform an error analysis across several SBDH risk factors. RESULTS Classification models inferring the presence of SBDH documentation achieved good performance (F1 score: 92.7-78.7; F1 considered as the primary evaluation metric). Performance was variable for models inferring patient SBDH risk status; results ranged from F1 = 82.7 for LGBT (lesbian, gay, bisexual, and transgender) status to F1 = 28.5 for intravenous drug use. Error analysis demonstrated that lexical diversity and documentation of historical SBDH status challenge inference of patient SBDH status. Three of five classifiers inferring topic-specific SBDH documentation and 10 of 11 patient SBDH status classifiers achieved highest performance when trained using both clinical notes and structured data. CONCLUSION Our findings suggest that combining clinical free-text notes and structured data provide the best approach in classifying patient SBDH status. Inferring patient SBDH status is most challenging among SBDH with low prevalence and high lexical diversity.
Collapse
Affiliation(s)
- Daniel J Feller
- Department of Biomedical Informatics, Columbia University, New York, New York, United States
| | | | - Jason Zucker
- Division of Infectious Diseases, Department of Internal Medicine, Columbia University Irving Medical Center, New York, New York, United States
| | - Michael T Yin
- Division of Infectious Diseases, Department of Internal Medicine, Columbia University Irving Medical Center, New York, New York, United States
| | - Peter Gordon
- Division of Infectious Diseases, Department of Internal Medicine, Columbia University Irving Medical Center, New York, New York, United States
| | - Noémie Elhadad
- Department of Biomedical Informatics, Columbia University, New York, New York, United States
| |
Collapse
|
18
|
Ferrão JC, Oliveira MD, Janela F, Martins HMG, Gartner D. Can structured EHR data support clinical coding? A data mining approach. Health Syst (Basingstoke) 2020; 10:138-161. [PMID: 34104432 PMCID: PMC8143604 DOI: 10.1080/20476965.2020.1729666] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2019] [Accepted: 10/22/2019] [Indexed: 10/24/2022] Open
Abstract
Structured data formats are gaining momentum in electronic health records and can be leveraged for decision support and research. Nevertheless, such structured data formats have not been explored for clinical coding, which is an essential process requiring significant manual workload in health organisations. This article explores the extent to which fully structured clinical data can support assignment of clinical codes to inpatient episodes, through a methodology that tackles high dimensionality issues, addresses the multi-label nature of coding and optimises model parameters. The methodology encompasses transformation of raw data to define a feature set, build a data matrix representation, and testing combinations of feature selection methods with machine learning models to predict code assignment. The methodology was tested with a real hospital dataset and showed varying predictive power across codes, while demonstrating the potential of leveraging structuring data to reduce workload and increase efficiency in clinical coding.
Collapse
Affiliation(s)
- José Carlos Ferrão
- CEG-IST, Centre for Management Studies of Instituto Superior Técnico, Universidade de Lisboa, Lisbon, Portugal
| | - Mónica Duarte Oliveira
- CEG-IST, Centre for Management Studies of Instituto Superior Técnico, Universidade de Lisboa, Lisbon, Portugal
| | - Filipe Janela
- Investigação, Desenvolvimento e Inovação, SIEMENS Healthineers, Amadora, Portugal
| | - Henrique M. G. Martins
- Centre for Research and Creativity in Informatics (CI), Hospital Prof. Doutor Fernando Fonseca, Amadora, Portugal
| | | |
Collapse
|
19
|
Nguyen AN, Truran D, Kemp M, Koopman B, Conlan D, O'Dwyer J, Zhang M, Karimi S, Hassanzadeh H, Lawley MJ, Green D. Computer-Assisted Diagnostic Coding: Effectiveness of an NLP-based approach using SNOMED CT to ICD-10 mappings. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2018; 2018:807-816. [PMID: 30815123 PMCID: PMC6371260] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Computer-assisted (diagnostic) coding (CAC) aims to improve the operational productivity and accuracy of clinical coders. The level of accuracy, especially for a wide range of complex and less prevalent clinical cases, remains an open research problem. This study investigates this problem on a broad spectrum of diagnostic codes and, in particular, investigates the effectiveness of utilising SNOMED CT for ICD-10 diagnosis coding. Hospital progress notes were used to provide the narrative rich electronic patient records for the investigation. A natural language processing (NLP) approach using mappings between SNOMED CT and ICD-10-AM (Australian Modification) was used to guide the coding. The proposed approach achieved 54.1% sensitivity and 70.2% positive predictive value. Given the complexity of the task, this was encouraging given the simplicity of the approach and what was projected as possible from a manual diagnosis code validation study (76.3% sensitivity). The results show the potential for advanced NLP-based approaches that leverage SNOMED CT to ICD-10 mapping for hospital in-patient coding.
Collapse
Affiliation(s)
- Anthony N Nguyen
- The Australian e-Health Research Centre, CSIRO, Brisbane/Sydney/Perth, Australia
| | - Donna Truran
- The Australian e-Health Research Centre, CSIRO, Brisbane/Sydney/Perth, Australia
| | - Madonna Kemp
- The Australian e-Health Research Centre, CSIRO, Brisbane/Sydney/Perth, Australia
| | - Bevan Koopman
- The Australian e-Health Research Centre, CSIRO, Brisbane/Sydney/Perth, Australia
| | - David Conlan
- The Australian e-Health Research Centre, CSIRO, Brisbane/Sydney/Perth, Australia
| | - John O'Dwyer
- The Australian e-Health Research Centre, CSIRO, Brisbane/Sydney/Perth, Australia
| | - Ming Zhang
- The Australian e-Health Research Centre, CSIRO, Brisbane/Sydney/Perth, Australia
| | | | - Hamed Hassanzadeh
- The Australian e-Health Research Centre, CSIRO, Brisbane/Sydney/Perth, Australia
| | - Michael J Lawley
- The Australian e-Health Research Centre, CSIRO, Brisbane/Sydney/Perth, Australia
| | - Damian Green
- Gold Coast Hospital and Health Service, Department of Health, Queensland Government, Gold Coast, Australia
| |
Collapse
|
20
|
Ta CN, Dumontier M, Hripcsak G, Tatonetti NP, Weng C. Columbia Open Health Data, clinical concept prevalence and co-occurrence from electronic health records. Sci Data 2018; 5:180273. [PMID: 30480666 PMCID: PMC6257042 DOI: 10.1038/sdata.2018.273] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2018] [Accepted: 10/16/2018] [Indexed: 12/11/2022] Open
Abstract
Columbia Open Health Data (COHD) is a publicly accessible database of electronic health record (EHR) prevalence and co-occurrence frequencies between conditions, drugs, procedures, and demographics. COHD was derived from Columbia University Irving Medical Center's Observational Health Data Sciences and Informatics (OHDSI) database. The lifetime dataset, derived from all records, contains 36,578 single concepts (11,952 conditions, 12,334 drugs, and 10,816 procedures) and 32,788,901 concept pairs from 5,364,781 patients. The 5-year dataset, derived from records from 2013-2017, contains 29,964 single concepts (10,159 conditions, 10,264 drugs, and 8,270 procedures) and 15,927,195 concept pairs from 1,790,431 patients. Exclusion of rare concepts (count ≤ 10) and Poisson randomization enable data sharing by eliminating risks to patient privacy. EHR prevalences are informative of healthcare consumption rates. Analysis of co-occurrence frequencies via relative frequency analysis and observed-expected frequency ratio are informative of associations between clinical concepts, useful for biomedical research tasks such as drug repurposing and pharmacovigilance. COHD is publicly accessible through a web application-programming interface (API) and downloadable from the Figshare repository. The code is available on GitHub.
Collapse
Affiliation(s)
- Casey N. Ta
- Department of Biomedical Informatics, Columbia University, NY, USA
| | - Michel Dumontier
- Institute of Data Science, Maastricht University, Maastricht, The Netherlands
| | - George Hripcsak
- Department of Biomedical Informatics, Columbia University, NY, USA
| | - Nicholas P. Tatonetti
- Department of Biomedical Informatics, Columbia University, NY, USA
- Department of Systems Biology, Columbia University, NY, USA
- Department of Medicine, Columbia University, NY, USA
| | - Chunhua Weng
- Department of Biomedical Informatics, Columbia University, NY, USA
| |
Collapse
|
21
|
Wu H, Toti G, Morley KI, Ibrahim ZM, Folarin A, Jackson R, Kartoglu I, Agrawal A, Stringer C, Gale D, Gorrell G, Roberts A, Broadbent M, Stewart R, Dobson RJB. SemEHR: A general-purpose semantic search system to surface semantic data from clinical notes for tailored care, trial recruitment, and clinical research. J Am Med Inform Assoc 2018; 25:530-537. [PMID: 29361077 PMCID: PMC6019046 DOI: 10.1093/jamia/ocx160] [Citation(s) in RCA: 50] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2017] [Revised: 11/28/2017] [Accepted: 01/08/2018] [Indexed: 11/23/2022] Open
Abstract
Objective Unlocking the data contained within both structured and unstructured components of electronic health records (EHRs) has the potential to provide a step change in data available for secondary research use, generation of actionable medical insights, hospital management, and trial recruitment. To achieve this, we implemented SemEHR, an open source semantic search and analytics tool for EHRs. Methods SemEHR implements a generic information extraction (IE) and retrieval infrastructure by identifying contextualized mentions of a wide range of biomedical concepts within EHRs. Natural language processing annotations are further assembled at the patient level and extended with EHR-specific knowledge to generate a timeline for each patient. The semantic data are serviced via ontology-based search and analytics interfaces. Results SemEHR has been deployed at a number of UK hospitals, including the Clinical Record Interactive Search, an anonymized replica of the EHR of the UK South London and Maudsley National Health Service Foundation Trust, one of Europe's largest providers of mental health services. In 2 Clinical Record Interactive Search-based studies, SemEHR achieved 93% (hepatitis C) and 99% (HIV) F-measure results in identifying true positive patients. At King's College Hospital in London, as part of the CogStack program (github.com/cogstack), SemEHR is being used to recruit patients into the UK Department of Health 100 000 Genomes Project (genomicsengland.co.uk). The validation study suggests that the tool can validate previously recruited cases and is very fast at searching phenotypes; time for recruitment criteria checking was reduced from days to minutes. Validated on open intensive care EHR data, Medical Information Mart for Intensive Care III, the vital signs extracted by SemEHR can achieve around 97% accuracy. Conclusion Results from the multiple case studies demonstrate SemEHR's efficiency: weeks or months of work can be done within hours or minutes in some cases. SemEHR provides a more comprehensive view of patients, bringing in more and unexpected insight compared to study-oriented bespoke IE systems. SemEHR is open source, available at https://github.com/CogStack/SemEHR.
Collapse
Affiliation(s)
- Honghan Wu
- Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King’s College London, London, UK
- School of Computer and Software, Nanjing University of Information Science and Technology, Nanjing, China
| | - Giulia Toti
- National Addiction Centre, Institute of Psychiatry, Psychology and Neuroscience, King’s College London, London, UK
| | - Katherine I Morley
- National Addiction Centre, Institute of Psychiatry, Psychology and Neuroscience, King’s College London, London, UK
- Centre for Epidemiology and Biostatistics, Melbourne School of Population and Global Health, University of Melbourne, Australia
| | - Zina M Ibrahim
- Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King’s College London, London, UK
- Farr Institute of Health Informatics Research, University College London, London, UK
| | - Amos Folarin
- Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King’s College London, London, UK
- Farr Institute of Health Informatics Research, University College London, London, UK
| | - Richard Jackson
- Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King’s College London, London, UK
| | | | - Asha Agrawal
- King’s College Hospital NHS Foundation Trust, London, UK
| | - Clive Stringer
- King’s College Hospital NHS Foundation Trust, London, UK
| | - Darren Gale
- King’s College Hospital NHS Foundation Trust, London, UK
| | - Genevieve Gorrell
- Department of Computer Science, University of Sheffield, Sheffield, UK
| | - Angus Roberts
- Department of Computer Science, University of Sheffield, Sheffield, UK
| | | | - Robert Stewart
- South London and Maudsley NHS Foundation Trust, London, UK
- Psychological Medicine, Institute of Psychiatry, Psychology and Neuroscience, King’s College London, London, UK
| | - Richard JB Dobson
- Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King’s College London, London, UK
- Farr Institute of Health Informatics Research, University College London, London, UK
| |
Collapse
|
22
|
Scheurwegs E, Sushil M, Tulkens S, Daelemans W, Luyckx K. Counting trees in Random Forests: Predicting symptom severity in psychiatric intake reports. J Biomed Inform 2017; 75S:S112-S119. [PMID: 28602906 PMCID: PMC5705466 DOI: 10.1016/j.jbi.2017.06.007] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2017] [Revised: 05/31/2017] [Accepted: 06/05/2017] [Indexed: 11/29/2022]
Abstract
The CEGS N-GRID 2016 Shared Task (Filannino et al., 2017) in Clinical Natural Language Processing introduces the assignment of a severity score to a psychiatric symptom, based on a psychiatric intake report. We present a method that employs the inherent interview-like structure of the report to extract relevant information from the report and generate a representation. The representation consists of a restricted set of psychiatric concepts (and the context they occur in), identified using medical concepts defined in UMLS that are directly related to the psychiatric diagnoses present in the Diagnostic and Statistical Manual of Mental Disorders, 4th Edition (DSM-IV) ontology. Random Forests provides a generalization of the extracted, case-specific features in our representation. The best variant presented here scored an inverse mean absolute error (MAE) of 80.64%. A concise concept-based representation, paired with identification of concept certainty and scope (family, patient), shows a robust performance on the task.
Collapse
Affiliation(s)
- Elyne Scheurwegs
- University of Antwerp, Computational Linguistics and Psycholinguistics (CLiPS) Research Center, Lange Winkelstraat 40-42, B-2000 Antwerp, Belgium; University of Antwerp, Advanced Database Research and Modelling Research Group (ADReM), Middelheimlaan 1, B-2020 Antwerp, Belgium; Antwerp University Hospital, ICT Department, Wilrijkstraat 10, B-2650 Edegem, Belgium.
| | - Madhumita Sushil
- University of Antwerp, Computational Linguistics and Psycholinguistics (CLiPS) Research Center, Lange Winkelstraat 40-42, B-2000 Antwerp, Belgium; Antwerp University Hospital, ICT Department, Wilrijkstraat 10, B-2650 Edegem, Belgium
| | - Stéphan Tulkens
- University of Antwerp, Computational Linguistics and Psycholinguistics (CLiPS) Research Center, Lange Winkelstraat 40-42, B-2000 Antwerp, Belgium
| | - Walter Daelemans
- University of Antwerp, Computational Linguistics and Psycholinguistics (CLiPS) Research Center, Lange Winkelstraat 40-42, B-2000 Antwerp, Belgium
| | - Kim Luyckx
- Antwerp University Hospital, ICT Department, Wilrijkstraat 10, B-2650 Edegem, Belgium
| |
Collapse
|
23
|
A Hybrid MCDM Model for Improving the Electronic Health Record to Better Serve Client Needs. SUSTAINABILITY 2017. [DOI: 10.3390/su9101819] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
24
|
Scheurwegs E, Cule B, Luyckx K, Luyten L, Daelemans W. Selecting relevant features from the electronic health record for clinical code prediction. J Biomed Inform 2017; 74:92-103. [PMID: 28919106 DOI: 10.1016/j.jbi.2017.09.004] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2017] [Revised: 09/11/2017] [Accepted: 09/12/2017] [Indexed: 11/25/2022]
Abstract
A multitude of information sources is present in the electronic health record (EHR), each of which can contain clues to automatically assign diagnosis and procedure codes. These sources however show information overlap and quality differences, which complicates the retrieval of these clues. Through feature selection, a denser representation with a consistent quality and less information overlap can be obtained. We introduce and compare coverage-based feature selection methods, based on confidence and information gain. These approaches were evaluated over a range of medical specialties, with seven different medical specialties for ICD-9-CM code prediction (six at the Antwerp University Hospital and one in the MIMIC-III dataset) and two different medical specialties for ICD-10-CM code prediction. Using confidence coverage to integrate all sources in an EHR shows a consistent improvement in F-measure (49.83% for diagnosis codes on average), both compared with the baseline (44.25% for diagnosis codes on average) and with using the best standalone source (44.41% for diagnosis codes on average). Confidence coverage creates a concise patient stay representation independent of a rigid framework such as UMLS, and contains easily interpretable features. Confidence coverage has several advantages to a baseline setup. In our baseline setup, feature selection was limited to a filter removing features with less than five total occurrences in the trainingset. Prediction results improved consistently when using multiple heterogeneous sources to predict clinical codes, while reducing the number of features and the processing time.
Collapse
Affiliation(s)
- Elyne Scheurwegs
- University of Antwerp, Advanced Database Research and Modelling Research Group (ADReM), Middelheimlaan 1, B-2020 Antwerp, Belgium; University of Antwerp, Computational Linguistics and Psycholinguistics (CLiPS) Research Center, Lange Winkelstraat 40-42, B-2000 Antwerp, Belgium.
| | - Boris Cule
- University of Antwerp, Advanced Database Research and Modelling Research Group (ADReM), Middelheimlaan 1, B-2020 Antwerp, Belgium
| | - Kim Luyckx
- Antwerp University Hospital, ICT Department, Wilrijkstraat 10, B-2650 Edegem, Belgium
| | - Léon Luyten
- Antwerp University Hospital, Medical Information Department, Wilrijkstraat 10, B-2650 Edegem, Belgium
| | - Walter Daelemans
- University of Antwerp, Computational Linguistics and Psycholinguistics (CLiPS) Research Center, Lange Winkelstraat 40-42, B-2000 Antwerp, Belgium
| |
Collapse
|
25
|
Jannot AS, Burgun A, Thervet E, Pallet N. The Diagnosis-Wide Landscape of Hospital-Acquired AKI. Clin J Am Soc Nephrol 2017; 12:874-884. [PMID: 28495862 PMCID: PMC5460713 DOI: 10.2215/cjn.10981016] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2016] [Accepted: 03/01/2017] [Indexed: 11/23/2022]
Abstract
BACKGROUND AND OBJECTIVES The exploration of electronic hospital records offers a unique opportunity to describe in-depth the prevalence of conditions associated with diagnoses at an unprecedented level of comprehensiveness. We used a diagnosis-wide approach, adapted from phenome-wide association studies (PheWAS), to perform an exhaustive analysis of all diagnoses associated with hospital-acquired AKI (HA-AKI) in a French urban tertiary academic hospital over a period of 10 years. DESIGN, SETTING, PARTICIPANTS, & MEASUREMENTS We retrospectively extracted all diagnoses from an i2b2 (Informatics for Integrating Biology and the Bedside) clinical data warehouse for patients who stayed in this hospital between 2006 and 2015 and had at least two plasma creatinine measurements performed during the first week of their stay. We then analyzed the association between HA-AKI and each International Classification of Diseases (ICD)-10 diagnostic category to draw a comprehensive picture of diagnoses associated with AKI. Hospital stays for 126,736 unique individuals were extracted. RESULTS Hemodynamic impairment and surgical procedures are the main factors associated with HA-AKI and five clusters of diagnoses were identified: sepsis, heart diseases, polytrauma, liver disease, and cardiovascular surgery. The ICD-10 code corresponding to AKI (N17) was recorded in 30% of the cases with HA-AKI identified, and in this situation, 20% of the diagnoses associated with HA-AKI corresponded to kidney diseases such as tubulointerstitial nephritis, necrotizing vasculitis, or myeloma cast nephropathy. Codes associated with HA-AKI that demonstrated the greatest increase in prevalence with time were related to influenza, polytrauma, and surgery of neoplasms of the genitourinary system. CONCLUSIONS Our approach, derived from PheWAS, is a valuable way to comprehensively identify and classify all of the diagnoses and clusters of diagnoses associated with HA-AKI. Our analysis delivers insights into how diagnoses associated with HA-AKI evolved over time. On the basis of ICD-10 codes, HA-AKI appears largely underestimated in this academic hospital.
Collapse
Affiliation(s)
- Anne-Sophie Jannot
- Departments of Medical Informatics, Biostatistics and Public Health
- Assistance Publique Hôpitaux de Paris, Paris, France
- Paris Descartes University, Paris, France; and
- National Institute for Health and Research (INSERM) U1138, Centre de Recherche des Cordeliers, Paris, France
| | - Anita Burgun
- Departments of Medical Informatics, Biostatistics and Public Health
- Assistance Publique Hôpitaux de Paris, Paris, France
- Paris Descartes University, Paris, France; and
- National Institute for Health and Research (INSERM) U1138, Centre de Recherche des Cordeliers, Paris, France
| | - Eric Thervet
- Nephrology, and
- Assistance Publique Hôpitaux de Paris, Paris, France
- Paris Descartes University, Paris, France; and
| | - Nicolas Pallet
- Nephrology, and
- Clinical Chemistry, Hôpital Européen Georges Pompidou, Paris, France
- Assistance Publique Hôpitaux de Paris, Paris, France
- Paris Descartes University, Paris, France; and
| |
Collapse
|
26
|
Assigning clinical codes with data-driven concept representation on Dutch clinical free text. J Biomed Inform 2017; 69:118-127. [DOI: 10.1016/j.jbi.2017.04.007] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2016] [Revised: 03/06/2017] [Accepted: 04/07/2017] [Indexed: 11/21/2022]
|
27
|
Upadhyaya SG, Murphree DH, Ngufor CG, Knight AM, Cronk DJ, Cima RR, Curry TB, Pathak J, Carter RE, Kor DJ. Automated Diabetes Case Identification Using Electronic Health Record Data at a Tertiary Care Facility. Mayo Clin Proc Innov Qual Outcomes 2017; 1:100-110. [PMID: 30225406 PMCID: PMC6135013 DOI: 10.1016/j.mayocpiqo.2017.04.005] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/05/2022] Open
Abstract
Objective To develop and validate a phenotyping algorithm for the identification of patients with type 1 and type 2 diabetes mellitus (DM) preoperatively using routinely available clinical data from electronic health records. Patients and Methods We used first-order logic rules (if-then-else rules) to imply the presence or absence of DM types 1 and 2. The “if” clause of each rule is a conjunction of logical and, or predicates that provides evidence toward or against the presence of DM. The rule includes International Classification of Diseases, Ninth Revision, Clinical Modification diagnostic codes, outpatient prescription information, laboratory values, and positive annotation of DM in patients’ clinical notes. This study was conducted from March 2, 2015, through February 10, 2016. The performance of our rule-based approach and similar approaches proposed by other institutions was evaluated with a reference standard created by an expert reviewer and implemented for routine clinical care at an academic medical center. Results A total of 4208 surgical patients (mean age, 52 years; males, 48%) were analyzed to develop the phenotyping algorithm. Expert review identified 685 patients (16.28% of the full cohort) as having DM. Our proposed method identified 684 patients (16.25%) as having DM. The algorithm performed well—99.70% sensitivity, 99.97% specificity—and compared favorably with previous approaches. Conclusion Among patients undergoing surgery, determination of DM can be made with high accuracy using simple, computationally efficient rules. Knowledge of patients’ DM status before surgery may alter physicians’ care plan and reduce postsurgical complications. Nevertheless, future efforts are necessary to determine the effect of first-order logic rules on clinical processes and patient outcomes.
Collapse
Key Words
- CCW, Chronic Condition Data Warehouse
- DDC, Durham Diabetes Coalition
- DM, diabetes mellitus
- EHR, electronic health record
- HbA1c of NYC, Hemoglobin A1c of New York City
- HbA1c, hemoglobin A1c
- ICD-9-CM, International Classification of Diseases, Ninth Revision, Clinical Modification
- MICS, Mayo Integrated Clinical Systems
- NLP, natural language processing
- SUPREME-DM, Surveillance, Prevention, and Management of Diabetes Mellitus
- T1DM, type 1 diabetes mellitus
- T2DM, type 2 diabetes mellitus
- eMERGE, Electronic Medical Records and Genomics
Collapse
Affiliation(s)
| | | | - Che G Ngufor
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN
| | - Alison M Knight
- Department of Anesthesiology and Perioperative Medicine, Mayo Clinic, Rochester, MN
| | - Daniel J Cronk
- Department of Information Technology, Mayo Clinic, Rochester, MN
| | - Robert R Cima
- Division of Colon and Rectal Surgery, Mayo Clinic, Rochester, MN.,Robert D. and Patricia E. Kern Center for Science of Health Care Delivery, Mayo Clinic, Rochester, MN
| | - Timothy B Curry
- Department of Anesthesiology and Perioperative Medicine, Mayo Clinic, Rochester, MN.,Department of Physiology and Biomedical Engineering, Mayo Clinic, Rochester, MN
| | | | - Rickey E Carter
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN
| | - Daryl J Kor
- Department of Anesthesiology and Perioperative Medicine, Mayo Clinic, Rochester, MN
| |
Collapse
|