1
|
Berge GT, Granmo OC, Tveit TO, Ruthjersen AL, Sharma J. Combining unsupervised, supervised and rule-based learning: the case of detecting patient allergies in electronic health records. BMC Med Inform Decis Mak 2023; 23:188. [PMID: 37723446 PMCID: PMC10507898 DOI: 10.1186/s12911-023-02271-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2022] [Accepted: 08/17/2023] [Indexed: 09/20/2023] Open
Abstract
BACKGROUND Data mining of electronic health records (EHRs) has a huge potential for improving clinical decision support and to help healthcare deliver precision medicine. Unfortunately, the rule-based and machine learning-based approaches used for natural language processing (NLP) in healthcare today all struggle with various shortcomings related to performance, efficiency, or transparency. METHODS In this paper, we address these issues by presenting a novel method for NLP that implements unsupervised learning of word embeddings, semi-supervised learning for simplified and accelerated clinical vocabulary and concept building, and deterministic rules for fine-grained control of information extraction. The clinical language is automatically learnt, and vocabulary, concepts, and rules supporting a variety of NLP downstream tasks can further be built with only minimal manual feature engineering and tagging required from clinical experts. Together, these steps create an open processing pipeline that gradually refines the data in a transparent way, which greatly improves the interpretable nature of our method. Data transformations are thus made transparent and predictions interpretable, which is imperative for healthcare. The combined method also has other advantages, like potentially being language independent, demanding few domain resources for maintenance, and able to cover misspellings, abbreviations, and acronyms. To test and evaluate the combined method, we have developed a clinical decision support system (CDSS) named Information System for Clinical Concept Searching (ICCS) that implements the method for clinical concept tagging, extraction, and classification. RESULTS In empirical studies the method shows high performance (recall 92.6%, precision 88.8%, F-measure 90.7%), and has demonstrated its value to clinical practice. Here we employ a real-life EHR-derived dataset to evaluate the method's performance on the task of classification (i.e., detecting patient allergies) against a range of common supervised learning algorithms. The combined method achieves state-of-the-art performance compared to the alternative methods we evaluate. We also perform a qualitative analysis of common word embedding methods on the task of word similarity to examine their potential for supporting automatic feature engineering for clinical NLP tasks. CONCLUSIONS Based on the promising results, we suggest more research should be aimed at exploiting the inherent synergies between unsupervised, supervised, and rule-based paradigms for clinical NLP.
Collapse
Affiliation(s)
- Geir Thore Berge
- Department of Information Systems, University of Agder, Kristiansand, Norway
- Department of Technology and eHealth, Sørlandet Hospital Trust, Kristiansand, Norway
| | | | - Tor Oddbjørn Tveit
- Department of Technology and eHealth, Sørlandet Hospital Trust, Kristiansand, Norway
- Department of Anesthesia and Intensive Care, Sørlandet Hospital Trust, Kristiansand, Norway
| | - Anna Linda Ruthjersen
- Department of Technology and eHealth, Sørlandet Hospital Trust, Kristiansand, Norway
| | - Jivitesh Sharma
- Department of Technology and eHealth, Sørlandet Hospital Trust, Kristiansand, Norway.
- Department of ICT, University of Agder, Grimstad, Norway.
| |
Collapse
|
2
|
Humbert-Droz M, Corley J, Tamang S, Gevaert O. Development and validation of MedDRA Tagger: a tool for extraction and structuring medical information from clinical notes. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2022:2022.12.14.22283470. [PMID: 36561189 PMCID: PMC9774225 DOI: 10.1101/2022.12.14.22283470] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Abstract
Rapid and automated extraction of clinical information from patients' notes is a desirable though difficult task. Natural language processing (NLP) and machine learning have great potential to automate and accelerate such applications, but developing such models can require a large amount of labeled clinical text, which can be a slow and laborious process. To address this gap, we propose the MedDRA tagger, a fast annotation tool that makes use of industrial level libraries such as spaCy, biomedical ontologies and weak supervision to annotate and extract clinical concepts at scale. The tool can be used to annotate clinical text and obtain labels for training machine learning models and further refine the clinical concept extraction performance, or to extract clinical concepts for observational study purposes. To demonstrate the usability and versatility of our tool, we present three different use cases: we use the tagger to determine patients with a primary brain cancer diagnosis, we show evidence of rising mental health symptoms at the population level and our last use case shows the evolution of COVID-19 symptomatology throughout three waves between February 2020 and October 2021. The validation of our tool showed good performance on both specific annotations from our development set (F1 score 0.81) and open source annotated data set (F1 score 0.79). We successfully demonstrate the versatility of our pipeline with three different use cases. Finally, we note that the modular nature of our tool allows for a straightforward adaptation to another biomedical ontology. We also show that our tool is independent of EHR system, and as such generalizable.
Collapse
Affiliation(s)
- Marie Humbert-Droz
- Stanford Center for Biomedical Informatics Research, Department of Medicine, Stanford University, Stanford, CA
| | | | - Suzanne Tamang
- Department of Biomedical Data Science, Stanford University, Stanford, CA
| | - Olivier Gevaert
- Stanford Center for Biomedical Informatics Research, Department of Medicine, Stanford University, Stanford, CA
- Department of Biomedical Data Science, Stanford University, Stanford, CA
| |
Collapse
|
3
|
Abdelkader W, Navarro T, Parrish R, Cotoi C, Germini F, Iorio A, Haynes RB, Lokker C. Machine Learning Approaches to Retrieve High-Quality, Clinically Relevant Evidence From the Biomedical Literature: Systematic Review. JMIR Med Inform 2021; 9:e30401. [PMID: 34499041 PMCID: PMC8461527 DOI: 10.2196/30401] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2021] [Revised: 07/15/2021] [Accepted: 07/25/2021] [Indexed: 11/20/2022] Open
Abstract
BACKGROUND The rapid growth of the biomedical literature makes identifying strong evidence a time-consuming task. Applying machine learning to the process could be a viable solution that limits effort while maintaining accuracy. OBJECTIVE The goal of the research was to summarize the nature and comparative performance of machine learning approaches that have been applied to retrieve high-quality evidence for clinical consideration from the biomedical literature. METHODS We conducted a systematic review of studies that applied machine learning techniques to identify high-quality clinical articles in the biomedical literature. Multiple databases were searched to July 2020. Extracted data focused on the applied machine learning model, steps in the development of the models, and model performance. RESULTS From 3918 retrieved studies, 10 met our inclusion criteria. All followed a supervised machine learning approach and applied, from a limited range of options, a high-quality standard for the training of their model. The results show that machine learning can achieve a sensitivity of 95% while maintaining a high precision of 86%. CONCLUSIONS Machine learning approaches perform well in retrieving high-quality clinical studies. Performance may improve by applying more sophisticated approaches such as active learning and unsupervised machine learning approaches.
Collapse
Affiliation(s)
- Wael Abdelkader
- Health Information Research Unit, Department of Health Research Methods, Evidence, and Impact, McMaster University, Hamilton, ON, Canada
| | - Tamara Navarro
- Health Information Research Unit, Department of Health Research Methods, Evidence, and Impact, McMaster University, Hamilton, ON, Canada
| | - Rick Parrish
- Health Information Research Unit, Department of Health Research Methods, Evidence, and Impact, McMaster University, Hamilton, ON, Canada
| | - Chris Cotoi
- Health Information Research Unit, Department of Health Research Methods, Evidence, and Impact, McMaster University, Hamilton, ON, Canada
| | - Federico Germini
- Health Information Research Unit, Department of Health Research Methods, Evidence, and Impact, McMaster University, Hamilton, ON, Canada
- Department of Medicine, McMaster University, Hamilton, ON, Canada
| | - Alfonso Iorio
- Health Information Research Unit, Department of Health Research Methods, Evidence, and Impact, McMaster University, Hamilton, ON, Canada
- Department of Medicine, McMaster University, Hamilton, ON, Canada
| | - R Brian Haynes
- Health Information Research Unit, Department of Health Research Methods, Evidence, and Impact, McMaster University, Hamilton, ON, Canada
- Department of Medicine, McMaster University, Hamilton, ON, Canada
| | - Cynthia Lokker
- Health Information Research Unit, Department of Health Research Methods, Evidence, and Impact, McMaster University, Hamilton, ON, Canada
| |
Collapse
|
4
|
Sung SF, Lin CY, Hu YH. EMR-Based Phenotyping of Ischemic Stroke Using Supervised Machine Learning and Text Mining Techniques. IEEE J Biomed Health Inform 2020; 24:2922-2931. [DOI: 10.1109/jbhi.2020.2976931] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
5
|
Khaleghi T, Murat A, Arslanturk S, Davies E. Automated Surgical Term Clustering: A Text Mining Approach for Unstructured Textual Surgery Descriptions. IEEE J Biomed Health Inform 2019; 24:2107-2118. [PMID: 31796420 DOI: 10.1109/jbhi.2019.2956973] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
High costs in health care and everlasting need for quality improvement in care delivery is increasingly becoming the motivating factor for novel predictive studies in health care informatics. Surgical services impact both the operating theatre costs and revenues and play critical role in care quality. Efficiency of such units relies extremely on effective operational planning and inventory management. A key ingredient to such planning activities is the structured and unstructured data available prior to the surgery day from the electronic health records and other information systems. Unstructured data, such as textual features of procedure description and notes, provide additional information while structured data alone is not sufficient. To effectively utilize textual information using text mining, textual features should be easily identifiable, i.e., without typographical errors and ad hoc abbreviations. While there exists numerous spelling correction and abbreviation identification tools, they are not suitable for the surgical medical text as they require a dictionary and cannot accommodate ad hoc words such as abbreviations. This study proposes a novel preprocessing framework for surgical text data to detect misspellings and abbreviations prior to the application of any text mining and predictive modeling. The proposed approach helps extract the most salient text features from the unstructured principal procedure and additional notes by effectively reducing the raw feature set dimension. The transformed (text) feature set thus improves subsequent prediction tasks in surgery units. We test and validate the proposed approach using datasets from multiple hospitals' surgical departments and benchmark feature sets.
Collapse
|
6
|
A Lightweight API-Based Approach for Building Flexible Clinical NLP Systems. JOURNAL OF HEALTHCARE ENGINEERING 2019; 2019:3435609. [PMID: 31511785 PMCID: PMC6714318 DOI: 10.1155/2019/3435609] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/18/2019] [Revised: 06/20/2019] [Accepted: 07/26/2019] [Indexed: 12/18/2022]
Abstract
Natural language processing (NLP) has become essential for secondary use of clinical data. Over the last two decades, many clinical NLP systems were developed in both academia and industry. However, nearly all existing systems are restricted to specific clinical settings mainly because they were developed for and tested with specific datasets, and they often fail to scale up. Therefore, using existing NLP systems for one's own clinical purposes requires substantial resources and long-term time commitments for customization and testing. Moreover, the maintenance is also troublesome and time-consuming. This research presents a lightweight approach for building clinical NLP systems with limited resources. Following the design science research approach, we propose a lightweight architecture which is designed to be composable, extensible, and configurable. It takes NLP as an external component which can be accessed independently and orchestrated in a pipeline via web APIs. To validate its feasibility, we developed a web-based prototype for clinical concept extraction with six well-known NLP APIs and evaluated it on three clinical datasets. In comparison with available benchmarks for the datasets, three high F1 scores (0.861, 0.724, and 0.805) were obtained from the evaluation. It also gained a low F1 score (0.373) on one of the tests, which probably is due to the small size of the test dataset. The development and evaluation of the prototype demonstrates that our approach has a great potential for building effective clinical NLP systems with limited resources.
Collapse
|
7
|
Xu K, Zhou Z, Gong T, Hao T, Liu W. SBLC: a hybrid model for disease named entity recognition based on semantic bidirectional LSTMs and conditional random fields. BMC Med Inform Decis Mak 2018; 18:114. [PMID: 30526592 PMCID: PMC6284263 DOI: 10.1186/s12911-018-0690-y] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022] Open
Abstract
Background Disease named entity recognition (NER) is a fundamental step in information processing of medical texts. However, disease NER involves complex issues such as descriptive modifiers in actual practice. The accurate identification of disease NER is a still an open and essential research problem in medical information extraction and text mining tasks. Methods A hybrid model named Semantics Bidirectional LSTM and CRF (SBLC) for disease named entity recognition task is proposed. The model leverages word embeddings, Bidirectional Long Short Term Memory networks and Conditional Random Fields. A publically available NCBI disease dataset is applied to evaluate the model through comparing with nine state-of-the-art baseline methods including cTAKES, MetaMap, DNorm, C-Bi-LSTM-CRF, TaggerOne and DNER. Results The results show that the SBLC model achieves an F1 score of 0.862 and outperforms the other methods. In addition, the model does not rely on external domain dictionaries, thus it can be more conveniently applied in many aspects of medical text processing. Conclusions According to performance comparison, the proposed SBLC model achieved the best performance, demonstrating its effectiveness in disease named entity recognition.
Collapse
Affiliation(s)
- Kai Xu
- School of Computer Science and Technology, Guangdong University of Technology, Guangzhou, China
| | - Zhanfan Zhou
- School of Information Science and Technology, Guangdong Universities of Foreign Studies, Guangzhou, China
| | - Tao Gong
- Educational Testing Service, Princeton, NJ, USA.,Center for Linguistics and Applied Linguistics, Guangdong University of Foreign Studies, Guangzhou, China
| | - Tianyong Hao
- School of Computer Science, South China Normal University, Guangzhou, China.
| | - Wenyin Liu
- School of Computer Science and Technology, Guangdong University of Technology, Guangzhou, China.
| |
Collapse
|
8
|
Segura-Bedmar I, Colón-Ruíz C, Tejedor-Alonso MÁ, Moro-Moro M. Predicting of anaphylaxis in big data EMR by exploring machine learning approaches. J Biomed Inform 2018; 87:50-59. [DOI: 10.1016/j.jbi.2018.09.012] [Citation(s) in RCA: 25] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2018] [Revised: 08/31/2018] [Accepted: 09/24/2018] [Indexed: 11/26/2022]
|
9
|
Karystianis G, Thayer K, Wolfe M, Tsafnat G. Evaluation of a rule-based method for epidemiological document classification towards the automation of systematic reviews. J Biomed Inform 2017; 70:27-34. [PMID: 28455150 DOI: 10.1016/j.jbi.2017.04.004] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2016] [Revised: 03/14/2017] [Accepted: 04/02/2017] [Indexed: 02/02/2023]
Abstract
INTRODUCTION Most data extraction efforts in epidemiology are focused on obtaining targeted information from clinical trials. In contrast, limited research has been conducted on the identification of information from observational studies, a major source for human evidence in many fields, including environmental health. The recognition of key epidemiological information (e.g., exposures) through text mining techniques can assist in the automation of systematic reviews and other evidence summaries. METHOD We designed and applied a knowledge-driven, rule-based approach to identify targeted information (study design, participant population, exposure, outcome, confounding factors, and the country where the study was conducted) from abstracts of epidemiological studies included in several systematic reviews of environmental health exposures. The rules were based on common syntactical patterns observed in text and are thus not specific to any systematic review. To validate the general applicability of our approach, we compared the data extracted using our approach versus hand curation for 35 epidemiological study abstracts manually selected for inclusion in two systematic reviews. RESULTS The returned F-score, precision, and recall ranged from 70% to 98%, 81% to 100%, and 54% to 97%, respectively. The highest precision was observed for exposure, outcome and population (100%) while recall was best for exposure and study design with 97% and 89%, respectively. The lowest recall was observed for the population (54%), which also had the lowest F-score (70%). CONCLUSION The generated performance of our text-mining approach demonstrated encouraging results for the identification of targeted information from observational epidemiological study abstracts related to environmental exposures. We have demonstrated that rules based on generic syntactic patterns in one corpus can be applied to other observational study design by simple interchanging the dictionaries aiming to identify certain characteristics (i.e., outcomes, exposures). At the document level, the recognised information can assist in the selection and categorization of studies included in a systematic review.
Collapse
Affiliation(s)
- George Karystianis
- Centre for Health Informatics, Australian Institute of Health Innovation, Macquarie University, Sydney, Australia.
| | - Kristina Thayer
- National Toxicology Program, National Institute of Environmental Health Sciences, Research Triangle Park, NC, USA
| | - Mary Wolfe
- National Toxicology Program, National Institute of Environmental Health Sciences, Research Triangle Park, NC, USA
| | - Guy Tsafnat
- Centre for Health Informatics, Australian Institute of Health Innovation, Macquarie University, Sydney, Australia
| |
Collapse
|
10
|
Kim YM, Delen D. Medical informatics research trend analysis: A text mining approach. Health Informatics J 2016; 24:432-452. [PMID: 30376768 DOI: 10.1177/1460458216678443] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022]
Abstract
The objective of this research is to identify major subject areas of medical informatics and explore the time-variant changes therein. As such it can inform the field about where medical informatics research has been and where it is heading. Furthermore, by identifying subject areas, this study identifies the development trends and the boundaries of medical informatics as an academic field. To conduct the study, first we identified 26,307 articles in PubMed archives which were published in the top medical informatics journals within the timeframe of 2002 to 2013. And then, employing a text mining -based semi-automated analytic approach, we clustered major research topics by analyzing the most frequently appearing subject terms extracted from the abstracts of these articles. The results indicated that some subject areas, such as biomedical, are declining, while other research areas such as health information technology (HIT), Internet-enabled research, and electronic medical/health records (EMR/EHR), are growing. The changes within the research subject areas can largely be attributed to the increasing capabilities and use of HIT. The Internet, for example, has changed the way medical research is conducted in the health care field. While discovering new medical knowledge through clinical and biological experiments is important, the utilization of EMR/EHR enabled the researchers to discover novel medical insight buried deep inside massive data sets, and hence, data analytics research has become a common complement in the medical field, rapidly growing in popularity.
Collapse
|
11
|
Scuba W, Tharp M, Mowery D, Tseytlin E, Liu Y, Drews FA, Chapman WW. Knowledge Author: facilitating user-driven, domain content development to support clinical information extraction. J Biomed Semantics 2016; 7:42. [PMID: 27338146 PMCID: PMC4919842 DOI: 10.1186/s13326-016-0086-9] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2015] [Accepted: 06/01/2016] [Indexed: 11/26/2022] Open
Abstract
BACKGROUND Clinical Natural Language Processing (NLP) systems require a semantic schema comprised of domain-specific concepts, their lexical variants, and associated modifiers to accurately extract information from clinical texts. An NLP system leverages this schema to structure concepts and extract meaning from the free texts. In the clinical domain, creating a semantic schema typically requires input from both a domain expert, such as a clinician, and an NLP expert who will represent clinical concepts created from the clinician's domain expertise into a computable format usable by an NLP system. The goal of this work is to develop a web-based tool, Knowledge Author, that bridges the gap between the clinical domain expert and the NLP system development by facilitating the development of domain content represented in a semantic schema for extracting information from clinical free-text. RESULTS Knowledge Author is a web-based, recommendation system that supports users in developing domain content necessary for clinical NLP applications. Knowledge Author's schematic model leverages a set of semantic types derived from the Secondary Use Clinical Element Models and the Common Type System to allow the user to quickly create and modify domain-related concepts. Features such as collaborative development and providing domain content suggestions through the mapping of concepts to the Unified Medical Language System Metathesaurus database further supports the domain content creation process. Two proof of concept studies were performed to evaluate the system's performance. The first study evaluated Knowledge Author's flexibility to create a broad range of concepts. A dataset of 115 concepts was created of which 87 (76 %) were able to be created using Knowledge Author. The second study evaluated the effectiveness of Knowledge Author's output in an NLP system by extracting concepts and associated modifiers representing a clinical element, carotid stenosis, from 34 clinical free-text radiology reports using Knowledge Author and an NLP system, pyConText. Knowledge Author's domain content produced high recall for concepts (targeted findings: 86 %) and varied recall for modifiers (certainty: 91 % sidedness: 80 %, neurovascular anatomy: 46 %). CONCLUSION Knowledge Author can support clinical domain content development for information extraction by supporting semantic schema creation by domain experts.
Collapse
Affiliation(s)
- William Scuba
- Department of Biomedical Informatics, University of Utah, Salt Lake City, UT, 84108, USA
| | - Melissa Tharp
- Department of Biomedical Informatics, University of Utah, Salt Lake City, UT, 84108, USA
| | - Danielle Mowery
- Department of Biomedical Informatics, University of Utah, Salt Lake City, UT, 84108, USA
| | - Eugene Tseytlin
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, 15206, USA
| | - Yang Liu
- University of California, San Diego, CA, 92093, USA
| | - Frank A Drews
- Department of Psychology, University of Utah, Salt Lake City, UT, 84108, USA
| | - Wendy W Chapman
- Department of Biomedical Informatics, University of Utah, Salt Lake City, UT, 84108, USA.
| |
Collapse
|
12
|
Hernandez-Boussard T, Tamang S, Blayney D, Brooks J, Shah N. New Paradigms for Patient-Centered Outcomes Research in Electronic Medical Records: An Example of Detecting Urinary Incontinence Following Prostatectomy. EGEMS 2016; 4:1231. [PMID: 27347492 PMCID: PMC4899050 DOI: 10.13063/2327-9214.1231] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
Introduction: National initiatives to develop quality metrics emphasize the need to include patient-centered outcomes. Patient-centered outcomes are complex, require documentation of patient communications, and have not been routinely collected by healthcare providers. The widespread implementation of electronic medical records (EHR) offers opportunities to assess patient-centered outcomes within the routine healthcare delivery system. The objective of this study was to test the feasibility and accuracy of identifying patient centered outcomes within the EHR. Methods: Data from patients with localized prostate cancer undergoing prostatectomy were used to develop and test algorithms to accurately identify patient-centered outcomes in post-operative EHRs – we used urinary incontinence as the use case. Standard data mining techniques were used to extract and annotate free text and structured data to assess urinary incontinence recorded within the EHRs. Results A total 5,349 prostate cancer patients were identified in our EHR-system between 1998–2013. Among these EHRs, 30.3% had a text mention of urinary incontinence within 90 days post-operative compared to less than 1.0% with a structured data field for urinary incontinence (i.e. ICD-9 code). Our workflow had good precision and recall for urinary incontinence (positive predictive value: 0.73 and sensitivity: 0.84). Discussion. Our data indicate that important patient-centered outcomes, such as urinary incontinence, are being captured in EHRs as free text and highlight the long-standing importance of accurate clinician documentation. Standard data mining algorithms can accurately and efficiently identify these outcomes in existing EHRs; the complete assessment of these outcomes is essential to move practice into the patient-centered realm of healthcare.
Collapse
|
13
|
Karystianis G, Sheppard T, Dixon WG, Nenadic G. Modelling and extraction of variability in free-text medication prescriptions from an anonymised primary care electronic medical record research database. BMC Med Inform Decis Mak 2016; 16:18. [PMID: 26860263 PMCID: PMC4748480 DOI: 10.1186/s12911-016-0255-x] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2015] [Accepted: 01/28/2016] [Indexed: 11/30/2022] Open
Abstract
Background Free-text medication prescriptions contain detailed instruction information that is key when preparing drug data for analysis. The objective of this study was to develop a novel model and automated text-mining method to extract detailed structured medication information from free-text prescriptions and explore their variability (e.g. optional dosages) in primary care research databases. Methods We introduce a prescription model that provides minimum and maximum values for dose number, frequency and interval, allowing modelling variability and flexibility within a drug prescription. We developed a text mining system that relies on rules to extract such structured information from prescription free-text dosage instructions. The system was applied to medication prescriptions from an anonymised primary care electronic record database (Clinical Practice Research Datalink, CPRD). Results We have evaluated our approach on a test set of 220 CPRD prescription free-text directions. The system achieved an overall accuracy of 91 % at the prescription level, with 97 % accuracy across the attribute levels. We then further analysed over 56,000 most common free text prescriptions from CPRD records and found that 1 in 4 has inherent variability, i.e. a choice in taking medication specified by different minimum and maximum doses, duration or frequency. Conclusions Our approach provides an accurate, automated way of coding prescription free text information, including information about flexibility and variability within a prescription. The method allows the researcher to decide how best to prepare the prescription data for drug efficacy and safety analyses in any given setting, and test various scenarios and their impact. Electronic supplementary material The online version of this article (doi:10.1186/s12911-016-0255-x) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- George Karystianis
- School of Computer Science, University of Manchester, Manchester, UK.,The Christie NHS Foundation Trust, Manchester, UK
| | - Therese Sheppard
- Arthritis Research UK Centre for Epidemiology, University of Manchester, Manchester, UK
| | - William G Dixon
- Arthritis Research UK Centre for Epidemiology, University of Manchester, Manchester, UK.,The Farr Institute of Health Informatics Research, Health eResearch Centre, Manchester, UK
| | - Goran Nenadic
- School of Computer Science, University of Manchester, Manchester, UK. .,The Farr Institute of Health Informatics Research, Health eResearch Centre, Manchester, UK. .,Manchester Institute of Biotechnology, University of Manchester, Manchester, UK.
| |
Collapse
|
14
|
An Introduction to Natural Language Processing: How You Can Get More From Those Electronic Notes You Are Generating. Pediatr Emerg Care 2015; 31:536-41. [PMID: 26148107 DOI: 10.1097/pec.0000000000000484] [Citation(s) in RCA: 45] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 12/26/2022]
Abstract
Electronically stored clinical documents may contain both structured data and unstructured data. The use of structured clinical data varies by facility, but clinicians are familiar with coded data such as International Classification of Diseases, Ninth Revision, Systematized Nomenclature of Medicine-Clinical Terms codes, and commonly other data including patient chief complaints or laboratory results. Most electronic health records have much more clinical information stored as unstructured data, for example, clinical narrative such as history of present illness, procedure notes, and clinical decision making are stored as unstructured data. Despite the importance of this information, electronic capture or retrieval of unstructured clinical data has been challenging. The field of natural language processing (NLP) is undergoing rapid development, and existing tools can be successfully used for quality improvement, research, healthcare coding, and even billing compliance. In this brief review, we provide examples of successful uses of NLP using emergency medicine physician visit notes for various projects and the challenges of retrieving specific data and finally present practical methods that can run on a standard personal computer as well as high-end state-of-the-art funded processes run by leading NLP informatics researchers.
Collapse
|
15
|
Bailey LC, Mistry KB, Tinoco A, Earls M, Rallins MC, Hanley K, Christensen K, Jones M, Woods D. Addressing electronic clinical information in the construction of quality measures. Acad Pediatr 2014; 14:S82-9. [PMID: 25169464 DOI: 10.1016/j.acap.2014.06.006] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/25/2013] [Revised: 06/10/2014] [Accepted: 06/12/2014] [Indexed: 10/24/2022]
Abstract
Electronic health records (EHR) and registries play a central role in health care and provide access to detailed clinical information at the individual, institutional, and population level. Use of these data for clinical quality/performance improvement and cost management has been a focus of policy initiatives over the past decade. The Children's Health Insurance Program Reauthorization Act of 2009 (CHIPRA)-mandated Pediatric Quality Measurement Program supports development and testing of quality measures for children on the basis of electronic clinical information, including de novo measures and respecification of existing measures designed for other data sources. Drawing on the experience of Centers of Excellence, we review both structural and pragmatic considerations in e-measurement. The presence of primary observations in EHR-derived data make it possible to measure outcomes in ways that are difficult with administrative data alone. However, relevant information may be located in narrative text, making it difficult to interpret. EHR systems are collecting more discrete data, but the structure, semantics, and adoption of data elements vary across vendors and sites. EHR systems also differ in ability to incorporate pediatric concepts such as variable dosing and growth percentiles. This variability complicates quality measurement, as do limitations in established measure formats, such as the Quality Data Model, to e-measurement. Addressing these challenges will require investment by vendors, researchers, and clinicians alike in developing better pediatric content for standard terminologies and data models, encouraging wider adoption of technical standards that support reliable quality measurement, better harmonizing data collection with clinical work flow in EHRs, and better understanding the behavior and potential of e-measures.
Collapse
Affiliation(s)
- L Charles Bailey
- Department of Pediatrics, Children's Hospital of Philadelphia and Perelman School of Medicine at the University of Pennsylvania, Philadelphia, Pa.
| | | | - Aldo Tinoco
- National Committee for Quality Assurance, Washington, DC
| | - Marian Earls
- Community Care of North Carolina, Greensboro, NC
| | | | | | | | | | - Donna Woods
- Feinberg School of Medicine, Northwestern University, Chicago, Ill
| |
Collapse
|
16
|
Gobbel GT, Garvin J, Reeves R, Cronin RM, Heavirland J, Williams J, Weaver A, Jayaramaraja S, Giuse D, Speroff T, Brown SH, Xu H, Matheny ME. Assisted annotation of medical free text using RapTAT. J Am Med Inform Assoc 2014; 21:833-41. [PMID: 24431336 DOI: 10.1136/amiajnl-2013-002255] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023] Open
Abstract
OBJECTIVE To determine whether assisted annotation using interactive training can reduce the time required to annotate a clinical document corpus without introducing bias. MATERIALS AND METHODS A tool, RapTAT, was designed to assist annotation by iteratively pre-annotating probable phrases of interest within a document, presenting the annotations to a reviewer for correction, and then using the corrected annotations for further machine learning-based training before pre-annotating subsequent documents. Annotators reviewed 404 clinical notes either manually or using RapTAT assistance for concepts related to quality of care during heart failure treatment. Notes were divided into 20 batches of 19-21 documents for iterative annotation and training. RESULTS The number of correct RapTAT pre-annotations increased significantly and annotation time per batch decreased by ~50% over the course of annotation. Annotation rate increased from batch to batch for assisted but not manual reviewers. Pre-annotation F-measure increased from 0.5 to 0.6 to >0.80 (relative to both assisted reviewer and reference annotations) over the first three batches and more slowly thereafter. Overall inter-annotator agreement was significantly higher between RapTAT-assisted reviewers (0.89) than between manual reviewers (0.85). DISCUSSION The tool reduced workload by decreasing the number of annotations needing to be added and helping reviewers to annotate at an increased rate. Agreement between the pre-annotations and reference standard, and agreement between the pre-annotations and assisted annotations, were similar throughout the annotation process, which suggests that pre-annotation did not introduce bias. CONCLUSIONS Pre-annotations generated by a tool capable of interactive training can reduce the time required to create an annotated document corpus by up to 50%.
Collapse
Affiliation(s)
- Glenn T Gobbel
- Department of Veterans Affairs Medical Center, Geriatric Research, Education and Clinical Center (GRECC), Nashville, Tennessee, USA Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, Tennessee, USA Division of General Internal Medicine & Public Health, Department of Medicine, Vanderbilt University School of Medicine, Nashville, Tennessee, USA
| | - Jennifer Garvin
- IDEAS Center SLC VA Healthcare System, Salt Lake City, Utah, USA Division of Epidemiology, University of Utah School of Medicine, Salt Lake City, Utah, USA Department of Biomedical Informatics, University of Utah School of Medicine, Salt Lake City, Utah, USA Department of Veterans Affairs Medical Center, Geriatric Research, Education and Clinical Center (GRECC), Salt Lake City, Utah, USA
| | - Ruth Reeves
- Department of Veterans Affairs Medical Center, Geriatric Research, Education and Clinical Center (GRECC), Nashville, Tennessee, USA Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, Tennessee, USA
| | - Robert M Cronin
- Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, Tennessee, USA Division of General Internal Medicine & Public Health, Department of Medicine, Vanderbilt University School of Medicine, Nashville, Tennessee, USA
| | - Julia Heavirland
- IDEAS Center SLC VA Healthcare System, Salt Lake City, Utah, USA
| | - Jenifer Williams
- IDEAS Center SLC VA Healthcare System, Salt Lake City, Utah, USA
| | - Allison Weaver
- IDEAS Center SLC VA Healthcare System, Salt Lake City, Utah, USA
| | - Shrimalini Jayaramaraja
- Division of General Internal Medicine & Public Health, Department of Medicine, Vanderbilt University School of Medicine, Nashville, Tennessee, USA
| | - Dario Giuse
- Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, Tennessee, USA
| | - Theodore Speroff
- Department of Veterans Affairs Medical Center, Geriatric Research, Education and Clinical Center (GRECC), Nashville, Tennessee, USA Division of General Internal Medicine & Public Health, Department of Medicine, Vanderbilt University School of Medicine, Nashville, Tennessee, USA Department of Biostatistics, Vanderbilt University School of Medicine, Nashville, Tennessee, USA
| | - Steven H Brown
- Department of Veterans Affairs Medical Center, Geriatric Research, Education and Clinical Center (GRECC), Nashville, Tennessee, USA Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, Tennessee, USA
| | - Hua Xu
- School of Biomedical Informatics, University of Texas Health Science Center, Houston, Texas, USA
| | - Michael E Matheny
- Department of Veterans Affairs Medical Center, Geriatric Research, Education and Clinical Center (GRECC), Nashville, Tennessee, USA Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, Tennessee, USA Division of General Internal Medicine & Public Health, Department of Medicine, Vanderbilt University School of Medicine, Nashville, Tennessee, USA Department of Biostatistics, Vanderbilt University School of Medicine, Nashville, Tennessee, USA
| |
Collapse
|
17
|
Ping XO, Tseng YJ, Chung Y, Wu YL, Hsu CW, Yang PM, Huang GT, Lai F, Liang JD. Information extraction for tracking liver cancer patients' statuses: from mixture of clinical narrative report types. Telemed J E Health 2013; 19:704-10. [PMID: 23869395 DOI: 10.1089/tmj.2012.0241] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
OBJECTIVE To provide an efficient way for tracking patients' condition over long periods of time and to facilitate the collection of clinical data from different types of narrative reports, it is critical to develop an efficient method for smoothly analyzing the clinical data accumulated in narrative reports. MATERIALS AND METHODS To facilitate liver cancer clinical research, a method was developed for extracting clinical factors from various types of narrative clinical reports, including ultrasound reports, radiology reports, pathology reports, operation notes, admission notes, and discharge summaries. An information extraction (IE) module was developed for tracking disease progression in liver cancer patients over time, and a rule-based classifier was developed for answering whether patients met the clinical research eligibility criteria. The classifier provided the answers and direct/indirect evidence (evidence sentences) for the clinical questions. To evaluate the implemented IE module and the classifier, the gold-standard annotations and answers were developed manually, and the results of the implemented system were compared with the gold standard. RESULTS The IE model achieved an F-score from 92.40% to 99.59%, and the classifier achieved accuracy from 96.15% to 100%. CONCLUSIONS The application was successfully applied to the various types of narrative clinical reports. It might be applied to the key extraction for other types of cancer patients.
Collapse
Affiliation(s)
- Xiao-Ou Ping
- 1 Department of Computer Science and Information Engineering, National Taiwan University , Taipei, Taiwan
| | | | | | | | | | | | | | | | | |
Collapse
|
18
|
Kottke TE, Baechler CJ. An algorithm that identifies coronary and heart failure events in the electronic health record. Prev Chronic Dis 2013; 10:E29. [PMID: 23449283 PMCID: PMC3592787 DOI: 10.5888/pcd10.120097] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Introduction The advent of universal health care coverage in the United States and the use of electronic health records can make the medical record a disease surveillance tool. The objective of our study was to identify criteria that accurately categorize acute coronary and heart failure events by using electronic health record data exclusively so that the medical record can be used for surveillance without manual record review. Methods We serially compared 3 computer algorithms to manual record review. The first 2 algorithms relied on ICD-9-CM (International Classification of Diseases, 9th Revision, Clinical Modification) codes, troponin levels, electrocardiogram (ECG) data, and echocardiograph data. The third algorithm relied on a detailed coding system, Intelligent Medical Objects, Inc., (IMO) interface terminology, troponin levels, and echocardiograph data. Results Cohen’s κ for the initial algorithm was 0.47 (95% confidence interval [CI], 0.41–0.54). Cohen’s κ was 0.61 (95% CI, 0.55–0.68) for the second algorithm. Cohen’s κ for the third algorithm was 0.99 (95% CI, 0.98–1.00). Conclusion Electronic medical record data are sufficient to categorize coronary heart disease and heart failure events without manual record review. However, only moderate agreement with medical record review can be achieved when the classification is based on 4-digit ICD-9-CM codes because ICD-9-CM 410.9 includes myocardial infarction with elevation of the ST segment on ECG (STEMI) and myocardial infarction without elevation of the ST segment on ECG (nSTEMI). Nearly perfect agreement can be achieved using IMO interface terminology, a more detailed coding system that tracks to ICD9, ICD10 (International Classification of Diseases, Tenth Revision, Clinical Modification), and SnoMED-CT (Systematized Nomenclature of Medicine – Clinical Terms).
Collapse
Affiliation(s)
- Thomas E Kottke
- HealthPartners Institute for Education and Research, Minneapolis, MN 55440-1524, USA.
| | | |
Collapse
|
19
|
Wu Y, Denny JC, Rosenbloom ST, Miller RA, Giuse DA, Xu H. A comparative study of current Clinical Natural Language Processing systems on handling abbreviations in discharge summaries. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2012; 2012:997-1003. [PMID: 23304375 PMCID: PMC3540461] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]
Abstract
Clinical Natural Language Processing (NLP) systems extract clinical information from narrative clinical texts in many settings. Previous research mentions the challenges of handling abbreviations in clinical texts, but provides little insight into how well current NLP systems correctly recognize and interpret abbreviations. In this paper, we compared performance of three existing clinical NLP systems in handling abbreviations: MetaMap, MedLEE, and cTAKES. The evaluation used an expert-annotated gold standard set of clinical documents (derived from from 32 de-identified patient discharge summaries) containing 1,112 abbreviations. The existing NLP systems achieved suboptimal performance in abbreviation identification, with F-scores ranging from 0.165 to 0.601. MedLEE achieved the best F-score of 0.601 for all abbreviations and 0.705 for clinically relevant abbreviations. This study suggested that accurate identification of clinical abbreviations is a challenging task and that more advanced abbreviation recognition modules might improve existing clinical NLP systems.
Collapse
Affiliation(s)
- Yonghui Wu
- Department of Biomedical Informatics, School of Medicine, Vanderbilt University, Nashville, TN, USA
| | | | | | | | | | | |
Collapse
|
20
|
Haerian K, Varn D, Vaidya S, Ena L, Chase HS, Friedman C. Detection of pharmacovigilance-related adverse events using electronic health records and automated methods. Clin Pharmacol Ther 2012; 92:228-34. [PMID: 22713699 DOI: 10.1038/clpt.2012.54] [Citation(s) in RCA: 87] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Abstract
Electronic health records (EHRs) are an important source of data for detection of adverse drug reactions (ADRs). However, adverse events are frequently due not to medications but to the patients' underlying conditions. Mining to detect ADRs from EHR data must account for confounders. We developed an automated method using natural-language processing (NLP) and a knowledge source to differentiate cases in which the patient's disease is responsible for the event rather than a drug. Our method was applied to 199,920 hospitalization records, concentrating on two serious ADRs: rhabdomyolysis (n = 687) and agranulocytosis (n = 772). Our method automatically identified 75% of the cases, those with disease etiology. The sensitivity and specificity were 93.8% (confidence interval: 88.9-96.7%) and 91.8% (confidence interval: 84.0-96.2%), respectively. The method resulted in considerable saving of time: for every 1 h spent in development, there was a saving of at least 20 h in manual review. The review of the remaining 25% of the cases therefore became more feasible, allowing us to identify the medications that had caused the ADRs.
Collapse
Affiliation(s)
- K Haerian
- Department of Biomedical Informatics, Columbia University, New York, NY, USA.
| | | | | | | | | | | |
Collapse
|
21
|
Harkema H, Chapman WW, Saul M, Dellon ES, Schoen RE, Mehrotra A. Developing a natural language processing application for measuring the quality of colonoscopy procedures. J Am Med Inform Assoc 2011; 18 Suppl 1:i150-6. [PMID: 21946240 PMCID: PMC3241178 DOI: 10.1136/amiajnl-2011-000431] [Citation(s) in RCA: 64] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2011] [Accepted: 08/18/2011] [Indexed: 12/24/2022] Open
Abstract
OBJECTIVE The quality of colonoscopy procedures for colorectal cancer screening is often inadequate and varies widely among physicians. Routine measurement of quality is limited by the costs of manual review of free-text patient charts. Our goal was to develop a natural language processing (NLP) application to measure colonoscopy quality. MATERIALS AND METHODS Using a set of quality measures published by physician specialty societies, we implemented an NLP engine that extracts 21 variables for 19 quality measures from free-text colonoscopy and pathology reports. We evaluated the performance of the NLP engine on a test set of 453 colonoscopy reports and 226 pathology reports, considering accuracy in extracting the values of the target variables from text, and the reliability of the outcomes of the quality measures as computed from the NLP-extracted information. RESULTS The average accuracy of the NLP engine over all variables was 0.89 (range: 0.62-1.0) and the average F measure over all variables was 0.74 (range: 0.49-0.89). The average agreement score, measured as Cohen's κ, between the manually established and NLP-derived outcomes of the quality measures was 0.62 (range: 0.09-0.86). DISCUSSION For nine of the 19 colonoscopy quality measures, the agreement score was 0.70 or above, which we consider a sufficient score for the NLP-derived outcomes of these measures to be practically useful for quality measurement. CONCLUSION The use of NLP for information extraction from free-text colonoscopy and pathology reports creates opportunities for large scale, routine quality measurement, which can support quality improvement in colonoscopy care.
Collapse
Affiliation(s)
- Henk Harkema
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, Pennsylvania 15260, USA.
| | | | | | | | | | | |
Collapse
|
22
|
Lin JW, Chang CH, Lin MW, Ebell MH, Chiang JH. Automating the process of critical appraisal and assessing the strength of evidence with information extraction technology. J Eval Clin Pract 2011; 17:832-8. [PMID: 21707873 DOI: 10.1111/j.1365-2753.2011.01712.x] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
BACKGROUND Critical appraisal, one of the most crucial steps in the practice of evidence-based medicine, is expertise-dependent and time-consuming. The objective of this study was to develop and evaluate an automated text-mining system that could determine the evidence level provided by a medical article. METHODS A text processor was designed and built to interpret the abstracts of medical literature. The system extracted information about: (1) the impact factor of the journal; (2) study design; (3) human subject involvement; (4) number of subjects; (5) P-value; and (6) confidence intervals. We used a classification tree algorithm (C4.5) to create a decision tree using supervised classification. Each article was categorized into evidence level A, B or C, and the output was compared to that determined by domain experts (the reference standard). RESULTS We used a corpus of 3180 cardiovascular disease original research articles, of which 1108 were previously assigned evidence level A, 1705 level B and 367 level C by domain experts. The abstracts were analysed by our automated system and an evidence level was assigned. The algorithm accurately classified 85% of the articles. The agreement between computer and domain experts was substantial (κ-value: 0.78). Cross-validation showed consistent results across repeated tests. CONCLUSION The automated engine accurately classified the evidence level. Misclassification might have resulted from incomplete information retrieval and inaccurate data extraction. Further efforts will focus on assessing relevance and using additional study design features to refine evidence level classification.
Collapse
Affiliation(s)
- Jou-Wei Lin
- Cardiovascular Center, National Taiwan University Hospital Yun-Lin Branch, Dou-Liou City, Taiwan
| | | | | | | | | |
Collapse
|
23
|
Buczak AL, Babin S, Moniz L. Data-driven approach for creating synthetic electronic medical records. BMC Med Inform Decis Mak 2010; 10:59. [PMID: 20946670 PMCID: PMC2972239 DOI: 10.1186/1472-6947-10-59] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2010] [Accepted: 10/14/2010] [Indexed: 12/03/2022] Open
Abstract
BACKGROUND New algorithms for disease outbreak detection are being developed to take advantage of full electronic medical records (EMRs) that contain a wealth of patient information. However, due to privacy concerns, even anonymized EMRs cannot be shared among researchers, resulting in great difficulty in comparing the effectiveness of these algorithms. To bridge the gap between novel bio-surveillance algorithms operating on full EMRs and the lack of non-identifiable EMR data, a method for generating complete and synthetic EMRs was developed. METHODS This paper describes a novel methodology for generating complete synthetic EMRs both for an outbreak illness of interest (tularemia) and for background records. The method developed has three major steps: 1) synthetic patient identity and basic information generation; 2) identification of care patterns that the synthetic patients would receive based on the information present in real EMR data for similar health problems; 3) adaptation of these care patterns to the synthetic patient population. RESULTS We generated EMRs, including visit records, clinical activity, laboratory orders/results and radiology orders/results for 203 synthetic tularemia outbreak patients. Validation of the records by a medical expert revealed problems in 19% of the records; these were subsequently corrected. We also generated background EMRs for over 3000 patients in the 4-11 yr age group. Validation of those records by a medical expert revealed problems in fewer than 3% of these background patient EMRs and the errors were subsequently rectified. CONCLUSIONS A data-driven method was developed for generating fully synthetic EMRs. The method is general and can be applied to any data set that has similar data elements (such as laboratory and radiology orders and results, clinical activity, prescription orders). The pilot synthetic outbreak records were for tularemia but our approach may be adapted to other infectious diseases. The pilot synthetic background records were in the 4-11 year old age group. The adaptations that must be made to the algorithms to produce synthetic background EMRs for other age groups are indicated.
Collapse
Affiliation(s)
- Anna L Buczak
- Johns Hopkins University Applied Physics Laboratory, 11100 Johns Hopkins Rd, Laurel, MD 20723-6099, USA
| | - Steven Babin
- Johns Hopkins University Applied Physics Laboratory, 11100 Johns Hopkins Rd, Laurel, MD 20723-6099, USA
| | - Linda Moniz
- Johns Hopkins University Applied Physics Laboratory, 11100 Johns Hopkins Rd, Laurel, MD 20723-6099, USA
| |
Collapse
|