51
|
Xie F, Zheng C, Yuh-Jer Shen A, Chen W. Extracting and analyzing ejection fraction values from electronic echocardiography reports in a large health maintenance organization. Health Informatics J 2016; 23:319-328. [PMID: 27271114 DOI: 10.1177/1460458216651917] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
The left ventricular ejection fraction value is an important prognostic indicator of cardiovascular outcomes including morbidity and mortality and is often used clinically to indicate severity of heart disease. However, it is usually reported in free-text echocardiography reports. We developed and validated a computerized algorithm to extract ejection fraction values from echocardiography reports and applied the algorithm to a large volume of unstructured echocardiography reports between 1995 and 2011 in a large health maintenance organization. A total of 621,856 echocardiography reports with a description of ejection fraction values or systolic functions were identified, of which 70 percent contained numeric ejection fraction values and the rest (30%) were text descriptions explicitly indicating the systolic left ventricular function. The 12.1 percent (16.0% for male and 8.4% for female) of these extracted ejection fraction values are <45 percent. Validation conducted based on a random sample of 200 reports yielded 95.0 percent sensitivity and 96.9 percent positive predictive value.
Collapse
Affiliation(s)
- Fagen Xie
- Kaiser Permanente Southern California, USA
| | | | | | - Wansu Chen
- Kaiser Permanente Southern California, USA
| |
Collapse
|
52
|
Sauer BC, Jones BE, Globe G, Leng J, Lu CC, He T, Teng CC, Sullivan P, Zeng Q. Performance of a Natural Language Processing (NLP) Tool to Extract Pulmonary Function Test (PFT) Reports from Structured and Semistructured Veteran Affairs (VA) Data. EGEMS 2016; 4:1217. [PMID: 27376095 PMCID: PMC4909376 DOI: 10.13063/2327-9214.1217] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Introduction/Objective: Pulmonary function tests (PFTs) are objective estimates of lung function, but are not reliably stored within the Veteran Health Affairs data systems as structured data. The aim of this study was to validate the natural language processing (NLP) tool we developed—which extracts spirometric values and responses to bronchodilator administration—against expert review, and to estimate the number of additional spirometric tests identified beyond the structured data. Methods: All patients at seven Veteran Affairs Medical Centers with a diagnostic code for asthma Jan 1, 2006–Dec 31, 2012 were included. Evidence of spirometry with a bronchodilator challenge (BDC) was extracted from structured data as well as clinical documents. NLP’s performance was compared against a human reference standard using a random sample of 1,001 documents. Results: In the validation set NLP demonstrated a precision of 98.9 percent (95 percent confidence intervals (CI): 93.9 percent, 99.7 percent), recall of 97.8 percent (95 percent CI: 92.2 percent, 99.7 percent), and an F-measure of 98.3 percent for the forced vital capacity pre- and post pairs and precision of 100 percent (95 percent CI: 96.6 percent, 100 percent), recall of 100 percent (95 percent CI: 96.6 percent, 100 percent), and an F-measure of 100 percent for the forced expiratory volume in one second pre- and post pairs for bronchodilator administration. Application of the NLP increased the proportion identified with complete bronchodilator challenge by 25 percent. Discussion/Conclusion: This technology can improve identification of PFTs for epidemiologic research. Caution must be taken in assuming that a single domain of clinical data can completely capture the scope of a disease, treatment, or clinical test.
Collapse
Affiliation(s)
- Brian C Sauer
- Salt Lake IDEAS Center, Veteran Affairs; Division of Epidemiology, Department of Internal Medicine, School of Medicine, University of Utah
| | - Barbara E Jones
- Salt Lake IDEAS Center, Veteran Affairs; Division of Epidemiology, Department of Internal Medicine, School of Medicine, University of Utah
| | | | - Jianwei Leng
- Division of Epidemiology, Department of Internal Medicine, School of Medicine, University of Utah
| | - Chao-Chin Lu
- Division of Epidemiology, Department of Internal Medicine, School of Medicine, University of Utah
| | - Tao He
- Division of Epidemiology, Department of Internal Medicine, School of Medicine, University of Utah
| | - Chia-Chen Teng
- Division of Epidemiology, Department of Internal Medicine, School of Medicine, University of Utah
| | - Patrick Sullivan
- Department of Pharmacy Practice, School of Pharmacy, Regis University
| | - Qing Zeng
- Salt Lake IDEAS Center, Veteran Affairs; Department of Biomedical Informatics, School of Medicine, University of Utah
| |
Collapse
|
53
|
Mowery DL, Chapman BE, Conway M, South BR, Madden E, Keyhani S, Chapman WW. Extracting a stroke phenotype risk factor from Veteran Health Administration clinical reports: an information content analysis. J Biomed Semantics 2016; 7:26. [PMID: 27175226 PMCID: PMC4863379 DOI: 10.1186/s13326-016-0065-1] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2015] [Accepted: 04/19/2016] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In the United States, 795,000 people suffer strokes each year; 10-15 % of these strokes can be attributed to stenosis caused by plaque in the carotid artery, a major stroke phenotype risk factor. Studies comparing treatments for the management of asymptomatic carotid stenosis are challenging for at least two reasons: 1) administrative billing codes (i.e., Current Procedural Terminology (CPT) codes) that identify carotid images do not denote which neurovascular arteries are affected and 2) the majority of the image reports are negative for carotid stenosis. Studies that rely on manual chart abstraction can be labor-intensive, expensive, and time-consuming. Natural Language Processing (NLP) can expedite the process of manual chart abstraction by automatically filtering reports with no/insignificant carotid stenosis findings and flagging reports with significant carotid stenosis findings; thus, potentially reducing effort, costs, and time. METHODS In this pilot study, we conducted an information content analysis of carotid stenosis mentions in terms of their report location (Sections), report formats (structures) and linguistic descriptions (expressions) from Veteran Health Administration free-text reports. We assessed an NLP algorithm, pyConText's, ability to discern reports with significant carotid stenosis findings from reports with no/insignificant carotid stenosis findings given these three document composition factors for two report types: radiology (RAD) and text integration utility (TIU) notes. RESULTS We observed that most carotid mentions are recorded in prose using categorical expressions, within the Findings and Impression sections for RAD reports and within neither of these designated sections for TIU notes. For RAD reports, pyConText performed with high sensitivity (88 %), specificity (84 %), and negative predictive value (95 %) and reasonable positive predictive value (70 %). For TIU notes, pyConText performed with high specificity (87 %) and negative predictive value (92 %), reasonable sensitivity (73 %), and moderate positive predictive value (58 %). pyConText performed with the highest sensitivity processing the full report rather than the Findings or Impressions independently. CONCLUSION We conclude that pyConText can reduce chart review efforts by filtering reports with no/insignificant carotid stenosis findings and flagging reports with significant carotid stenosis findings from the Veteran Health Administration electronic health record, and hence has utility for expediting a comparative effectiveness study of treatment strategies for stroke prevention.
Collapse
Affiliation(s)
- Danielle L. Mowery
- />Department of Biomedical Informatics, University of Utah, Salt Lake City, UT USA
- />IDEAS Center, Veteran Affair Health Care System, Salt Lake City, UT USA
| | - Brian E. Chapman
- />Department of Biomedical Informatics, University of Utah, Salt Lake City, UT USA
- />IDEAS Center, Veteran Affair Health Care System, Salt Lake City, UT USA
| | - Mike Conway
- />Department of Biomedical Informatics, University of Utah, Salt Lake City, UT USA
| | - Brett R. South
- />Department of Biomedical Informatics, University of Utah, Salt Lake City, UT USA
- />IDEAS Center, Veteran Affair Health Care System, Salt Lake City, UT USA
| | - Erin Madden
- />San Francisco Veteran Affair Health Care System, San Francisco, CA USA
| | - Salomeh Keyhani
- />San Francisco Veteran Affair Health Care System, San Francisco, CA USA
| | - Wendy W. Chapman
- />Department of Biomedical Informatics, University of Utah, Salt Lake City, UT USA
- />IDEAS Center, Veteran Affair Health Care System, Salt Lake City, UT USA
| |
Collapse
|
54
|
A Natural Language Processing Tool for Large-Scale Data Extraction from Echocardiography Reports. PLoS One 2016; 11:e0153749. [PMID: 27124000 PMCID: PMC4849652 DOI: 10.1371/journal.pone.0153749] [Citation(s) in RCA: 38] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2015] [Accepted: 04/04/2016] [Indexed: 11/19/2022] Open
Abstract
Large volumes of data are continuously generated from clinical notes and diagnostic studies catalogued in electronic health records (EHRs). Echocardiography is one of the most commonly ordered diagnostic tests in cardiology. This study sought to explore the feasibility and reliability of using natural language processing (NLP) for large-scale and targeted extraction of multiple data elements from echocardiography reports. An NLP tool, EchoInfer, was developed to automatically extract data pertaining to cardiovascular structure and function from heterogeneously formatted echocardiographic data sources. EchoInfer was applied to echocardiography reports (2004 to 2013) available from 3 different on-going clinical research projects. EchoInfer analyzed 15,116 echocardiography reports from 1684 patients, and extracted 59 quantitative and 21 qualitative data elements per report. EchoInfer achieved a precision of 94.06%, a recall of 92.21%, and an F1-score of 93.12% across all 80 data elements in 50 reports. Physician review of 400 reports demonstrated that EchoInfer achieved a recall of 92–99.9% and a precision of >97% in four data elements, including three quantitative and one qualitative data element. Failure of EchoInfer to correctly identify or reject reported parameters was primarily related to non-standardized reporting of echocardiography data. EchoInfer provides a powerful and reliable NLP-based approach for the large-scale, targeted extraction of information from heterogeneous data sources. The use of EchoInfer may have implications for the clinical management and research analysis of patients undergoing echocardiographic evaluation.
Collapse
|
55
|
Text Mining for Precision Medicine: Bringing Structure to EHRs and Biomedical Literature to Understand Genes and Health. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2016; 939:139-166. [PMID: 27807747 DOI: 10.1007/978-981-10-1503-8_7] [Citation(s) in RCA: 28] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/30/2023]
Abstract
The key question of precision medicine is whether it is possible to find clinically actionable granularity in diagnosing disease and classifying patient risk. The advent of next-generation sequencing and the widespread adoption of electronic health records (EHRs) have provided clinicians and researchers a wealth of data and made possible the precise characterization of individual patient genotypes and phenotypes. Unstructured text-found in biomedical publications and clinical notes-is an important component of genotype and phenotype knowledge. Publications in the biomedical literature provide essential information for interpreting genetic data. Likewise, clinical notes contain the richest source of phenotype information in EHRs. Text mining can render these texts computationally accessible and support information extraction and hypothesis generation. This chapter reviews the mechanics of text mining in precision medicine and discusses several specific use cases, including database curation for personalized cancer medicine, patient outcome prediction from EHR-derived cohorts, and pharmacogenomic research. Taken as a whole, these use cases demonstrate how text mining enables effective utilization of existing knowledge sources and thus promotes increased value for patients and healthcare systems. Text mining is an indispensable tool for translating genotype-phenotype data into effective clinical care that will undoubtedly play an important role in the eventual realization of precision medicine.
Collapse
|
56
|
Bellows BK, DuVall SL, Kamauu AWC, Supina D, Babcock T, LaFleur J. Healthcare costs and resource utilization of patients with binge-eating disorder and eating disorder not otherwise specified in the Department of Veterans Affairs. Int J Eat Disord 2015; 48:1082-91. [PMID: 25959636 DOI: 10.1002/eat.22427] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/25/2014] [Revised: 04/13/2015] [Accepted: 04/19/2015] [Indexed: 11/08/2022]
Abstract
OBJECTIVE The objective of this study was to compare the one-year healthcare costs and utilization of patients with binge-eating disorder (BED) to patients with eating disorder not otherwise specified without BED (EDNOS-only) and to matched patients without an eating disorder (NED). METHODS A natural language processing (NLP) algorithm identified adults with BED from clinical notes in the Department of Veterans Affairs (VA) electronic health record database from 2000 to 2011. Patients with EDNOS-only were identified using ICD-9 code (307.50) and those with NLP-identified BED were excluded. First diagnosis date defined the index date for both groups. Patients with NED were randomly matched 4:1, as available, to patients with BED on age, sex, BMI, depression diagnosis, and index month. Patients with cost data (2005-2011) were included. Total healthcare, inpatient, outpatient, and pharmacy costs were examined. Generalized linear models were used to compare total one-year healthcare costs while adjusting for baseline patient characteristics. RESULTS There were 257 BED, 743 EDNOS-only, and 823 matched NED patients identified. The mean (SD) total unadjusted one-year costs, in 2011 US dollars, were $33,716 ($38,928) for BED, $37,052 ($40,719) for EDNOS-only, and $19,548 ($35,780) for NED patients. When adjusting for patient characteristics, BED patients had one-year total healthcare costs $5,589 higher than EDNOS-only (p = 0.06) and $18,152 higher than matched NED patients (p < 0.001). DISCUSSION This study is the first to use NLP to identify BED patients and quantify their healthcare costs and utilization. Patients with BED had similar one-year total healthcare costs to EDNOS-only patients, but significantly higher costs than patients with NED.
Collapse
Affiliation(s)
- Brandon K Bellows
- VA Salt Lake City Health Care System, Salt Lake City, Utah.,Department of Pharmacotherapy, University of Utah College of Pharmacy, Salt Lake City, Utah
| | - Scott L DuVall
- VA Salt Lake City Health Care System, Salt Lake City, Utah.,Department of Pharmacotherapy, University of Utah College of Pharmacy, Salt Lake City, Utah.,Division of Epidemiology, University of Utah School of Medicine, Salt Lake City, Utah
| | | | | | | | - Joanne LaFleur
- VA Salt Lake City Health Care System, Salt Lake City, Utah.,Department of Pharmacotherapy, University of Utah College of Pharmacy, Salt Lake City, Utah
| |
Collapse
|
57
|
Fine-grained information extraction from German transthoracic echocardiography reports. BMC Med Inform Decis Mak 2015; 15:91. [PMID: 26563260 PMCID: PMC4643516 DOI: 10.1186/s12911-015-0215-x] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2015] [Accepted: 11/03/2015] [Indexed: 11/30/2022] Open
Abstract
Background Information extraction techniques that get structured representations out of unstructured data make a large amount of clinically relevant information about patients accessible for semantic applications. These methods typically rely on standardized terminologies that guide this process. Many languages and clinical domains, however, lack appropriate resources and tools, as well as evaluations of their applications, especially if detailed conceptualizations of the domain are required. For instance, German transthoracic echocardiography reports have not been targeted sufficiently before, despite of their importance for clinical trials. This work therefore aimed at development and evaluation of an information extraction component with a fine-grained terminology that enables to recognize almost all relevant information stated in German transthoracic echocardiography reports at the University Hospital of Würzburg. Methods A domain expert validated and iteratively refined an automatically inferred base terminology. The terminology was used by an ontology-driven information extraction system that outputs attribute value pairs. The final component has been mapped to the central elements of a standardized terminology, and it has been evaluated according to documents with different layouts. Results The final system achieved state-of-the-art precision (micro average.996) and recall (micro average.961) on 100 test documents that represent more than 90 % of all reports. In particular, principal aspects as defined in a standardized external terminology were recognized with f1=.989 (micro average) and f1=.963 (macro average). As a result of keyword matching and restraint concept extraction, the system obtained high precision also on unstructured or exceptionally short documents, and documents with uncommon layout. Conclusions The developed terminology and the proposed information extraction system allow to extract fine-grained information from German semi-structured transthoracic echocardiography reports with very high precision and high recall on the majority of documents at the University Hospital of Würzburg. Extracted results populate a clinical data warehouse which supports clinical research. Electronic supplementary material The online version of this article (doi:10.1186/s12911-015-0215-x) contains supplementary material, which is available to authorized users.
Collapse
|
58
|
Chen W, Kowatch R, Lin S, Splaingard M, Huang Y. Interactive Cohort Identification of Sleep Disorder Patients Using Natural Language Processing and i2b2. Appl Clin Inform 2015; 6:345-63. [PMID: 26171080 DOI: 10.4338/aci-2014-11-ra-0106] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2014] [Accepted: 02/23/2015] [Indexed: 11/23/2022] Open
Abstract
UNLABELLED Nationwide Children's Hospital established an i2b2 (Informatics for Integrating Biology & the Bedside) application for sleep disorder cohort identification. Discrete data were gleaned from semistructured sleep study reports. The system showed to work more efficiently than the traditional manual chart review method, and it also enabled searching capabilities that were previously not possible. OBJECTIVE We report on the development and implementation of the sleep disorder i2b2 cohort identification system using natural language processing of semi-structured documents. METHODS We developed a natural language processing approach to automatically parse concepts and their values from semi-structured sleep study documents. Two parsers were developed: a regular expression parser for extracting numeric concepts and a NLP based tree parser for extracting textual concepts. Concepts were further organized into i2b2 ontologies based on document structures and in-domain knowledge. RESULTS 26,550 concepts were extracted with 99% being textual concepts. 1.01 million facts were extracted from sleep study documents such as demographic information, sleep study lab results, medications, procedures, diagnoses, among others. The average accuracy of terminology parsing was over 83% when comparing against those by experts. The system is capable of capturing both standard and non-standard terminologies. The time for cohort identification has been reduced significantly from a few weeks to a few seconds. CONCLUSION Natural language processing was shown to be powerful for quickly converting large amount of semi-structured or unstructured clinical data into discrete concepts, which in combination of intuitive domain specific ontologies, allows fast and effective interactive cohort identification through the i2b2 platform for research and clinical use.
Collapse
Affiliation(s)
- W Chen
- Research Information Solutions and Innovations , Columbus, OH
| | - R Kowatch
- Center for Innovation in Pediatric Practice , Columbus, OH
| | - S Lin
- Research Information Solutions and Innovations , Columbus, OH
| | - M Splaingard
- Sleep Disorder Center, Nationwide Children's Hospital , Columbus, OH
| | - Y Huang
- Research Information Solutions and Innovations , Columbus, OH
| |
Collapse
|
59
|
Meng F, Morioka C. Automating the generation of lexical patterns for processing free text in clinical documents. J Am Med Inform Assoc 2015; 22:980-6. [PMID: 25977405 DOI: 10.1093/jamia/ocv012] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2014] [Accepted: 02/09/2015] [Indexed: 01/30/2023] Open
Abstract
OBJECTIVE Many tasks in natural language processing utilize lexical pattern-matching techniques, including information extraction (IE), negation identification, and syntactic parsing. However, it is generally difficult to derive patterns that achieve acceptable levels of recall while also remaining highly precise. MATERIALS AND METHODS We present a multiple sequence alignment (MSA)-based technique that automatically generates patterns, thereby leveraging language usage to determine the context of words that influence a given target. MSAs capture the commonalities among word sequences and are able to reveal areas of linguistic stability and variation. In this way, MSAs provide a systemic approach to generating lexical patterns that are generalizable, which will both increase recall levels and maintain high levels of precision. RESULTS The MSA-generated patterns exhibited consistent F1-, F.5-, and F2- scores compared to two baseline techniques for IE across four different tasks. Both baseline techniques performed well for some tasks and less well for others, but MSA was found to consistently perform at a high level for all four tasks. DISCUSSION The performance of MSA on the four extraction tasks indicates the method's versatility. The results show that the MSA-based patterns are able to handle the extraction of individual data elements as well as relations between two concepts without the need for large amounts of manual intervention. CONCLUSION We presented an MSA-based framework for generating lexical patterns that showed consistently high levels of both performance and recall over four different extraction tasks when compared to baseline methods.
Collapse
Affiliation(s)
- Frank Meng
- Medical Imaging Informatics Group, Department of Radiological Sciences, University of California, Los Angeles, CA, USA MAVERIC, VA Boston Healthcare System, Boston MA, USA
| | - Craig Morioka
- Medical Imaging Informatics Group, Department of Radiological Sciences, University of California, Los Angeles, CA, USA Department of Radiology, VA Greater Los Angeles Healthcare System, Los Angeles CA, USA
| |
Collapse
|
60
|
Iyngkaran P, Biddargardi N, Bastiampillai T, Beneby G. Making sense of health care delivery Where does the close to community health care worker fit in? - The case for congestive heart failure. Indian Heart J 2015; 67:250-8. [PMID: 26138183 PMCID: PMC4495591 DOI: 10.1016/j.ihj.2015.03.013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2015] [Accepted: 03/30/2015] [Indexed: 11/18/2022] Open
Abstract
Close to community health care workers (CTC-HCW) is an increasingly used term to describe the emergence of a new partner in health services delivery. In strengthening arguments for this part of the health workforce the authorities, health staffers, supporters, sceptics and perhaps clients will look to the academicians and the evidence base to determine the fate of this group. There is no doubt, CTC-HCW are a vital resource, whose importance is tied to socio-demo-geographic variables. Regardless of what the common perceptions of its importance are, the evolving evidence base could suggest either way. In this short commentary we would like to highlight the importance of a balanced and common sense approach in these arguments. An important example is heart failure where the majority have an associated comorbidity and one in four would also suffer with cognitive or mood disturbances. It is unclear how the CTC-HCW would fare for this devastating syndrome. In moving forward it is important we understand there are: strengths and limitations in the evidence gathering processes; indecision as to the questions; uncertainty of the starting points to gather evidence; and sociodemogeographic biases, which have to be factored before determining the fate of this much needed health care resource.
Collapse
Affiliation(s)
- P Iyngkaran
- Consultant Cardiologist, Research Fellow & Senior Lecturer Flinders University, NT Medical School, Darwin, Australia.
| | - N Biddargardi
- A/Prof eHealth Research, Medical School Flinders University, Australia; Manager, Mental Health Observatory Research Unit, Country Health SALHN Margaret Tobin Centre, GPO Box 2100, Adelaide 5001, South Australia, Australia
| | - T Bastiampillai
- Clinical Director, Southern Adelaide Mental Health Services, Australia; Associate Professor, Mental Health Flinders University, Australia
| | - G Beneby
- Chief Medical Officer, Commonwealth of the Bahamas, Australia
| |
Collapse
|
61
|
Berndt DJ, McCart JA, Finch DK, Luther SL. A Case Study of Data Quality in Text Mining Clinical Progress Notes. ACM TRANSACTIONS ON MANAGEMENT INFORMATION SYSTEMS 2015. [DOI: 10.1145/2669368] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Abstract
Text analytic methods are often aimed at extracting useful information from the vast array of unstructured, free format text documents that are created by almost all organizational processes. The success of any text mining application rests on the quality of the underlying data being analyzed, including both predictive features and outcome labels. In this case study, some focused experiments regarding data quality are used to assess the robustness of Statistical Text Mining (STM) algorithms when applied to clinical progress notes. In particular, the experiments consider the impacts of task complexity (by removing signals), training set size, and target outcome quality. While this research is conducted using a dataset drawn from the medical domain, the data quality issues explored are of more general interest.
Collapse
Affiliation(s)
- Donald J. Berndt
- University of South Florida and VA Consortium for Healthcare Informatics Research; HSR&D Center of Innovation on Disability and Rehabilitation Research, James A. Haley Veterans Hospital, Tampa, FL
| | - James A. McCart
- VA Consortium for Healthcare Informatics Research; HSR&D Center of Innovation on Disability and Rehabilitation Research, James A. Haley Veterans Hospital, Tampa, FL
| | - Dezon K. Finch
- VA Consortium for Healthcare Informatics Research; HSR&D Center of Innovation on Disability and Rehabilitation Research, James A. Haley Veterans Hospital, Tampa, FL
| | - Stephen L. Luther
- VA Consortium for Healthcare Informatics Research; HSR&D Center of Innovation on Disability and Rehabilitation Research, James A. Haley Veterans Hospital, Tampa, FL
| |
Collapse
|
62
|
Meystre SM, Kim Y, Heavirland J, Williams J, Bray BE, Garvin J. Heart Failure Medications Detection and Prescription Status Classification in Clinical Narrative Documents. Stud Health Technol Inform 2015; 216:609-613. [PMID: 26262123 PMCID: PMC5009609] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
Angiotensin Converting Enzyme Inhibitors (ACEI) and Angiotensin II Receptor Blockers (ARB) are two common medication classes used for heart failure treatment. The ADAHF (Automated Data Acquisition for Heart Failure) project aimed at automatically extracting heart failure treatment performance metrics from clinical narrative documents, and these medications are an important component of the performance metrics. We developed two different systems to detect these medications, rule-based and machine learning-based. The rule-based system used dictionary lookups with fuzzy string searching and showed successful performance even if our corpus contains various misspelled medications. The machine learning-based system uses lexical and morphological features and produced similar results. The best performance was achieved when combining the two methods, reaching 99.3% recall and 98.8% precision. To determine the prescription status of each medication (i.e., active, discontinued, or negative), we implemented a SVM classifier with lexical features and achieved good performance, reaching 95.49% accuracy, in a five-fold cross-validation evaluation.
Collapse
Affiliation(s)
- Stéphane M Meystre
- Department of Biomedical Informatics, University of Utah, Salt Lake City, Utah, USA
| | - Youngjun Kim
- School of Computing, University of Utah, Salt Lake City, Utah, USA
| | | | | | - Bruce E Bray
- Department of Biomedical Informatics, University of Utah, Salt Lake City, Utah, USA
| | - Jennifer Garvin
- Department of Biomedical Informatics, University of Utah, Salt Lake City, Utah, USA
| |
Collapse
|
63
|
Wells QS, Farber-Eger E, Crawford DC. Extraction of echocardiographic data from the electronic medical record is a rapid and efficient method for study of cardiac structure and function. J Clin Bioinforma 2014; 4:12. [PMID: 25276338 PMCID: PMC4177384 DOI: 10.1186/2043-9113-4-12] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2014] [Accepted: 09/11/2014] [Indexed: 11/28/2022] Open
Abstract
Background Measures of cardiac structure and function are important human phenotypes that are associated with a range of clinical outcomes. Studying these traits in large populations can be time consuming and costly. Utilizing data from large electronic medical records (EMRs) is one possible solution to this problem. We describe the extraction and filtering of quantitative transthoracic echocardiographic data from the Epidemiologic Architecture for Genes Linked to Environment (EAGLE) study, a large, racially diverse, EMR-based cohort (n = 15,863). Results There were 6,076 echocardiography reports for 2,834 unique adult subjects. Missing data were uncommon with over 90% of data points present. Data irregularities are primarily related to inconsistent use of measurement units and transcriptional errors. The reported filtering method requires manual review of very few data points (<1%), and filtered echocardiographic parameters are similar to published data from epidemiologic populations of similar ethnicity. Moreover, the cohort is comparable in size, and in some cases larger than community-based cohorts of similar race/ethnicity. Conclusions These results demonstrate that echocardiographic data can be efficiently extracted from EMRs, and suggest that EMR-based cohorts have the potential to make major contributions toward the study of epidemiologic and genotype-phenotype associations for cardiac structure and function in diverse populations.
Collapse
Affiliation(s)
- Quinn S Wells
- Department of Medicine, Vanderbilt University, Nashville, TN 37232, USA ; Department of Pharmacology, Vanderbilt University, Nashville, TN 37232, USA ; Vanderbilt University Medical Center, 2525 West End Avenue, Suite 300, Nashville TN 37203, USA
| | - Eric Farber-Eger
- Center for Human Genetics Research, Vanderbilt University, Nashville, TN 37232, USA
| | - Dana C Crawford
- Center for Human Genetics Research, Vanderbilt University, Nashville, TN 37232, USA ; Molecular Physiology and Biophysics, Vanderbilt University, Nashville, TN 37232, USA
| |
Collapse
|
64
|
Can Natural Language Processing Fulfill the Promise of Electronic Medical Records? J Card Fail 2014; 20:465-6. [DOI: 10.1016/j.cardfail.2014.04.020] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2014] [Accepted: 04/28/2014] [Indexed: 11/18/2022]
|
65
|
Pathak J, Kho AN, Denny JC. Electronic health records-driven phenotyping: challenges, recent advances, and perspectives. J Am Med Inform Assoc 2014; 20:e206-11. [PMID: 24302669 DOI: 10.1136/amiajnl-2013-002428] [Citation(s) in RCA: 167] [Impact Index Per Article: 16.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022] Open
Affiliation(s)
- Jyotishman Pathak
- Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota, USA
| | | | | |
Collapse
|
66
|
Bradford W, Hurdle JF, LaSalle B, Facelli JC. Development of a HIPAA-compliant environment for translational research data and analytics. J Am Med Inform Assoc 2014; 21:185-9. [PMID: 23911553 PMCID: PMC3912719 DOI: 10.1136/amiajnl-2013-001769] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2013] [Revised: 07/17/2013] [Accepted: 07/20/2013] [Indexed: 11/03/2022] Open
Abstract
High-performance computing centers (HPC) traditionally have far less restrictive privacy management policies than those encountered in healthcare. We show how an HPC can be re-engineered to accommodate clinical data while retaining its utility in computationally intensive tasks such as data mining, machine learning, and statistics. We also discuss deploying protected virtual machines. A critical planning step was to engage the university's information security operations and the information security and privacy office. Access to the environment requires a double authentication mechanism. The first level of authentication requires access to the university's virtual private network and the second requires that the users be listed in the HPC network information service directory. The physical hardware resides in a data center with controlled room access. All employees of the HPC and its users take the university's local Health Insurance Portability and Accountability Act training series. In the first 3 years, researcher count has increased from 6 to 58.
Collapse
Affiliation(s)
- Wayne Bradford
- Center for High Performance Computing, University of Utah Health Sciences Center, Salt Lake City, Utah, USA
| | - John F Hurdle
- Department of Biomedical Informatics, University of Utah, Salt Lake City, Utah, USA
| | - Bernie LaSalle
- Department of Biomedical Informatics, University of Utah, Salt Lake City, Utah, USA
| | - Julio C Facelli
- Center for High Performance Computing, University of Utah Health Sciences Center, Salt Lake City, Utah, USA
- Department of Biomedical Informatics, University of Utah, Salt Lake City, Utah, USA
| |
Collapse
|
67
|
Ng K, Ghoting A, Steinhubl SR, Stewart WF, Malin B, Sun J. PARAMO: a PARAllel predictive MOdeling platform for healthcare analytic research using electronic health records. J Biomed Inform 2013; 48:160-70. [PMID: 24370496 DOI: 10.1016/j.jbi.2013.12.012] [Citation(s) in RCA: 70] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2013] [Revised: 11/25/2013] [Accepted: 12/17/2013] [Indexed: 12/31/2022]
Abstract
OBJECTIVE Healthcare analytics research increasingly involves the construction of predictive models for disease targets across varying patient cohorts using electronic health records (EHRs). To facilitate this process, it is critical to support a pipeline of tasks: (1) cohort construction, (2) feature construction, (3) cross-validation, (4) feature selection, and (5) classification. To develop an appropriate model, it is necessary to compare and refine models derived from a diversity of cohorts, patient-specific features, and statistical frameworks. The goal of this work is to develop and evaluate a predictive modeling platform that can be used to simplify and expedite this process for health data. METHODS To support this goal, we developed a PARAllel predictive MOdeling (PARAMO) platform which (1) constructs a dependency graph of tasks from specifications of predictive modeling pipelines, (2) schedules the tasks in a topological ordering of the graph, and (3) executes those tasks in parallel. We implemented this platform using Map-Reduce to enable independent tasks to run in parallel in a cluster computing environment. Different task scheduling preferences are also supported. RESULTS We assess the performance of PARAMO on various workloads using three datasets derived from the EHR systems in place at Geisinger Health System and Vanderbilt University Medical Center and an anonymous longitudinal claims database. We demonstrate significant gains in computational efficiency against a standard approach. In particular, PARAMO can build 800 different models on a 300,000 patient data set in 3h in parallel compared to 9days if running sequentially. CONCLUSION This work demonstrates that an efficient parallel predictive modeling platform can be developed for EHR data. This platform can facilitate large-scale modeling endeavors and speed-up the research workflow and reuse of health information. This platform is only a first step and provides the foundation for our ultimate goal of building analytic pipelines that are specialized for health data researchers.
Collapse
Affiliation(s)
- Kenney Ng
- IBM TJ Watson Research Center, Yorktown Heights, NY, United States.
| | - Amol Ghoting
- IBM TJ Watson Research Center, Yorktown Heights, NY, United States
| | - Steven R Steinhubl
- Scripps Translational Science Institute, LaJolla, CA, United States; Geisinger Medical Center, Danville, PA, United States
| | | | - Bradley Malin
- Department of Biomedical Informatics, School of Medicine, Vanderbilt University, Nashville, TN, United States; Department of Electrical Engineering and Computer Science, School of Engineering, Vanderbilt University, Nashville, TN, United States
| | - Jimeng Sun
- IBM TJ Watson Research Center, Yorktown Heights, NY, United States
| |
Collapse
|
68
|
Validating a natural language processing tool to exclude psychogenic nonepileptic seizures in electronic medical record-based epilepsy research. Epilepsy Behav 2013; 29:578-80. [PMID: 24135384 DOI: 10.1016/j.yebeh.2013.09.025] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/18/2013] [Revised: 09/15/2013] [Accepted: 09/16/2013] [Indexed: 11/20/2022]
Abstract
RATIONALE As electronic health record (EHR) systems become more available, they will serve as an important resource for collecting epidemiologic data in epilepsy research. However, since clinicians do not have a systematic method for coding psychogenic nonepileptic seizures (PNES), patients with PNES are often misclassified as having epilepsy, leading to sampling error. This study validates a natural language processing (NLP) tool that uses linguistic information to help identify patients with PNES. METHODS Using the VA national clinical database, 2200 notes of Iraq and Afghanistan veterans who completed video electroencephalograph (VEEG) monitoring were reviewed manually, and the veterans were identified as having documented PNES or not. Reviewers identified PNES-related vocabulary to inform a NLP tool called Yale cTakes Extension (YTEX). Using NLP techniques, YTEX annotates syntactic constructs, named entities, and their negation context in the EHR. These annotations are passed to a classifier to detect patients without PNES. The classifier was evaluated by calculating positive predictive values (PPVs), sensitivity, and F-score. RESULTS Of the 742 Iraq and Afghanistan veterans who received a diagnosis of epilepsy or seizure disorder by VEEG, 44 had documented events on VEEG: 22 veterans (3.0%) had definite PNES only, 20 (2.7%) had probable PNES, and 2 (0.3%) had both PNES and epilepsy documented. The remaining 698 veterans did not have events captured during the VEEG admission and/or did not have a definitive diagnosis. Our classifier achieved a PPV of 93%, a sensitivity of 99%, and a F-score of 96%. CONCLUSION Our study demonstrates that the YTEX NLP tool and classifier is highly accurate in excluding PNES, diagnosed with VEEG, in EHR systems. The tool may be very valuable in preventing false positive identification of patients with epilepsy in EHR-based epidemiologic research.
Collapse
|
69
|
Abhyankar S, Demner-Fushman D. A simple method to extract key maternal data from neonatal clinical notes. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2013; 2013:2-9. [PMID: 24551317 PMCID: PMC3900117] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
Knowledge about maternal history is critical for guiding certain aspects of newborn clinical care as well as for research on neonatal issues. However, often the only maternal history available in the newborn record is in the clinical notes. We are using data from the MIMIC-II database for a clinical study on newborns admitted to the intensive care unit. Important maternal data were only available in the newborn notes, so we developed a simple algorithm to extract those data. We manually derived patterns for maternal age, gravida/para status, and laboratory results by reviewing a small set of notes. Using regular expressions and specific filters for notes and results, we extracted maternal data with recall of 0.91-0.99 and precision of 0.95-1.0 for the 289 infants in our study. Our methods could be used with other research datasets and with clinical documentation systems to extract maternal data into a more useful, structured format.
Collapse
|
70
|
Shivade C, Raghavan P, Fosler-Lussier E, Embi PJ, Elhadad N, Johnson SB, Lai AM. A review of approaches to identifying patient phenotype cohorts using electronic health records. J Am Med Inform Assoc 2013; 21:221-30. [PMID: 24201027 PMCID: PMC3932460 DOI: 10.1136/amiajnl-2013-001935] [Citation(s) in RCA: 286] [Impact Index Per Article: 26.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023] Open
Abstract
Objective To summarize literature describing approaches aimed at automatically identifying patients with a common phenotype. Materials and methods We performed a review of studies describing systems or reporting techniques developed for identifying cohorts of patients with specific phenotypes. Every full text article published in (1) Journal of American Medical Informatics Association, (2) Journal of Biomedical Informatics, (3) Proceedings of the Annual American Medical Informatics Association Symposium, and (4) Proceedings of Clinical Research Informatics Conference within the past 3 years was assessed for inclusion in the review. Only articles using automated techniques were included. Results Ninety-seven articles met our inclusion criteria. Forty-six used natural language processing (NLP)-based techniques, 24 described rule-based systems, 41 used statistical analyses, data mining, or machine learning techniques, while 22 described hybrid systems. Nine articles described the architecture of large-scale systems developed for determining cohort eligibility of patients. Discussion We observe that there is a rise in the number of studies associated with cohort identification using electronic medical records. Statistical analyses or machine learning, followed by NLP techniques, are gaining popularity over the years in comparison with rule-based systems. Conclusions There are a variety of approaches for classifying patients into a particular phenotype. Different techniques and data sources are used, and good performance is reported on datasets at respective institutions. However, no system makes comprehensive use of electronic medical records addressing all of their known weaknesses.
Collapse
Affiliation(s)
- Chaitanya Shivade
- Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio, USA
| | | | | | | | | | | | | |
Collapse
|
71
|
Bellows BK, LaFleur J, Kamauu AWC, Ginter T, Forbush TB, Agbor S, Supina D, Hodgkins P, DuVall SL. Automated identification of patients with a diagnosis of binge eating disorder from narrative electronic health records. J Am Med Inform Assoc 2013; 21:e163-8. [PMID: 24201026 DOI: 10.1136/amiajnl-2013-001859] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022] Open
Abstract
Binge eating disorder (BED) does not have an International Classification of Diseases, 9th or 10th edition code, but is included under 'eating disorder not otherwise specified' (EDNOS). This historical cohort study identified patients with clinician-diagnosed BED from electronic health records (EHR) in the Department of Veterans Affairs between 2000 and 2011 using natural language processing (NLP) and compared their characteristics to patients identified by EDNOS diagnosis codes. NLP identified 1487 BED patients with classification accuracy of 91.8% and sensitivity of 96.2% compared to human review. After applying study inclusion criteria, 525 patients had NLP-identified BED only, 1354 had EDNOS only, and 68 had both BED and EDNOS. Patient characteristics were similar between the groups. This is the first study to use NLP as a method to identify BED patients from EHR data and will allow further epidemiological study of patients with BED in systems with adequate clinical notes.
Collapse
|
72
|
Davis MF, Sriram S, Bush WS, Denny JC, Haines JL. Automated extraction of clinical traits of multiple sclerosis in electronic medical records. J Am Med Inform Assoc 2013; 20:e334-40. [PMID: 24148554 PMCID: PMC3861927 DOI: 10.1136/amiajnl-2013-001999] [Citation(s) in RCA: 41] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
Abstract
Objectives The clinical course of multiple sclerosis (MS) is highly variable, and research data collection is costly and time consuming. We evaluated natural language processing techniques applied to electronic medical records (EMR) to identify MS patients and the key clinical traits of their disease course. Materials and methods We used four algorithms based on ICD-9 codes, text keywords, and medications to identify individuals with MS from a de-identified, research version of the EMR at Vanderbilt University. Using a training dataset of the records of 899 individuals, algorithms were constructed to identify and extract detailed information regarding the clinical course of MS from the text of the medical records, including clinical subtype, presence of oligoclonal bands, year of diagnosis, year and origin of first symptom, Expanded Disability Status Scale (EDSS) scores, timed 25-foot walk scores, and MS medications. Algorithms were evaluated on a test set validated by two independent reviewers. Results We identified 5789 individuals with MS. For all clinical traits extracted, precision was at least 87% and specificity was greater than 80%. Recall values for clinical subtype, EDSS scores, and timed 25-foot walk scores were greater than 80%. Discussion and conclusion This collection of clinical data represents one of the largest databases of detailed, clinical traits available for research on MS. This work demonstrates that detailed clinical information is recorded in the EMR and can be extracted for research purposes with high reliability.
Collapse
Affiliation(s)
- Mary F Davis
- Center for Human Genetics Research, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | | | | | | | | |
Collapse
|
73
|
Byrd RJ, Steinhubl SR, Sun J, Ebadollahi S, Stewart WF. Automatic identification of heart failure diagnostic criteria, using text analysis of clinical notes from electronic health records. Int J Med Inform 2013; 83:983-92. [PMID: 23317809 DOI: 10.1016/j.ijmedinf.2012.12.005] [Citation(s) in RCA: 67] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2012] [Revised: 12/13/2012] [Accepted: 12/17/2012] [Indexed: 10/27/2022]
Abstract
OBJECTIVE Early detection of Heart Failure (HF) could mitigate the enormous individual and societal burden from this disease. Clinical detection is based, in part, on recognition of the multiple signs and symptoms comprising the Framingham HF diagnostic criteria that are typically documented, but not necessarily synthesized, by primary care physicians well before more specific diagnostic studies are done. We developed a natural language processing (NLP) procedure to identify Framingham HF signs and symptoms among primary care patients, using electronic health record (EHR) clinical notes, as a prelude to pattern analysis and clinical decision support for early detection of HF. DESIGN We developed a hybrid NLP pipeline that performs two levels of analysis: (1) At the criteria mention level, a rule-based NLP system is constructed to annotate all affirmative and negative mentions of Framingham criteria. (2) At the encounter level, we construct a system to label encounters according to whether any Framingham criterion is asserted, denied, or unknown. MEASUREMENTS Precision, recall, and F-score are used as performance metrics for criteria mention extraction and for encounter labeling. RESULTS Our criteria mention extractions achieve a precision of 0.925, a recall of 0.896, and an F-score of 0.910. Encounter labeling achieves an F-score of 0.932. CONCLUSION Our system accurately identifies and labels affirmations and denials of Framingham diagnostic criteria in primary care clinical notes and may help in the attempt to improve the early detection of HF. With adaptation and tooling, our development methodology can be repeated in new problem settings.
Collapse
Affiliation(s)
- Roy J Byrd
- IBM T. J. Watson Research Center, Yorktown Heights, NY, United States.
| | - Steven R Steinhubl
- Geisinger Medical Center, Center for Health Research, Danville, PA, United States
| | - Jimeng Sun
- IBM T. J. Watson Research Center, Yorktown Heights, NY, United States
| | | | - Walter F Stewart
- Sutter Health, Research, Development, & Dissemination, Concord, CA, United States
| |
Collapse
|
74
|
Combining HCI, Natural Language Processing, and Knowledge Discovery - Potential of IBM Content Analytics as an Assistive Technology in the Biomedical Field. HUMAN-COMPUTER INTERACTION AND KNOWLEDGE DISCOVERY IN COMPLEX, UNSTRUCTURED, BIG DATA 2013. [DOI: 10.1007/978-3-642-39146-0_2] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
|
75
|
Liu M, Shah A, Jiang M, Peterson NB, Dai Q, Aldrich MC, Chen Q, Bowton EA, Liu H, Denny JC, Xu H. A study of transportability of an existing smoking status detection module across institutions. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2012; 2012:577-86. [PMID: 23304330 PMCID: PMC3540509] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]
Abstract
Electronic Medical Records (EMRs) are valuable resources for clinical observational studies. Smoking status of a patient is one of the key factors for many diseases, but it is often embedded in narrative text. Natural language processing (NLP) systems have been developed for this specific task, such as the smoking status detection module in the clinical Text Analysis and Knowledge Extraction System (cTAKES). This study examined transportability of the smoking module in cTAKES on the Vanderbilt University Hospital's EMR data. Our evaluation demonstrated that modest effort of change is necessary to achieve desirable performance. We modified the system by filtering notes, annotating new data for training the machine learning classifier, and adding rules to the rule-based classifiers. Our results showed that the customized module achieved significantly higher F-measures at all levels of classification (i.e., sentence, document, patient) compared to the direct application of the cTAKES module to the Vanderbilt data.
Collapse
Affiliation(s)
- Mei Liu
- Department of Biomedical Informatics, School of Medicine, Vanderbilt University, Nashville, TN, USA
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
76
|
Jonnalagadda SR, Del Fiol G, Medlin R, Weir C, Fiszman M, Mostafa J, Liu H. Automatically extracting sentences from Medline citations to support clinicians' information needs. J Am Med Inform Assoc 2012; 20:995-1000. [PMID: 23100128 DOI: 10.1136/amiajnl-2012-001347] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/04/2022] Open
Abstract
OBJECTIVE Online health knowledge resources contain answers to most of the information needs raised by clinicians in the course of care. However, significant barriers limit the use of these resources for decision-making, especially clinicians' lack of time. In this study we assessed the feasibility of automatically generating knowledge summaries for a particular clinical topic composed of relevant sentences extracted from Medline citations. METHODS The proposed approach combines information retrieval and semantic information extraction techniques to identify relevant sentences from Medline abstracts. We assessed this approach in two case studies on the treatment alternatives for depression and Alzheimer's disease. RESULTS A total of 515 of 564 (91.3%) sentences retrieved in the two case studies were relevant to the topic of interest. About one-third of the relevant sentences described factual knowledge or a study conclusion that can be used for supporting information needs at the point of care. CONCLUSIONS The high rate of relevant sentences is desirable, given that clinicians' lack of time is one of the main barriers to using knowledge resources at the point of care. Sentence rank was not significantly associated with relevancy, possibly due to most sentences being highly relevant. Sentences located closer to the end of the abstract and sentences with treatment and comparative predications were likely to be conclusive sentences. Our proposed technical approach to helping clinicians meet their information needs is promising. The approach can be extended for other knowledge resources and information need types.
Collapse
|
77
|
Verspoor K, Cohen KB, Hunter L. The textual characteristics of traditional and Open Access scientific journals are similar. BMC Bioinformatics 2009; 10:183. [PMID: 19527520 PMCID: PMC2714574 DOI: 10.1186/1471-2105-10-183] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2008] [Accepted: 06/15/2009] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Recent years have seen an increased amount of natural language processing (NLP) work on full text biomedical journal publications. Much of this work is done with Open Access journal articles. Such work assumes that Open Access articles are representative of biomedical publications in general and that methods developed for analysis of Open Access full text publications will generalize to the biomedical literature as a whole. If this assumption is wrong, the cost to the community will be large, including not just wasted resources, but also flawed science. This paper examines that assumption. RESULTS We collected two sets of documents, one consisting only of Open Access publications and the other consisting only of traditional journal publications. We examined them for differences in surface linguistic structures that have obvious consequences for the ease or difficulty of natural language processing and for differences in semantic content as reflected in lexical items. Regarding surface linguistic structures, we examined the incidence of conjunctions, negation, passives, and pronominal anaphora, and found that the two collections did not differ. We also examined the distribution of sentence lengths and found that both collections were characterized by the same mode. Regarding lexical items, we found that the Kullback-Leibler divergence between the two collections was low, and was lower than the divergence between either collection and a reference corpus. Where small differences did exist, log likelihood analysis showed that they were primarily in the area of formatting and in specific named entities. CONCLUSION We did not find structural or semantic differences between the Open Access and traditional journal collections.
Collapse
Affiliation(s)
- Karin Verspoor
- Center for Computational Pharmacology, University of Colorado Denver School of Medicine, PO Box 6511, MS 8303, Aurora, CO 80045, USA
| | - K Bretonnel Cohen
- Center for Computational Pharmacology, University of Colorado Denver School of Medicine, PO Box 6511, MS 8303, Aurora, CO 80045, USA
| | - Lawrence Hunter
- Center for Computational Pharmacology, University of Colorado Denver School of Medicine, PO Box 6511, MS 8303, Aurora, CO 80045, USA
| |
Collapse
|
78
|
Aniba MR, Siguenza S, Friedrich A, Plewniak F, Poch O, Marchler-Bauer A, Thompson JD. Knowledge-based expert systems and a proof-of-concept case study for multiple sequence alignment construction and analysis. Brief Bioinform 2009; 10:11-23. [PMID: 18971242 PMCID: PMC2638625 DOI: 10.1093/bib/bbn045] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2008] [Revised: 10/02/2008] [Indexed: 11/15/2022] Open
Abstract
The traditional approach to bioinformatics analyses relies on independent task-specific services and applications, using different input and output formats, often idiosyncratic, and frequently not designed to inter-operate. In general, such analyses were performed by experts who manually verified the results obtained at each step in the process. Today, the amount of bioinformatics information continuously being produced means that handling the various applications used to study this information presents a major data management and analysis challenge to researchers. It is now impossible to manually analyse all this information and new approaches are needed that are capable of processing the large-scale heterogeneous data in order to extract the pertinent information. We review the recent use of integrated expert systems aimed at providing more efficient knowledge extraction for bioinformatics research. A general methodology for building knowledge-based expert systems is described, focusing on the unstructured information management architecture, UIMA, which provides facilities for both data and process management. A case study involving a multiple alignment expert system prototype called AlexSys is also presented.
Collapse
Affiliation(s)
- Mohamed Radhouene Aniba
- Institut de Génétique et de Biologie Moléculaire et Cellulaire (IGBMC), F-67400 Illkirch, France
| | | | | | | | | | | | | |
Collapse
|