1
|
Hassanpour S, Langlotz CP. Unsupervised Topic Modeling in a Large Free Text Radiology Report Repository. J Digit Imaging 2017; 29:59-62. [PMID: 26353748 DOI: 10.1007/s10278-015-9823-3] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022] Open
Abstract
Radiology report narrative contains a large amount of information about the patient's health and the radiologist's interpretation of medical findings. Most of this critical information is entered in free text format, even when structured radiology report templates are used. The radiology report narrative varies in use of terminology and language among different radiologists and organizations. The free text format and the subtlety and variations of natural language hinder the extraction of reusable information from radiology reports for decision support, quality improvement, and biomedical research. Therefore, as the first step to organize and extract the information content in a large multi-institutional free text radiology report repository, we have designed and developed an unsupervised machine learning approach to capture the main concepts in a radiology report repository and partition the reports based on their main foci. In this approach, radiology reports are modeled in a vector space and compared to each other through a cosine similarity measure. This similarity is used to cluster radiology reports and identify the repository's underlying topics. We applied our approach on a repository of 1,899,482 radiology reports from three major healthcare organizations. Our method identified 19 major radiology report topics in the repository and clustered the reports accordingly to these topics. Our results are verified by a domain expert radiologist and successfully explain the repository's primary topics and extract the corresponding reports. The results of our system provide a target-based corpus and framework for information extraction and retrieval systems for radiology reports.
Collapse
Affiliation(s)
- Saeed Hassanpour
- Department of Radiology, Stanford University, 300 Pasteur Drive, Stanford, CA, 94305, USA.
| | - Curtis P Langlotz
- Department of Radiology, Stanford University, 300 Pasteur Drive, Stanford, CA, 94305, USA
| |
Collapse
|
2
|
Facilitating surveillance of pulmonary invasive mold diseases in patients with haematological malignancies by screening computed tomography reports using natural language processing. PLoS One 2014; 9:e107797. [PMID: 25250675 PMCID: PMC4175456 DOI: 10.1371/journal.pone.0107797] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2014] [Accepted: 08/23/2014] [Indexed: 01/22/2023] Open
Abstract
Purpose Prospective surveillance of invasive mold diseases (IMDs) in haematology patients should be standard of care but is hampered by the absence of a reliable laboratory prompt and the difficulty of manual surveillance. We used a high throughput technology, natural language processing (NLP), to develop a classifier based on machine learning techniques to screen computed tomography (CT) reports supportive for IMDs. Patients and Methods We conducted a retrospective case-control study of CT reports from the clinical encounter and up to 12-weeks after, from a random subset of 79 of 270 case patients with 33 probable/proven IMDs by international definitions, and 68 of 257 uninfected-control patients identified from 3 tertiary haematology centres. The classifier was trained and tested on a reference standard of 449 physician annotated reports including a development subset (n = 366), from a total of 1880 reports, using 10-fold cross validation, comparing binary and probabilistic predictions to the reference standard to generate sensitivity, specificity and area under the receiver-operating-curve (ROC). Results For the development subset, sensitivity/specificity was 91% (95%CI 86% to 94%)/79% (95%CI 71% to 84%) and ROC area was 0.92 (95%CI 89% to 94%). Of 25 (5.6%) missed notifications, only 4 (0.9%) reports were regarded as clinically significant. Conclusion CT reports are a readily available and timely resource that may be exploited by NLP to facilitate continuous prospective IMD surveillance with translational benefits beyond surveillance alone.
Collapse
|
3
|
Sarioglu E, Choi HA, Yadav K. Clinical report classification using Natural Language Processing and Topic Modeling. PROCEEDINGS OF THE ... INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS. INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS 2012; 2012:204-209. [PMID: 37767274 PMCID: PMC10530625 DOI: 10.1109/icmla.2012.173] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/29/2023]
Abstract
Large amount of electronic clinical data encompasses important information in free text format. To be able to help guide medical decision-making, text needs to be efficiently processed and coded. In this research, we investigate techniques to improve classification of Emergency Department computed tomography (CT) reports. The proposed system uses Natural Language Processing (NLP) to generate structured output from the reports and then machine learning techniques to code for the presence of clinically important injuries for traumatic orbital fracture victims. Topic modeling of the corpora is also utilized as an alternative representation of the patient reports. Our results show that both NLP and topic modeling improves raw text classification results. Within NLP features, filtering the codes using modifiers produces the best performance. Topic modeling shows mixed results. Topic vectors provide good dimensionality reduction and get comparable classification results as with NLP features. However, binary topic classification fails to improve upon raw text classification.
Collapse
Affiliation(s)
- Efsun Sarioglu
- Computer Science Department The George Washington University, Washington, DC, USA
| | - Hyeong-Ah Choi
- Computer Science Department The George Washington University, Washington, DC, USA
| | - Kabir Yadav
- Department of Emergency Medicine The George Washington University, Washington, DC, USA
| |
Collapse
|
4
|
Mavandadi S, Feng S, Yu F, Dimitrov S, Nielsen-Saines K, Prescott WR, Ozcan A. A mathematical framework for combining decisions of multiple experts toward accurate and remote diagnosis of malaria using tele-microscopy. PLoS One 2012; 7:e46192. [PMID: 23071544 PMCID: PMC3469564 DOI: 10.1371/journal.pone.0046192] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2012] [Accepted: 08/28/2012] [Indexed: 11/19/2022] Open
Abstract
We propose a methodology for digitally fusing diagnostic decisions made by multiple medical experts in order to improve accuracy of diagnosis. Toward this goal, we report an experimental study involving nine experts, where each one was given more than 8,000 digital microscopic images of individual human red blood cells and asked to identify malaria infected cells. The results of this experiment reveal that even highly trained medical experts are not always self-consistent in their diagnostic decisions and that there exists a fair level of disagreement among experts, even for binary decisions (i.e., infected vs. uninfected). To tackle this general medical diagnosis problem, we propose a probabilistic algorithm to fuse the decisions made by trained medical experts to robustly achieve higher levels of accuracy when compared to individual experts making such decisions. By modelling the decisions of experts as a three component mixture model and solving for the underlying parameters using the Expectation Maximisation algorithm, we demonstrate the efficacy of our approach which significantly improves the overall diagnostic accuracy of malaria infected cells. Additionally, we present a mathematical framework for performing ‘slide-level’ diagnosis by using individual ‘cell-level’ diagnosis data, shedding more light on the statistical rules that should govern the routine practice in examination of e.g., thin blood smear samples. This framework could be generalized for various other tele-pathology needs, and can be used by trained experts within an efficient tele-medicine platform.
Collapse
Affiliation(s)
- Sam Mavandadi
- Electrical Engineering Department, University of California Los Angeles, Los Angeles, California, United States of America
- Bioengineering Department, University of California Los Angeles, Los Angeles, California, United States of America
| | - Steve Feng
- Electrical Engineering Department, University of California Los Angeles, Los Angeles, California, United States of America
- Bioengineering Department, University of California Los Angeles, Los Angeles, California, United States of America
| | - Frank Yu
- Electrical Engineering Department, University of California Los Angeles, Los Angeles, California, United States of America
- Bioengineering Department, University of California Los Angeles, Los Angeles, California, United States of America
| | - Stoyan Dimitrov
- Electrical Engineering Department, University of California Los Angeles, Los Angeles, California, United States of America
- Bioengineering Department, University of California Los Angeles, Los Angeles, California, United States of America
| | - Karin Nielsen-Saines
- Division of Infectious Diseases, Department of Pediatrics, School of Medicine, University of California Los Angeles, Los Angeles, California, United States of America
| | | | - Aydogan Ozcan
- Electrical Engineering Department, University of California Los Angeles, Los Angeles, California, United States of America
- Bioengineering Department, University of California Los Angeles, Los Angeles, California, United States of America
- California NanoSystems Institute, University of California Los Angeles, Los Angeles, California, United States of America
- Department of Surgery, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, California, United States of America
- * E-mail:
| |
Collapse
|
5
|
Rosenbloom ST, Denny JC, Xu H, Lorenzi N, Stead WW, Johnson KB. Data from clinical notes: a perspective on the tension between structure and flexible documentation. J Am Med Inform Assoc 2011; 18:181-6. [PMID: 21233086 DOI: 10.1136/jamia.2010.007237] [Citation(s) in RCA: 226] [Impact Index Per Article: 16.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
Clinical documentation is central to patient care. The success of electronic health record system adoption may depend on how well such systems support clinical documentation. A major goal of integrating clinical documentation into electronic heath record systems is to generate reusable data. As a result, there has been an emphasis on deploying computer-based documentation systems that prioritize direct structured documentation. Research has demonstrated that healthcare providers value different factors when writing clinical notes, such as narrative expressivity, amenability to the existing workflow, and usability. The authors explore the tension between expressivity and structured clinical documentation, review methods for obtaining reusable data from clinical notes, and recommend that healthcare providers be able to choose how to document patient care based on workflow and note content needs. When reusable data are needed from notes, providers can use structured documentation or rely on post-hoc text processing to produce structured data, as appropriate.
Collapse
Affiliation(s)
- S Trent Rosenbloom
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA.
| | | | | | | | | | | |
Collapse
|
6
|
Soysal E, Cicekli I, Baykal N. Design and evaluation of an ontology based information extraction system for radiological reports. Comput Biol Med 2010; 40:900-11. [PMID: 20970122 DOI: 10.1016/j.compbiomed.2010.10.002] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2009] [Revised: 08/15/2010] [Accepted: 10/05/2010] [Indexed: 10/18/2022]
|
7
|
Reliability of zygapophysial joint space measurements made from magnetic resonance imaging scans of acute low back pain subjects: comparison of 2 statistical methods. J Manipulative Physiol Ther 2010; 33:220-5. [PMID: 20350676 DOI: 10.1016/j.jmpt.2010.01.009] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2009] [Revised: 11/05/2009] [Indexed: 11/22/2022]
Abstract
OBJECTIVE This purpose of this study was to assess the reliability of measurements made of the zygapophysial (Z) joint space from the magnetic resonance imaging scans of subjects with acute low back pain using new equipment and 2 different methods of statistical analysis. If found to be reliable, the methods of Z joint measurement can be applied to scans taken before and after spinal manipulation in a larger study of acute low back pain subjects. METHODS Three observers measured the central anterior-to-posterior distance of the left and right L4/L5 and L5/S1 Z joint space from 5 subject scans (20 digitizer measurements, rounded to 0.1 mm) on 2 separate occasions separated by 4 weeks. Observers were blinded to each other and their previous work. Intra- and interobserver reliability was calculated by means of intraclass correlation coefficients and also by mean differences using the methods of Bland and Altman (1986). A mean difference of less than +/-0.4 mm was considered clinically acceptable. RESULTS Intraclass correlation coefficients showed intraobserver reliabilities of 0.95 (95% confidence interval, 0.87-0.98), 0.83 (0.62-0.92), and 0.92 (0.83-0.96) for each of the 3 observers and interobserver reliabilities of 0.90 (0.82-0.95), 0.79 (0.61-0.90), and 0.84 (0.75-0.90) for the first and second measurements and overall reliability, respectively. The mean difference between the first and second measurements was -0.04 mm (+/-1.96 SD = -0.37 to 0.29), 0.23 (-0.48 to 0.94), 0.25 (-0.24 to 0.75), and 0.15 (-0.44 to 0.74) for each of the 3 observers and the overall agreement, respectively. CONCLUSIONS Both statistical methods were found to be useful and complementary and showed the measurements to be highly reliable.
Collapse
|
8
|
Chiang JH, Lin JW, Yang CW. Automated evaluation of electronic discharge notes to assess quality of care for cardiovascular diseases using Medical Language Extraction and Encoding System (MedLEE). J Am Med Inform Assoc 2010; 17:245-52. [PMID: 20442141 DOI: 10.1136/jamia.2009.000182] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022] Open
Abstract
The objective of this study was to develop and validate an automated acquisition system to assess quality of care (QC) measures for cardiovascular diseases. This system combining searching and retrieval algorithms was designed to extract QC measures from electronic discharge notes and to estimate the attainment rates to the current standards of care. It was developed on the patients with ST-segment elevation myocardial infarction and tested on the patients with unstable angina/non-ST-segment elevation myocardial infarction, both diseases sharing almost the same QC measures. The system was able to reach a reasonable agreement (kappa value) with medical experts from 0.65 (early reperfusion rate) to 0.97 (beta-blockers and lipid-lowering agents before discharge) for different QC measures in the test set, and then applied to evaluate QC in the patients who underwent coronary artery bypass grafting surgery. The result has validated a new tool to reliably extract QC measures for cardiovascular diseases.
Collapse
Affiliation(s)
- Jung-Hsien Chiang
- Institute of Medical Informatics and Department of Computer Science, National Cheng Kung University, Tainan, Taiwan.
| | | | | |
Collapse
|
9
|
Gu HH, Hripcsak G, Chen Y, Morrey CP, Elhanan G, Cimino J, Geller J, Perl Y. Evaluation of a UMLS Auditing Process of Semantic Type Assignments. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2007; 2007:294-298. [PMID: 18693845 PMCID: PMC2655790] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Received: 03/13/2007] [Revised: 07/17/2007] [Accepted: 10/11/2007] [Indexed: 05/26/2023]
Abstract
The UMLS is a terminological system that integrates many source terminologies. Each concept in the UMLS is assigned one or more semantic types from the Semantic Network, an upper level ontology for biomedicine. Due to the complexity of the UMLS, errors exist in the semantic type assignments. Finding assignment errors may unearth modeling errors. Even with sophisticated tools, discovering assignment errors requires manual review. In this paper we describe the evaluation of an auditing project of UMLS semantic type assignments. We studied the performance of the auditors who reviewed potential errors. We found that four auditors, interacting according to a multi-step protocol, identified a high rate of errors (one or more errors in 81% of concepts studied) and that results were sufficiently reliable (0.67 to 0.70) for the two most common types of errors. However, reliability was low for each individual auditor, suggesting that review of potential errors is resource-intensive.
Collapse
|
10
|
Hiissa M, Pahikkala T, Suominen H, Lehtikunnas T, Back B, Karsten H, Salanterä S, Salakoski T. Towards automated classification of intensive care nursing narratives. Int J Med Inform 2007; 76 Suppl 3:S362-8. [PMID: 17513166 DOI: 10.1016/j.ijmedinf.2007.03.003] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2006] [Revised: 03/20/2007] [Accepted: 03/28/2007] [Indexed: 01/09/2023]
Abstract
BACKGROUND Nursing narratives are an important part of patient documentation, but the possibilities to utilize them in the direct care process are limited due to the lack of proper tools. One solution to facilitate the utilization of narrative data could be to classify them according to their content. OBJECTIVES Our objective is to address two issues related to designing an automated classifier: domain experts' agreement on the content of classes Breathing, Blood Circulation and Pain, as well as the ability of a machine-learning-based classifier to learn the classification patterns of the nurses. METHODS The data we used were a set of Finnish intensive care nursing narratives, and we used the regularized least-squares (RLS) algorithm for the automatic classification. The agreement of the nurses was assessed by using Cohen's kappa, and the performance of the algorithm was measured using area under ROC curve (AUC). RESULTS On average, the values of kappa were around 0.8. The agreement was highest in the class Blood Circulation, and lowest in the class Breathing. The RLS algorithm was able to learn the classification patterns of the three nurses on an acceptable level; the values of AUC were generally around 0.85. CONCLUSIONS Our results indicate that the free text in nursing documentation can be automatically classified and this can offer a way to develop electronic patient records.
Collapse
Affiliation(s)
- Marketta Hiissa
- Turku Centre for Computer Science, Joukahaisenkatu 3-5 B, 20520 Turku, Finland.
| | | | | | | | | | | | | | | |
Collapse
|
11
|
Harber P, Crawford L, Cheema A, Schacter L. Computer algorithm for automated work group classification from free text: the DREAM technique. J Occup Environ Med 2007; 49:41-9. [PMID: 17215712 DOI: 10.1097/01.jom.0000251826.37828.2e] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
OBJECTIVE This study developed and tested a computer method to automatically assign subjects to aggregate work groups based on their free text work descriptions. METHODS The Double Root Extended Automated Matcher (DREAM) algorithm classifies individuals based on pairs of subjects' free text word roots in common with those of standard classification systems and several explicitly defined linkages between term roots and aggregates. RESULTS DREAM effectively analyzed free text from 5887 participants in a multisite chronic obstructive pulmonary disease prevention study (Lung Health Study). For a test set of 533 cases, DREAMs classifications compared favorably with those of a four-human panel. The humans rated the accuracy of DREAM as good or better in 80% of the test cases. CONCLUSIONS Automated text interpretation is a promising tool for analyzing large data sets for applications in data mining, research, and surveillance. Work descriptive information is most useful when it can link an individual to aggregate entities that have occupational health relevance. Determining the appropriate group requires considerable expertise. This article describes a new method for making such assignments using a computer algorithm to reduce dependence on the limited number of occupational health experts. In addition, computer algorithms foster consistency of assignments.
Collapse
Affiliation(s)
- Philip Harber
- Division of Occupational and Environmental Medicine, Department of Family Medicine, David Geffen School of Medicine, University of California at Los Angeles, Los Angeles, California 90024, USA.
| | | | | | | |
Collapse
|
12
|
Wall SP, Mayorga O, Banfield CE, Wall ME, Aisic I, Auerbach C, Gennis P. Computer-Assisted Categorizing of Head Computed Tomography Reports for Clinical Decision Rule Research. Ann Emerg Med 2006; 48:551-7, 557.e1-25. [PMID: 16997422 DOI: 10.1016/j.annemergmed.2006.06.031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2005] [Revised: 03/15/2006] [Accepted: 06/08/2006] [Indexed: 10/24/2022]
Abstract
STUDY OBJECTIVE To develop software that categorizes electronic head computed tomography (CT) reports into groups useful for clinical decision rule research. METHODS Data were obtained from the Second National Emergency X-Radiography Utilization Study, a cohort of head injury patients having received head CT. CT reports were reviewed manually for presence or absence of clinically important subdural or epidural hematoma, defined as greater than 1.0 cm in width or causing mass effect. Manual categorization was done by 2 independent researchers blinded to each other's results. A third researcher adjudicated discrepancies. A random sample of 300 reports with radiologic abnormalities was selected for software development. After excluding reports categorized manually or by software as indeterminate (neither positive nor negative), we calculated sensitivity and specificity by using manual categorization as the standard. System efficiency was defined as the percentage of reports categorized as positive or negative, regardless of accuracy. Software was refined until analysis of the training data yielded sensitivity and specificity approximating 95% and efficiency exceeding 75%. To test the system, we calculated sensitivity, specificity, and efficiency, using the remaining 1,911 reports. RESULTS Of the 1,911 reports, 160 had clinically important subdural or epidural hematoma. The software exhibited good agreement with manual categorization of all reports, including indeterminate ones (weighted kappa 0.62; 95% confidence interval [CI] 0.58 to 0.65). Sensitivity, specificity, and efficiency of the computerized system for identifying manual positives and negatives were 96% (95% CI 91% to 98%), 98% (95% CI 98% to 99%), and 79% (95% CI 77% to 80%), respectively. CONCLUSION Categorizing head CT reports by computer for clinical decision rule research is feasible.
Collapse
Affiliation(s)
- Stephen P Wall
- Department of Emergency Medicine, Jacobi Medical Center, Albert Einstein College of Medicine, Bronx, NY 10461, USA.
| | | | | | | | | | | | | |
Collapse
|
13
|
|
14
|
Chapman WW, Dowling JN, Wagner MM. Generating a reliable reference standard set for syndromic case classification. J Am Med Inform Assoc 2005; 12:618-29. [PMID: 16049227 PMCID: PMC1294033 DOI: 10.1197/jamia.m1841] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2005] [Accepted: 06/07/2005] [Indexed: 11/10/2022] Open
Abstract
OBJECTIVE To generate and measure the reliability for a reference standard set with representative cases from seven broad syndromic case definitions and several narrower syndromic definitions used for biosurveillance. DESIGN From 527,228 eligible patients between 1990 and 2003, we generated a set of patients potentially positive for seven syndromes by classifying all eligible patients according to their ICD-9 primary discharge diagnoses. We selected a representative subset of the cases for chart review by physicians, who read emergency department reports and assigned values to 14 variables related to the seven syndromes. MEASUREMENTS (1) Positive predictive value of the ICD-9 diagnoses; (2) prevalence of the syndromic definitions and related variables; (3) agreement between physician raters demonstrated by kappa, kappa corrected for bias and prevalence, and Finn's r; and (4) reliability of the reference standard classifications demonstrated by generalizability coefficients. RESULTS Positive predictive value for ICD-9 classification ranged from 0.33 for botulinic to 0.86 for gastrointestinal. We generated between 80 and 566 positive cases for six of the seven syndromic definitions. Rash syndrome exhibited low prevalence (34 cases). Agreement between physician raters was high, with kappa > 0.70 for most variables. Ratings showed no bias. Finn's r was >0.70 for all variables. Generalizability coefficients were >0.70 for all variables but three. CONCLUSION Of the 27 syndromes generated by the 14 variables, 21 showed high enough prevalence, agreement, and reliability to be used as reference standard definitions against which an automated syndromic classifier could be compared. Syndromic definitions that showed poor agreement or low prevalence include febrile botulinic syndrome, febrile and nonfebrile rash syndrome, respiratory syndrome explained by a nonrespiratory or noninfectious diagnosis, and febrile and nonfebrile gastrointestinal syndrome explained by a nongastrointestinal or noninfectious diagnosis.
Collapse
Affiliation(s)
- Wendy W Chapman
- RODS Laboratory, Center for Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA 15213-2582, USA.
| | | | | |
Collapse
|
15
|
Abstract
OBJECTIVE The use of icons and other graphical components in user interfaces has become nearly ubiquitous. The interpretation of such icons is based on the assumption that different users perceive the shapes similarly. At the most basic level, different users must agree on which shapes are similar and which are different. If this similarity can be measured, it may be usable as the basis to design better icons. DESIGN The purpose of this study was to evaluate a novel method for categorizing the visual similarity of graphical primitives, called Presentation Discovery, in the domain of mammography. Six domain experts were given 50 common textual mammography findings and asked to draw how they would represent those findings graphically. Nondomain experts sorted the resulting graphics into groups based on their visual characteristics. The resulting groups were then analyzed using traditional statistics and hypothesis discovery tools. Strength of agreement was evaluated using computational simulations of sorting behavior. MEASUREMENTS Sorter agreement was measured at both the individual graphical and concept-group levels using a novel simulation-based method. "Consensus clusters" of graphics were derived using a hierarchical clustering algorithm. RESULTS The multiple sorters were able to reliably group graphics into similar groups that strongly correlated with underlying domain concepts. Visual inspection of the resulting consensus clusters indicated that graphical primitives that could be informative in the design of icons were present. CONCLUSION The method described provides a rigorous alternative to intuitive design processes frequently employed in the design of icons and other graphical interface components.
Collapse
Affiliation(s)
- Philip R O Payne
- Department of Biomedical Informatics, Columbia University, 622 West 168th Street, VC5, New York, NY 10025, USA.
| | | |
Collapse
|
16
|
Chung J, Murphy S. Concept-value pair extraction from semi-structured clinical narrative: a case study using echocardiogram reports. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2005; 2005:131-5. [PMID: 16779016 PMCID: PMC1560613] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 05/10/2023]
Abstract
The task of gathering detailed patient information from narrative text presents a significant barrier to clinical research. A prototype information extraction system was developed to identify concepts and their associated values from narrative echocardiogram reports. The system uses a Unified Medical Language System compatible architecture and takes advantage of canonical language use patterns to identify sentence templates with which concepts and their related values can be identified. The data extracted from this system will be used to enrich an existing database used by clinical researchers in a large university healthcare system to identify potential research candidates fulfilling clinical inclusion criteria. The system was developed and evaluated using ten clinical concepts. Concept-value pairs extracted by the system were compared with findings extracted manually by the author. The system was able to recall 78% [95%CI, 76-80%] of the relevant findings, with a precision of 99% [95%CI, 98-99%].
Collapse
Affiliation(s)
- Jeanhee Chung
- Laboratory of Computer Science, Department of Medicine, Massachusetts General Hospital, Boston, MA, USA
| | | |
Collapse
|
17
|
Rindflesch TC, Fiszman M. The interaction of domain knowledge and linguistic structure in natural language processing: interpreting hypernymic propositions in biomedical text. J Biomed Inform 2003; 36:462-77. [PMID: 14759819 DOI: 10.1016/j.jbi.2003.11.003] [Citation(s) in RCA: 228] [Impact Index Per Article: 10.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2003] [Indexed: 11/16/2022]
Abstract
Interpretation of semantic propositions in free-text documents such as MEDLINE citations would provide valuable support for biomedical applications, and several approaches to semantic interpretation are being pursued in the biomedical informatics community. In this paper, we describe a methodology for interpreting linguistic structures that encode hypernymic propositions, in which a more specific concept is in a taxonomic relationship with a more general concept. In order to effectively process these constructions, we exploit underspecified syntactic analysis and structured domain knowledge from the Unified Medical Language System (UMLS). After introducing the syntactic processing on which our system depends, we focus on the UMLS knowledge that supports interpretation of hypernymic propositions. We first use semantic groups from the Semantic Network to ensure that the two concepts involved are compatible; hierarchical information in the Metathesaurus then determines which concept is more general and which more specific. A preliminary evaluation of a sample based on the semantic group Chemicals and Drugs provides 83% precision. An error analysis was conducted and potential solutions to the problems encountered are presented. The research discussed here serves as a paradigm for investigating the interaction between domain knowledge and linguistic structure in natural language processing, and could also make a contribution to research on automatic processing of discourse structure. Additional implications of the system we present include its integration in advanced semantic interpretation processors for biomedical text and its use for information extraction in specific domains. The approach has the potential to support a range of applications, including information retrieval and ontology engineering.
Collapse
Affiliation(s)
- Thomas C Rindflesch
- Lister Hill National Center for Biomedical Communications, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, 8600 Rockville Pike, Bethesda, MD 20894, USA.
| | | |
Collapse
|
18
|
van Ast JF, Talmon JL, Renier WO, Ahles PPM, Hasman A. Development of diagnostic reference frames for seizures. Part 1: inter-participant agreement in the selection of symptoms. Int J Med Inform 2003; 70:285-92. [PMID: 12909180 DOI: 10.1016/s1386-5056(03)00047-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
Abstract
OBJECTIVE Our aim is to develop reliable descriptions of various seizure types, which will be used as a basis for decision support. We use expert opinions in this process. In this contribution we evaluate the inter-participant agreement in the selection of frequently occurring symptoms for the description of seizure types. METHOD We compared the actual agreement among participants with the agreement that would result from random symptom selection as well as with the maximal agreement attainable. For each seizure type we calculated the reliability coefficients of the responses. RESULTS For all seizure types we found that the agreement in symptom selection among the participants is significantly higher than expected by chance, but not reaching the maximum agreement attainable. The reliability coefficients varied between 0.56 and 0.74 for the various seizure types. CONCLUSION Although the participants do not reach the maximum agreement attainable in the selection of symptoms, the majority agreement on characteristic frequently occurring symptoms for the different seizure types does approach the maximum agreement attainable. Therefore, we conclude that expert opinions can be used for building descriptions of seizure types. However, to derive a reliable set of symptoms for the construction of the diagnostic reference frames (DRFs) more participants are needed.
Collapse
Affiliation(s)
- J F van Ast
- Department of Medical Informatics, University of Maastricht, PO Box 616, 6200 MD Maastricht, Netherlands.
| | | | | | | | | |
Collapse
|
19
|
Pratt W, Yetisgen-Yildiz M. A study of biomedical concept identification: MetaMap vs. people. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2003; 2003:529-33. [PMID: 14728229 PMCID: PMC1479976] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 04/28/2023]
Abstract
Although huge amounts of unstructured text are available as a rich source of biomedical knowledge, to process this unstructured knowledge requires tools that identify concepts from free-form text. MetaMap is one tool that system developers in biomedicine have commonly used for such a task, but few have studied how well it accomplishes this task in general. In this paper, we report on a study that compares MetaMap's performance against that of six people. Such studies are challenging because the task is inherently subjective and establishing consensus is difficult. Nonetheless, for those concepts that subjects generally agreed on, MetaMap was able to identify most concepts, if they were represented in the UMLS. However, MetaMap identified many other concepts that peo-ple did not. We also report on our analysis of the types of failures that MetaMap exhibited as well as trends in the way people chose to identify concepts.
Collapse
Affiliation(s)
- Wanda Pratt
- Biomedical and Health Informatics, School of Medicine, University of Washington, Seattle, USA
| | | |
Collapse
|
20
|
Fiszman M, Rindflesch TC, Kilicoglu H. Integrating a hypernymic proposition interpreter into a semantic processor for biomedical texts. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2003; 2003:239-43. [PMID: 14728170 PMCID: PMC1479962] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 04/28/2023]
Abstract
Semantic processing provides the potential for producing high quality results in natural language processing (NLP) applications in the biomedical domain. In this paper, we address a specific semantic phenomenon, the hypernymic proposition, and concentrate on integrating the interpretation of such predications into a more general semantic processor in order to improve overall accuracy. A preliminary evaluation assesses the contribution of hypernymic propositions in providing more specific semantic predications and thus improving effectiveness in retrieving treatment propositions in MEDLINE abstracts. Finally, we discuss the generalization of this methodology to additional semantic propositions as well as other types of biomedical texts.
Collapse
Affiliation(s)
- Marcelo Fiszman
- National Library of Medicine, National Institutes of Health, Department of Health and Human Services, Bethesda, Maryland 20894, USA
| | | | | |
Collapse
|
21
|
Mamlin BW, Heinze DT, McDonald CJ. Automated extraction and normalization of findings from cancer-related free-text radiology reports. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2003; 2003:420-4. [PMID: 14728207 PMCID: PMC1479955] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 04/28/2023]
Abstract
UNLABELLED We describe the performance of a particular natural language processing system that uses knowledge vectors to extract findings from radiology reports. LifeCode (A-Life Medical, Inc.) has been successfully coding reports for billing purposes for several years. In this study, we describe the use of LifeCode to code all findings within a set of 500 cancer-related radiology reports against a test set in which all findings were manually tagged. The system was trained with 1400 reports prior to running the test set. RESULTS LifeCode had a recall of 84.5% and precision of 95.7% in the coding of cancer-related radiology report findings. CONCLUSION Despite the use of a modest sized training set and minimal training iterations, when applied to cancer-related reports the system achieved recall and precision measures comparable to other reputable natural language processors in this domain.
Collapse
Affiliation(s)
- Burke W Mamlin
- Regenstrief Institute for Health Care, Indianapolis, Indiana, USA
| | | | | |
Collapse
|
22
|
Huang Y, Lowe HJ, Hersh WR. A pilot study of contextual UMLS indexing to improve the precision of concept-based representation in XML-structured clinical radiology reports. J Am Med Inform Assoc 2003; 10:580-7. [PMID: 12925544 PMCID: PMC264436 DOI: 10.1197/jamia.m1369] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
OBJECTIVE Despite the advantages of structured data entry, much of the patient record is still stored as unstructured or semistructured narrative text. The issue of representing clinical document content remains problematic. The authors' prior work using an automated UMLS document indexing system has been encouraging but has been affected by the generally low indexing precision of such systems. In an effort to improve precision, the authors have developed a context-sensitive document indexing model to calculate the optimal subset of UMLS source vocabularies used to index each document section. This pilot study was performed to evaluate the utility of this indexing approach on a set of clinical radiology reports. DESIGN A set of clinical radiology reports that had been indexed manually using UMLS concept descriptors was indexed automatically by the SAPHIRE indexing engine. Using the data generated by this process the authors developed a system that simulated indexing, at the document section level, of the same document set using many permutations of a subset of the UMLS constituent vocabularies. MEASUREMENTS The precision and recall scores generated by simulated indexing for each permutation of two or three UMLS constituent vocabularies were determined. RESULTS While there was considerable variation in precision and recall values across the different subtypes of radiology reports, the overall effect of this indexing strategy using the best combination of two or three UMLS constituent vocabularies was an improvement in precision without significant impact of recall. CONCLUSION In this pilot study a contextual indexing strategy improved overall precision in a set of clinical radiology reports.
Collapse
Affiliation(s)
- Yang Huang
- Stanford Medical Informatics, The Office of Information Resources and Technology, Stanford University School of Medicine, California 94305, USA.
| | | | | |
Collapse
|
23
|
Bindels R, Hasman A, van Wersch JWJ, Pop P, Winkens RAG. The reliability of assessing the appropriateness of requested diagnostic tests. Med Decis Making 2003; 23:31-7. [PMID: 12583453 DOI: 10.1177/0272989x02239647] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Despite a poor reliability, peer assessment is the traditional method to assess the appropriateness of health care activities. This article describes the reliability of the human assessment of the appropriateness of diagnostic tests requests. The authors used a random selection of 1217 tests from 253 request forms submitted by general practitioners in the Maastricht region of The Netherlands. Three reviewers independently assessed the appropriateness of each requested test. Interrater kappa values ranged from 0.33 to 0.42, and kappa values of intrarater agreement ranged from 0.48 to 0.68. The joint reliability coefficient of the 3 reviewers was 0.66. This reliability is sufficient to review test ordering over a series of cases but is not sufficient to make case-by-case assessments. Sixteen reviewers are needed to obtain a joint reliability of 0.95. The authors conclude that there is substantial variation in assessment concerning what is an appropriately requested diagnostic test and that this feedback method is not reliable enough to make a case-by-case assessment. Computer support maybe beneficial to support and make the process of peer review more uniform.
Collapse
Affiliation(s)
- Rianne Bindels
- Department of Medical Informatics, University of Mastricht, The Netherlands.
| | | | | | | | | |
Collapse
|
24
|
Hripcsak G, Austin JHM, Alderson PO, Friedman C. Use of natural language processing to translate clinical information from a database of 889,921 chest radiographic reports. Radiology 2002; 224:157-63. [PMID: 12091676 DOI: 10.1148/radiol.2241011118] [Citation(s) in RCA: 134] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
PURPOSE To evaluate translation of chest radiographic reports by using natural language processing and to compare the findings with those in the literature. MATERIALS AND METHODS A natural language processor coded 10 years of narrative chest radiographic reports from an urban academic medical center. Coding for 150 reports was compared with manual coding. Frequencies and co-occurrences of 24 clinical conditions (diseases, abnormalities, and clinical states) were estimated. The ratio of right to left lung mass, association of pleural effusion with other conditions, and frequency of bullet and stab wounds were compared with independent observations. The sensitivity and specificity of the system's pneumothorax coding were compared with those of manual financial coding. RESULTS The system coded 889,921 reports on 251,186 patients. On the basis of manual coding of 150 reports, the processor's sensitivity (0.81) and specificity (0.99) were comparable to those previously reported for natural language processing and for expert coders. The frequencies of the selected conditions ranged from 0.22 for pleural effusion to 0.0004 for tension pneumothorax. The database confirmed earlier observations that lung cancer occurs in a 3:2 right-to-left ratio. The association of pleural effusion with other conditions mirrored that in the literature. Bullet and stab wounds decreased during 10 years at a rate consistent with crime statistics. A review of pneumothorax cases showed that the database (sensitivity, 1.00; specificity, 0.996) was more accurate than financial discharge coding (sensitivity, 0.17; P =.002; specificity, 0.996; not significant). CONCLUSION Internal and external validation in this study confirmed the accuracy of natural language processing for translating chest radiographic narrative reports into a large database of information.
Collapse
Affiliation(s)
- George Hripcsak
- Department of Medical Informatics, Columbia University, 622 W 168th St, VC-5, New York, NY 10032, USA.
| | | | | | | |
Collapse
|
25
|
Abstract
Agreement measures are used frequently in reliability studies that involve categorical data. Simple measures like observed agreement and specific agreement can reveal a good deal about the sample. Chance-corrected agreement in the form of the kappa statistic is used frequently based on its correspondence to an intraclass correlation coefficient and the ease of calculating it, but its magnitude depends on the tasks and categories in the experiment. It is helpful to separate the components of disagreement when the goal is to improve the reliability of an instrument or of the raters. Approaches based on modeling the decision making process can be helpful here, including tetrachoric correlation, polychoric correlation, latent trait models, and latent class models. Decision making models can also be used to better understand the behavior of different agreement metrics. For example, if the observed prevalence of responses in one of two available categories is low, then there is insufficient information in the sample to judge raters' ability to discriminate cases, and kappa may underestimate the true agreement and observed agreement may overestimate it.
Collapse
Affiliation(s)
- George Hripcsak
- Department of Medical Informatics, Columbia University, 622 West 168th Street, VC5, New York, NY 10032, USA.
| | | |
Collapse
|
26
|
Hripcsak G, Wilcox A. Reference standards, judges, and comparison subjects: roles for experts in evaluating system performance. J Am Med Inform Assoc 2002; 9:1-15. [PMID: 11751799 PMCID: PMC349383 DOI: 10.1136/jamia.2002.0090001] [Citation(s) in RCA: 51] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022] Open
Abstract
Medical informatics systems are often designed to perform at the level of human experts. Evaluation of the performance of these systems is often constrained by lack of reference standards, either because the appropriate response is not known or because no simple appropriate response exists. Even when performance can be assessed, it is not always clear whether the performance is sufficient or reasonable. These challenges can be addressed if an evaluator enlists the help of clinical domain experts. 1) The experts can carry out the same tasks as the system, and then their responses can be combined to generate a reference standard. 2)The experts can judge the appropriateness of system output directly. 3) The experts can serve as comparison subjects with which the system can be compared. These are separate roles that have different implications for study design, metrics, and issues of reliability and validity. Diagrams help delineate the roles of experts in complex study designs.
Collapse
Affiliation(s)
- George Hripcsak
- Department of Medical Informatics, Columbia University, New York, New York 10032, USA.
| | | |
Collapse
|
27
|
Chapman WW, Fizman M, Chapman BE, Haug PJ. A comparison of classification algorithms to automatically identify chest X-ray reports that support pneumonia. J Biomed Inform 2001; 34:4-14. [PMID: 11376542 DOI: 10.1006/jbin.2001.1000] [Citation(s) in RCA: 58] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
We compared the performance of expert-crafted rules, a Bayesian network, and a decision tree at automatically identifying chest X-ray reports that support acute bacterial pneumonia. We randomly selected 292 chest X-ray reports, 75 (25%) of which were from patients with a hospital discharge diagnosis of bacterial pneumonia. The reports were encoded by our natural language processor and then manually corrected for mistakes. The encoded observations were analyzed by three expert systems to determine whether the reports supported pneumonia. The reference standard for radiologic support of pneumonia was the majority vote of three physicians. We compared (a) the performance of the expert systems against each other and (b) the performance of the expert systems against that of four physicians who were not part of the gold standard. Output from the expert systems and the physicians was transformed so that comparisons could be made with both binary and probabilistic output. Metrics of comparison for binary output were sensitivity (sens), precision (prec), and specificity (spec). The metric of comparison for probabilistic output was the area under the receiver operator characteristic (ROC) curve. We used McNemar's test to determine statistical significance for binary output and univariate z-tests for probabilistic output. Measures of performance of the expert systems for binary (probabilistic) output were as follows: Rules--sens, 0.92; prec, 0.80; spec, 0.86 (Az, 0.960); Bayesian network--sens, 0.90; prec, 0.72; spec, 0.78 (Az, 0.945); decision tree--sens, 0.86; prec, 0.85; spec, 0.91 (Az, 0.940). Comparisons of the expert systems against each other using binary output showed a significant difference between the rules and the Bayesian network and between the decision tree and the Bayesian network. Comparisons of expert systems using probabilistic output showed no significant differences. Comparisons of binary output against physicians showed differences between the Bayesian network and two physicians. Comparisons of probabilistic output against physicians showed a difference between the decision tree and one physician. The expert systems performed similarly for the probabilistic output but differed in measures of sensitivity, precision, and specificity produced by the binary output. All three expert systems performed similarly to physicians.
Collapse
Affiliation(s)
- W W Chapman
- Center for Biomedical Informatics, University of Pittsburgh, Pittsburgh, Pennsylvania 15213, USA
| | | | | | | |
Collapse
|
28
|
Jordan DA, McKeown KR, Concepcion KJ, Feiner SK, Hatzivassiloglou V. Generation and evaluation of intraoperative inferences for automated health care briefings on patient status after bypass surgery. J Am Med Inform Assoc 2001; 8:267-80. [PMID: 11320071 PMCID: PMC131034 DOI: 10.1136/jamia.2001.0080267] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/04/2022] Open
Abstract
OBJECTIVE The authors present a system that scans electronic records from cardiac surgery and uses inference rules to identify and classify abnormal events (e.g., hypertension) that may occur during critical surgical points (e.g., start of bypass). This vital information is used as the content of automatically generated briefings designed by MAGIC, a multimedia system that they are developing to brief intensive care unit clinicians on patient status after cardiac surgery. By recognizing patterns in the patient record, inferences concisely summarize detailed patient data. DESIGN The authors present the development of inference rules that identify important information about patient status and describe their implementation and an experiment they carried out to validate their correctness. The data for a set of 24 patients were analyzed independently by the system and by 46 physicians. MEASUREMENTS The authors measured accuracy, specificity, and sensitivity by comparing system inferences against physician judgments, in cases where all three physicians agreed and against the majority opinion in all cases. RESULTS For laboratory inferences, evaluation shows that the system has an average accuracy of 98 percent (full agreement) and 96 percent (majority model). An analysis of interrater agreement, however, showed that physicians do not agree on abnormal hemodynamic events and could not serve as a gold standard for evaluating hemodynamic events. Analysis of discrepancies reveals possibilities for system improvement and causes of physician disagreement. CONCLUSIONS This evaluation shows that the laboratory inferences of the system have high accuracy. The lack of agreement among physicians highlights the need for an objective quality-assurance tool for hemodynamic inferences. The system provides such a tool by implementing inferencing procedures established in the literature.
Collapse
Affiliation(s)
| | | | | | - Steven K. Feiner
- Affiliation of the authors: Columbia University, New York, New York
| | | |
Collapse
|
29
|
Abstract
Computer decision support systems are computer applications designed to aid clinicians in making diagnostic and therapeutic decisions in patient care. They can simplify access to data needed to make decisions, provide reminders and prompts at the time of a patient encounter, assist in establishing a diagnosis and in entering appropriate orders, and alert clinicians when new patterns in patient data are recognized. Decision support systems that present patient-specific recommendations in a form that can save clinicians time have been shown to be highly effective, sustainable tools for changing clinician behavior. Designing and implementing such systems is challenging because of the computing infrastructure required, the need for patient data in a machine-processible form, and the changes to existing workflow that may result. Despite these difficulties, there is substantial evidence from trials in a wide range of clinical settings that computer decision support systems help clinicians do a better job caring for patients. As computer-based records and order-entry systems become more common, automated decision support systems will be used more broadly.
Collapse
Affiliation(s)
- T H Payne
- VA Puget Sound Health Care System, University of Washington, Seattle, WA 98108, USA.
| |
Collapse
|
30
|
Fiszman M, Chapman WW, Aronsky D, Evans RS, Haug PJ. Automatic detection of acute bacterial pneumonia from chest X-ray reports. J Am Med Inform Assoc 2000; 7:593-604. [PMID: 11062233 PMCID: PMC129668 DOI: 10.1136/jamia.2000.0070593] [Citation(s) in RCA: 164] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022] Open
Abstract
OBJECTIVE To evaluate the performance of a natural language processing system in extracting pneumonia-related concepts from chest x-ray reports. METHODS DESIGN Four physicians, three lay persons, a natural language processing system, and two keyword searches (designated AAKS and KS) detected the presence or absence of three pneumonia-related concepts and inferred the presence or absence of acute bacterial pneumonia from 292 chest x-ray reports. Gold standard: Majority vote of three independent physicians. Reliability of the gold standard was measured. OUTCOME MEASURES Recall, precision, specificity, and agreement (using Finn's R: statistic) with respect to the gold standard. Differences between the physicians and the other subjects were tested using the McNemar test for each pneumonia concept and for the disease inference of acute bacterial pneumonia. RESULTS Reliability of the reference standard ranged from 0.86 to 0.96. Recall, precision, specificity, and agreement (Finn R:) for the inference on acute bacterial pneumonia were, respectively, 0.94, 0.87, 0.91, and 0.84 for physicians; 0.95, 0.78, 0.85, and 0.75 for natural language processing system; 0.46, 0.89, 0.95, and 0.54 for lay persons; 0.79, 0.63, 0.71, and 0.49 for AAKS; and 0.87, 0.70, 0.77, and 0.62 for KS. The McNemar pairwise comparisons showed differences between one physician and the natural language processing system for the infiltrate concept and between another physician and the natural language processing system for the inference on acute bacterial pneumonia. The comparisons also showed that most physicians were significantly different from the other subjects in all pneumonia concepts and the disease inference. CONCLUSION In extracting pneumonia related concepts from chest x-ray reports, the performance of the natural language processing system was similar to that of physicians and better than that of lay persons and keyword searches. The encoded pneumonia information has the potential to support several pneumonia-related applications used in our institution. The applications include a decision support system called the antibiotic assistant, a computerized clinical protocol for pneumonia, and a quality assurance application in the radiology department.
Collapse
Affiliation(s)
- M Fiszman
- The University of Utah, Salt Lake City, Utah, USA.
| | | | | | | | | |
Collapse
|
31
|
Fiszman M, Haug PJ. Using medical language processing to support real-time evaluation of pneumonia guidelines. Proc AMIA Symp 2000:235-9. [PMID: 11079880 PMCID: PMC2244071] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/18/2023] Open
Abstract
OBJECTIVE To evaluate if a medical language processing (MLP) system is able to support real-time computerization of community-acquired pneumonia (CAP) guidelines. METHODS Prospective validation study in the emergency department of a tertiary care facility. All the chest x-ray reports available in real-time for an admission decision during a five-week period were included. The MLP system was compared to a physician for the automatic selection of eligible patients and on the extraction of radiographic findings required by five different CAP guidelines. The gold standard comprised of three independent physicians and reliability measures were calculated. The outcome measures were the area under the receiver operated characteristic curve (AUC) for selecting eligible patients, sensitivity, positive predictive value (PPV), and specificity for the extraction of radiographic findings. RESULTS During the five-week period, 243 reports were available in real-time. The AUCs on selecting eligible CAP patients were 89.7% (CI: 84.2%, 93.7%) for the MLP system, and 93.3% (CI: 83.9%, 97.8%) for the physician. The average sensitivity, PPV, and specificity for radiographic findings that assessed localization and extension of CAP were respectively: 94%, 87%, 96% (physician); and 34%, 90%, 95% (MLP system). Both, the MLP system and the physician had average sensitivity, PPV, and specificity of 97%, 97%, and 99%, respectively, when localization was not an issue. Reliability measures for the gold standard were above 70%. CONCLUSION The MLP system was able to support real-time computerization of guidelines by selecting eligible patients and extracting radiographic findings that do not assess localization and extension of CAP.
Collapse
Affiliation(s)
- M Fiszman
- Department of Medical Informatics, LDS Hospital, University of Utah, USA
| | | |
Collapse
|
32
|
Abstract
OBJECTIVE The task of ad hoc classification is to automatically place a large number of text documents into nonstandard categories that are determined by a user. The authors examine the use of statistical information retrieval techniques for ad hoc classification of dictated mammography reports. DESIGN The authors' approach is the automated generation of a classification algorithm based on positive and negative evidence that is extracted from relevance-judged documents. Test documents are sorted into three conceptual bins: membership in a user-defined class, exclusion from the user-defined class, and uncertain. Documentation of absent findings through the use of negation and conjunction, a hallmark of interpretive test results, is managed by expansion and tokenization of these phrases. MEASUREMENTS Classifier performance is evaluated using a single measure, the F measure, which provides a weighted combination of recall and precision of document sorting into true positive and true negative bins. RESULTS Single terms are the most effective text feature in the classification profile, with some improvement provided by the addition of pairs of unordered terms to the profile. Excessive iterations of automated classifier enhancement degrade performance because of overtraining. Performance is best when the proportions of relevant and irrelevant documents in the training collection are close to equal. Special handling of negation phrases improves performance when the number of terms in the classification profile is limited. CONCLUSIONS The ad hoc classifier system is a promising approach for the classification of large collections of medical documents. NegExpander can distinguish positive evidence from negative evidence when the negative evidence plays an important role in the classification.
Collapse
Affiliation(s)
- D B Aronow
- Center for Intelligent Information Retrieval, University of Massachusetts, Amherst 01003, USA.
| | | | | |
Collapse
|
33
|
|
34
|
Friedman C, Knirsch C, Shagina L, Hripcsak G. Automating a severity score guideline for community-acquired pneumonia employing medical language processing of discharge summaries. Proc AMIA Symp 1999:256-60. [PMID: 10566360 PMCID: PMC2232753] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/14/2023] Open
Abstract
Obtaining encoded variables is often a key obstacle to automating clinical guidelines. Frequently the pertinent information occurs as text in patient reports, but text is inadequate for the task. This paper describes a retrospective study that automates determination of severity classes for patients with community-acquired pneumonia (i.e. classifies patients into risk classes 1-5), a common and costly clinical problem. Most of the variables for the automated application were obtained by writing queries based on output generated by MedLEE1, a natural language processor that encodes clinical information in text. Comorbidities, vital signs, and symptoms from discharge summaries as well as information from chest x-ray reports were used. The results were very good because when compared with a reference standard obtained manually by an independent expert, the automated application demonstrated an accuracy, sensitivity, and specificity of 93%, 92%, and 93% respectively for processing discharge summaries, and 96%, 87%, and 98% respectively for chest x-rays. The accuracy for vital sign values was 85%, and the accuracy for determining the exact risk class was 80%. The remaining 20% that did not match exactly differed by only one class.
Collapse
Affiliation(s)
- C Friedman
- Department of Computer Science, Queens College CUNY, USA
| | | | | | | |
Collapse
|