51
|
Daymont C, Ross ME, Russell Localio A, Fiks AG, Wasserman RC, Grundmeier RW. Automated identification of implausible values in growth data from pediatric electronic health records. J Am Med Inform Assoc 2018; 24:1080-1087. [PMID: 28453637 DOI: 10.1093/jamia/ocx037] [Citation(s) in RCA: 61] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2016] [Accepted: 03/17/2017] [Indexed: 11/14/2022] Open
Abstract
Objective Large electronic health record (EHR) datasets are increasingly used to facilitate research on growth, but measurement and recording errors can lead to biased results. We developed and tested an automated method for identifying implausible values in pediatric EHR growth data. Materials and Methods Using deidentified data from 46 primary care sites, we developed an algorithm to identify weight and height values that should be excluded from analysis, including implausible values and values that were recorded repeatedly without remeasurement. The foundation of the algorithm is a comparison of each measurement, expressed as a standard deviation score, with a weighted moving average of a child's other measurements. We evaluated the performance of the algorithm by (1) comparing its results with the judgment of physician reviewers for a stratified random selection of 400 measurements and (2) evaluating its accuracy in a dataset with simulated errors. Results Of 2 000 595 growth measurements from 280 610 patients 1 to 21 years old, 3.8% of weight and 4.5% of height values were identified as implausible or excluded for other reasons. The proportion excluded varied widely by primary care site. The automated method had a sensitivity of 97% (95% confidence interval [CI], 94-99%) and a specificity of 90% (95% CI, 85-94%) for identifying implausible values compared to physician judgment, and identified 95% (weight) and 98% (height) of simulated errors. Discussion and Conclusion This automated, flexible, and validated method for preparing large datasets will facilitate the use of pediatric EHR growth datasets for research.
Collapse
Affiliation(s)
- Carrie Daymont
- Departments of Pediatrics and Public Health Sciences, Penn State College of Medicine, Hershey, PA, USA
| | - Michelle E Ross
- Department of Biostatistics and Epidemiology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - A Russell Localio
- Department of Biostatistics and Epidemiology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Alexander G Fiks
- Department of Biomedical and Health Informatics
- Pediatric Research Consortium
- Center for Pediatric Clinical Effectiveness
- PolicyLab, Children's Hospital of Philadelphia, Philadelphia, PA, USA
- Pediatric Research in Office Settings, American Academy of Pediatrics, Elk Grove, IL, USA
| | - Richard C Wasserman
- Pediatric Research in Office Settings, American Academy of Pediatrics, Elk Grove, IL, USA
- Department of Pediatrics, University of Vermont, Burlington, VT, USA
| | | |
Collapse
|
52
|
Automatic information extraction from unstructured mammography reports using distributed semantics. J Biomed Inform 2018; 78:78-86. [PMID: 29329701 DOI: 10.1016/j.jbi.2017.12.016] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2017] [Revised: 11/29/2017] [Accepted: 12/30/2017] [Indexed: 11/20/2022]
Abstract
To date, the methods developed for automated extraction of information from radiology reports are mainly rule-based or dictionary-based, and, therefore, require substantial manual effort to build these systems. Recent efforts to develop automated systems for entity detection have been undertaken, but little work has been done to automatically extract relations and their associated named entities in narrative radiology reports that have comparable accuracy to rule-based methods. Our goal is to extract relations in a unsupervised way from radiology reports without specifying prior domain knowledge. We propose a hybrid approach for information extraction that combines dependency-based parse tree with distributed semantics for generating structured information frames about particular findings/abnormalities from the free-text mammography reports. The proposed IE system obtains a F1-score of 0.94 in terms of completeness of the content in the information frames, which outperforms a state-of-the-art rule-based system in this domain by a significant margin. The proposed system can be leveraged in a variety of applications, such as decision support and information retrieval, and may also easily scale to other radiology domains, since there is no need to tune the system with hand-crafted information extraction rules.
Collapse
|
53
|
Manimaran J, Velmurugan T. Evaluation of lexicon- and syntax-based negation detection algorithms using clinical text data. BIO-ALGORITHMS AND MED-SYSTEMS 2017. [DOI: 10.1515/bams-2017-0016] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
AbstractBackground:Clinical Text Analysis and Knowledge Extraction System (cTAKES) is an open-source natural language processing (NLP) system. In recent development modules of cTAKES, a negation detection (ND) algorithm is used to improve annotation capabilities and simplify automatic identification of negative context in large clinical documents. In this research, the two types of ND algorithms used are lexicon and syntax, which are analyzed using a database made openly available by the National Center for Biomedical Computing. The aim of this analysis is to find the pros and cons of these algorithms.Methods:Patient medical reports were collected from three institutions included the 2010 i2b2/VA Clinical NLP Challenge, which is the input data for this analysis. This database includes patient discharge summaries and progress notes. The patient data is fed into five ND algorithms: NegEx, ConText, pyConTextNLP, DEEPEN and Negation Resolution (NR). NegEx, ConText and pyConTextNLP are lexicon-based, whereas DEEPEN and NR are syntax-based. The results from these five ND algorithms are post-processed and compared with the annotated data. Finally, the performance of these ND algorithms is evaluated by computing standard measures including F-measure, kappa statistics and ROC, among others, as well as the execution time of each algorithm.Results:This research is tested through practical implementation based on the accuracy of each algorithm’s results and computational time to evaluate its performance in order to find a robust and reliable ND algorithm.Conclusions:The performance of the chosen ND algorithms is analyzed based on the results produced by this research approach. The time and accuracy of each algorithm are calculated and compared to suggest the best method.
Collapse
|
54
|
Sohn S, Wang Y, Wi CI, Krusemark EA, Ryu E, Ali MH, Juhn YJ, Liu H. Clinical documentation variations and NLP system portability: a case study in asthma birth cohorts across institutions. J Am Med Inform Assoc 2017; 25:353-359. [PMID: 29202185 PMCID: PMC7378885 DOI: 10.1093/jamia/ocx138] [Citation(s) in RCA: 45] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2017] [Revised: 09/20/2017] [Accepted: 10/25/2017] [Indexed: 12/11/2022] Open
Abstract
Objective To assess clinical documentation variations across health care institutions using different electronic medical record systems and investigate how they affect natural language processing (NLP) system portability. Materials and Methods Birth cohorts from Mayo Clinic and Sanford Children’s Hospital (SCH) were used in this study (n = 298 for each). Documentation variations regarding asthma between the 2 cohorts were examined in various aspects: (1) overall corpus at the word level (ie, lexical variation), (2) topics and asthma-related concepts (ie, semantic variation), and (3) clinical note types (ie, process variation). We compared those statistics and explored NLP system portability for asthma ascertainment in 2 stages: prototype and refinement. Results There exist notable lexical variations (word-level similarity = 0.669) and process variations (differences in major note types containing asthma-related concepts). However, semantic-level corpora were relatively homogeneous (topic similarity = 0.944, asthma-related concept similarity = 0.971). The NLP system for asthma ascertainment had anF-score of 0.937 at Mayo, and produced 0.813 (prototype) and 0.908 (refinement) when applied at SCH. Discussion The criteria for asthma ascertainment are largely dependent on asthma-related concepts. Therefore, we believe that semantic similarity is important to estimate NLP system portability. As the Mayo Clinic and SCH corpora were relatively homogeneous at a semantic level, the NLP system, developed at Mayo Clinic, was imported to SCH successfully with proper adjustments to deal with the intrinsic corpus heterogeneity.
Collapse
Affiliation(s)
- Sunghwan Sohn
- Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, MN, USA
| | - Yanshan Wang
- Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, MN, USA
| | - Chung-Il Wi
- Department of Pediatric and Adolescent Medicine, Mayo Clinic, Rochester, MN, USA
| | | | - Euijung Ryu
- Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, MN, USA
| | - Mir H Ali
- Department of Pediatrics, Sanford Children's Hospital, Sioux Falls, SD, USA
| | - Young J Juhn
- Department of Pediatric and Adolescent Medicine, Mayo Clinic, Rochester, MN, USA
| | - Hongfang Liu
- Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, MN, USA
| |
Collapse
|
55
|
Yim WW, Kwan SW, Yetisgen M. Classifying tumor event attributes in radiology reports. J Assoc Inf Sci Technol 2017. [DOI: 10.1002/asi.23937] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Affiliation(s)
- Wen-wai Yim
- Palo Alto Veterans Affairs, Biomedical Informatics Research, Stanford University, 1265 Welch Road; Stanford CA 94305
| | - Sharon W. Kwan
- Department of Radiology; Interventional Radiology Section, University of Washington Medical Center, 1959 NE Pacific Street; Seattle WA 98195 USA
| | - Meliha Yetisgen
- Biomedical and Health Informatics, Linguistics; University of Washington, Box 358047; Seattle WA 98195 USA
| |
Collapse
|
56
|
Deléger L, Campillos L, Ligozat AL, Névéol A. Design of an extensive information representation scheme for clinical narratives. J Biomed Semantics 2017; 8:37. [PMID: 28893314 PMCID: PMC5594525 DOI: 10.1186/s13326-017-0135-z] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2016] [Accepted: 07/26/2017] [Indexed: 12/26/2022] Open
Abstract
BACKGROUND Knowledge representation frameworks are essential to the understanding of complex biomedical processes, and to the analysis of biomedical texts that describe them. Combined with natural language processing (NLP), they have the potential to contribute to retrospective studies by unlocking important phenotyping information contained in the narrative content of electronic health records (EHRs). This work aims to develop an extensive information representation scheme for clinical information contained in EHR narratives, and to support secondary use of EHR narrative data to answer clinical questions. METHODS We review recent work that proposed information representation schemes and applied them to the analysis of clinical narratives. We then propose a unifying scheme that supports the extraction of information to address a large variety of clinical questions. RESULTS We devised a new information representation scheme for clinical narratives that comprises 13 entities, 11 attributes and 37 relations. The associated annotation guidelines can be used to consistently apply the scheme to clinical narratives and are https://cabernet.limsi.fr/annotation_guide_for_the_merlot_french_clinical_corpus-Sept2016.pdf . CONCLUSION The information scheme includes many elements of the major schemes described in the clinical natural language processing literature, as well as a uniquely detailed set of relations.
Collapse
Affiliation(s)
- Louise Deléger
- French National Institute for Agricultural Research (INRA), Domaine de Vilvert, Jouy en Josas, Paris, 78352, France.,LIMSI, CNRS, Université Paris - Saclay, Rue John von Neumann, Orsay, 91405, France
| | - Leonardo Campillos
- LIMSI, CNRS, Université Paris - Saclay, Rue John von Neumann, Orsay, 91405, France
| | - Anne-Laure Ligozat
- LIMSI, CNRS, Université Paris - Saclay, Rue John von Neumann, Orsay, 91405, France.,ENSIIE, 1 square de la résistance, Évry Cedex, 91025, France
| | - Aurélie Névéol
- LIMSI, CNRS, Université Paris - Saclay, Rue John von Neumann, Orsay, 91405, France.
| |
Collapse
|
57
|
Névéol A, Zweigenbaum P. Clinical Natural Language Processing in 2014: Foundational Methods Supporting Efficient Healthcare. Yearb Med Inform 2017; 10:194-8. [PMID: 26293868 DOI: 10.15265/iy-2015-035] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023] Open
Abstract
OBJECTIVE To summarize recent research and present a selection of the best papers published in 2014 in the field of clinical Natural Language Processing (NLP). METHOD A systematic review of the literature was performed by the two section editors of the IMIA Yearbook NLP section by searching bibliographic databases with a focus on NLP efforts applied to clinical texts or aimed at a clinical outcome. A shortlist of candidate best papers was first selected by the section editors before being peer-reviewed by independent external reviewers. RESULTS The clinical NLP best paper selection shows that the field is tackling text analysis methods of increasing depth. The full review process highlighted five papers addressing foundational methods in clinical NLP using clinically relevant texts from online forums or encyclopedias, clinical texts from Electronic Health Records, and included studies specifically aiming at a practical clinical outcome. The increased access to clinical data that was made possible with the recent progress of de-identification paved the way for the scientific community to address complex NLP problems such as word sense disambiguation, negation, temporal analysis and specific information nugget extraction. These advances in turn allowed for efficient application of NLP to clinical problems such as cancer patient triage. Another line of research investigates online clinically relevant texts and brings interesting insight on communication strategies to convey health-related information. CONCLUSIONS The field of clinical NLP is thriving through the contributions of both NLP researchers and healthcare professionals interested in applying NLP techniques for concrete healthcare purposes. Clinical NLP is becoming mature for practical applications with a significant clinical impact.
Collapse
Affiliation(s)
- A Névéol
- Aurélie Névéol, LIMSI CNRS UPR 3251, Rue John von Neumann, Campus Universitaire d'Orsay, 91405 Orsay cedex, France, E-mail: {neveol,pz}@limsi.fr
| | | |
Collapse
|
58
|
Miller T, Dligach D, Bethard S, Lin C, Savova G. Towards generalizable entity-centric clinical coreference resolution. J Biomed Inform 2017; 69:251-258. [PMID: 28438706 DOI: 10.1016/j.jbi.2017.04.015] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2016] [Revised: 04/13/2017] [Accepted: 04/19/2017] [Indexed: 11/16/2022]
Abstract
OBJECTIVE This work investigates the problem of clinical coreference resolution in a model that explicitly tracks entities, and aims to measure the performance of that model in both traditional in-domain train/test splits and cross-domain experiments that measure the generalizability of learned models. METHODS The two methods we compare are a baseline mention-pair coreference system that operates over pairs of mentions with best-first conflict resolution and a mention-synchronous system that incrementally builds coreference chains. We develop new features that incorporate distributional semantics, discourse features, and entity attributes. We use two new coreference datasets with similar annotation guidelines - the THYME colon cancer dataset and the DeepPhe breast cancer dataset. RESULTS The mention-synchronous system performs similarly on in-domain data but performs much better on new data. Part of speech tag features prove superior in feature generalizability experiments over other word representations. Our methods show generalization improvement but there is still a performance gap when testing in new domains. DISCUSSION Generalizability of clinical NLP systems is important and under-studied, so future work should attempt to perform cross-domain and cross-institution evaluations and explicitly develop features and training regimens that favor generalizability. A performance-optimized version of the mention-synchronous system will be included in the open source Apache cTAKES software.
Collapse
Affiliation(s)
- Timothy Miller
- Boston Children's Hospital, Boston, MA, United States; Harvard Medical School, Boston, MA, United States.
| | | | | | - Chen Lin
- Boston Children's Hospital, Boston, MA, United States
| | - Guergana Savova
- Boston Children's Hospital, Boston, MA, United States; Harvard Medical School, Boston, MA, United States
| |
Collapse
|
59
|
Névéol A, Zweigenbaum P. Clinical Natural Language Processing in 2015: Leveraging the Variety of Texts of Clinical Interest. Yearb Med Inform 2016; 25:234-239. [PMID: 27830256 PMCID: PMC5171575 DOI: 10.15265/iy-2016-049] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023] Open
Abstract
OBJECTIVE To summarize recent research and present a selection of the best papers published in 2015 in the field of clinical Natural Language Processing (NLP). METHOD A systematic review of the literature was performed by the two section editors of the IMIA Yearbook NLP section by searching bibliographic databases with a focus on NLP efforts applied to clinical texts or aimed at a clinical outcome. Section editors first selected a shortlist of candidate best papers that were then peer-reviewed by independent external reviewers. RESULTS The clinical NLP best paper selection shows that clinical NLP is making use of a variety of texts of clinical interest to contribute to the analysis of clinical information and the building of a body of clinical knowledge. The full review process highlighted five papers analyzing patient-authored texts or seeking to connect and aggregate multiple sources of information. They provide a contribution to the development of methods, resources, applications, and sometimes a combination of these aspects. CONCLUSIONS The field of clinical NLP continues to thrive through the contributions of both NLP researchers and healthcare professionals interested in applying NLP techniques to impact clinical practice. Foundational progress in the field makes it possible to leverage a larger variety of texts of clinical interest for healthcare purposes.
Collapse
Affiliation(s)
- A Névéol
- Aurélie Névéol, LIMSI CNRS UPR 3251, Université Paris Saclay, Rue John von Neumann, 91400 Orsay, France, E-mail:
| | - P Zweigenbaum
- Pierre Zweigenbaum, LIMSI CNRS UPR 3251, Université Paris Saclay, Rue John von Neumann, 91400 Orsay, France, E-mail:
| |
Collapse
|
60
|
Henriksson A, Kvist M, Dalianis H, Duneld M. Identifying adverse drug event information in clinical notes with distributional semantic representations of context. J Biomed Inform 2015; 57:333-49. [PMID: 26291578 DOI: 10.1016/j.jbi.2015.08.013] [Citation(s) in RCA: 46] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2015] [Revised: 07/19/2015] [Accepted: 08/10/2015] [Indexed: 10/23/2022]
Abstract
For the purpose of post-marketing drug safety surveillance, which has traditionally relied on the voluntary reporting of individual cases of adverse drug events (ADEs), other sources of information are now being explored, including electronic health records (EHRs), which give us access to enormous amounts of longitudinal observations of the treatment of patients and their drug use. Adverse drug events, which can be encoded in EHRs with certain diagnosis codes, are, however, heavily underreported. It is therefore important to develop capabilities to process, by means of computational methods, the more unstructured EHR data in the form of clinical notes, where clinicians may describe and reason around suspected ADEs. In this study, we report on the creation of an annotated corpus of Swedish health records for the purpose of learning to identify information pertaining to ADEs present in clinical notes. To this end, three key tasks are tackled: recognizing relevant named entities (disorders, symptoms, drugs), labeling attributes of the recognized entities (negation, speculation, temporality), and relationships between them (indication, adverse drug event). For each of the three tasks, leveraging models of distributional semantics - i.e., unsupervised methods that exploit co-occurrence information to model, typically in vector space, the meaning of words - and, in particular, combinations of such models, is shown to improve the predictive performance. The ability to make use of such unsupervised methods is critical when faced with large amounts of sparse and high-dimensional data, especially in domains where annotated resources are scarce.
Collapse
Affiliation(s)
- Aron Henriksson
- Department of Computer and Systems Sciences (DSV), Stockholm University, Sweden.
| | - Maria Kvist
- Department of Computer and Systems Sciences (DSV), Stockholm University, Sweden; Department of Learning, Informatics, Management and Ethics (LIME), Karolinska Institutet, Sweden.
| | - Hercules Dalianis
- Department of Computer and Systems Sciences (DSV), Stockholm University, Sweden.
| | - Martin Duneld
- Department of Computer and Systems Sciences (DSV), Stockholm University, Sweden.
| |
Collapse
|
61
|
Mehrabi S, Krishnan A, Sohn S, Roch AM, Schmidt H, Kesterson J, Beesley C, Dexter P, Max Schmidt C, Liu H, Palakal M. DEEPEN: A negation detection system for clinical text incorporating dependency relation into NegEx. J Biomed Inform 2015; 54:213-9. [PMID: 25791500 PMCID: PMC5863758 DOI: 10.1016/j.jbi.2015.02.010] [Citation(s) in RCA: 39] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2014] [Revised: 01/22/2015] [Accepted: 02/24/2015] [Indexed: 12/01/2022]
Abstract
In Electronic Health Records (EHRs), much of valuable information regarding patients' conditions is embedded in free text format. Natural language processing (NLP) techniques have been developed to extract clinical information from free text. One challenge faced in clinical NLP is that the meaning of clinical entities is heavily affected by modifiers such as negation. A negation detection algorithm, NegEx, applies a simplistic approach that has been shown to be powerful in clinical NLP. However, due to the failure to consider the contextual relationship between words within a sentence, NegEx fails to correctly capture the negation status of concepts in complex sentences. Incorrect negation assignment could cause inaccurate diagnosis of patients' condition or contaminated study cohorts. We developed a negation algorithm called DEEPEN to decrease NegEx's false positives by taking into account the dependency relationship between negation words and concepts within a sentence using Stanford dependency parser. The system was developed and tested using EHR data from Indiana University (IU) and it was further evaluated on Mayo Clinic dataset to assess its generalizability. The evaluation results demonstrate DEEPEN, which incorporates dependency parsing into NegEx, can reduce the number of incorrect negation assignment for patients with positive findings, and therefore improve the identification of patients with the target clinical findings in EHRs.
Collapse
Affiliation(s)
- Saeed Mehrabi
- School of Informatics and Computing, Indiana University, Indianapolis, IN, USA; Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Anand Krishnan
- School of Informatics and Computing, Indiana University, Indianapolis, IN, USA
| | - Sunghwan Sohn
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Alexandra M Roch
- Department of Surgery, Indiana University, Indianapolis, IN, USA
| | - Heidi Schmidt
- Department of Surgery, Indiana University, Indianapolis, IN, USA
| | | | | | - Paul Dexter
- Regenstrief Institute, Indianapolis, IN, USA
| | - C Max Schmidt
- Department of Surgery, Indiana University, Indianapolis, IN, USA
| | - Hongfang Liu
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA.
| | - Mathew Palakal
- School of Informatics and Computing, Indiana University, Indianapolis, IN, USA.
| |
Collapse
|
62
|
Mehrabi S, Krishnan A, Roch AM, Schmidt H, Li D, Kesterson J, Beesley C, Dexter P, Schmidt M, Palakal M, Liu H. Identification of Patients with Family History of Pancreatic Cancer--Investigation of an NLP System Portability. Stud Health Technol Inform 2015; 216:604-608. [PMID: 26262122 PMCID: PMC5863760] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
In this study we have developed a rule-based natural language processing (NLP) system to identify patients with family history of pancreatic cancer. The algorithm was developed in a Unstructured Information Management Architecture (UIMA) framework and consisted of section segmentation, relation discovery, and negation detection. The system was evaluated on data from two institutions. The family history identification precision was consistent across the institutions shifting from 88.9% on Indiana University (IU) dataset to 87.8% on Mayo Clinic dataset. Customizing the algorithm on the the Mayo Clinic data, increased its precision to 88.1%. The family member relation discovery achieved precision, recall, and F-measure of 75.3%, 91.6% and 82.6% respectively. Negation detection resulted in precision of 99.1%. The results show that rule-based NLP approaches for specific information extraction tasks are portable across institutions; however customization of the algorithm on the new dataset improves its performance.
Collapse
Affiliation(s)
- Saeed Mehrabi
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN
| | - Anand Krishnan
- School of Informatics and Computing, Indiana University, Indianapolis, IN
| | | | - Heidi Schmidt
- Department of Surgery, Indiana University, Indianapolis, IN
| | - DingCheng Li
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN
| | | | | | - Paul Dexter
- Regenstrief Institute Inc., Indianapolis, IN
| | - Max Schmidt
- Department of Surgery, Indiana University, Indianapolis, IN
| | - Mathew Palakal
- School of Informatics and Computing, Indiana University, Indianapolis, IN
| | - Hongfang Liu
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN
| |
Collapse
|