1
|
Adverse drug event detection using natural language processing: A scoping review of supervised learning methods. PLoS One 2023; 18:e0279842. [PMID: 36595517 PMCID: PMC9810201 DOI: 10.1371/journal.pone.0279842] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2022] [Accepted: 12/15/2022] [Indexed: 01/04/2023] Open
Abstract
To reduce adverse drug events (ADEs), hospitals need a system to support them in monitoring ADE occurrence routinely, rapidly, and at scale. Natural language processing (NLP), a computerized approach to analyze text data, has shown promising results for the purpose of ADE detection in the context of pharmacovigilance. However, a detailed qualitative assessment and critical appraisal of NLP methods for ADE detection in the context of ADE monitoring in hospitals is lacking. Therefore, we have conducted a scoping review to close this knowledge gap, and to provide directions for future research and practice. We included articles where NLP was applied to detect ADEs in clinical narratives within electronic health records of inpatients. Quantitative and qualitative data items relating to NLP methods were extracted and critically appraised. Out of 1,065 articles screened for eligibility, 29 articles met the inclusion criteria. Most frequent tasks included named entity recognition (n = 17; 58.6%) and relation extraction/classification (n = 15; 51.7%). Clinical involvement was reported in nine studies (31%). Multiple NLP modelling approaches seem suitable, with Long Short Term Memory and Conditional Random Field methods most commonly used. Although reported overall performance of the systems was high, it provides an inflated impression given a steep drop in performance when predicting the ADE entity or ADE relation class. When annotating corpora, treating an ADE as a relation between a drug and non-drug entity seems the best practice. Future research should focus on semi-automated methods to reduce the manual annotation effort, and examine implementation of the NLP methods in practice.
Collapse
|
2
|
French E, McInnes BT. An overview of biomedical entity linking throughout the years. J Biomed Inform 2023; 137:104252. [PMID: 36464228 PMCID: PMC9845184 DOI: 10.1016/j.jbi.2022.104252] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2022] [Revised: 09/19/2022] [Accepted: 11/15/2022] [Indexed: 12/04/2022]
Abstract
Biomedical Entity Linking (BEL) is the task of mapping of spans of text within biomedical documents to normalized, unique identifiers within an ontology. This is an important task in natural language processing for both translational information extraction applications and providing context for downstream tasks like relationship extraction. In this paper, we will survey the progression of BEL from its inception in the late 80s to present day state of the art systems, provide a comprehensive list of datasets available for training BEL systems, reference shared tasks focused on BEL, discuss the technical components that comprise BEL systems, and discuss possible directions for the future of the field.
Collapse
Affiliation(s)
- Evan French
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, USA.
| | - Bridget T McInnes
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, USA
| |
Collapse
|
3
|
Furrer L, Cornelius J, Rinaldi F. Parallel sequence tagging for concept recognition. BMC Bioinformatics 2022; 22:623. [PMID: 35331131 PMCID: PMC8943923 DOI: 10.1186/s12859-021-04511-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2021] [Accepted: 12/01/2021] [Indexed: 11/23/2022] Open
Abstract
BACKGROUND Named Entity Recognition (NER) and Normalisation (NEN) are core components of any text-mining system for biomedical texts. In a traditional concept-recognition pipeline, these tasks are combined in a serial way, which is inherently prone to error propagation from NER to NEN. We propose a parallel architecture, where both NER and NEN are modeled as a sequence-labeling task, operating directly on the source text. We examine different harmonisation strategies for merging the predictions of the two classifiers into a single output sequence. RESULTS We test our approach on the recent Version 4 of the CRAFT corpus. In all 20 annotation sets of the concept-annotation task, our system outperforms the pipeline system reported as a baseline in the CRAFT shared task, a competition of the BioNLP Open Shared Tasks 2019. We further refine the systems from the shared task by optimising the harmonisation strategy separately for each annotation set. CONCLUSIONS Our analysis shows that the strengths of the two classifiers can be combined in a fruitful way. However, prediction harmonisation requires individual calibration on a development set for each annotation set. This allows achieving a good trade-off between established knowledge (training set) and novel information (unseen concepts).
Collapse
Affiliation(s)
- Lenz Furrer
- Department of Computational Linguistics, University of Zurich, Zurich, Switzerland
- Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - Joseph Cornelius
- Dalle Molle Institute for Artificial Intelligence Research (IDSIA USI/SUPSI), Lugano, Switzerland
- Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - Fabio Rinaldi
- Dalle Molle Institute for Artificial Intelligence Research (IDSIA USI/SUPSI), Lugano, Switzerland.
- Department of Quantitative Biomedicine, University of Zurich, Zurich, Switzerland.
- Swiss Institute of Bioinformatics, Zurich, Switzerland.
- Fondazione Bruno Kessler, Trento, Italy.
| |
Collapse
|
4
|
Artificial Intelligence and Cardiovascular Genetics. Life (Basel) 2022; 12:life12020279. [PMID: 35207566 PMCID: PMC8875522 DOI: 10.3390/life12020279] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2021] [Revised: 01/26/2022] [Accepted: 02/09/2022] [Indexed: 12/13/2022] Open
Abstract
Polygenic diseases, which are genetic disorders caused by the combined action of multiple genes, pose unique and significant challenges for the diagnosis and management of affected patients. A major goal of cardiovascular medicine has been to understand how genetic variation leads to the clinical heterogeneity seen in polygenic cardiovascular diseases (CVDs). Recent advances and emerging technologies in artificial intelligence (AI), coupled with the ever-increasing availability of next generation sequencing (NGS) technologies, now provide researchers with unprecedented possibilities for dynamic and complex biological genomic analyses. Combining these technologies may lead to a deeper understanding of heterogeneous polygenic CVDs, better prognostic guidance, and, ultimately, greater personalized medicine. Advances will likely be achieved through increasingly frequent and robust genomic characterization of patients, as well the integration of genomic data with other clinical data, such as cardiac imaging, coronary angiography, and clinical biomarkers. This review discusses the current opportunities and limitations of genomics; provides a brief overview of AI; and identifies the current applications, limitations, and future directions of AI in genomics.
Collapse
|
5
|
Lossio-Ventura JA, Sun R, Boussard S, Hernandez-Boussard T. Clinical concept recognition: Evaluation of existing systems on EHRs. Front Artif Intell 2022; 5:1051724. [PMID: 36714202 PMCID: PMC9880223 DOI: 10.3389/frai.2022.1051724] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2022] [Accepted: 12/15/2022] [Indexed: 01/15/2023] Open
Abstract
Objective The adoption of electronic health records (EHRs) has produced enormous amounts of data, creating research opportunities in clinical data sciences. Several concept recognition systems have been developed to facilitate clinical information extraction from these data. While studies exist that compare the performance of many concept recognition systems, they are typically developed internally and may be biased due to different internal implementations, parameters used, and limited number of systems included in the evaluations. The goal of this research is to evaluate the performance of existing systems to retrieve relevant clinical concepts from EHRs. Methods We investigated six concept recognition systems, including CLAMP, cTAKES, MetaMap, NCBO Annotator, QuickUMLS, and ScispaCy. Clinical concepts extracted included procedures, disorders, medications, and anatomical location. The system performance was evaluated on two datasets: the 2010 i2b2 and the MIMIC-III. Additionally, we assessed the performance of these systems in five challenging situations, including negation, severity, abbreviation, ambiguity, and misspelling. Results For clinical concept extraction, CLAMP achieved the best performance on exact and inexact matching, with an F-score of 0.70 and 0.94, respectively, on i2b2; and 0.39 and 0.50, respectively, on MIMIC-III. Across the five challenging situations, ScispaCy excelled in extracting abbreviation information (F-score: 0.86) followed by NCBO Annotator (F-score: 0.79). CLAMP outperformed in extracting severity terms (F-score 0.73) followed by NCBO Annotator (F-score: 0.68). CLAMP outperformed other systems in extracting negated concepts (F-score 0.63). Conclusions Several concept recognition systems exist to extract clinical information from unstructured data. This study provides an external evaluation by end-users of six commonly used systems across different extraction tasks. Our findings suggest that CLAMP provides the most comprehensive set of annotations for clinical concept extraction tasks and associated challenges. Comparing standard extraction tasks across systems provides guidance to other clinical researchers when selecting a concept recognition system relevant to their clinical information extraction task.
Collapse
Affiliation(s)
- Juan Antonio Lossio-Ventura
- Biomedical Informatics Research, Stanford University, Stanford, CA, United States.,National Institute of Mental Health, National Institutes of Health, Bethesda, MD, United States
| | - Ran Sun
- Biomedical Informatics Research, Stanford University, Stanford, CA, United States
| | | | - Tina Hernandez-Boussard
- Biomedical Informatics Research, Stanford University, Stanford, CA, United States.,Department of Biomedical Data Sciences, Stanford University, Stanford, CA, United States.,Department of Surgery, Stanford University, Stanford, CA, United States
| |
Collapse
|
6
|
Converting Biomedical Text Annotated Resources into FAIR Research Objects with an Open Science Platform. APPLIED SCIENCES-BASEL 2021. [DOI: 10.3390/app11209648] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
Abstract
Today, there are excellent resources for the semantic annotation of biomedical text. These resources span from ontologies, tools for NLP, annotators, and web services. Most of these are available either in the form of open source components (i.e., MetaMap) or as web services that offer free access (i.e., Whatizit). In order to use these resources in automatic text annotation pipelines, researchers face significant technical challenges. For open-source tools, the challenges include the setting up of the computational environment, the resolution of dependencies, as well as the compilation and installation of the software. For web services, the challenge is implementing clients to undertake communication with the respective web APIs. Even resources that are available as Docker containers (i.e., NCBO annotator) require significant technical skills for installation and setup. This work deals with the task of creating ready-to-install and run Research Objects (ROs) for a large collection of components in biomedical text analysis. These components include (a) tools such as cTAKES, NOBLE Coder, MetaMap, NCBO annotator, BeCAS, and Neji; (b) ontologies from BioPortal, NCBI BioSystems, and Open Biomedical Ontologies; and (c) text corpora such as BC4GO, Mantra Gold Standard Corpus, and the COVID-19 Open Research Dataset. We make these resources available in OpenBio.eu, an open-science RO repository and workflow management system. All ROs can be searched, shared, edited, downloaded, commented on, and rated. We also demonstrate how one can easily connect these ROs to form a large variety of text annotation pipelines.
Collapse
|
7
|
Lauriola I, Aiolli F, Lavelli A, Rinaldi F. Learning adaptive representations for entity recognition in the biomedical domain. J Biomed Semantics 2021; 12:10. [PMID: 34001263 PMCID: PMC8127187 DOI: 10.1186/s13326-021-00238-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2019] [Accepted: 03/09/2021] [Indexed: 11/10/2022] Open
Abstract
Background Named Entity Recognition is a common task in Natural Language Processing applications, whose purpose is to recognize named entities in textual documents. Several systems exist to solve this task in the biomedical domain, based on Natural Language Processing techniques and Machine Learning algorithms. A crucial step of these applications is the choice of the representation which describes data. Several representations have been proposed in the literature, some of which are based on a strong knowledge of the domain, and they consist of features manually defined by domain experts. Usually, these representations describe the problem well, but they require a lot of human effort and annotated data. On the other hand, general-purpose representations like word-embeddings do not require human domain knowledge, but they could be too general for a specific task. Results This paper investigates methods to learn the best representation from data directly, by combining several knowledge-based representations and word embeddings. Two mechanisms have been considered to perform the combination, which are neural networks and Multiple Kernel Learning. To this end, we use a hybrid architecture for biomedical entity recognition which integrates dictionary look-up (also known as gazetteers) with machine learning techniques. Results on the CRAFT corpus clearly show the benefits of the proposed algorithm in terms of F1 score. Conclusions Our experiments show that the principled combination of general, domain specific, word-, and character-level representations improves the performance of entity recognition. We also discussed the contribution of each representation in the final solution.
Collapse
Affiliation(s)
- Ivano Lauriola
- Department of Mathematics, University of Padova, Via Trieste 63, Padova, 35121, Italy. .,Fondazione Bruno Kessler, Via Sommarive 18, Trento, 38123, Italy.
| | - Fabio Aiolli
- Department of Mathematics, University of Padova, Via Trieste 63, Padova, 35121, Italy
| | - Alberto Lavelli
- Fondazione Bruno Kessler, Via Sommarive 18, Trento, 38123, Italy
| | - Fabio Rinaldi
- Fondazione Bruno Kessler, Via Sommarive 18, Trento, 38123, Italy.,Dalle Molle Institute for Artificial Intelligence Research (IDSIA), Via Cantonale 2C, Manno, 6928, Svizzera.,Department of Quantitative Biomedicine, University of Zurich, Andreasstrasse 15, Zürich, 8050, Svizzera.,SIB, Swiss Institute of Bioinformatics, Génopode, Quartier UNIL-Sorge, bâtiment, Lausanne, 1015, Svizzera
| |
Collapse
|
8
|
Colicchio TK, Dissanayake PI, Cimino JJ. Formal representation of patients' care context data: the path to improving the electronic health record. J Am Med Inform Assoc 2021; 27:1648-1657. [PMID: 32935127 PMCID: PMC7671623 DOI: 10.1093/jamia/ocaa134] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2020] [Revised: 05/15/2020] [Accepted: 06/10/2020] [Indexed: 11/24/2022] Open
Abstract
Objective To develop a collection of concept-relationship-concept tuples to formally represent patients’ care context data to inform electronic health record (EHR) development. Materials and Methods We reviewed semantic relationships reported in the literature and developed a manual annotation schema. We used the initial schema to annotate sentences extracted from narrative note sections of cardiology, urology, and ear, nose, and throat (ENT) notes. We audio recorded ENT visits and annotated their parsed transcripts. We combined the results of each annotation into a consolidated set of concept-relationship-concept tuples. We then compared the tuples used within and across the multiple data sources. Results We annotated a total of 626 sentences. Starting with 8 relationships from the literature, we annotated 182 sentences from 8 inpatient consult notes (initial set of tuples = 43). Next, we annotated 232 sentences from 10 outpatient visit notes (enhanced set of tuples = 75). Then, we annotated 212 sentences from transcripts of 5 outpatient visits (final set of tuples = 82). The tuples from the visit transcripts covered 103 (74%) concepts documented in the notes of their respective visits. There were 20 (24%) tuples used across all data sources, 10 (12%) used only in inpatient notes, 15 (18%) used only in visit notes, and 7 (9%) used only in the visit transcripts. Conclusions We produced a robust set of 82 tuples useful to represent patients’ care context data. We propose several applications of our tuples to improve EHR navigation, data entry, learning health systems, and decision support.
Collapse
Affiliation(s)
| | | | - James J Cimino
- Informatics Institute, University of Alabama at Birmingham, USA
| |
Collapse
|
9
|
Diaz-Garelli F, Lenoir KM, Wells BJ. Catch Me if You Can: Acute Events Hidden in Structured Chronic Disease Diagnosis Descriptions Show Detectable Recording Patterns in EHR. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2021; 2020:373-382. [PMID: 33936410 PMCID: PMC8075503] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Our previous research shows that structured cancer DX description data accuracy varied across electronic health record (EHR) segments (e.g. encounter DX, problem list, etc.). We provide initial evidence corroborating these findings in EHRs from patients with diabetes. We hypothesized that the odds of recording an "uncontrolled diabetes" DX increased after a hemoglobin A1c result above 9% and that this rate would vary across EHR segments. Our statistical models revealed that each DX indicating uncontrolled diabetes was 2.6% more likely to occur post-A1c>9% overall (adj-p=.0005) and 3.9% after controlling for EHR segment (adj-p<.0001). However, odds ratios varied across segments (1.021
Collapse
|
10
|
Ryu B, Yoon E, Kim S, Lee S, Baek H, Yi S, Na HY, Kim JW, Baek RM, Hwang H, Yoo S. Transformation of Pathology Reports Into the Common Data Model With Oncology Module: Use Case for Colon Cancer. J Med Internet Res 2020; 22:e18526. [PMID: 33295294 PMCID: PMC7758167 DOI: 10.2196/18526] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2020] [Revised: 05/20/2020] [Accepted: 11/11/2020] [Indexed: 01/15/2023] Open
Abstract
BACKGROUND Common data models (CDMs) help standardize electronic health record data and facilitate outcome analysis for observational and longitudinal research. An analysis of pathology reports is required to establish fundamental information infrastructure for data-driven colon cancer research. The Observational Medical Outcomes Partnership (OMOP) CDM is used in distributed research networks for clinical data; however, it requires conversion of free text-based pathology reports into the CDM's format. There are few use cases of representing cancer data in CDM. OBJECTIVE In this study, we aimed to construct a CDM database of colon cancer-related pathology with natural language processing (NLP) for a research platform that can utilize both clinical and omics data. The essential text entities from the pathology reports are extracted, standardized, and converted to the OMOP CDM format in order to utilize the pathology data in cancer research. METHODS We extracted clinical text entities, mapped them to the standard concepts in the Observational Health Data Sciences and Informatics vocabularies, and built databases and defined relations for the CDM tables. Major clinical entities were extracted through NLP on pathology reports of surgical specimens, immunohistochemical studies, and molecular studies of colon cancer patients at a tertiary general hospital in South Korea. Items were extracted from each report using regular expressions in Python. Unstructured data, such as text that does not have a pattern, were handled with expert advice by adding regular expression rules. Our own dictionary was used for normalization and standardization to deal with biomarker and gene names and other ungrammatical expressions. The extracted clinical and genetic information was mapped to the Logical Observation Identifiers Names and Codes databases and the Systematized Nomenclature of Medicine (SNOMED) standard terminologies recommended by the OMOP CDM. The database-table relationships were newly defined through SNOMED standard terminology concepts. The standardized data were inserted into the CDM tables. For evaluation, 100 reports were randomly selected and independently annotated by a medical informatics expert and a nurse. RESULTS We examined and standardized 1848 immunohistochemical study reports, 3890 molecular study reports, and 12,352 pathology reports of surgical specimens (from 2017 to 2018). The constructed and updated database contained the following extracted colorectal entities: (1) NOTE_NLP, (2) MEASUREMENT, (3) CONDITION_OCCURRENCE, (4) SPECIMEN, and (5) FACT_RELATIONSHIP of specimen with condition and measurement. CONCLUSIONS This study aimed to prepare CDM data for a research platform to take advantage of all omics clinical and patient data at Seoul National University Bundang Hospital for colon cancer pathology. A more sophisticated preparation of the pathology data is needed for further research on cancer genomics, and various types of text narratives are the next target for additional research on the use of data in the CDM.
Collapse
Affiliation(s)
- Borim Ryu
- Office of eHealth Research and Business, Seoul National University Bundang Hospital, Seongnam, Republic of Korea
| | - Eunsil Yoon
- Office of eHealth Research and Business, Seoul National University Bundang Hospital, Seongnam, Republic of Korea
| | - Seok Kim
- Office of eHealth Research and Business, Seoul National University Bundang Hospital, Seongnam, Republic of Korea
| | - Sejoon Lee
- Department of Pathology and Translational Medicine, Seoul National University Bundang Hospital, Seongnam, Republic of Korea
| | - Hyunyoung Baek
- Office of eHealth Research and Business, Seoul National University Bundang Hospital, Seongnam, Republic of Korea
| | - Soyoung Yi
- Office of eHealth Research and Business, Seoul National University Bundang Hospital, Seongnam, Republic of Korea
| | - Hee Young Na
- Department of Pathology and Translational Medicine, Seoul National University Bundang Hospital, Seongnam, Republic of Korea
| | - Ji-Won Kim
- Division of Hematology and Medical Oncology, Seoul National University Bundang Hospital, Seongnam, Republic of Korea
| | - Rong-Min Baek
- Department of Plastic Surgery, Seoul National University Bundang Hospital, Seongnam, Republic of Korea
| | - Hee Hwang
- Office of eHealth Research and Business, Seoul National University Bundang Hospital, Seongnam, Republic of Korea
| | - Sooyoung Yoo
- Office of eHealth Research and Business, Seoul National University Bundang Hospital, Seongnam, Republic of Korea
| |
Collapse
|
11
|
Wang Z, Zhu Y, Li D, Yin Y, Zhang J. Feature rearrangement based deep learning system for predicting heart failure mortality. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2020; 191:105383. [PMID: 32062185 DOI: 10.1016/j.cmpb.2020.105383] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/30/2019] [Revised: 01/22/2020] [Accepted: 02/03/2020] [Indexed: 06/10/2023]
Abstract
BACKGROUND AND OBJECTIVE Heart Failure is a clinical syndrome commonly caused by any structural or functional impairment. Fast and accurate mortality prediction for Heart Failure is essential to improve the health care of patients and prevent them from death. However, due to the imbalance problem and poor feature representation in Heart Failure data, mortality prediction of Heart Failure is difficult with some simple models. To handle these problems, this study is focused on proposing a fast and accurate Heart Failure mortality prediction framework. METHODS This paper proposes a feature rearrangement based deep learning system for heart failure mortality prediction. The proposed framework improves the performance of predicting heart failure mortality by handling imbalance problem and achieving better feature representation. This paper also proposes a method named Feature rearrangement based convolutional layer, which demonstrates that the order of the input features is essential for the convolutional network. RESULTS The proposed system is experimentally evaluated on real-world Heart Failure data collected from the EHR system of Shanghai Shuguang Hospital, where 10,198 in-patients records are extracted between March 2009 and April 2016. Internal comparison results illustrate that the proposed framework achieves the best performance for Heart Failure mortality prediction. Extensive experimental results compared with other machine learning methods demonstrate that the proposed method has the highest average accuracy and area under the curve while predicting the three goals of in-hospital mortality, 30-day mortality, and 1-year mortality. Finally, top 12 essential clinical features are mined with their chi-square scores, which can help to assist clinicians in the treatment and research of heart failure. CONCLUSIONS The proposed method successfully predict different target in three observation windows. Feature rearrangement based convolutional layer and Focal loss are employed into the proposed framework, which helps promote the prediction accuracy of Heart Failure death. The proposed method is fast and accurate for predicting heart failure mortality, especially for imbalance situation. This paper also provide a reasonable pipeline to model EHRs data and handle imbalance problem in medical data.
Collapse
Affiliation(s)
- Zhe Wang
- Key Laboratory of Advanced Control and Optimization for Chemical Processes, Ministry of Education, East China University of Science and Technology, Shanghai 200237, PR China; Department of Computer Science and Engineering, East China University of Science and Technology, Shanghai 200237, PR China.
| | - Yiwen Zhu
- Department of Computer Science and Engineering, East China University of Science and Technology, Shanghai 200237, PR China
| | - Dongdong Li
- Department of Computer Science and Engineering, East China University of Science and Technology, Shanghai 200237, PR China.
| | - Yichao Yin
- Shanghai Shuguang Hospital, Shanghai 200021, PR China
| | - Jing Zhang
- Department of Computer Science and Engineering, East China University of Science and Technology, Shanghai 200237, PR China
| |
Collapse
|
12
|
Venkataraman GR, Pineda AL, Bear Don’t Walk IV OJ, Zehnder AM, Ayyar S, Page RL, Bustamante CD, Rivas MA. FasTag: Automatic text classification of unstructured medical narratives. PLoS One 2020; 15:e0234647. [PMID: 32569327 PMCID: PMC7307763 DOI: 10.1371/journal.pone.0234647] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2020] [Accepted: 05/30/2020] [Indexed: 02/07/2023] Open
Abstract
Unstructured clinical narratives are continuously being recorded as part of delivery of care in electronic health records, and dedicated tagging staff spend considerable effort manually assigning clinical codes for billing purposes. Despite these efforts, however, label availability and accuracy are both suboptimal. In this retrospective study, we aimed to automate the assignment of top-level International Classification of Diseases version 9 (ICD-9) codes to clinical records from human and veterinary data stores using minimal manual labor and feature curation. Automating top-level annotations could in turn enable rapid cohort identification, especially in a veterinary setting. To this end, we trained long short-term memory (LSTM) recurrent neural networks (RNNs) on 52,722 human and 89,591 veterinary records. We investigated the accuracy of both separate-domain and combined-domain models and probed model portability. We established relevant baseline classification performances by training Decision Trees (DT) and Random Forests (RF). We also investigated whether transforming the data using MetaMap Lite, a clinical natural language processing tool, affected classification performance. We showed that the LSTM-RNNs accurately classify veterinary and human text narratives into top-level categories with an average weighted macro F1 score of 0.74 and 0.68 respectively. In the "neoplasia" category, the model trained on veterinary data had a high validation accuracy in veterinary data and moderate accuracy in human data, with F1 scores of 0.91 and 0.70 respectively. Our LSTM method scored slightly higher than that of the DT and RF models. The use of LSTM-RNN models represents a scalable structure that could prove useful in cohort identification for comparative oncology studies. Digitization of human and veterinary health information will continue to be a reality, particularly in the form of unstructured narratives. Our approach is a step forward for these two domains to learn from and inform one another.
Collapse
Affiliation(s)
- Guhan Ram Venkataraman
- Department of Biomedical Data Science, School of Medicine, Stanford University, Stanford, CA, United States of America
| | - Arturo Lopez Pineda
- Department of Biomedical Data Science, School of Medicine, Stanford University, Stanford, CA, United States of America
| | - Oliver J. Bear Don’t Walk IV
- Department of Biomedical Informatics, Vagelos College of Physicians and Surgeons, Columbia University, New York, NY, United States of America
| | | | - Sandeep Ayyar
- Department of Biomedical Data Science, School of Medicine, Stanford University, Stanford, CA, United States of America
| | - Rodney L. Page
- Department of Clinical Sciences, College of Veterinary Medicine and Biomedical Sciences, Colorado State University, Fort Collins, CO, United States of America
| | - Carlos D. Bustamante
- Department of Biomedical Data Science, School of Medicine, Stanford University, Stanford, CA, United States of America
- Chan Zuckerberg Biohub, San Francisco, CA, United States of America
| | - Manuel A. Rivas
- Department of Biomedical Data Science, School of Medicine, Stanford University, Stanford, CA, United States of America
| |
Collapse
|
13
|
Narayanan A, Topaloglu U, Laurini JA, Diaz-Garelli F. Building Cancer Diagnosis Text to OncoTree Mapping Pipelines for Clinical Sequencing Data Integration and Curation. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2020; 2020:440-448. [PMID: 32477665 PMCID: PMC7233083] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Precision oncology research seeks to derive knowledge from existing data. Current work seeks to integrate clinical and genomic data across cancer centers to enable impactful secondary use. However, integrated data reliability depends on the data curation method used and its systematicity. In practice, data integration and mapping are often done manually even though crucial data such as oncological diagnoses (DX) show varying accuracy and specificity levels. We hypothesized that mapping of text-form cancer DX to a standardized terminology (OncoTree) could be automated using existing methods (e.g. natural language processing (NLP) modules and application programming interfaces [APIs]). We found that our best-performing pipeline prototype was effective but limited by API development limitations (accurately mapped 96.2% of textual DX dataset to NCI Thesaurus (NCIt), 44.2% through NCIt to OncoTree). These results suggest the pipeline model could be viable to automate data curation. Such techniques may become increasingly more reliable with further development.
Collapse
Affiliation(s)
- Adhithya Narayanan
- University of North Carolina at Chapel Hill, Chapel Hill, NC
- Wake Forest Baptist Medical Center, Winston Salem, NC
| | | | - Javier A Laurini
- Wake Forest Baptist Medical Center, Winston Salem, NC
- Montefiore Health System, Bronx, NY
| | - Franck Diaz-Garelli
- Wake Forest Baptist Medical Center, Winston Salem, NC
- University of North Carolina at Charlotte, Charlotte, NC
| |
Collapse
|
14
|
Rajendran S, Topaloglu U. Extracting Smoking Status from Electronic Health Records Using NLP and Deep Learning. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2020; 2020:507-516. [PMID: 32477672 PMCID: PMC7233082] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Half a million people die every year from smoking-related issues across the United States. It is essential to identify individuals who are tobacco-dependent in order to implement preventive measures. In this study, we investigate the effectiveness of deep learning models to extract smoking status of patients from clinical progress notes. A Natural Language Processing (NLP) Pipeline was built that cleans the progress notes prior to processing by three deep neural networks: a CNN, a unidirectional LSTM, and a bidirectional LSTM. Each of these models was trained with a pre- trained or a post-trained word embedding layer. Three traditional machine learning models were also employed to compare against the neural networks. Each model has generated both binary and multi-class label classification. Our results showed that the CNN model with a pre-trained embedding layer performed the best for both binary and multi- class label classification.
Collapse
Affiliation(s)
- Suraj Rajendran
- Wake Forest University School of Medicine, Winston Salem, NC
- Biomedical Engineering, Georgia Institute of Technology, Atlanta, GA
| | - Umit Topaloglu
- Wake Forest University School of Medicine, Winston Salem, NC
| |
Collapse
|
15
|
Althubaiti S, Kafkas Ş, Abdelhakim M, Hoehndorf R. Combining lexical and context features for automatic ontology extension. J Biomed Semantics 2020; 11:1. [PMID: 31931870 PMCID: PMC6958746 DOI: 10.1186/s13326-019-0218-0] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2018] [Accepted: 12/24/2019] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Ontologies are widely used across biology and biomedicine for the annotation of databases. Ontology development is often a manual, time-consuming, and expensive process. Automatic or semi-automatic identification of classes that can be added to an ontology can make ontology development more efficient. RESULTS We developed a method that uses machine learning and word embeddings to identify words and phrases that are used to refer to an ontology class in biomedical Europe PMC full-text articles. Once labels and synonyms of a class are known, we use machine learning to identify the super-classes of a class. For this purpose, we identify lexical term variants, use word embeddings to capture context information, and rely on automated reasoning over ontologies to generate features, and we use an artificial neural network as classifier. We demonstrate the utility of our approach in identifying terms that refer to diseases in the Human Disease Ontology and to distinguish between different types of diseases. CONCLUSIONS Our method is capable of discovering labels that refer to a class in an ontology but are not present in an ontology, and it can identify whether a class should be a subclass of some high-level ontology classes. Our approach can therefore be used for the semi-automatic extension and quality control of ontologies. The algorithm, corpora and evaluation datasets are available at https://github.com/bio-ontology-research-group/ontology-extension.
Collapse
Affiliation(s)
- Sara Althubaiti
- Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, 23955-6900, Saudi Arabia.,Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology, Thuwal, 23955-6900, Saudi Arabia
| | - Şenay Kafkas
- Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, 23955-6900, Saudi Arabia.,Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology, Thuwal, 23955-6900, Saudi Arabia
| | - Marwa Abdelhakim
- Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, 23955-6900, Saudi Arabia.,Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology, Thuwal, 23955-6900, Saudi Arabia
| | - Robert Hoehndorf
- Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, 23955-6900, Saudi Arabia. .,Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology, Thuwal, 23955-6900, Saudi Arabia.
| |
Collapse
|
16
|
Braun IR, Lawrence-Dill CJ. Automated Methods Enable Direct Computation on Phenotypic Descriptions for Novel Candidate Gene Prediction. FRONTIERS IN PLANT SCIENCE 2020; 10:1629. [PMID: 31998331 PMCID: PMC6965352 DOI: 10.3389/fpls.2019.01629] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/05/2019] [Accepted: 11/19/2019] [Indexed: 06/01/2023]
Abstract
Natural language descriptions of plant phenotypes are a rich source of information for genetics and genomics research. We computationally translated descriptions of plant phenotypes into structured representations that can be analyzed to identify biologically meaningful associations. These representations include the entity-quality (EQ) formalism, which uses terms from biological ontologies to represent phenotypes in a standardized, semantically rich format, as well as numerical vector representations generated using natural language processing (NLP) methods (such as the bag-of-words approach and document embedding). We compared resulting phenotype similarity measures to those derived from manually curated data to determine the performance of each method. Computationally derived EQ and vector representations were comparably successful in recapitulating biological truth to representations created through manual EQ statement curation. Moreover, NLP methods for generating vector representations of phenotypes are scalable to large quantities of text because they require no human input. These results indicate that it is now possible to computationally and automatically produce and populate large-scale information resources that enable researchers to query phenotypic descriptions directly.
Collapse
Affiliation(s)
- Ian R. Braun
- Department of Genetics, Development, and Cell Biology, Iowa State University, Ames, IA, United States
- Interdepartmental Bioinformatics and Computational Biology, Iowa State University, Ames, IA, United States
| | - Carolyn J. Lawrence-Dill
- Department of Genetics, Development, and Cell Biology, Iowa State University, Ames, IA, United States
- Interdepartmental Bioinformatics and Computational Biology, Iowa State University, Ames, IA, United States
- Department of Agronomy, Iowa State University, Ames, IA, United States
| |
Collapse
|
17
|
Cuzzola J, Bagheri E, Jovanovic J. UMLS to DBPedia link discovery through circular resolution. J Am Med Inform Assoc 2019; 25:819-826. [PMID: 29648604 DOI: 10.1093/jamia/ocy021] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2017] [Accepted: 02/26/2018] [Indexed: 11/14/2022] Open
Abstract
Objective The goal of this work is to map Unified Medical Language System (UMLS) concepts to DBpedia resources using widely accepted ontology relations from the Simple Knowledge Organization System (skos:exactMatch, skos:closeMatch) and from the Resource Description Framework Schema (rdfs:seeAlso), as a result of which a complete mapping from UMLS (UMLS 2016AA) to DBpedia (DBpedia 2015-10) is made publicly available that includes 221 690 skos:exactMatch, 26 276 skos:closeMatch, and 6 784 322 rdfs:seeAlso mappings. Methods We propose a method called circular resolution that utilizes a combination of semantic annotators to map UMLS concepts to DBpedia resources. A set of annotators annotate definitions of UMLS concepts returning DBpedia resources while another set performs annotation on DBpedia resource abstracts returning UMLS concepts. Our pipeline aligns these 2 sets of annotations to determine appropriate mappings from UMLS to DBpedia. Results We evaluate our proposed method using structured data from the Wikidata knowledge base as the ground truth, which consists of 4899 already existing UMLS to DBpedia mappings. Our results show an 83% recall with 77% precision-at-one (P@1) in mapping UMLS concepts to DBpedia resources on this testing set. Conclusions The proposed circular resolution method is a simple yet effective technique for linking UMLS concepts to DBpedia resources. Experiments using Wikidata-based ground truth reveal a high mapping accuracy. In addition to the complete UMLS mapping downloadable in n-triple format, we provide an online browser and a RESTful service to explore the mappings.
Collapse
Affiliation(s)
- John Cuzzola
- Laboratory for Systems, Software and Semantics (LS3), Ryerson University, Ontario, Canada
| | - Ebrahim Bagheri
- Laboratory for Systems, Software and Semantics (LS3), Ryerson University, Ontario, Canada
| | - Jelena Jovanovic
- Faculty of Organizational Sciences (FOS), University of Belgrade, Belgrade, Serbia
| |
Collapse
|
18
|
Savova GK, Danciu I, Alamudun F, Miller T, Lin C, Bitterman DS, Tourassi G, Warner JL. Use of Natural Language Processing to Extract Clinical Cancer Phenotypes from Electronic Medical Records. Cancer Res 2019; 79:5463-5470. [PMID: 31395609 PMCID: PMC7227798 DOI: 10.1158/0008-5472.can-19-0579] [Citation(s) in RCA: 75] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2019] [Revised: 06/17/2019] [Accepted: 07/29/2019] [Indexed: 12/12/2022]
Abstract
Current models for correlating electronic medical records with -omics data largely ignore clinical text, which is an important source of phenotype information for patients with cancer. This data convergence has the potential to reveal new insights about cancer initiation, progression, metastasis, and response to treatment. Insights from this real-world data will catalyze clinical care, research, and regulatory activities. Natural language processing (NLP) methods are needed to extract these rich cancer phenotypes from clinical text. Here, we review the advances of NLP and information extraction methods relevant to oncology based on publications from PubMed as well as NLP and machine learning conference proceedings in the last 3 years. Given the interdisciplinary nature of the fields of oncology and information extraction, this analysis serves as a critical trail marker on the path to higher fidelity oncology phenotypes from real-world data.
Collapse
Affiliation(s)
- Guergana K Savova
- Computational Health Informatics Program, Boston Children's Hospital, Boston, Massachusetts.
- Harvard Medical School, Boston, Massachusetts
| | | | | | - Timothy Miller
- Computational Health Informatics Program, Boston Children's Hospital, Boston, Massachusetts
- Harvard Medical School, Boston, Massachusetts
| | - Chen Lin
- Computational Health Informatics Program, Boston Children's Hospital, Boston, Massachusetts
| | - Danielle S Bitterman
- Harvard Medical School, Boston, Massachusetts
- Dana Farber Cancer Institute, Boston, Massachusetts
| | | | | |
Collapse
|
19
|
Liu C, Ta CN, Rogers JR, Li Z, Lee J, Butler AM, Shang N, Kury FSP, Wang L, Shen F, Liu H, Ena L, Friedman C, Weng C. Ensembles of natural language processing systems for portable phenotyping solutions. J Biomed Inform 2019; 100:103318. [PMID: 31655273 DOI: 10.1016/j.jbi.2019.103318] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2019] [Revised: 09/15/2019] [Accepted: 10/21/2019] [Indexed: 02/04/2023]
Abstract
BACKGROUND Manually curating standardized phenotypic concepts such as Human Phenotype Ontology (HPO) terms from narrative text in electronic health records (EHRs) is time consuming and error prone. Natural language processing (NLP) techniques can facilitate automated phenotype extraction and thus improve the efficiency of curating clinical phenotypes from clinical texts. While individual NLP systems can perform well for a single cohort, an ensemble-based method might shed light on increasing the portability of NLP pipelines across different cohorts. METHODS We compared four NLP systems, MetaMapLite, MedLEE, ClinPhen and cTAKES, and four ensemble techniques, including intersection, union, majority-voting and machine learning, for extracting generic phenotypic concepts. We addressed two important research questions regarding automated phenotype recognition. First, we evaluated the performance of different approaches in identifying generic phenotypic concepts. Second, we compared the performance of different methods to identify patient-specific phenotypic concepts. To better quantify the effects caused by concept granularity differences on performance, we developed a novel evaluation metric that considered concept hierarchies and frequencies. Each of the approaches was evaluated on a gold standard set of clinical documents annotated by clinical experts. One dataset containing 1,609 concepts derived from 50 clinical notes from two different institutions was used in both evaluations, and an additional dataset of 608 concepts derived from 50 case report abstracts obtained from PubMed was used for evaluation of identifying generic phenotypic concepts only. RESULTS For generic phenotypic concept recognition, the top three performers in the NYP/CUIMC dataset are union ensemble (F1, 0.634), training-based ensemble (F1, 0.632), and majority vote-based ensemble (F1, 0.622). In the Mayo dataset, the top three are majority vote-based ensemble (F1, 0.642), cTAKES (F1, 0.615), and MedLEE (F1, 0.559). In the PubMed dataset, the top three are majority vote-based ensemble (F1, 0.719), training-based (F1, 0.696) and MetaMapLite (F1, 0.694). For identifying patient specific phenotypes, the top three performers in the NYP/CUIMC dataset are majority vote-based ensemble (F1, 0.610), MedLEE (F1, 0.609), and training-based ensemble (F1, 0.585). In the Mayo dataset, the top three are majority vote-based ensemble (F1, 0.604), cTAKES (F1, 0.531) and MedLEE (F1, 0.527). CONCLUSIONS Our study demonstrates that ensembles of natural language processing can improve both generic phenotypic concept recognition and patient specific phenotypic concept identification over individual systems. Among the individual NLP systems, each individual system performed best when they were applied in the dataset that they were primary designed for. However, combining multiple NLP systems to create an ensemble can generally improve the performance. Specifically, the ensemble can increase the results reproducibility across different cohorts and tasks, and thus provide a more portable phenotyping solution compared to individual NLP systems.
Collapse
Affiliation(s)
- Cong Liu
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA
| | - Casey N Ta
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA
| | - James R Rogers
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA
| | - Ziran Li
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA
| | - Junghwan Lee
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA
| | - Alex M Butler
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA
| | - Ning Shang
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA
| | | | - Liwei Wang
- Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, MN 55901, USA
| | - Feichen Shen
- Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, MN 55901, USA
| | - Hongfang Liu
- Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, MN 55901, USA
| | - Lyudmila Ena
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA
| | - Carol Friedman
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA
| | - Chunhua Weng
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA.
| |
Collapse
|
20
|
Diaz-Garelli JF, Strowd R, Ahmed T, Wells BJ, Merrill R, Laurini J, Pasche B, Topaloglu U. A tale of three subspecialties: Diagnosis recording patterns are internally consistent but Specialty-Dependent. JAMIA Open 2019; 2:369-377. [PMID: 31984369 PMCID: PMC6951969 DOI: 10.1093/jamiaopen/ooz020] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2018] [Revised: 04/22/2019] [Accepted: 05/27/2019] [Indexed: 11/16/2022] Open
Abstract
BACKGROUND Structured diagnosis (DX) are crucial for secondary use of electronic health record (EHR) data. However, they are often suboptimally recorded. Our previous work showed initial evidence of variable DX recording patterns in oncology charts even after biopsy records are available. OBJECTIVE We verified this finding's internal and external validity. We hypothesized that this recording pattern would be preserved in a larger cohort of patients for the same disease. We also hypothesized that this effect would vary across subspecialties. METHODS We extracted DX data from EHRs of patients treated for brain, lung, and pancreatic neoplasms, identified through clinician-led chart reviews. We used statistical methods (i.e., binomial and mixed model regressions) to test our hypotheses. RESULTS We found variable recording patterns in brain neoplasm DX (i.e., larger number of distinct DX-OR = 2.2, P < 0.0001, higher descriptive specificity scores-OR = 1.4, P < 0.0001-and much higher entropy after the BX-OR = 3.8 P = 0.004 and OR = 8.0, P < 0.0001), confirming our initial findings. We also found strikingly different patterns for lung and pancreas DX. Although both seemed to have much lower DX sequence entropy after the BX-OR = 0.198, P = 0.015 and OR = 0.099, P = 0.015, respectively compared to OR = 3.8 P = 0.004). We also found statistically significant differences between the brain dataset and both the lung (P < 0.0001) and pancreas (0.009 CONCLUSION Our results suggest that disease-specific DX entry patterns exist and are established differently by clinical subspecialty. These differences should be accounted for during clinical data reuse and data quality assessments but also during EHR entry system design to maximize accurate, precise and consistent data entry likelihood.
Collapse
Affiliation(s)
| | - Roy Strowd
- Wake Forest Baptist Medical Center, Winston Salem, North Carolina, USA
| | - Tamjeed Ahmed
- Wake Forest Baptist Medical Center, Winston Salem, North Carolina, USA
| | - Brian J Wells
- Wake Forest Baptist Medical Center, Winston Salem, North Carolina, USA
| | - Rebecca Merrill
- Wake Forest Baptist Medical Center, Winston Salem, North Carolina, USA
| | - Javier Laurini
- Wake Forest Baptist Medical Center, Winston Salem, North Carolina, USA
| | - Boris Pasche
- Wake Forest Baptist Medical Center, Winston Salem, North Carolina, USA
| | - Umit Topaloglu
- Wake Forest Baptist Medical Center, Winston Salem, North Carolina, USA
| |
Collapse
|
21
|
Ling AY, Kurian AW, Caswell-Jin JL, Sledge GW, Shah NH, Tamang SR. Using natural language processing to construct a metastatic breast cancer cohort from linked cancer registry and electronic medical records data. JAMIA Open 2019; 2:528-537. [PMID: 32025650 PMCID: PMC6994019 DOI: 10.1093/jamiaopen/ooz040] [Citation(s) in RCA: 25] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2019] [Revised: 07/13/2019] [Accepted: 08/13/2019] [Indexed: 02/04/2023] Open
Abstract
Objectives Most population-based cancer databases lack information on metastatic recurrence. Electronic medical records (EMR) and cancer registries contain complementary information on cancer diagnosis, treatment and outcome, yet are rarely used synergistically. To construct a cohort of metastatic breast cancer (MBC) patients, we applied natural language processing techniques within a semisupervised machine learning framework to linked EMR-California Cancer Registry (CCR) data. Materials and Methods We studied all female patients treated at Stanford Health Care with an incident breast cancer diagnosis from 2000 to 2014. Our database consisted of structured fields and unstructured free-text clinical notes from EMR, linked to CCR, a component of the Surveillance, Epidemiology and End Results Program (SEER). We identified de novo MBC patients from CCR and extracted information on distant recurrences from patient notes in EMR. Furthermore, we trained a regularized logistic regression model for recurrent MBC classification and evaluated its performance on a gold standard set of 146 patients. Results There were 11 459 breast cancer patients in total and the median follow-up time was 96.3 months. We identified 1886 MBC patients, 512 (27.1%) of whom were de novo MBC patients and 1374 (72.9%) were recurrent MBC patients. Our final MBC classifier achieved an area under the receiver operating characteristic curve (AUC) of 0.917, with sensitivity 0.861, specificity 0.878, and accuracy 0.870. Discussion and Conclusion To enable population-based research on MBC, we developed a framework for retrospective case detection combining EMR and CCR data. Our classifier achieved good AUC, sensitivity, and specificity without expert-labeled examples.
Collapse
Affiliation(s)
- Albee Y Ling
- Biomedical Informatics Training Program, Stanford University, Stanford, CA.,Department of Biomedical Data Science, Stanford University, Stanford, CA
| | - Allison W Kurian
- Department of Medicine, Stanford University School of Medicine, Stanford, CA.,Department of Health Research and Policy, Stanford University School of Medicine, Stanford, CA
| | | | - George W Sledge
- Department of Medicine, Stanford University School of Medicine, Stanford, CA
| | - Nigam H Shah
- Department of Biomedical Data Science, Stanford University, Stanford, CA.,Center for Biomedical Informatics Research, Stanford University, CA
| | - Suzanne R Tamang
- Department of Biomedical Data Science, Stanford University, Stanford, CA.,Center for Population Health Sciences, Stanford University, CA
| |
Collapse
|
22
|
Tsueng G, Nanis M, Fouquier JT, Mayers M, Good BM, Su AI. Applying citizen science to gene, drug and disease relationship extraction from biomedical abstracts. Bioinformatics 2019; 36:1226-1233. [PMID: 31504205 PMCID: PMC8104067 DOI: 10.1093/bioinformatics/btz678] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2019] [Revised: 08/05/2019] [Accepted: 08/29/2019] [Indexed: 01/31/2023] Open
Abstract
MOTIVATION Biomedical literature is growing at a rate that outpaces our ability to harness the knowledge contained therein. To mine valuable inferences from the large volume of literature, many researchers use information extraction algorithms to harvest information in biomedical texts. Information extraction is usually accomplished via a combination of manual expert curation and computational methods. Advances in computational methods usually depend on the time-consuming generation of gold standards by a limited number of expert curators. Citizen science is public participation in scientific research. We previously found that citizen scientists are willing and capable of performing named entity recognition of disease mentions in biomedical abstracts, but did not know if this was true with relationship extraction (RE). RESULTS In this article, we introduce the Relationship Extraction Module of the web-based application Mark2Cure (M2C) and demonstrate that citizen scientists can perform RE. We confirm the importance of accurate named entity recognition on user performance of RE and identify design issues that impacted data quality. We find that the data generated by citizen scientists can be used to identify relationship types not currently available in the M2C Relationship Extraction Module. We compare the citizen science-generated data with algorithm-mined data and identify ways in which the two approaches may complement one another. We also discuss opportunities for future improvement of this system, as well as the potential synergies between citizen science, manual biocuration and natural language processing. AVAILABILITY AND IMPLEMENTATION Mark2Cure platform: https://mark2cure.org; Mark2Cure source code: https://github.com/sulab/mark2cure; and data and analysis code for this article: https://github.com/gtsueng/M2C_rel_nb. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Max Nanis
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA 92037, USA
| | - Jennifer T Fouquier
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA 92037, USA
| | - Michael Mayers
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA 92037, USA
| | - Benjamin M Good
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA 92037, USA
| | - Andrew I Su
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA 92037, USA
| |
Collapse
|
23
|
Trivedi G, Hong C, Dadashzadeh ER, Handzel RM, Hochheiser H, Visweswaran S. Identifying incidental findings from radiology reports of trauma patients: An evaluation of automated feature representation methods. Int J Med Inform 2019; 129:81-87. [PMID: 31445293 PMCID: PMC6717529 DOI: 10.1016/j.ijmedinf.2019.05.021] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2018] [Revised: 03/07/2019] [Accepted: 05/21/2019] [Indexed: 12/21/2022]
Abstract
BACKGROUND Radiologic imaging of trauma patients often uncovers findings that are unrelated to the trauma. These are termed as incidental findings and identifying them in radiology examination reports is necessary for appropriate follow-up. We developed and evaluated an automated pipeline to identify incidental findings at sentence and section levels in radiology reports of trauma patients. METHODS We created an annotated dataset of 4,181 reports and investigated automated feature representations including traditional word and clinical concept (such as SNOMED CT) representations, as well as word and concept embeddings. We evaluated these representations by using them with traditional classifiers such as logistic regression and with deep learning methods such as convolutional neural networks (CNNs). RESULTS The best performance was observed using word embeddings with CNNs with F1 scores of 0.66 and 0.52 at section and sentence levels respectively. The F1 score was statistically significantly higher for sections compared to sentences (Wilcoxon; Z < 0.001, p < 0.05). Compared to using words alone, the addition of SNOMED CT concepts did not improve performance. At the sentence level, the F1 score improved significantly from 0.46 to 0.52 when using pre-trained embeddings (Wilcoxon; Z < 0.001, p < 0.05). CONCLUSION The results show that the best performance was achieved by using embeddings with CNNs at both sentence and section levels. This provides evidence that such a pipeline is capable of accurately identifying incidental findings in radiology reports in an automated manner.
Collapse
Affiliation(s)
- Gaurav Trivedi
- Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA, United States; School of Computing and Information, University of Pittsburgh, Pittsburgh, PA, United States.
| | - Charmgil Hong
- School of Computing and Information, University of Pittsburgh, Pittsburgh, PA, United States.
| | - Esmaeel R Dadashzadeh
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, United States; Department of Surgery, University of Pittsburgh, Pittsburgh, PA, United States.
| | - Robert M Handzel
- Department of Surgery, University of Pittsburgh, Pittsburgh, PA, United States.
| | - Harry Hochheiser
- Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA, United States; Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, United States.
| | - Shyam Visweswaran
- Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA, United States; Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, United States.
| |
Collapse
|
24
|
Furrer L, Jancso A, Colic N, Rinaldi F. OGER++: hybrid multi-type entity recognition. J Cheminform 2019; 11:7. [PMID: 30666476 PMCID: PMC6689863 DOI: 10.1186/s13321-018-0326-3] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2018] [Accepted: 12/27/2018] [Indexed: 12/14/2022] Open
Abstract
Background We present a text-mining tool for recognizing biomedical entities in scientific literature. OGER++ is a hybrid system for named entity recognition and concept recognition (linking), which combines a dictionary-based annotator with a corpus-based disambiguation component. The annotator uses an efficient look-up strategy combined with a normalization method for matching spelling variants. The disambiguation classifier is implemented as a feed-forward neural network which acts as a postfilter to the previous step. Results We evaluated the system in terms of processing speed and annotation quality. In the speed benchmarks, the OGER++ web service processes 9.7 abstracts or 0.9 full-text documents per second. On the CRAFT corpus, we achieved 71.4% and 56.7% F1 for named entity recognition and concept recognition, respectively. Conclusions Combining knowledge-based and data-driven components allows creating a system with competitive performance in biomedical text mining.
Collapse
Affiliation(s)
- Lenz Furrer
- Institute of Computational Linguistics, University of Zurich, Andreasstr. 15, 8050, Zürich, Switzerland
| | - Anna Jancso
- Institute of Computational Linguistics, University of Zurich, Andreasstr. 15, 8050, Zürich, Switzerland
| | - Nicola Colic
- Institute of Computational Linguistics, University of Zurich, Andreasstr. 15, 8050, Zürich, Switzerland
| | - Fabio Rinaldi
- Institute of Computational Linguistics, University of Zurich, Andreasstr. 15, 8050, Zürich, Switzerland. .,Fondazione Bruno Kessler, Via Sommarive, 18, 38123, Trento, Italy.
| |
Collapse
|
25
|
Li M, He Q, Ma J, He F, Zhu Y, Chang C, Chen T. PPICurator: A Tool for Extracting Comprehensive Protein-Protein Interaction Information. Proteomics 2019; 19:e1800291. [PMID: 30521143 DOI: 10.1002/pmic.201800291] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2018] [Revised: 11/12/2018] [Indexed: 11/07/2022]
Abstract
Protein-protein interaction extraction through biological literature curation is widely employed for proteome analysis. There is a strong need for a tool that can assist researchers in extracting comprehensive PPI information through literature curation, which is critical in research on protein, for example, construction of protein interaction network, identification of protein signaling pathway, and discovery of meaningful protein interaction. However, most of current tools can only extract PPI relations. None of them are capable of extracting other important PPI information, such as interaction directions, effects, and functional annotations. To address these issues, this paper proposes PPICurator, a novel tool for extracting comprehensive PPI information with a variety of logic and syntax features based on a new support vector machine classifier. PPICurator provides a friendly web-based user interface. It is a platform that automates the extraction of comprehensive PPI information through literature, including PPI relations, as well as their confidential scores, interaction directions, effects, and functional annotations. Thus, PPICurator is more comprehensive than state-of-the-art tools. Moreover, it outperforms state-of-the-art tools in the accuracy of PPI relation extraction measured by F-score and recall on the widely used open datasets. PPICurator is available at https://ppicurator.hupo.org.cn.
Collapse
Affiliation(s)
- Mansheng Li
- State Key Laboratory of Proteomics, Beijing Proteome Research Center, National Center for Protein Sciences (Beijing), Beijing Institute of Life Omics, Beijing, 102206, P. R. China
| | - Qiang He
- School of Software and Electrical Engineering, Swinburne University of Technology, Melbourne, Victoria, 3122, Australia
| | - Jie Ma
- State Key Laboratory of Proteomics, Beijing Proteome Research Center, National Center for Protein Sciences (Beijing), Beijing Institute of Life Omics, Beijing, 102206, P. R. China
| | - Fuchu He
- State Key Laboratory of Proteomics, Beijing Proteome Research Center, National Center for Protein Sciences (Beijing), Beijing Institute of Life Omics, Beijing, 102206, P. R. China
| | - Yunping Zhu
- State Key Laboratory of Proteomics, Beijing Proteome Research Center, National Center for Protein Sciences (Beijing), Beijing Institute of Life Omics, Beijing, 102206, P. R. China
| | - Cheng Chang
- State Key Laboratory of Proteomics, Beijing Proteome Research Center, National Center for Protein Sciences (Beijing), Beijing Institute of Life Omics, Beijing, 102206, P. R. China
| | - Tao Chen
- State Key Laboratory of Proteomics, Beijing Proteome Research Center, National Center for Protein Sciences (Beijing), Beijing Institute of Life Omics, Beijing, 102206, P. R. China
| |
Collapse
|
26
|
Guan M, Cho S, Petro R, Zhang W, Pasche B, Topaloglu U. Natural language processing and recurrent network models for identifying genomic mutation-associated cancer treatment change from patient progress notes. JAMIA Open 2019; 2:139-149. [PMID: 30944913 PMCID: PMC6435007 DOI: 10.1093/jamiaopen/ooy061] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2018] [Revised: 11/26/2018] [Accepted: 12/21/2018] [Indexed: 01/16/2023] Open
Abstract
Objectives Natural language processing (NLP) and machine learning approaches were used to build classifiers to identify genomic-related treatment changes in the free-text visit progress notes of cancer patients. Methods We obtained 5889 deidentified progress reports (2439 words on average) for 755 cancer patients who have undergone a clinical next generation sequencing (NGS) testing in Wake Forest Baptist Comprehensive Cancer Center for our data analyses. An NLP system was implemented to process the free-text data and extract NGS-related information. Three types of recurrent neural network (RNN) namely, gated recurrent unit, long short-term memory (LSTM), and bidirectional LSTM (LSTM_Bi) were applied to classify documents to the treatment-change and no-treatment-change groups. Further, we compared the performances of RNNs to 5 machine learning algorithms including Naive Bayes, K-nearest Neighbor, Support Vector Machine for classification, Random forest, and Logistic Regression. Results Our results suggested that, overall, RNNs outperformed traditional machine learning algorithms, and LSTM_Bi showed the best performance among the RNNs in terms of accuracy, precision, recall, and F1 score. In addition, pretrained word embedding can improve the accuracy of LSTM by 3.4% and reduce the training time by more than 60%. Discussion and Conclusion NLP and RNN-based text mining solutions have demonstrated advantages in information retrieval and document classification tasks for unstructured clinical progress notes.
Collapse
Affiliation(s)
- Meijian Guan
- Department of Computer Science, Wake Forest University, Winston-Salem, North Carolina, USA.,Wake Forest Baptist Comprehensive Cancer Center, Winston Salem, North Carolina, USA
| | - Samuel Cho
- Department of Computer Science, Wake Forest University, Winston-Salem, North Carolina, USA.,Department of Physics, Wake Forest University, Winston-Salem, North Carolina, USA
| | - Robin Petro
- Wake Forest Baptist Comprehensive Cancer Center, Winston Salem, North Carolina, USA
| | - Wei Zhang
- Wake Forest Baptist Comprehensive Cancer Center, Winston Salem, North Carolina, USA.,Department of Cancer Biology, Wake Forest School of Medicine, Winston Salem, North Carolina, USA
| | - Boris Pasche
- Wake Forest Baptist Comprehensive Cancer Center, Winston Salem, North Carolina, USA.,Department of Cancer Biology, Wake Forest School of Medicine, Winston Salem, North Carolina, USA
| | - Umit Topaloglu
- Wake Forest Baptist Comprehensive Cancer Center, Winston Salem, North Carolina, USA.,Department of Cancer Biology, Wake Forest School of Medicine, Winston Salem, North Carolina, USA
| |
Collapse
|
27
|
Torii M, Yang EW, Doan S. A Preliminary Study of Clinical Concept Detection Using Syntactic Relations. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2018; 2018:1028-1035. [PMID: 30815146 PMCID: PMC6371372] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Concept detection is an integral step in natural language processing (NLP) applications in the clinical domain. Clinical concepts are detailed (e.g., "pain in left/right upper/lower arm/leg") and expressed in diverse phrase types (e.g., noun, verb, adjective, or prepositional phrase). There are rich terminological resources in the clinical domain that include many concept synonyms. Even with these resources, concept detection remains challenging due to discontinuous and/or permuted phrase occurrences. To overcome this challenge, we investigated an approach to exploiting syntactic information. Syntactic patterns of concept phrases were mined from continuous, non-permuted forms of synonyms, and these patterns were used to detect discontinuous and/or permuted concept phrases. Experiments on 790 de-identified clinical notes showed that the proposed approach can potentially boost a recall of concept detection. Meanwhile, challenges and limitations were noticed. In this paper, we report and discuss our preliminary analysis and finding.
Collapse
Affiliation(s)
- Manabu Torii
- Medical Informatics, Kaiser Permanente Southern California, San Diego, CA
| | - Elly W Yang
- Medical Informatics, Kaiser Permanente Southern California, San Diego, CA
| | - Son Doan
- Medical Informatics, Kaiser Permanente Southern California, San Diego, CA
| |
Collapse
|
28
|
Tchechmedjiev A, Abdaoui A, Emonet V, Zevio S, Jonquet C. SIFR annotator: ontology-based semantic annotation of French biomedical text and clinical notes. BMC Bioinformatics 2018; 19:405. [PMID: 30400805 PMCID: PMC6218966 DOI: 10.1186/s12859-018-2429-2] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2018] [Accepted: 10/10/2018] [Indexed: 12/01/2022] Open
Abstract
Background Despite a wide adoption of English in science, a significant amount of biomedical data are produced in other languages, such as French. Yet a majority of natural language processing or semantic tools as well as domain terminologies or ontologies are only available in English, and cannot be readily applied to other languages, due to fundamental linguistic differences. However, semantic resources are required to design semantic indexes and transform biomedical (text)data into knowledge for better information mining and retrieval. Results We present the SIFR Annotator (http://bioportal.lirmm.fr/annotator), a publicly accessible ontology-based annotation web service to process biomedical text data in French. The service, developed during the Semantic Indexing of French Biomedical Data Resources (2013–2019) project is included in the SIFR BioPortal, an open platform to host French biomedical ontologies and terminologies based on the technology developed by the US National Center for Biomedical Ontology. The portal facilitates use and fostering of ontologies by offering a set of services –search, mappings, metadata, versioning, visualization, recommendation– including for annotation purposes. We introduce the adaptations and improvements made in applying the technology to French as well as a number of language independent additional features –implemented by means of a proxy architecture– in particular annotation scoring and clinical context detection. We evaluate the performance of the SIFR Annotator on different biomedical data, using available French corpora –Quaero (titles from French MEDLINE abstracts and EMEA drug labels) and CépiDC (ICD-10 coding of death certificates)– and discuss our results with respect to the CLEF eHealth information extraction tasks. Conclusions We show the web service performs comparably to other knowledge-based annotation approaches in recognizing entities in biomedical text and reach state-of-the-art levels in clinical context detection (negation, experiencer, temporality). Additionally, the SIFR Annotator is the first openly web accessible tool to annotate and contextualize French biomedical text with ontology concepts leveraging a dictionary currently made of 28 terminologies and ontologies and 333 K concepts. The code is openly available, and we also provide a Docker packaging for easy local deployment to process sensitive (e.g., clinical) data in-house (https://github.com/sifrproject).
Collapse
Affiliation(s)
- Andon Tchechmedjiev
- Laboratory of Informatics, Robotics and Microelectronics of Montpellier (LIRMM), University of Montpellier, CNRS, 161, rue Ada, 34095, Montpellier cedex 5, France. .,LGI2P, IMT Mines Ales, Univ Montpellier, Alès, France.
| | - Amine Abdaoui
- Laboratory of Informatics, Robotics and Microelectronics of Montpellier (LIRMM), University of Montpellier, CNRS, 161, rue Ada, 34095, Montpellier cedex 5, France
| | - Vincent Emonet
- Laboratory of Informatics, Robotics and Microelectronics of Montpellier (LIRMM), University of Montpellier, CNRS, 161, rue Ada, 34095, Montpellier cedex 5, France
| | - Stella Zevio
- Laboratory of Informatics, Robotics and Microelectronics of Montpellier (LIRMM), University of Montpellier, CNRS, 161, rue Ada, 34095, Montpellier cedex 5, France
| | - Clement Jonquet
- Laboratory of Informatics, Robotics and Microelectronics of Montpellier (LIRMM), University of Montpellier, CNRS, 161, rue Ada, 34095, Montpellier cedex 5, France.,Center for Biomedical Informatics Research (BMIR), Stanford University, 1265 Welch Rd, Stanford, CA, 94305, USA
| |
Collapse
|
29
|
Varghese J, Sandmann S, Dugas M. Web-Based Information Infrastructure Increases the Interrater Reliability of Medical Coders: Quasi-Experimental Study. J Med Internet Res 2018; 20:e274. [PMID: 30322834 PMCID: PMC6231825 DOI: 10.2196/jmir.9644] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2017] [Revised: 05/03/2018] [Accepted: 06/28/2018] [Indexed: 01/05/2023] Open
Abstract
Background Medical coding is essential for standardized communication and integration of clinical data. The Unified Medical Language System by the National Library of Medicine is the largest clinical terminology system for medical coders and Natural Language Processing tools. However, the abundance of ambiguous codes leads to low rates of uniform coding among different coders. Objective The objective of our study was to measure uniform coding among different medical experts in terms of interrater reliability and analyze the effect on interrater reliability using an expert- and Web-based code suggestion system. Methods We conducted a quasi-experimental study in which 6 medical experts coded 602 medical items from structured quality assurance forms or free-text eligibility criteria of 20 different clinical trials. The medical item content was selected on the basis of mortality-leading diseases according to World Health Organization data. The intervention comprised using a semiautomatic code suggestion tool that is linked to a European information infrastructure providing a large medical text corpus of >300,000 medical form items with expert-assigned semantic codes. Krippendorff alpha (Kalpha) with bootstrap analysis was used for the interrater reliability analysis, and coding times were measured before and after the intervention. Results The intervention improved interrater reliability in structured quality assurance form items (from Kalpha=0.50, 95% CI 0.43-0.57 to Kalpha=0.62 95% CI 0.55-0.69) and free-text eligibility criteria (from Kalpha=0.19, 95% CI 0.14-0.24 to Kalpha=0.43, 95% CI 0.37-0.50) while preserving or slightly reducing the mean coding time per item for all 6 coders. Regardless of the intervention, precoordination and structured items were associated with significantly high interrater reliability, but the proportion of items that were precoordinated significantly increased after intervention (eligibility criteria: OR 4.92, 95% CI 2.78-8.72; quality assurance: OR 1.96, 95% CI 1.19-3.25). Conclusions The Web-based code suggestion mechanism improved interrater reliability toward moderate or even substantial intercoder agreement. Precoordination and the use of structured versus free-text data elements are key drivers of higher interrater reliability.
Collapse
Affiliation(s)
- Julian Varghese
- Institute of Medical Informatics, University of Münster, Münster, Germany
| | - Sarah Sandmann
- Institute of Medical Informatics, University of Münster, Münster, Germany
| | - Martin Dugas
- Institute of Medical Informatics, European Research Center for Information Systems, Münster, Germany
| |
Collapse
|
30
|
Mura C, Draizen EJ, Bourne PE. Structural biology meets data science: does anything change? Curr Opin Struct Biol 2018; 52:95-102. [PMID: 30267935 DOI: 10.1016/j.sbi.2018.09.003] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2018] [Revised: 08/31/2018] [Accepted: 09/07/2018] [Indexed: 01/22/2023]
Abstract
Data science has emerged from the proliferation of digital data, coupled with advances in algorithms, software and hardware (e.g., GPU computing). Innovations in structural biology have been driven by similar factors, spurring us to ask: can these two fields impact one another in deep and hitherto unforeseen ways? We posit that the answer is yes. New biological knowledge lies in the relationships between sequence, structure, function and disease, all of which play out on the stage of evolution, and data science enables us to elucidate these relationships at scale. Here, we consider the above question from the five key pillars of data science: acquisition, engineering, analytics, visualization and policy, with an emphasis on machine learning as the premier analytics approach.
Collapse
Affiliation(s)
- Cameron Mura
- Department of Biomedical Engineering, University of Virginia, Charlottesville, VA 22908, USA
| | - Eli J Draizen
- Department of Biomedical Engineering, University of Virginia, Charlottesville, VA 22908, USA
| | - Philip E Bourne
- Department of Biomedical Engineering, University of Virginia, Charlottesville, VA 22908, USA; Data Science Institute, University of Virginia, Charlottesville, VA 22904, USA.
| |
Collapse
|
31
|
Campos L, Pedro V, Couto F. Impact of translation on named-entity recognition in radiology texts. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2018; 2017:4097790. [PMID: 29220455 PMCID: PMC5737072 DOI: 10.1093/database/bax064] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/30/2017] [Accepted: 08/03/2017] [Indexed: 11/17/2022]
Abstract
Radiology reports describe the results of radiography procedures and have the potential of being a useful source of information which can bring benefits to health care systems around the world. One way to automatically extract information from the reports is by using Text Mining tools. The problem is that these tools are mostly developed for English and reports are usually written in the native language of the radiologist, which is not necessarily English. This creates an obstacle to the sharing of Radiology information between different communities. This work explores the solution of translating the reports to English before applying the Text Mining tools, probing the question of what translation approach should be used. We created MRRAD (Multilingual Radiology Research Articles Dataset), a parallel corpus of Portuguese research articles related to Radiology and a number of alternative translations (human, automatic and semi-automatic) to English. This is a novel corpus which can be used to move forward the research on this topic. Using MRRAD we studied which kind of automatic or semi-automatic translation approach is more effective on the Named-entity recognition task of finding RadLex terms in the English version of the articles. Considering the terms extracted from human translations as our gold standard, we calculated how similar to this standard were the terms extracted using other translations. We found that a completely automatic translation approach using Google leads to F-scores (between 0.861 and 0.868, depending on the extraction approach) similar to the ones obtained through a more expensive semi-automatic translation approach using Unbabel (between 0.862 and 0.870). To better understand the results we also performed a qualitative analysis of the type of errors found in the automatic and semi-automatic translations. Database URL:https://github.com/lasigeBioTM/MRRAD
Collapse
Affiliation(s)
- Luís Campos
- LASIGE, Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, 1749-016 Lisboa, Portugal
| | - Vasco Pedro
- Unbabel Lda, Rua Visconde de Santarém, 67-B, 1000-286 Lisboa, Portugal
| | - Francisco Couto
- LASIGE, Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, 1749-016 Lisboa, Portugal
| |
Collapse
|
32
|
Diaz-Garelli JF, Wells BJ, Yelton C, Strowd R, Topaloglu U. Biopsy Records Do Not Reduce Diagnosis Variability in Cancer Patient EHRs: Are We More Uncertain After Knowing? AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2018; 2017:72-80. [PMID: 29888044 PMCID: PMC5961789] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Diagnostic codes are crucial for analyses of electronic health record (EHR) data but their accuracy and precision are often lacking. Although providers enter precise diagnoses into progress notes, billing standards may limit the particularity of a diagnostic code. Variability also arises from the creation of multiple descriptions for a particular diagnostic code. We hypothesized that the variability of diagnostic codes would be greater before surgical pathology results were recorded in the medical record. A well annotated cohort of patients with brain neoplasms was studied. After diagnostic pathology reporting, the odds of more distinct diagnostic descriptions were 2.30 times higher (p=0.00358), entropy in diagnostic sequences was 2.26 times higher (p=0.0259) and entropy in diagnostic precision scores was 15.5 times higher (p=0.0324). Although diagnostic codes became more distinct on average after diagnostic pathology reporting, there was a paradoxical increase in the variability of the codes selected. Researchers must be aware of the inconsistencies and variability in particularity in structured diagnostic coding despite the presence of a definitive diagnosis.
Collapse
Affiliation(s)
| | - Brian J Wells
- Wake Forest Baptist Medical Center, Winston Salem, NC
| | - Caleb Yelton
- Wake Forest Baptist Medical Center, Winston Salem, NC
| | - Roy Strowd
- Wake Forest Baptist Medical Center, Winston Salem, NC
| | | |
Collapse
|
33
|
GNOMICS: A one-stop shop for biomedical and genomic data. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2018; 2017:118-123. [PMID: 29888054 PMCID: PMC5961829] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/03/2022]
Abstract
The World Wide Web is an indispensable tool for biomedical researchers who are striving to understand the molecular basis of phenotype. However, it presents challenges in the form of proliferation of data resources, with heterogeneity ranging from their content to functionality to interfaces. This often frustrates researchers who must visit multiple sites, become familiar with their interfaces, and learn how to use them to extract knowledge. Even then, one may never feel sure that they have tracked down all needed information. We envision addressing this challenge with GNOMICS (Genomic Nomenclature Omnibus and Multifaceted Informatics and Computational Suite), a suite with both a programmatic interface and a GUI. GNOMICS allows for extensible biomedical functionality, including identifier conversion, pathway enrichment, sequence alignment, and reference gathering, among others. It combines usage of other biological and chemical database application programming interfaces (APIs) to deliver uniform data which can be further manipulated and parsed.
Collapse
|
34
|
Demner-Fushman D, Rogers WJ, Aronson AR. MetaMap Lite: an evaluation of a new Java implementation of MetaMap. J Am Med Inform Assoc 2018; 24:841-844. [PMID: 28130331 DOI: 10.1093/jamia/ocw177] [Citation(s) in RCA: 84] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2016] [Accepted: 12/09/2016] [Indexed: 11/13/2022] Open
Abstract
MetaMap is a widely used named entity recognition tool that identifies concepts from the Unified Medical Language System Metathesaurus in text. This study presents MetaMap Lite, an implementation of some of the basic MetaMap functions in Java. On several collections of biomedical literature and clinical text, MetaMap Lite demonstrated real-time speed and precision, recall, and F1 scores comparable to or exceeding those of MetaMap and other popular biomedical text processing tools, clinical Text Analysis and Knowledge Extraction System (cTAKES) and DNorm.
Collapse
Affiliation(s)
- Dina Demner-Fushman
- Lister Hill National Center for Biomedical Communications, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Willie J Rogers
- Lister Hill National Center for Biomedical Communications, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Alan R Aronson
- Lister Hill National Center for Biomedical Communications, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| |
Collapse
|
35
|
Marlenga B, Berg RL, Pickett W. National Public Health Data Systems in the United States: Applications to Child Agricultural Injury Surveillance. J Rural Health 2018; 34:314-321. [DOI: 10.1111/jrh.12292] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2017] [Revised: 11/30/2017] [Accepted: 12/08/2017] [Indexed: 11/26/2022]
Affiliation(s)
- Barbara Marlenga
- National Children's Center for Rural and Agricultural Health and Safety, National Farm Medicine Center; Marshfield Clinic Research Institute; Marshfield Wisconsin
| | - Richard L. Berg
- Biomedical Informatics Research Center; Marshfield Clinic Research Institute; Marshfield Wisconsin
| | - William Pickett
- Department of Public Health Sciences; Queen's University; Kingston Ontario Canada
| |
Collapse
|
36
|
Wang Y, Wang L, Rastegar-Mojarad M, Moon S, Shen F, Afzal N, Liu S, Zeng Y, Mehrabi S, Sohn S, Liu H. Clinical information extraction applications: A literature review. J Biomed Inform 2018; 77:34-49. [PMID: 29162496 PMCID: PMC5771858 DOI: 10.1016/j.jbi.2017.11.011] [Citation(s) in RCA: 316] [Impact Index Per Article: 52.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2017] [Revised: 11/01/2017] [Accepted: 11/17/2017] [Indexed: 12/24/2022]
Abstract
BACKGROUND With the rapid adoption of electronic health records (EHRs), it is desirable to harvest information and knowledge from EHRs to support automated systems at the point of care and to enable secondary use of EHRs for clinical and translational research. One critical component used to facilitate the secondary use of EHR data is the information extraction (IE) task, which automatically extracts and encodes clinical information from text. OBJECTIVES In this literature review, we present a review of recent published research on clinical information extraction (IE) applications. METHODS A literature search was conducted for articles published from January 2009 to September 2016 based on Ovid MEDLINE In-Process & Other Non-Indexed Citations, Ovid MEDLINE, Ovid EMBASE, Scopus, Web of Science, and ACM Digital Library. RESULTS A total of 1917 publications were identified for title and abstract screening. Of these publications, 263 articles were selected and discussed in this review in terms of publication venues and data sources, clinical IE tools, methods, and applications in the areas of disease- and drug-related studies, and clinical workflow optimizations. CONCLUSIONS Clinical IE has been used for a wide range of applications, however, there is a considerable gap between clinical studies using EHR data and studies using clinical IE. This study enabled us to gain a more concrete understanding of the gap and to provide potential solutions to bridge this gap.
Collapse
Affiliation(s)
- Yanshan Wang
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, United States
| | - Liwei Wang
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, United States
| | - Majid Rastegar-Mojarad
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, United States
| | - Sungrim Moon
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, United States
| | - Feichen Shen
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, United States
| | - Naveed Afzal
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, United States
| | - Sijia Liu
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, United States
| | - Yuqun Zeng
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, United States
| | - Saeed Mehrabi
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, United States
| | - Sunghwan Sohn
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, United States
| | - Hongfang Liu
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, United States.
| |
Collapse
|
37
|
Bada M, Vasilevsky N, Baumgartner WA, Haendel M, Hunter LE. Gold-standard ontology-based anatomical annotation in the CRAFT Corpus. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2017; 2017:4780291. [PMID: 31725864 PMCID: PMC7243923 DOI: 10.1093/database/bax087] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/28/2017] [Revised: 10/25/2017] [Accepted: 10/27/2017] [Indexed: 12/24/2022]
Abstract
Gold-standard annotated corpora have become important resources for the training and testing of natural-language-processing (NLP) systems designed to support biocuration efforts, and ontologies are increasingly used to facilitate curational consistency and semantic integration across disparate resources. Bringing together the respective power of these, the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of full-length, open-access biomedical journal articles with extensive manually created syntactic, formatting and semantic markup, was previously created and released. This initial public release has already been used in multiple projects to drive development of systems focused on a variety of biocuration, search, visualization, and semantic and syntactic NLP tasks. Building on its demonstrated utility, we have expanded the CRAFT Corpus with a large set of manually created semantic annotations relying on Uberon, an ontology representing anatomical entities and life-cycle stages of multicellular organisms across species as well as types of multicellular organisms defined in terms of life-cycle stage and sexual characteristics. This newly created set of annotations, which has been added for v2.1 of the corpus, is by far the largest publicly available collection of gold-standard anatomical markup and is the first large-scale effort at manual markup of biomedical text relying on the entirety of an anatomical terminology, as opposed to annotation with a small number of high-level anatomical categories, as performed in previous corpora. In addition to presenting and discussing this newly available resource, we apply it to provide a performance baseline for the automatic annotation of anatomical concepts in biomedical text using a prominent concept recognition system. The full corpus, released with a CC BY 3.0 license, may be downloaded from http://bionlp-corpora.sourceforge.net/CRAFT/index.shtml. Database URL: http://bionlp-corpora.sourceforge.net/CRAFT/index.shtml
Collapse
Affiliation(s)
- Michael Bada
- School of Medicine, Department of Pharmacology, University of Colorado Anschutz Medical Campus, 12801 E. 17th Ave., P.O. Box 6511, MS 8303, Aurora, CO 80045-0511, USA
| | - Nicole Vasilevsky
- Ontology Development Group, Library, Oregon Health & Science University, 318 SW Sam Jackson, Park Road, Portland, OR 97239, USA
| | - William A Baumgartner
- School of Medicine, Department of Pharmacology, University of Colorado Anschutz Medical Campus, 12801 E. 17th Ave., P.O. Box 6511, MS 8303, Aurora, CO 80045-0511, USA
| | - Melissa Haendel
- Ontology Development Group, Library, Oregon Health & Science University, 318 SW Sam Jackson, Park Road, Portland, OR 97239, USA
| | - Lawrence E Hunter
- School of Medicine, Department of Pharmacology, University of Colorado Anschutz Medical Campus, 12801 E. 17th Ave., P.O. Box 6511, MS 8303, Aurora, CO 80045-0511, USA
| |
Collapse
|
38
|
Xie F, Lee J, Munoz-Plaza CE, Hahn EE, Chen W. Application of Text Information Extraction System for Real-Time Cancer Case Identification in an Integrated Healthcare Organization. J Pathol Inform 2017; 8:48. [PMID: 29416911 PMCID: PMC5760847 DOI: 10.4103/jpi.jpi_55_17] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2017] [Accepted: 11/01/2017] [Indexed: 12/29/2022] Open
Abstract
Background: Surgical pathology reports (SPR) contain rich clinical diagnosis information. The text information extraction system (TIES) is an end-to-end application leveraging natural language processing technologies and focused on the processing of pathology and/or radiology reports. Methods: We deployed the TIES system and integrated SPRs into the TIES system on a daily basis at Kaiser Permanente Southern California. The breast cancer cases diagnosed in December 2013 from the Cancer Registry (CANREG) were used to validate the performance of the TIES system. The National Cancer Institute Metathesaurus (NCIM) concept terms and codes to describe breast cancer were identified through the Unified Medical Language System Terminology Service (UTS) application. The identified NCIM codes were used to search for the coded SPRs in the back-end datastore directly. The identified cases were then compared with the breast cancer patients pulled from CANREG. Results: A total of 437 breast cancer concept terms and 14 combinations of “breast“and “cancer“ terms were identified from the UTS application. A total of 249 breast cancer cases diagnosed in December 2013 was pulled from CANREG. Out of these 249 cases, 241 were successfully identified by the TIES system from a total of 457 reports. The TIES system also identified an additional 277 cases that were not part of the validation sample. Out of the 277 cases, 11% were determined as highly likely to be cases after manual examinations, and 86% were in CANREG but were diagnosed in months other than December of 2013. Conclusions: The study demonstrated that the TIES system can effectively identify potential breast cancer cases in our care setting. Identified potential cases can be easily confirmed by reviewing the corresponding annotated reports through the front-end visualization interface. The TIES system is a great tool for identifying potential various cancer cases in a timely manner and on a regular basis in support of clinical research studies.
Collapse
Affiliation(s)
- Fagen Xie
- Department of Research and Evaluation, Kaiser Permanente Southern California Medical Group, Pasadena, CA, USA
| | - Janet Lee
- Department of Research and Evaluation, Kaiser Permanente Southern California Medical Group, Pasadena, CA, USA
| | - Corrine E Munoz-Plaza
- Department of Research and Evaluation, Kaiser Permanente Southern California Medical Group, Pasadena, CA, USA
| | - Erin E Hahn
- Department of Research and Evaluation, Kaiser Permanente Southern California Medical Group, Pasadena, CA, USA
| | - Wansu Chen
- Department of Research and Evaluation, Kaiser Permanente Southern California Medical Group, Pasadena, CA, USA
| |
Collapse
|
39
|
Boyce RD, Jao J, Miller T, Kane-Gill SL. Automated Screening of Emergency Department Notes for Drug-Associated Bleeding Adverse Events Occurring in Older Adults. Appl Clin Inform 2017; 8:1022-1030. [PMID: 29241242 DOI: 10.4338/aci-2017-02-ra-0036] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
Objective To conduct research to show the value of text mining for automatically identifying suspected bleeding adverse drug events (ADEs) in the emergency department (ED).
Methods A corpus of ED admission notes was manually annotated for bleeding ADEs. The notes were taken for patients ≥ 65 years of age who had an ICD-9 code for bleeding, the presence of hemoglobin value ≤ 8 g/dL, or were transfused > 2 units of packed red blood cells. This training corpus was used to develop bleeding ADE algorithms using Random Forest and Classification and Regression Tree (CART). A completely separate set of notes was annotated and used to test the classification performance of the final models using the area under the ROC curve (AUROC).
Results The best performing CART resulted in an AUROC on the training set of 0.882. The model's AUROC on the test set was 0.827. At a sensitivity of 0.679, the model had a specificity of 0.908 and a positive predictive value (PPV) of 0.814. It had a relatively simple and intuitive structure consisting of 13 decision nodes and 14 leaf nodes. Decision path probabilities ranged from 0.041 to 1.0. The AUROC for the best performing Random Forest method on the training set was 0.917. On the test set, the model's AUROC was 0.859. At a sensitivity of 0.274, the model had a specificity of 0.986 and a PPV of 0.92.
Conclusion Both models accurately identify bleeding ADEs using the presence or absence of certain clinical concepts in ED admission notes for older adult patients. The CART model is particularly noteworthy because it does not require significant technical overhead to implement. Future work should seek to replicate the results on a larger test set pulled from another institution.
Collapse
Affiliation(s)
- Richard D Boyce
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, Pennsylvania, United States
| | - Jeremy Jao
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, Pennsylvania, United States
| | - Taylor Miller
- Department of Pharmacy, University of Pittsburgh Medical Center, Pittsburgh, Pennsylvania, United States
| | - Sandra L Kane-Gill
- School of Pharmacy, University of Pittsburgh, Pittsburgh, Pennsylvania, United States
| |
Collapse
|
40
|
Basaldella M, Furrer L, Tasso C, Rinaldi F. Entity recognition in the biomedical domain using a hybrid approach. J Biomed Semantics 2017; 8:51. [PMID: 29122011 PMCID: PMC5679148 DOI: 10.1186/s13326-017-0157-6] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2017] [Accepted: 09/19/2017] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND This article describes a high-recall, high-precision approach for the extraction of biomedical entities from scientific articles. METHOD The approach uses a two-stage pipeline, combining a dictionary-based entity recognizer with a machine-learning classifier. First, the OGER entity recognizer, which has a bias towards high recall, annotates the terms that appear in selected domain ontologies. Subsequently, the Distiller framework uses this information as a feature for a machine learning algorithm to select the relevant entities only. For this step, we compare two different supervised machine-learning algorithms: Conditional Random Fields and Neural Networks. RESULTS In an in-domain evaluation using the CRAFT corpus, we test the performance of the combined systems when recognizing chemicals, cell types, cellular components, biological processes, molecular functions, organisms, proteins, and biological sequences. Our best system combines dictionary-based candidate generation with Neural-Network-based filtering. It achieves an overall precision of 86% at a recall of 60% on the named entity recognition task, and a precision of 51% at a recall of 49% on the concept recognition task. CONCLUSION These results are to our knowledge the best reported so far in this particular task.
Collapse
Affiliation(s)
- Marco Basaldella
- Università degli Studi di Udine, Via delle Scienze 208, Udine, 33100 Italy
| | - Lenz Furrer
- University of Zurich, Institute of Computational Linguistics and Swiss Institute of Bioinformatics, Andreasstrasse 15, Zürich, CH-8050 Switzerland
| | - Carlo Tasso
- Università degli Studi di Udine, Via delle Scienze 208, Udine, 33100 Italy
| | - Fabio Rinaldi
- University of Zurich, Institute of Computational Linguistics and Swiss Institute of Bioinformatics, Andreasstrasse 15, Zürich, CH-8050 Switzerland
| |
Collapse
|
41
|
Groth P, Cox J. Indicators for the use of robotic labs in basic biomedical research: a literature analysis. PeerJ 2017; 5:e3997. [PMID: 29134146 PMCID: PMC5681851 DOI: 10.7717/peerj.3997] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2017] [Accepted: 10/16/2017] [Indexed: 01/29/2023] Open
Abstract
Robotic labs, in which experiments are carried out entirely by robots, have the potential to provide a reproducible and transparent foundation for performing basic biomedical laboratory experiments. In this article, we investigate whether these labs could be applicable in current experimental practice. We do this by text mining 1,628 papers for occurrences of methods that are supported by commercial robotic labs. Using two different concept recognition tools, we find that 86%-89% of the papers have at least one of these methods. This and our other results provide indications that robotic labs can serve as the foundation for performing many lab-based experiments.
Collapse
|
42
|
Jovanović J, Bagheri E. Semantic annotation in biomedicine: the current landscape. J Biomed Semantics 2017; 8:44. [PMID: 28938912 PMCID: PMC5610427 DOI: 10.1186/s13326-017-0153-x] [Citation(s) in RCA: 34] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2016] [Accepted: 09/17/2017] [Indexed: 01/12/2023] Open
Abstract
The abundance and unstructured nature of biomedical texts, be it clinical or research content, impose significant challenges for the effective and efficient use of information and knowledge stored in such texts. Annotation of biomedical documents with machine intelligible semantics facilitates advanced, semantics-based text management, curation, indexing, and search. This paper focuses on annotation of biomedical entity mentions with concepts from relevant biomedical knowledge bases such as UMLS. As a result, the meaning of those mentions is unambiguously and explicitly defined, and thus made readily available for automated processing. This process is widely known as semantic annotation, and the tools that perform it are known as semantic annotators.Over the last dozen years, the biomedical research community has invested significant efforts in the development of biomedical semantic annotation technology. Aiming to establish grounds for further developments in this area, we review a selected set of state of the art biomedical semantic annotators, focusing particularly on general purpose annotators, that is, semantic annotation tools that can be customized to work with texts from any area of biomedicine. We also examine potential directions for further improvements of today's annotators which could make them even more capable of meeting the needs of real-world applications. To motivate and encourage further developments in this area, along the suggested and/or related directions, we review existing and potential practical applications and benefits of semantic annotators.
Collapse
Affiliation(s)
- Jelena Jovanović
- Department of Software Engineering, University of Belgrade, 154 Jove Ilica Street, Belgrade, Serbia
| | - Ebrahim Bagheri
- Department of Electrical Engineering, Ryerson University, 245 Church Street, Toronto, Canada.
| |
Collapse
|
43
|
Névéol A, Zweigenbaum P. Making Sense of Big Textual Data for Health Care: Findings from the Section on Clinical Natural Language Processing. Yearb Med Inform 2017; 26:228-234. [PMID: 29063569 PMCID: PMC6239234 DOI: 10.15265/iy-2017-027] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2017] [Indexed: 02/01/2023] Open
Abstract
Objectives: To summarize recent research and present a selection of the best papers published in 2016 in the field of clinical Natural Language Processing (NLP). Method: A survey of the literature was performed by the two section editors of the IMIA Yearbook NLP section. Bibliographic databases were searched for papers with a focus on NLP efforts applied to clinical texts or aimed at a clinical outcome. Papers were automatically ranked and then manually reviewed based on titles and abstracts. A shortlist of candidate best papers was first selected by the section editors before being peer-reviewed by independent external reviewers. Results: The five clinical NLP best papers provide a contribution that ranges from emerging original foundational methods to transitioning solid established research results to a practical clinical setting. They offer a framework for abbreviation disambiguation and coreference resolution, a classification method to identify clinically useful sentences, an analysis of counseling conversations to improve support to patients with mental disorder and grounding of gradable adjectives. Conclusions: Clinical NLP continued to thrive in 2016, with an increasing number of contributions towards applications compared to fundamental methods. Fundamental work addresses increasingly complex problems such as lexical semantics, coreference resolution, and discourse analysis. Research results translate into freely available tools, mainly for English.
Collapse
Affiliation(s)
- A. Névéol
- LIMSI, CNRS, Université Paris Saclay, Orsay, France
| | | | | |
Collapse
|
44
|
Gonzalez-Hernandez G, Sarker A, O’Connor K, Savova G. Capturing the Patient's Perspective: a Review of Advances in Natural Language Processing of Health-Related Text. Yearb Med Inform 2017; 26:214-227. [PMID: 29063568 PMCID: PMC6250990 DOI: 10.15265/iy-2017-029] [Citation(s) in RCA: 63] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
Background: Natural Language Processing (NLP) methods are increasingly being utilized to mine knowledge from unstructured health-related texts. Recent advances in noisy text processing techniques are enabling researchers and medical domain experts to go beyond the information encapsulated in published texts (e.g., clinical trials and systematic reviews) and structured questionnaires, and obtain perspectives from other unstructured sources such as Electronic Health Records (EHRs) and social media posts. Objectives: To review the recently published literature discussing the application of NLP techniques for mining health-related information from EHRs and social media posts. Methods: Literature review included the research published over the last five years based on searches of PubMed, conference proceedings, and the ACM Digital Library, as well as on relevant publications referenced in papers. We particularly focused on the techniques employed on EHRs and social media data. Results: A set of 62 studies involving EHRs and 87 studies involving social media matched our criteria and were included in this paper. We present the purposes of these studies, outline the key NLP contributions, and discuss the general trends observed in the field, the current state of research, and important outstanding problems. Conclusions: Over the recent years, there has been a continuing transition from lexical and rule-based systems to learning-based approaches, because of the growth of annotated data sets and advances in data science. For EHRs, publicly available annotated data is still scarce and this acts as an obstacle to research progress. On the contrary, research on social media mining has seen a rapid growth, particularly because the large amount of unlabeled data available via this resource compensates for the uncertainty inherent to the data. Effective mechanisms to filter out noise and for mapping social media expressions to standard medical concepts are crucial and latent research problems. Shared tasks and other competitive challenges have been driving factors behind the implementation of open systems, and they are likely to play an imperative role in the development of future systems.
Collapse
Affiliation(s)
- G. Gonzalez-Hernandez
- Department of Epidemiology, Biostatistics, and Informatics, The Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - A. Sarker
- Department of Epidemiology, Biostatistics, and Informatics, The Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - K. O’Connor
- Department of Epidemiology, Biostatistics, and Informatics, The Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - G. Savova
- Boston Children’s Hospital and Harvard Medical School, Boston, MA, USA
| |
Collapse
|
45
|
RysannMD: A biomedical semantic annotator balancing speed and accuracy. J Biomed Inform 2017; 71:91-109. [PMID: 28552401 DOI: 10.1016/j.jbi.2017.05.016] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2017] [Revised: 05/21/2017] [Accepted: 05/22/2017] [Indexed: 11/22/2022]
Abstract
Recently, both researchers and practitioners have explored the possibility of semantically annotating large and continuously evolving collections of biomedical texts such as research papers, medical reports, and physician notes in order to enable their efficient and effective management and use in clinical practice or research laboratories. Such annotations can be automatically generated by biomedical semantic annotators - tools that are specifically designed for detecting and disambiguating biomedical concepts mentioned in text. The biomedical community has already presented several solid automated semantic annotators. However, the existing tools are either strong in their disambiguation capacity, i.e., the ability to identify the correct biomedical concept for a given piece of text among several candidate concepts, or they excel in their processing time, i.e., work very efficiently, but none of the semantic annotation tools reported in the literature has both of these qualities. In this paper, we present RysannMD (Ryerson Semantic Annotator for Medical Domain), a biomedical semantic annotation tool that strikes a balance between processing time and performance while disambiguating biomedical terms. In other words, RysannMD provides reasonable disambiguation performance when choosing the right sense for a biomedical term in a given context, and does that in a reasonable time. To examine how RysannMD stands with respect to the state of the art biomedical semantic annotators, we have conducted a series of experiments using standard benchmarking corpora, including both gold and silver standards, and four modern biomedical semantic annotators, namely cTAKES, MetaMap, NOBLE Coder, and Neji. The annotators were compared with respect to the quality of the produced annotations measured against gold and silver standards using precision, recall, and F1 measure and speed, i.e., processing time. In the experiments, RysannMD achieved the best median F1 measure across the benchmarking corpora, independent of the standard used (silver/gold), biomedical subdomain, and document size. In terms of the annotation speed, RysannMD scored the second best median processing time across all the experiments. The obtained results indicate that RysannMD offers the best performance among the examined semantic annotators when both quality of annotation and speed are considered simultaneously.
Collapse
|
46
|
Jackson RG, Patel R, Jayatilleke N, Kolliakou A, Ball M, Gorrell G, Roberts A, Dobson RJ, Stewart R. Natural language processing to extract symptoms of severe mental illness from clinical text: the Clinical Record Interactive Search Comprehensive Data Extraction (CRIS-CODE) project. BMJ Open 2017; 7:e012012. [PMID: 28096249 PMCID: PMC5253558 DOI: 10.1136/bmjopen-2016-012012] [Citation(s) in RCA: 118] [Impact Index Per Article: 16.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/23/2016] [Revised: 08/11/2016] [Accepted: 10/04/2016] [Indexed: 01/13/2023] Open
Abstract
OBJECTIVES We sought to use natural language processing to develop a suite of language models to capture key symptoms of severe mental illness (SMI) from clinical text, to facilitate the secondary use of mental healthcare data in research. DESIGN Development and validation of information extraction applications for ascertaining symptoms of SMI in routine mental health records using the Clinical Record Interactive Search (CRIS) data resource; description of their distribution in a corpus of discharge summaries. SETTING Electronic records from a large mental healthcare provider serving a geographic catchment of 1.2 million residents in four boroughs of south London, UK. PARTICIPANTS The distribution of derived symptoms was described in 23 128 discharge summaries from 7962 patients who had received an SMI diagnosis, and 13 496 discharge summaries from 7575 patients who had received a non-SMI diagnosis. OUTCOME MEASURES Fifty SMI symptoms were identified by a team of psychiatrists for extraction based on salience and linguistic consistency in records, broadly categorised under positive, negative, disorganisation, manic and catatonic subgroups. Text models for each symptom were generated using the TextHunter tool and the CRIS database. RESULTS We extracted data for 46 symptoms with a median F1 score of 0.88. Four symptom models performed poorly and were excluded. From the corpus of discharge summaries, it was possible to extract symptomatology in 87% of patients with SMI and 60% of patients with non-SMI diagnosis. CONCLUSIONS This work demonstrates the possibility of automatically extracting a broad range of SMI symptoms from English text discharge summaries for patients with an SMI diagnosis. Descriptive data also indicated that most symptoms cut across diagnoses, rather than being restricted to particular groups.
Collapse
Affiliation(s)
- Richard G Jackson
- Institute of Psychiatry, Psychology & Neuroscience, King's College London, London, UK
| | - Rashmi Patel
- Institute of Psychiatry, Psychology & Neuroscience, King's College London, London, UK
| | - Nishamali Jayatilleke
- Institute of Psychiatry, Psychology & Neuroscience, King's College London, London, UK
| | - Anna Kolliakou
- Institute of Psychiatry, Psychology & Neuroscience, King's College London, London, UK
| | - Michael Ball
- Institute of Psychiatry, Psychology & Neuroscience, King's College London, London, UK
| | - Genevieve Gorrell
- Department of Computer Science, University of Sheffield, Sheffield, UK
| | - Angus Roberts
- Department of Computer Science, University of Sheffield, Sheffield, UK
| | - Richard J Dobson
- Institute of Psychiatry, Psychology & Neuroscience, King's College London, London, UK
| | - Robert Stewart
- Institute of Psychiatry, Psychology & Neuroscience, King's College London, London, UK
| |
Collapse
|
47
|
Development of an Inflammatory Bowel Disease Research Registry Derived from Observational Electronic Health Record Data for Comprehensive Clinical Phenotyping. Dig Dis Sci 2016; 61:3236-3245. [PMID: 27619390 PMCID: PMC5069178 DOI: 10.1007/s10620-016-4278-z] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/03/2016] [Accepted: 08/10/2016] [Indexed: 12/17/2022]
Abstract
BACKGROUND Inflammatory bowel disease (IBD) is a heterogeneous collection of chronic inflammatory disorders of the digestive tract. Clinical, genetic, and pathological heterogeneity makes it increasingly difficult to translate efficacy studies into real-world practice. Our objective was to develop a comprehensive natural history registry derived from multi-year observational data to facilitate effectiveness and clinical phenotypic research in IBD. METHODS A longitudinal, consented registry with prospectively collected data was developed at UPMC. All adult IBD patients receiving care at the tertiary care center of UPMC are eligible for enrollment. Detailed data in the electronic health record are accessible for registry research purposes. Data are exported directly from the electronic health record and temporally organized for research. RESULTS To date, there are over 2565 patients participating in the IBD research registry. All patients have demographic data, clinical disease characteristics, and disease course data including healthcare utilization, laboratory values, health-related questionnaires quantifying disease activity and quality of life, and analytical information on treatment, temporally organized for 6 years (2009-2015). The data have resulted in a detailed definition of clinical phenotypes suitable for association studies with parameters of disease outcomes and treatment response. We have established the infrastructure required to examine the effectiveness of treatment and disease course in the real-world setting of IBD. CONCLUSIONS The IBD research registry offers a unique opportunity to investigate clinical research questions regarding the natural course of the disease, phenotype association studies, effectiveness of treatment, and quality of care research.
Collapse
|
48
|
Alnazzawi N, Thompson P, Ananiadou S. Mapping Phenotypic Information in Heterogeneous Textual Sources to a Domain-Specific Terminological Resource. PLoS One 2016; 11:e0162287. [PMID: 27643689 PMCID: PMC5028053 DOI: 10.1371/journal.pone.0162287] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2016] [Accepted: 08/19/2016] [Indexed: 02/02/2023] Open
Abstract
Biomedical literature articles and narrative content from Electronic Health Records (EHRs) both constitute rich sources of disease-phenotype information. Phenotype concepts may be mentioned in text in multiple ways, using phrases with a variety of structures. This variability stems partly from the different backgrounds of the authors, but also from the different writing styles typically used in each text type. Since EHR narrative reports and literature articles contain different but complementary types of valuable information, combining details from each text type can help to uncover new disease-phenotype associations. However, the alternative ways in which the same concept may be mentioned in each source constitutes a barrier to the automatic integration of information. Accordingly, identification of the unique concepts represented by phrases in text can help to bridge the gap between text types. We describe our development of a novel method, PhenoNorm, which integrates a number of different similarity measures to allow automatic linking of phenotype concept mentions to known concepts in the UMLS Metathesaurus, a biomedical terminological resource. PhenoNorm was developed using the PhenoCHF corpus—a collection of literature articles and narratives in EHRs, annotated for phenotypic information relating to congestive heart failure (CHF). We evaluate the performance of PhenoNorm in linking CHF-related phenotype mentions to Metathesaurus concepts, using a newly enriched version of PhenoCHF, in which each phenotype mention has an expert-verified link to a concept in the UMLS Metathesaurus. We show that PhenoNorm outperforms a number of alternative methods applied to the same task. Furthermore, we demonstrate PhenoNorm’s wider utility, by evaluating its ability to link mentions of various other types of medically-related information, occurring in texts covering wider subject areas, to concepts in different terminological resources. We show that PhenoNorm can maintain performance levels, and that its accuracy compares favourably to other methods applied to these tasks.
Collapse
Affiliation(s)
- Noha Alnazzawi
- National Centre for Text Mining, Manchester Institute of Biotechnology, Manchester University, Manchester, United Kingdom
- * E-mail:
| | - Paul Thompson
- National Centre for Text Mining, Manchester Institute of Biotechnology, Manchester University, Manchester, United Kingdom
| | - Sophia Ananiadou
- National Centre for Text Mining, Manchester Institute of Biotechnology, Manchester University, Manchester, United Kingdom
| |
Collapse
|
49
|
Hochheiser H, Castine M, Harris D, Savova G, Jacobson RS. An information model for computable cancer phenotypes. BMC Med Inform Decis Mak 2016; 16:121. [PMID: 27629872 PMCID: PMC5024416 DOI: 10.1186/s12911-016-0358-4] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2016] [Accepted: 09/01/2016] [Indexed: 12/27/2022] Open
Abstract
BACKGROUND Standards, methods, and tools supporting the integration of clinical data and genomic information are an area of significant need and rapid growth in biomedical informatics. Integration of cancer clinical data and cancer genomic information poses unique challenges, because of the high volume and complexity of clinical data, as well as the heterogeneity and instability of cancer genome data when compared with germline data. Current information models of clinical and genomic data are not sufficiently expressive to represent individual observations and to aggregate those observations into longitudinal summaries over the course of cancer care. These models are acutely needed to support the development of systems and tools for generating the so called clinical "deep phenotype" of individual cancer patients, a process which remains almost entirely manual in cancer research and precision medicine. METHODS Reviews of existing ontologies and interviews with cancer researchers were used to inform iterative development of a cancer phenotype information model. We translated a subset of the Fast Healthcare Interoperability Resources (FHIR) models into the OWL 2 Description Logic (DL) representation, and added extensions as needed for modeling cancer phenotypes with terms derived from the NCI Thesaurus. Models were validated with domain experts and evaluated against competency questions. RESULTS The DeepPhe Information model represents cancer phenotype data at increasing levels of abstraction from mention level in clinical documents to summaries of key events and findings. We describe the model using breast cancer as an example, depicting methods to represent phenotypic features of cancers, tumors, treatment regimens, and specific biologic behaviors that span the entire course of a patient's disease. CONCLUSIONS We present a multi-scale information model for representing individual document mentions, document level classifications, episodes along a disease course, and phenotype summarization, linking individual observations to high-level summaries in support of subsequent integration and analysis.
Collapse
Affiliation(s)
- Harry Hochheiser
- Department of Biomedical Informatics, University of Pittsburgh School of Medicine, 5607 Baum Boulevard, Rm 523, Pittsburgh, 15206-3701, PA, USA. .,Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA, USA.
| | - Melissa Castine
- Department of Biomedical Informatics, University of Pittsburgh School of Medicine, 5607 Baum Boulevard, Rm 523, Pittsburgh, 15206-3701, PA, USA
| | - David Harris
- Boston Children's Hospital and Harvard Medical School, Boston, MA, USA
| | - Guergana Savova
- Boston Children's Hospital and Harvard Medical School, Boston, MA, USA
| | - Rebecca S Jacobson
- Department of Biomedical Informatics, University of Pittsburgh School of Medicine, 5607 Baum Boulevard, Rm 523, Pittsburgh, 15206-3701, PA, USA.,Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA, USA.,University of Pittsburgh Cancer Institute, Pittsburgh, PA, USA
| |
Collapse
|