1
|
Schäfer H, Idrissi-Yaghir A, Arzideh K, Damm H, Pakull TM, Schmidt CS, Bahn M, Lodde G, Livingstone E, Schadendorf D, Nensa F, Horn PA, Friedrich CM. BioKGrapher: Initial evaluation of automated knowledge graph construction from biomedical literature. Comput Struct Biotechnol J 2024; 24:639-660. [PMID: 39502384 PMCID: PMC11536026 DOI: 10.1016/j.csbj.2024.10.017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2024] [Revised: 10/11/2024] [Accepted: 10/11/2024] [Indexed: 11/08/2024] Open
Abstract
Background The growth of biomedical literature presents challenges in extracting and structuring knowledge. Knowledge Graphs (KGs) offer a solution by representing relationships between biomedical entities. However, manual construction of KGs is labor-intensive and time-consuming, highlighting the need for automated methods. This work introduces BioKGrapher, a tool for automatic KG construction using large-scale publication data, with a focus on biomedical concepts related to specific medical conditions. BioKGrapher allows researchers to construct KGs from PubMed IDs. Methods The BioKGrapher pipeline begins with Named Entity Recognition and Linking (NER+NEL) to extract and normalize biomedical concepts from PubMed, mapping them to the Unified Medical Language System (UMLS). Extracted concepts are weighted and re-ranked using Kullback-Leibler divergence and local frequency balancing. These concepts are then integrated into hierarchical KGs, with relationships formed using terminologies like SNOMED CT and NCIt. Downstream applications include multi-label document classification using Adapter-infused Transformer models. Results BioKGrapher effectively aligns generated concepts with clinical practice guidelines from the German Guideline Program in Oncology (GGPO), achievingF 1 -Scores of up to 0.6. In multi-label classification, Adapter-infused models using a BioKGrapher cancer-specific KG improved microF 1 -Scores by up to 0.89 percentage points over a non-specific KG and 2.16 points over base models across three BERT variants. The drug-disease extraction case study identified indications for Nivolumab and Rituximab. Conclusion BioKGrapher is a tool for automatic KG construction, aligning with the GGPO and enhancing downstream task performance. It offers a scalable solution for managing biomedical knowledge, with potential applications in literature recommendation, decision support, and drug repurposing.
Collapse
Affiliation(s)
- Henning Schäfer
- Institute for Transfusion Medicine, University Hospital Essen, Hufelandstraße 55, Essen, 45147, Germany
- Department of Computer Science, University of Applied Sciences and Arts Dortmund (FHDO), Emil-Figge Str. 42, Dortmund, 44227, Germany
| | - Ahmad Idrissi-Yaghir
- Department of Computer Science, University of Applied Sciences and Arts Dortmund (FHDO), Emil-Figge Str. 42, Dortmund, 44227, Germany
- Institute for Medical Informatics, Biometry and Epidemiology (IMIBE), University Hospital Essen, Hufelandstraße 55, Essen, 45147, Germany
| | - Kamyar Arzideh
- Institute for AI in Medicine (IKIM), University Hospital Essen, Girardetstraße 2, Essen, 45131, Germany
| | - Hendrik Damm
- Department of Computer Science, University of Applied Sciences and Arts Dortmund (FHDO), Emil-Figge Str. 42, Dortmund, 44227, Germany
- Institute for Medical Informatics, Biometry and Epidemiology (IMIBE), University Hospital Essen, Hufelandstraße 55, Essen, 45147, Germany
| | - Tabea M.G. Pakull
- Institute for Transfusion Medicine, University Hospital Essen, Hufelandstraße 55, Essen, 45147, Germany
- Department of Computer Science, University of Applied Sciences and Arts Dortmund (FHDO), Emil-Figge Str. 42, Dortmund, 44227, Germany
| | - Cynthia S. Schmidt
- Institute for Transfusion Medicine, University Hospital Essen, Hufelandstraße 55, Essen, 45147, Germany
- Institute for AI in Medicine (IKIM), University Hospital Essen, Girardetstraße 2, Essen, 45131, Germany
| | - Mikel Bahn
- Institute for AI in Medicine (IKIM), University Hospital Essen, Girardetstraße 2, Essen, 45131, Germany
| | - Georg Lodde
- Department of Dermatology, University Hospital Essen, Hufelandstraße 55, Essen, 45147, Germany
| | - Elisabeth Livingstone
- Department of Dermatology, University Hospital Essen, Hufelandstraße 55, Essen, 45147, Germany
| | - Dirk Schadendorf
- Department of Dermatology, University Hospital Essen, Hufelandstraße 55, Essen, 45147, Germany
| | - Felix Nensa
- Institute for AI in Medicine (IKIM), University Hospital Essen, Girardetstraße 2, Essen, 45131, Germany
- Institute of Interventional and Diagnostic Radiology and Neuroradiology, University Hospital Essen, Hufelandstraße 55, Essen, 45147, Germany
| | - Peter A. Horn
- Institute for Transfusion Medicine, University Hospital Essen, Hufelandstraße 55, Essen, 45147, Germany
| | - Christoph M. Friedrich
- Department of Computer Science, University of Applied Sciences and Arts Dortmund (FHDO), Emil-Figge Str. 42, Dortmund, 44227, Germany
- Institute for Medical Informatics, Biometry and Epidemiology (IMIBE), University Hospital Essen, Hufelandstraße 55, Essen, 45147, Germany
| |
Collapse
|
2
|
Wu J, Dong H, Li Z, Wang H, Li R, Patra A, Dai C, Ali W, Scordis P, Wu H. A hybrid framework with large language models for rare disease phenotyping. BMC Med Inform Decis Mak 2024; 24:289. [PMID: 39375687 PMCID: PMC11460004 DOI: 10.1186/s12911-024-02698-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2024] [Accepted: 09/26/2024] [Indexed: 10/09/2024] Open
Abstract
PURPOSE Rare diseases pose significant challenges in diagnosis and treatment due to their low prevalence and heterogeneous clinical presentations. Unstructured clinical notes contain valuable information for identifying rare diseases, but manual curation is time-consuming and prone to subjectivity. This study aims to develop a hybrid approach combining dictionary-based natural language processing (NLP) tools with large language models (LLMs) to improve rare disease identification from unstructured clinical reports. METHODS We propose a novel hybrid framework that integrates the Orphanet Rare Disease Ontology (ORDO) and the Unified Medical Language System (UMLS) to create a comprehensive rare disease vocabulary. SemEHR, a dictionary-based NLP tool, is employed to extract rare disease mentions from clinical notes. To refine the results and improve accuracy, we leverage various LLMs, including LLaMA3, Phi3-mini, and domain-specific models like OpenBioLLM and BioMistral. Different prompting strategies, such as zero-shot, few-shot, and knowledge-augmented generation, are explored to optimize the LLMs' performance. RESULTS The proposed hybrid approach demonstrates superior performance compared to traditional NLP systems and standalone LLMs. LLaMA3 and Phi3-mini achieve the highest F1 scores in rare disease identification. Few-shot prompting with 1-3 examples yields the best results, while knowledge-augmented generation shows limited improvement. Notably, the approach uncovers a significant number of potential rare disease cases not documented in structured diagnostic records, highlighting its ability to identify previously unrecognized patients. CONCLUSION The hybrid approach combining dictionary-based NLP tools with LLMs shows great promise for improving rare disease identification from unstructured clinical reports. By leveraging the strengths of both techniques, the method demonstrates superior performance and the potential to uncover hidden rare disease cases. Further research is needed to address limitations related to ontology mapping and overlapping case identification, and to integrate the approach into clinical practice for early diagnosis and improved patient outcomes.
Collapse
Affiliation(s)
- Jinge Wu
- Institute of Health Informatics, University College London, London, UK.
- UCB Pharma UK, Slough, UK.
| | - Hang Dong
- Department of Computer Science, University of Exeter, Exeter, UK
| | - Zexi Li
- The Nuffield Department of Surgical Sciences, University of Oxford, Oxford, UK
| | - Haowei Wang
- Division of Medicine, University College London, London, UK
| | - Runci Li
- EGA- Institute for Women's Health, University College London, London, UK
| | | | | | | | | | - Honghan Wu
- Institute of Health Informatics, University College London, London, UK.
- School of Health and Wellbeing, University of Glasgow, Glasgow, UK.
| |
Collapse
|
3
|
Zhang H, Jethani N, Jones S, Genes N, Major VJ, Jaffe IS, Cardillo AB, Heilenbach N, Ali NF, Bonanni LJ, Clayburn AJ, Khera Z, Sadler EC, Prasad J, Schlacter J, Liu K, Silva B, Montgomery S, Kim EJ, Lester J, Hill TM, Avoricani A, Chervonski E, Davydov J, Small W, Chakravartty E, Grover H, Dodson JA, Brody AA, Aphinyanaphongs Y, Masurkar A, Razavian N. Evaluating Large Language Models in Extracting Cognitive Exam Dates and Scores. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2023.07.10.23292373. [PMID: 38405784 PMCID: PMC10888985 DOI: 10.1101/2023.07.10.23292373] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/27/2024]
Abstract
Importance Large language models (LLMs) are crucial for medical tasks. Ensuring their reliability is vital to avoid false results. Our study assesses two state-of-the-art LLMs (ChatGPT and LlaMA-2) for extracting clinical information, focusing on cognitive tests like MMSE and CDR. Objective Evaluate ChatGPT and LlaMA-2 performance in extracting MMSE and CDR scores, including their associated dates. Methods Our data consisted of 135,307 clinical notes (Jan 12th, 2010 to May 24th, 2023) mentioning MMSE, CDR, or MoCA. After applying inclusion criteria 34,465 notes remained, of which 765 underwent ChatGPT (GPT-4) and LlaMA-2, and 22 experts reviewed the responses. ChatGPT successfully extracted MMSE and CDR instances with dates from 742 notes. We used 20 notes for fine-tuning and training the reviewers. The remaining 722 were assigned to reviewers, with 309 each assigned to two reviewers simultaneously. Inter-rater-agreement (Fleiss' Kappa), precision, recall, true/false negative rates, and accuracy were calculated. Our study follows TRIPOD reporting guidelines for model validation. Results For MMSE information extraction, ChatGPT (vs. LlaMA-2) achieved accuracy of 83% (vs. 66.4%), sensitivity of 89.7% (vs. 69.9%), true-negative rates of 96% (vs 60.0%), and precision of 82.7% (vs 62.2%). For CDR the results were lower overall, with accuracy of 87.1% (vs. 74.5%), sensitivity of 84.3% (vs. 39.7%), true-negative rates of 99.8% (98.4%), and precision of 48.3% (vs. 16.1%). We qualitatively evaluated the MMSE errors of ChatGPT and LlaMA-2 on double-reviewed notes. LlaMA-2 errors included 27 cases of total hallucination, 19 cases of reporting other scores instead of MMSE, 25 missed scores, and 23 cases of reporting only the wrong date. In comparison, ChatGPT's errors included only 3 cases of total hallucination, 17 cases of wrong test reported instead of MMSE, and 19 cases of reporting a wrong date. Conclusions In this diagnostic/prognostic study of ChatGPT and LlaMA-2 for extracting cognitive exam dates and scores from clinical notes, ChatGPT exhibited high accuracy, with better performance compared to LlaMA-2. The use of LLMs could benefit dementia research and clinical care, by identifying eligible patients for treatments initialization or clinical trial enrollments. Rigorous evaluation of LLMs is crucial to understanding their capabilities and limitations.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - Abraham A Brody
- NYU Rory Meyers College of Nursing, NYU Grossman School of Medicine
| | | | | | | |
Collapse
|
4
|
Amar F, April A, Abran A. Electronic Health Record and Semantic Issues Using Fast Healthcare Interoperability Resources: Systematic Mapping Review. J Med Internet Res 2024; 26:e45209. [PMID: 38289660 PMCID: PMC10865191 DOI: 10.2196/45209] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2022] [Revised: 03/07/2023] [Accepted: 12/19/2023] [Indexed: 02/01/2024] Open
Abstract
BACKGROUND The increasing use of electronic health records and the Internet of Things has led to interoperability issues at different levels (structural and semantic). Standards are important not only for successfully exchanging data but also for appropriately interpreting them (semantic interoperability). Thus, to facilitate the semantic interoperability of data exchanged in health care, considerable resources have been deployed to improve the quality of shared clinical data by structuring and mapping them to the Fast Healthcare Interoperability Resources (FHIR) standard. OBJECTIVE The aims of this study are 2-fold: to inventory the studies on FHIR semantic interoperability resources and terminologies and to identify and classify the approaches and contributions proposed in these studies. METHODS A systematic mapping review (SMR) was conducted using 10 electronic databases as sources of information for inventory and review studies published during 2012 to 2022 on the development and improvement of semantic interoperability using the FHIR standard. RESULTS A total of 70 FHIR studies were selected and analyzed to identify FHIR resource types and terminologies from a semantic perspective. The proposed semantic approaches were classified into 6 categories, namely mapping (31/126, 24.6%), terminology services (18/126, 14.3%), resource description framework or web ontology language-based proposals (24/126, 19%), annotation proposals (18/126, 14.3%), machine learning (ML) and natural language processing (NLP) proposals (20/126, 15.9%), and ontology-based proposals (15/126, 11.9%). From 2012 to 2022, there has been continued research in 6 categories of approaches as well as in new and emerging annotations and ML and NLP proposals. This SMR also classifies the contributions of the selected studies into 5 categories: framework or architecture proposals, model proposals, technique proposals, comparison services, and tool proposals. The most frequent type of contribution is the proposal of a framework or architecture to enable semantic interoperability. CONCLUSIONS This SMR provides a classification of the different solutions proposed to address semantic interoperability using FHIR at different levels: collecting, extracting and annotating data, modeling electronic health record data from legacy systems, and applying transformation and mapping to FHIR models and terminologies. The use of ML and NLP for unstructured data is promising and has been applied to specific use case scenarios. In addition, terminology services are needed to accelerate their use and adoption; furthermore, techniques and tools to automate annotation and ontology comparison should help reduce human interaction.
Collapse
Affiliation(s)
- Fouzia Amar
- École de technologie supérieure - ETS, Montreal, QC, Canada
| | - Alain April
- École de technologie supérieure - ETS, Montreal, QC, Canada
| | - Alain Abran
- École de technologie supérieure - ETS, Montreal, QC, Canada
| |
Collapse
|
5
|
Abdulnazar A, Roller R, Schulz S, Kreuzthaler M. Unsupervised SapBERT-based bi-encoders for medical concept annotation of clinical narratives with SNOMED CT. Digit Health 2024; 10:20552076241288681. [PMID: 39493636 PMCID: PMC11531008 DOI: 10.1177/20552076241288681] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2024] [Accepted: 09/03/2024] [Indexed: 11/05/2024] Open
Abstract
Objective Clinical narratives provide comprehensive patient information. Achieving interoperability involves mapping relevant details to standardized medical vocabularies. Typically, natural language processing divides this task into named entity recognition (NER) and medical concept normalization (MCN). State-of-the-art results require supervised setups with abundant training data. However, the limited availability of annotated data due to sensitivity and time constraints poses challenges. This study addressed the need for unsupervised medical concept annotation (MCA) to overcome these limitations and support the creation of annotated datasets. Method We use an unsupervised SapBERT-based bi-encoder model to analyze n-grams from narrative text and measure their similarity to SNOMED CT concepts. At the end, we apply a syntactical re-ranker. For evaluation, we use the semantic tags of SNOMED CT candidates to assess the NER phase and their concept IDs to assess the MCN phase. The approach is evaluated with both English and German narratives. Result Without training data, our unsupervised approach achieves an F1 score of 0.765 in English and 0.557 in German for MCN. Evaluation at the semantic tag level reveals that "disorder" has the highest F1 scores, 0.871 and 0.648 on English and German datasets. Furthermore, the MCA approach on the semantic tag "disorder" shows F1 scores of 0.839 and 0.696 in English and 0.685 and 0.437 in German for NER and MCN, respectively. Conclusion This unsupervised approach demonstrates potential for initial annotation (pre-labeling) in manual annotation tasks. While promising for certain semantic tags, challenges remain, including false positives, contextual errors, and variability of clinical language, requiring further fine-tuning.
Collapse
Affiliation(s)
- Akhila Abdulnazar
- Institute for Medical Informatics, Statistics and Documentation, Medical University of Graz, Austria
- CBmed GmbH – Center for Biomarker Research in Medicine, Graz, Austria
| | - Roland Roller
- German Research Center for Artificial Intelligence (DFKI), Berlin, Germany
| | - Stefan Schulz
- Institute for Medical Informatics, Statistics and Documentation, Medical University of Graz, Austria
| | - Markus Kreuzthaler
- Institute for Medical Informatics, Statistics and Documentation, Medical University of Graz, Austria
| |
Collapse
|
6
|
Msosa YJ, Grauslys A, Zhou Y, Wang T, Buchan I, Langan P, Foster S, Walker M, Pearson M, Folarin A, Roberts A, Maskell S, Dobson R, Kullu C, Kehoe D. Trustworthy Data and AI Environments for Clinical Prediction: Application to Crisis-Risk in People With Depression. IEEE J Biomed Health Inform 2023; 27:5588-5598. [PMID: 37669205 DOI: 10.1109/jbhi.2023.3312011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/07/2023]
Abstract
Depression is a common mental health condition that often occurs in association with other chronic illnesses, and varies considerably in severity. Electronic Health Records (EHRs) contain rich information about a patient's medical history and can be used to train, test and maintain predictive models to support and improve patient care. This work evaluated the feasibility of implementing an environment for predicting mental health crisis among people living with depression based on both structured and unstructured EHRs. A large EHR from a mental health provider, Mersey Care, was pseudonymised and ingested into the Natural Language Processing (NLP) platform CogStack, allowing text content in binary clinical notes to be extracted. All unstructured clinical notes and summaries were semantically annotated by MedCAT and BioYODIE NLP services. Cases of crisis in patients with depression were then identified. Random forest models, gradient boosting trees, and Long Short-Term Memory (LSTM) networks, with varying feature arrangement, were trained to predict the occurrence of crisis. The results showed that all the prediction models can use a combination of structured and unstructured EHR information to predict crisis in patients with depression with good and useful accuracy. The LSTM network that was trained on a modified dataset with only 1000 most-important features from the random forest model with temporality showed the best performance with a mean AUC of 0.901 and a standard deviation of 0.006 using a training dataset and a mean AUC of 0.810 and 0.01 using a hold-out test dataset. Comparing the results from the technical evaluation with the views of psychiatrists shows that there are now opportunities to refine and integrate such prediction models into pragmatic point-of-care clinical decision support tools for supporting mental healthcare delivery.
Collapse
|
7
|
Nath N, Lee SH, Lee I. Application of specialized word embeddings and named entity and attribute recognition to the problem of unsupervised automated clinical coding. Comput Biol Med 2023; 165:107422. [PMID: 37722157 DOI: 10.1016/j.compbiomed.2023.107422] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2023] [Revised: 07/30/2023] [Accepted: 08/28/2023] [Indexed: 09/20/2023]
Abstract
Notes documented by clinicians, such as patient histories, hospital courses, lab reports and others are often annotated with standardized clinical codes by medical coders to facilitate a variety of secondary processing applications such as billing and statistical analyses. Clinical coding, traditionally manual and labor-intensive, has seen a surge in research interest by deep learning researchers pursuing to automate it. However, deep learning methods require large volumes of annotated clinical data for training and offer little to explain why codes were assigned to pieces of text. In this paper, we propose an unsupervised method which does not need annotated clinical text and is fully interpretable, by using Named Entity and Attribute Recognition and word embeddings specialized for the clinical domain. These methods successfully glean important information from large volumes of clinical notes and encode them effectively in order to perform automatic clinical coding.
Collapse
Affiliation(s)
- Namrata Nath
- UniSA STEM, University of South Australia, GPO Box 2471, Adelaide, SA, 5001, Australia.
| | - Sang-Heon Lee
- UniSA STEM, University of South Australia, Adelaide, Australia
| | - Ivan Lee
- UniSA STEM, University of South Australia, Adelaide, Australia
| |
Collapse
|
8
|
Wang L, Ambite JL, Appaji A, Bijsterbosch J, Dockes J, Herrick R, Kogan A, Lander H, Marcus D, Moore SM, Poline JB, Rajasekar A, Sahoo SS, Turner MD, Wang X, Wang Y, Turner JA. NeuroBridge: a prototype platform for discovery of the long-tail neuroimaging data. Front Neuroinform 2023; 17:1215261. [PMID: 37720825 PMCID: PMC10500076 DOI: 10.3389/fninf.2023.1215261] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2023] [Accepted: 08/01/2023] [Indexed: 09/19/2023] Open
Abstract
Introduction Open science initiatives have enabled sharing of large amounts of already collected data. However, significant gaps remain regarding how to find appropriate data, including underutilized data that exist in the long tail of science. We demonstrate the NeuroBridge prototype and its ability to search PubMed Central full-text papers for information relevant to neuroimaging data collected from schizophrenia and addiction studies. Methods The NeuroBridge architecture contained the following components: (1) Extensible ontology for modeling study metadata: subject population, imaging techniques, and relevant behavioral, cognitive, or clinical data. Details are described in the companion paper in this special issue; (2) A natural-language based document processor that leveraged pre-trained deep-learning models on a small-sample document corpus to establish efficient representations for each article as a collection of machine-recognized ontological terms; (3) Integrated search using ontology-driven similarity to query PubMed Central and NeuroQuery, which provides fMRI activation maps along with PubMed source articles. Results The NeuroBridge prototype contains a corpus of 356 papers from 2018 to 2021 describing schizophrenia and addiction neuroimaging studies, of which 186 were annotated with the NeuroBridge ontology. The search portal on the NeuroBridge website https://neurobridges.org/ provides an interactive Query Builder, where the user builds queries by selecting NeuroBridge ontology terms to preserve the ontology tree structure. For each return entry, links to the PubMed abstract as well as to the PMC full-text article, if available, are presented. For each of the returned articles, we provide a list of clinical assessments described in the Section "Methods" of the article. Articles returned from NeuroQuery based on the same search are also presented. Conclusion The NeuroBridge prototype combines ontology-based search with natural-language text-mining approaches to demonstrate that papers relevant to a user's research question can be identified. The NeuroBridge prototype takes a first step toward identifying potential neuroimaging data described in full-text papers. Toward the overall goal of discovering "enough data of the right kind," ongoing work includes validating the document processor with a larger corpus, extending the ontology to include detailed imaging data, and extracting information regarding data availability from the returned publications and incorporating XNAT-based neuroimaging databases to enhance data accessibility.
Collapse
Affiliation(s)
- Lei Wang
- Psychiatry and Behavioral Health Department, The Ohio State University Wexner Medical Center, Columbus, OH, United States
| | - José Luis Ambite
- Information Sciences Institute and Computer Science, University of Southern California, Los Angeles, CA, United States
| | - Abhishek Appaji
- Department of Medical Electronics Engineering, BMS College of Engineering, Bangalore, India
| | - Janine Bijsterbosch
- Department of Radiology, Washington University in St. Louis, St. Louis, MO, United States
| | - Jerome Dockes
- Department of Neurology and Neurosurgery, McGill University, Montreal, QC, Canada
| | - Rick Herrick
- Department of Radiology, Washington University in St. Louis, St. Louis, MO, United States
| | - Alex Kogan
- Psychiatry and Behavioral Health Department, The Ohio State University Wexner Medical Center, Columbus, OH, United States
| | - Howard Lander
- Renaissance Computing Institute, Chapel Hill, NC, United States
| | - Daniel Marcus
- Department of Radiology, Washington University in St. Louis, St. Louis, MO, United States
| | - Stephen M. Moore
- Department of Radiology, Washington University in St. Louis, St. Louis, MO, United States
| | - Jean-Baptiste Poline
- Department of Neurology and Neurosurgery, McGill University, Montreal, QC, Canada
| | - Arcot Rajasekar
- Renaissance Computing Institute, Chapel Hill, NC, United States
- School of Information and Library Science, University of North Carolina at Chapel Hill, Chapel Hill, NC, United States
| | - Satya S. Sahoo
- Department of Population and Quantitative Health Sciences, Case Western Reserve University, Cleveland, OH, United States
| | - Matthew D. Turner
- Psychiatry and Behavioral Health Department, The Ohio State University Wexner Medical Center, Columbus, OH, United States
| | - Xiaochen Wang
- College of Information Sciences and Technology, Pennsylvania State University, State College, PA, United States
| | - Yue Wang
- School of Information and Library Science, University of North Carolina at Chapel Hill, Chapel Hill, NC, United States
| | - Jessica A. Turner
- Psychiatry and Behavioral Health Department, The Ohio State University Wexner Medical Center, Columbus, OH, United States
| |
Collapse
|
9
|
Andrew NE, Beare R, Ravipati T, Parker E, Snowdon D, Naude K, Srikanth V. Developing a linked electronic health record derived data platform to support research into healthy ageing. Int J Popul Data Sci 2023; 8:2129. [PMID: 37670961 PMCID: PMC10476553 DOI: 10.23889/ijpds.v8i1.2129] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/07/2023] Open
Abstract
Introduction Digitalisation of Electronic Health Record (EHR) data has created unique opportunities for research. However, these data are routinely collected for operational purposes and so are not curated to the standard required for research. Harnessing such routine data at large scale allows efficient and long-term epidemiological and health services research. Objectives To describe the establishment a linked EHR derived data platform in the National Centre for Healthy Ageing, Melbourne, Australia, aimed at enabling research targeting national health priority areas in ageing. Methods Our approach incorporated: data validation, curation and warehousing to ensure quality and completeness; end-user engagement and consensus on the platform content; implementation of an artificial intelligence (AI) pipeline for extraction of text-based data items; early consumer involvement; and implementation of routine collection of patient reported outcome measures, in a multisite public health service. Results Data for a cohort of >800,000 patients collected over a 10-year period have been curated within the platform's research data warehouse. So far 117 items have been identified as suitable for inclusion, from 11 research relevant datasets held within the health service EHR systems. Data access, extraction and release processes, guided by the Five Safes Framework, are being tested through project use-cases. A natural language processing (NLP) pipeline has been implemented and a framework for the routine collection and incorporation of patient reported outcome measures developed. Conclusions We highlight the importance of establishing comprehensive processes for the foundations of a data platform utilising routine data not collected for research purposes. These robust foundations will facilitate future expansion through linkages to other datasets for the efficient and cost-effective study of health related to ageing at a large scale.
Collapse
Affiliation(s)
- Nadine E. Andrew
- National Centre for Healthy Ageing, Frankston, Victoria, Australia
- Department of Medicine, Peninsula Clinical School, Central Clinical School, Monash University, Frankston, Victoria, Australia
| | - Richard Beare
- National Centre for Healthy Ageing, Frankston, Victoria, Australia
- Department of Medicine, Peninsula Clinical School, Central Clinical School, Monash University, Frankston, Victoria, Australia
| | - Tanya Ravipati
- Department of Medicine, Peninsula Clinical School, Central Clinical School, Monash University, Frankston, Victoria, Australia
| | - Emily Parker
- Department of Medicine, Peninsula Clinical School, Central Clinical School, Monash University, Frankston, Victoria, Australia
| | - David Snowdon
- National Centre for Healthy Ageing, Frankston, Victoria, Australia
- Department of Medicine, Peninsula Clinical School, Central Clinical School, Monash University, Frankston, Victoria, Australia
| | - Kim Naude
- Department of Medicine, Peninsula Clinical School, Central Clinical School, Monash University, Frankston, Victoria, Australia
| | - Velandai Srikanth
- National Centre for Healthy Ageing, Frankston, Victoria, Australia
- Department of Medicine, Peninsula Clinical School, Central Clinical School, Monash University, Frankston, Victoria, Australia
- Department of Medicine & Geriatric Medicine, Frankston Hospital, Peninsula Health, Melbourne, Australia
| |
Collapse
|
10
|
Dong H, Suárez-Paniagua V, Zhang H, Wang M, Casey A, Davidson E, Chen J, Alex B, Whiteley W, Wu H. Ontology-driven and weakly supervised rare disease identification from clinical notes. BMC Med Inform Decis Mak 2023; 23:86. [PMID: 37147628 PMCID: PMC10162001 DOI: 10.1186/s12911-023-02181-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2022] [Accepted: 04/21/2023] [Indexed: 05/07/2023] Open
Abstract
BACKGROUND Computational text phenotyping is the practice of identifying patients with certain disorders and traits from clinical notes. Rare diseases are challenging to be identified due to few cases available for machine learning and the need for data annotation from domain experts. METHODS We propose a method using ontologies and weak supervision, with recent pre-trained contextual representations from Bi-directional Transformers (e.g. BERT). The ontology-driven framework includes two steps: (i) Text-to-UMLS, extracting phenotypes by contextually linking mentions to concepts in Unified Medical Language System (UMLS), with a Named Entity Recognition and Linking (NER+L) tool, SemEHR, and weak supervision with customised rules and contextual mention representation; (ii) UMLS-to-ORDO, matching UMLS concepts to rare diseases in Orphanet Rare Disease Ontology (ORDO). The weakly supervised approach is proposed to learn a phenotype confirmation model to improve Text-to-UMLS linking, without annotated data from domain experts. We evaluated the approach on three clinical datasets, MIMIC-III discharge summaries, MIMIC-III radiology reports, and NHS Tayside brain imaging reports from two institutions in the US and the UK, with annotations. RESULTS The improvements in the precision were pronounced (by over 30% to 50% absolute score for Text-to-UMLS linking), with almost no loss of recall compared to the existing NER+L tool, SemEHR. Results on radiology reports from MIMIC-III and NHS Tayside were consistent with the discharge summaries. The overall pipeline processing clinical notes can extract rare disease cases, mostly uncaptured in structured data (manually assigned ICD codes). CONCLUSION The study provides empirical evidence for the task by applying a weakly supervised NLP pipeline on clinical notes. The proposed weak supervised deep learning approach requires no human annotation except for validation and testing, by leveraging ontologies, NER+L tools, and contextual representations. The study also demonstrates that Natural Language Processing (NLP) can complement traditional ICD-based approaches to better estimate rare diseases in clinical notes. We discuss the usefulness and limitations of the weak supervision approach and propose directions for future studies.
Collapse
Affiliation(s)
- Hang Dong
- Centre for Medical Informatics, Usher Institute of Population Health Sciences and Informatics, University of Edinburgh, Edinburgh, United Kingdom.
- Health Data Research UK, London, United Kingdom.
- Department of Computer Science, University of Oxford, Oxford, United Kingdom.
| | - Víctor Suárez-Paniagua
- Centre for Medical Informatics, Usher Institute of Population Health Sciences and Informatics, University of Edinburgh, Edinburgh, United Kingdom
- Health Data Research UK, London, United Kingdom
| | - Huayu Zhang
- Advanced Care Research Centre, Usher Institute, University of Edinburgh, Edinburgh, United Kingdom
| | - Minhong Wang
- Institute of Health Informatics, University College London, London, United Kingdom
| | - Arlene Casey
- Advanced Care Research Centre, Usher Institute, University of Edinburgh, Edinburgh, United Kingdom
| | - Emma Davidson
- Centre for Clinical Brain Sciences, University of Edinburgh, Edinburgh, United Kingdom
| | - Jiaoyan Chen
- Department of Computer Science, The University of Manchester, Manchester, United Kingdom
| | - Beatrice Alex
- Edinburgh Futures Institute, University of Edinburgh, Edinburgh, United Kingdom
| | - William Whiteley
- Health Data Research UK, London, United Kingdom
- Centre for Clinical Brain Sciences, University of Edinburgh, Edinburgh, United Kingdom
| | - Honghan Wu
- Health Data Research UK, London, United Kingdom.
- Institute of Health Informatics, University College London, London, United Kingdom.
| |
Collapse
|
11
|
He T, Belouali A, Patricoski J, Lehmann H, Ball R, Anagnostou V, Kreimeyer K, Botsis T. Trends and opportunities in computable clinical phenotyping: A scoping review. J Biomed Inform 2023; 140:104335. [PMID: 36933631 DOI: 10.1016/j.jbi.2023.104335] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2022] [Revised: 03/07/2023] [Accepted: 03/09/2023] [Indexed: 03/18/2023]
Abstract
Identifying patient cohorts meeting the criteria of specific phenotypes is essential in biomedicine and particularly timely in precision medicine. Many research groups deliver pipelines that automatically retrieve and analyze data elements from one or more sources to automate this task and deliver high-performing computable phenotypes. We applied a systematic approach based on the Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines to conduct a thorough scoping review on computable clinical phenotyping. Five databases were searched using a query that combined the concepts of automation, clinical context, and phenotyping. Subsequently, four reviewers screened 7960 records (after removing over 4000 duplicates) and selected 139 that satisfied the inclusion criteria. This dataset was analyzed to extract information on target use cases, data-related topics, phenotyping methodologies, evaluation strategies, and portability of developed solutions. Most studies supported patient cohort selection without discussing the application to specific use cases, such as precision medicine. Electronic Health Records were the primary source in 87.1 % (N = 121) of all studies, and International Classification of Diseases codes were heavily used in 55.4 % (N = 77) of all studies, however, only 25.9 % (N = 36) of the records described compliance with a common data model. In terms of the presented methods, traditional Machine Learning (ML) was the dominant method, often combined with natural language processing and other approaches, while external validation and portability of computable phenotypes were pursued in many cases. These findings revealed that defining target use cases precisely, moving away from sole ML strategies, and evaluating the proposed solutions in the real setting are essential opportunities for future work. There is also momentum and an emerging need for computable phenotyping to support clinical and epidemiological research and precision medicine.
Collapse
Affiliation(s)
- Ting He
- Department of Oncology, The Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University School of Medicine, Baltimore, MD, USA; Biomedical Informatics and Data Science Section, Johns Hopkins University School of Medicine, Baltimore, MD, USA.
| | - Anas Belouali
- Biomedical Informatics and Data Science Section, Johns Hopkins University School of Medicine, Baltimore, MD, USA
| | - Jessica Patricoski
- Biomedical Informatics and Data Science Section, Johns Hopkins University School of Medicine, Baltimore, MD, USA
| | - Harold Lehmann
- Biomedical Informatics and Data Science Section, Johns Hopkins University School of Medicine, Baltimore, MD, USA
| | - Robert Ball
- Office of Surveillance and Epidemiology, Center for Drug Evaluation and Research, US FDA, Silver Spring, MD, USA
| | - Valsamo Anagnostou
- Department of Oncology, The Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University School of Medicine, Baltimore, MD, USA
| | - Kory Kreimeyer
- Department of Oncology, The Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University School of Medicine, Baltimore, MD, USA; Biomedical Informatics and Data Science Section, Johns Hopkins University School of Medicine, Baltimore, MD, USA
| | - Taxiarchis Botsis
- Department of Oncology, The Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University School of Medicine, Baltimore, MD, USA; Biomedical Informatics and Data Science Section, Johns Hopkins University School of Medicine, Baltimore, MD, USA.
| |
Collapse
|
12
|
Venkatesh KP, Raza MM, Kvedar JC. Automating the overburdened clinical coding system: challenges and next steps. NPJ Digit Med 2023; 6:16. [PMID: 36737496 PMCID: PMC9898522 DOI: 10.1038/s41746-023-00768-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2023] [Accepted: 01/27/2023] [Indexed: 02/05/2023] Open
|
13
|
Farajidavar N, O'Gallagher K, Bean D, Nabeebaccus A, Zakeri R, Bromage D, Kraljevic Z, Teo JTH, Dobson RJ, Shah AM. Diagnostic signature for heart failure with preserved ejection fraction (HFpEF): a machine learning approach using multi-modality electronic health record data. BMC Cardiovasc Disord 2022; 22:567. [PMID: 36567336 PMCID: PMC9791783 DOI: 10.1186/s12872-022-03005-w] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2022] [Accepted: 12/12/2022] [Indexed: 12/27/2022] Open
Abstract
BACKGROUND Heart failure with preserved ejection fraction (HFpEF) is thought to be highly prevalent yet remains underdiagnosed. Evidence-based treatments are available that increase quality of life and decrease hospitalization. We sought to develop a data-driven diagnostic model to predict from electronic health records (EHR) the likelihood of HFpEF among patients with unexplained dyspnea and preserved left ventricular EF. METHODS AND RESULTS The derivation cohort comprised patients with dyspnea and echocardiography results. Structured and unstructured data were extracted using an automated informatics pipeline. Patients were retrospectively diagnosed as HFpEF (cases), non-HF (control cohort I), or HF with reduced EF (HFrEF; control cohort II). The ability of clinical parameters and investigations to discriminate cases from controls was evaluated by extreme gradient boosting. A likelihood scoring system was developed and validated in a separate test cohort. The derivation cohort included 1585 consecutive patients: 133 cases of HFpEF (9%), 194 non-HF cases (Control cohort I) and 1258 HFrEF cases (Control cohort II). Two HFpEF diagnostic signatures were derived, comprising symptoms, diagnoses and investigation results. A final prediction model was generated based on the averaged likelihood scores from these two models. In a validation cohort consisting of 269 consecutive patients [with 66 HFpEF cases (24.5%)], the diagnostic power of detecting HFpEF had an AUROC of 90% (P < 0.001) and average precision of 74%. CONCLUSION This diagnostic signature enables discrimination of HFpEF from non-cardiac dyspnea or HFrEF from EHR and can assist in the diagnostic evaluation in patients with unexplained dyspnea. This approach will enable identification of HFpEF patients who may then benefit from new evidence-based therapies.
Collapse
Affiliation(s)
- Nazli Farajidavar
- King's College London British Heart Foundation Centre of Excellence, School of Cardiovascular and Metabolic Medicine and Sciences, King's College London, James Black Centre, 125 Coldharbour Lane, London, SE5 9NU, UK
| | - Kevin O'Gallagher
- King's College London British Heart Foundation Centre of Excellence, School of Cardiovascular and Metabolic Medicine and Sciences, King's College London, James Black Centre, 125 Coldharbour Lane, London, SE5 9NU, UK
- King's College Hospital NHS Foundation Trust, London, UK
| | - Daniel Bean
- King's College London British Heart Foundation Centre of Excellence, School of Cardiovascular and Metabolic Medicine and Sciences, King's College London, James Black Centre, 125 Coldharbour Lane, London, SE5 9NU, UK
- Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK
- Health Data Research UK London, Institute of Health Informatics, University College London, London, UK
| | - Adam Nabeebaccus
- King's College London British Heart Foundation Centre of Excellence, School of Cardiovascular and Metabolic Medicine and Sciences, King's College London, James Black Centre, 125 Coldharbour Lane, London, SE5 9NU, UK
- King's College Hospital NHS Foundation Trust, London, UK
| | - Rosita Zakeri
- King's College London British Heart Foundation Centre of Excellence, School of Cardiovascular and Metabolic Medicine and Sciences, King's College London, James Black Centre, 125 Coldharbour Lane, London, SE5 9NU, UK
- King's College Hospital NHS Foundation Trust, London, UK
| | - Daniel Bromage
- King's College London British Heart Foundation Centre of Excellence, School of Cardiovascular and Metabolic Medicine and Sciences, King's College London, James Black Centre, 125 Coldharbour Lane, London, SE5 9NU, UK
- King's College Hospital NHS Foundation Trust, London, UK
| | - Zeljko Kraljevic
- Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK
| | - James T H Teo
- King's College Hospital NHS Foundation Trust, London, UK
| | - Richard J Dobson
- King's College London British Heart Foundation Centre of Excellence, School of Cardiovascular and Metabolic Medicine and Sciences, King's College London, James Black Centre, 125 Coldharbour Lane, London, SE5 9NU, UK
- Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK
- Health Data Research UK London, Institute of Health Informatics, University College London, London, UK
- NIHR Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King's College London, London, UK
| | - Ajay M Shah
- King's College London British Heart Foundation Centre of Excellence, School of Cardiovascular and Metabolic Medicine and Sciences, King's College London, James Black Centre, 125 Coldharbour Lane, London, SE5 9NU, UK.
- King's College Hospital NHS Foundation Trust, London, UK.
| |
Collapse
|
14
|
Wu H, Wang M, Wu J, Francis F, Chang YH, Shavick A, Dong H, Poon MTC, Fitzpatrick N, Levine AP, Slater LT, Handy A, Karwath A, Gkoutos GV, Chelala C, Shah AD, Stewart R, Collier N, Alex B, Whiteley W, Sudlow C, Roberts A, Dobson RJB. A survey on clinical natural language processing in the United Kingdom from 2007 to 2022. NPJ Digit Med 2022; 5:186. [PMID: 36544046 PMCID: PMC9770568 DOI: 10.1038/s41746-022-00730-6] [Citation(s) in RCA: 18] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2022] [Accepted: 11/29/2022] [Indexed: 12/24/2022] Open
Abstract
Much of the knowledge and information needed for enabling high-quality clinical research is stored in free-text format. Natural language processing (NLP) has been used to extract information from these sources at scale for several decades. This paper aims to present a comprehensive review of clinical NLP for the past 15 years in the UK to identify the community, depict its evolution, analyse methodologies and applications, and identify the main barriers. We collect a dataset of clinical NLP projects (n = 94; £ = 41.97 m) funded by UK funders or the European Union's funding programmes. Additionally, we extract details on 9 funders, 137 organisations, 139 persons and 431 research papers. Networks are created from timestamped data interlinking all entities, and network analysis is subsequently applied to generate insights. 431 publications are identified as part of a literature review, of which 107 are eligible for final analysis. Results show, not surprisingly, clinical NLP in the UK has increased substantially in the last 15 years: the total budget in the period of 2019-2022 was 80 times that of 2007-2010. However, the effort is required to deepen areas such as disease (sub-)phenotyping and broaden application domains. There is also a need to improve links between academia and industry and enable deployments in real-world settings for the realisation of clinical NLP's great potential in care delivery. The major barriers include research and development access to hospital data, lack of capable computational resources in the right places, the scarcity of labelled data and barriers to sharing of pretrained models.
Collapse
Affiliation(s)
- Honghan Wu
- Institute of Health Informatics, University College London, London, UK.
| | - Minhong Wang
- Institute of Health Informatics, University College London, London, UK
| | - Jinge Wu
- Institute of Health Informatics, University College London, London, UK
- Usher Institute, University of Edinburgh, Edinburgh, UK
| | - Farah Francis
- Usher Institute, University of Edinburgh, Edinburgh, UK
| | - Yun-Hsuan Chang
- Institute of Health Informatics, University College London, London, UK
| | - Alex Shavick
- Research Department of Pathology, UCL Cancer Institute, University College London, London, UK
| | - Hang Dong
- Usher Institute, University of Edinburgh, Edinburgh, UK
- Department of Computer Science, University of Oxford, Oxford, UK
| | | | | | - Adam P Levine
- Research Department of Pathology, UCL Cancer Institute, University College London, London, UK
| | - Luke T Slater
- Institute of Cancer and Genomics, University of Birmingham, Birmingham, UK
| | - Alex Handy
- Institute of Health Informatics, University College London, London, UK
- University College London Hospitals NHS Trust, London, UK
| | - Andreas Karwath
- Institute of Cancer and Genomics, University of Birmingham, Birmingham, UK
| | - Georgios V Gkoutos
- Institute of Cancer and Genomics, University of Birmingham, Birmingham, UK
| | - Claude Chelala
- Centre for Tumour Biology, Barts Cancer Institute, Queen Mary University of London, London, UK
| | - Anoop Dinesh Shah
- Institute of Health Informatics, University College London, London, UK
| | - Robert Stewart
- Department of Psychological Medicine, Institute of Psychiatry, Psychology and Neuroscience (IoPPN), King's College London, London, UK
- South London and Maudsley NHS Foundation Trust, London, UK
| | - Nigel Collier
- Theoretical and Applied Linguistics, Faculty of Modern & Medieval Languages & Linguistics, University of Cambridge, Cambridge, UK
| | - Beatrice Alex
- Edinburgh Futures Institute, University of Edinburgh, Edinburgh, UK
| | | | - Cathie Sudlow
- Usher Institute, University of Edinburgh, Edinburgh, UK
| | - Angus Roberts
- Department of Biostatistics & Health Informatics, King's College London, London, UK
| | - Richard J B Dobson
- Institute of Health Informatics, University College London, London, UK
- Department of Biostatistics & Health Informatics, King's College London, London, UK
| |
Collapse
|
15
|
Yang Z, Wang S, Rawat BPS, Mitra A, Yu H. Knowledge Injected Prompt Based Fine-tuning for Multi-label Few-shot ICD Coding. PROCEEDINGS OF THE CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING. CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING 2022; 2022:1767-1781. [PMID: 36848298 PMCID: PMC9958514] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 03/01/2023]
Abstract
Automatic International Classification of Diseases (ICD) coding aims to assign multiple ICD codes to a medical note with average length of 3,000+ tokens. This task is challenging due to a high-dimensional space of multi-label assignment (tens of thousands of ICD codes) and the long-tail challenge: only a few codes (common diseases) are frequently assigned while most codes (rare diseases) are infrequently assigned. This study addresses the long-tail challenge by adapting a prompt-based fine-tuning technique with label semantics, which has been shown to be effective under few-shot setting. To further enhance the performance in medical domain, we propose a knowledge-enhanced longformer by injecting three domain-specific knowledge: hierarchy, synonym, and abbreviation with additional pretraining using contrastive learning. Experiments on MIMIC-III-full, a benchmark dataset of code assignment, show that our proposed method outperforms previous state-of-the-art method in 14.5% in marco F1 (from 10.3 to 11.8, P<0.001). To further test our model on few-shot setting, we created a new rare diseases coding dataset, MIMIC-III-rare50, on which our model improves marco F1 from 17.1 to 30.4 and micro F1 from 17.2 to 32.6 compared to previous method.
Collapse
Affiliation(s)
- Zhichao Yang
- College of Information and Computer Sciences, University of Massachusetts Amherst
| | - Shufan Wang
- College of Information and Computer Sciences, University of Massachusetts Amherst
| | | | - Avijit Mitra
- College of Information and Computer Sciences, University of Massachusetts Amherst
| | - Hong Yu
- College of Information and Computer Sciences, University of Massachusetts Amherst
- Department of Computer Science, University of Massachusetts Lowell
| |
Collapse
|
16
|
Duda SN, Kennedy N, Conway D, Cheng AC, Nguyen V, Zayas-Cabán T, Harris PA. HL7 FHIR-based tools and initiatives to support clinical research: a scoping review. J Am Med Inform Assoc 2022; 29:1642-1653. [PMID: 35818340 PMCID: PMC9382376 DOI: 10.1093/jamia/ocac105] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2021] [Revised: 05/23/2022] [Accepted: 06/20/2022] [Indexed: 11/14/2022] Open
Abstract
OBJECTIVES The HL7® fast healthcare interoperability resources (FHIR®) specification has emerged as the leading interoperability standard for the exchange of healthcare data. We conducted a scoping review to identify trends and gaps in the use of FHIR for clinical research. MATERIALS AND METHODS We reviewed published literature, federally funded project databases, application websites, and other sources to discover FHIR-based papers, projects, and tools (collectively, "FHIR projects") available to support clinical research activities. RESULTS Our search identified 203 different FHIR projects applicable to clinical research. Most were associated with preparations to conduct research, such as data mapping to and from FHIR formats (n = 66, 32.5%) and managing ontologies with FHIR (n = 30, 14.8%), or post-study data activities, such as sharing data using repositories or registries (n = 24, 11.8%), general research data sharing (n = 23, 11.3%), and management of genomic data (n = 21, 10.3%). With the exception of phenotyping (n = 19, 9.4%), fewer FHIR-based projects focused on needs within the clinical research process itself. DISCUSSION Funding and usage of FHIR-enabled solutions for research are expanding, but most projects appear focused on establishing data pipelines and linking clinical systems such as electronic health records, patient-facing data systems, and registries, possibly due to the relative newness of FHIR and the incentives for FHIR integration in health information systems. Fewer FHIR projects were associated with research-only activities. CONCLUSION The FHIR standard is becoming an essential component of the clinical research enterprise. To develop FHIR's full potential for clinical research, funding and operational stakeholders should address gaps in FHIR-based research tools and methods.
Collapse
Affiliation(s)
- Stephany N Duda
- Vanderbilt Institute for Clinical and Translational Research, Vanderbilt University Medical Center, Nashville, Tennessee, USA
- Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, Tennessee, USA
| | - Nan Kennedy
- Vanderbilt Institute for Clinical and Translational Research, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Douglas Conway
- Vanderbilt Institute for Clinical and Translational Research, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Alex C Cheng
- Vanderbilt Institute for Clinical and Translational Research, Vanderbilt University Medical Center, Nashville, Tennessee, USA
- Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, Tennessee, USA
| | - Viet Nguyen
- Stratametrics LLC, Salt Lake City, Utah, USA
- HL7 Da Vinci Project, Ann Arbor, Michigan, USA
| | - Teresa Zayas-Cabán
- National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA
| | - Paul A Harris
- Vanderbilt Institute for Clinical and Translational Research, Vanderbilt University Medical Center, Nashville, Tennessee, USA
- Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, Tennessee, USA
| |
Collapse
|
17
|
Abstract
In the medical field, text classification based on natural language process (NLP) has shown good results and has great practical application prospects such as clinical medical value, but most existing research focuses on English electronic medical record data, and there is less research on the natural language processing task for Chinese electronic medical records. Most of the current Chinese electronic medical records are non-institutionalized texts, which generally have low utilization rates and inconsistent terminology, often mingling patients’ symptoms, medications, diagnoses, and other essential information. In this paper, we propose a Capsule network model for electronic medical record classification, which combines LSTM and GRU models and relies on a unique routing structure to extract complex Chinese medical text features. The experimental results show that this model outperforms several other baseline models and achieves excellent results with an F1 value of 73.51% on the Chinese electronic medical record dataset, at least 4.1% better than other baseline models.
Collapse
|
18
|
Xu D, Miller T. A simple neural vector space model for medical concept normalization using concept embeddings. J Biomed Inform 2022; 130:104080. [PMID: 35472514 PMCID: PMC9351985 DOI: 10.1016/j.jbi.2022.104080] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2022] [Revised: 04/15/2022] [Accepted: 04/19/2022] [Indexed: 11/24/2022]
Abstract
OBJECTIVE Medical concept normalization (MCN), the task of linking textual mentions to concepts in an ontology, provides a solution to unify different ways of referring to the same concept. In this paper, we present a simple neural MCN model that takes mentions as input and directly predicts concepts. MATERIALS AND METHODS We evaluate our proposed model on clinical datasets from ShARe/CLEF eHealth 2013 shared task and 2019 n2c2/OHNLP shared task track 3. Our neural MCN model consists of an encoder, and a normalized temperature-scaled softmax (NT-softmax) layer that maximizes the cosine similarity score of matching the mention to the correct concept. We adopt SAPBERT as the encoder and initialize the weights in the NT-softmax layer with pre-computed concept embeddings from SAPBERT. RESULTS Our proposed neural model achieves competitive performance on ShARe/CLEF 2013 and establishes a new state-of-the-art on 2019-n2c2-MCN. Yet this model is simpler than most prior work: it requires no complex pipelines, no hand-crafted rules, and no preprocessing, making it simpler to apply in new settings. DISCUSSION Analyses of our proposed model show that the NT-softmax is better than the conventional softmax on the MCN task, and both the CUI-less threshold parameter and the initialization of the weight vectors in the NT-softmax layer contribute to the improvements. CONCLUSION We propose a simple neural model for clinical MCN, an one-step approach with simpler inference and more effective performance than prior work. Our analyses demonstrate future work on MCN may require more effort on unseen concepts.
Collapse
Affiliation(s)
- Dongfang Xu
- Computational Health Informatics Program, Boston Children's Hospital, Boston, MA, USA; Department of Pediatrics, Harvard Medical School Boston, MA, USA.
| | - Timothy Miller
- Computational Health Informatics Program, Boston Children's Hospital, Boston, MA, USA; Department of Pediatrics, Harvard Medical School Boston, MA, USA
| |
Collapse
|
19
|
Christy SM, Reich RR, Rathwell JA, Vadaparampil ST, Isaacs-Soriano KA, Friedman MS, Roetzheim RG, Giuliano AR. Using the Electronic Health Record to Characterize the Hepatitis C Virus Care Cascade. Public Health Rep 2022; 137:498-505. [PMID: 33831316 PMCID: PMC9109542 DOI: 10.1177/00333549211005812] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2023] Open
Abstract
OBJECTIVES Chronic hepatitis C virus (HCV) infection is one of the main causes of hepatocellular carcinoma. Before initiating a multilevel HCV screening intervention, we sought to (1) describe concordance between the electronic health record (EHR) data warehouse and manual medical record review in recording aspects of HCV testing and treatment and (2) estimate the percentage of patients with chronic HCV infection who initiated and completed HCV treatment using manual medical record review. METHODS We examined the medical records for 177 patients (100 randomly selected patients born during 1945-1965 without evidence of HCV testing and 77 adult patients of any birth cohort who had completed HCV testing) with a primary care or relevant specialist visit at an academic health care system in Tampa, Florida, from 2015 through 2018. We used the Cohen κ coefficient to examine the degree of concordance between the searchable data warehouse and the medical record review abstractions. Descriptive statistics characterized referral to and receipt of treatment among patients with chronic HCV infection from medical record review. RESULTS We found generally good concordance between the data warehouse abstraction and medical record review for HCV testing data (κ ranged from 0.66 to 0.87). However, the data warehouse failed to capture data on HCV treatment variables. According to medical record review, 28 patients had chronic HCV infection; 16 patients were prescribed treatment, 14 initiated treatment, and 9 achieved and had a reported posttreatment undetected HCV viral load. CONCLUSIONS Using data warehouse data provides generally reliable HCV testing information. However, without the use of natural language processing and purposeful EHR design, manual medical record reviews will likely be required to characterize treatment initiation and completion.
Collapse
Affiliation(s)
- Shannon M. Christy
- Department of Health Outcomes and Behavior, Division of
Population Science, H. Lee Moffitt Cancer Center and Research Institute, Tampa, FL,
USA
- Department of Gastrointestinal Oncology, H. Lee Moffitt Cancer
Center and Research Institute, Tampa, FL, USA
- Center for Immunization and Infection Research in Cancer, H. Lee
Moffitt Cancer Center and Research Institute, Tampa, FL, USA
- Department of Oncologic Sciences, Morsani College of Medicine,
University of South Florida, Tampa, FL, USA
| | - Richard R. Reich
- Biostatistics and Bioinformatics Shared Resource, H. Lee Moffitt
Cancer Center and Research Institute, Tampa, FL, USA
| | - Julie A. Rathwell
- Center for Immunization and Infection Research in Cancer, H. Lee
Moffitt Cancer Center and Research Institute, Tampa, FL, USA
- Department of Cancer Epidemiology, Division of Population Science,
H. Lee Moffitt Cancer Center and Research Institute, Tampa, FL, USA
| | - Susan T. Vadaparampil
- Department of Health Outcomes and Behavior, Division of
Population Science, H. Lee Moffitt Cancer Center and Research Institute, Tampa, FL,
USA
- Center for Immunization and Infection Research in Cancer, H. Lee
Moffitt Cancer Center and Research Institute, Tampa, FL, USA
- Department of Oncologic Sciences, Morsani College of Medicine,
University of South Florida, Tampa, FL, USA
| | - Kimberly A. Isaacs-Soriano
- Center for Immunization and Infection Research in Cancer, H. Lee
Moffitt Cancer Center and Research Institute, Tampa, FL, USA
- Department of Cancer Epidemiology, Division of Population Science,
H. Lee Moffitt Cancer Center and Research Institute, Tampa, FL, USA
| | - Mark S. Friedman
- Department of Gastrointestinal Oncology, H. Lee Moffitt Cancer
Center and Research Institute, Tampa, FL, USA
- Department of Oncologic Sciences, Morsani College of Medicine,
University of South Florida, Tampa, FL, USA
| | - Richard G. Roetzheim
- Department of Health Outcomes and Behavior, Division of
Population Science, H. Lee Moffitt Cancer Center and Research Institute, Tampa, FL,
USA
- Center for Immunization and Infection Research in Cancer, H. Lee
Moffitt Cancer Center and Research Institute, Tampa, FL, USA
- Department of Family Medicine, Morsani College of Medicine,
University of South Florida, Tampa, FL, USA
| | - Anna R. Giuliano
- Center for Immunization and Infection Research in Cancer, H. Lee
Moffitt Cancer Center and Research Institute, Tampa, FL, USA
- Department of Oncologic Sciences, Morsani College of Medicine,
University of South Florida, Tampa, FL, USA
- Department of Cancer Epidemiology, Division of Population Science,
H. Lee Moffitt Cancer Center and Research Institute, Tampa, FL, USA
| |
Collapse
|
20
|
Schreiber S, Irving PM, Sharara AI, Martín-Arranz MD, Hébuterne X, Penchev P, Danese S, Anthopoulos P, Akhundova-Unadkat G, Baert F. Review article: randomised controlled trials in inflammatory bowel disease-common challenges and potential solutions. Aliment Pharmacol Ther 2022; 55:658-669. [PMID: 35132657 DOI: 10.1111/apt.16781] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/05/2021] [Revised: 11/19/2021] [Accepted: 01/10/2022] [Indexed: 12/13/2022]
Abstract
BACKGROUND Recruitment rates for Crohn's disease and ulcerative colitis clinical trials continue to decrease annually. The inability to reach recruitment targets and complete trials has serious implications for stakeholders in the inflammatory bowel disease (IBD) community. Action is required to ensure patients with an unmet medical need have access to new therapies to improve the management of their IBD. AIMS Identify challenges contributing to recruitment decline in IBD clinical trials and propose potential solutions. METHODS PubMed and Google were used to identify literature, regulatory guidelines and conference proceedings related to IBD clinical trials and related concepts. Data on IBD clinical trials conducted between 1989 and 2020 were extracted from the Trialtrove database. RESULTS Key aspects that may improve recruitment rates were identified. An increasingly patient-centric approach should be taken to study design including improvements to the readability of key trial documentation and inclusion of patient representatives in trial planning. Placebo is unappealing to patients; approaches including platform trials should be explored to minimise placebo exposure. Non-invasive imaging, biomarkers and novel digital endpoints should continue to be examined to reduce the burden on patients. Reducing the administrative burden associated with trials via the use of electronic signatures, for example, may benefit study sites and investigators. Changes implemented to IBD trials during the COVID-19 pandemic provided examples of how trial conduct can be rapidly and constructively adapted. CONCLUSIONS To improve recruitment in Crohn's disease and ulcerative colitis trials, the IBD community should address a broad range of issues related to clinical trial conduct.
Collapse
Affiliation(s)
- Stefan Schreiber
- Department Internal Medicine I, University Hospital Schleswig-Holstein, Christian-Alrechts-Unversity, Kiel, Germany
| | | | - Ala I Sharara
- Division of Gastroenterology, Department of Internal Medicine, American University of Beirut Medical Center, Beirut, Lebanon
| | - María Dolores Martín-Arranz
- Department of Gastroenterology, La Paz University Hospital, Madrid, Spain.,School of Medicine, Universidad Autónoma de Madrid, Madrid, Spain.,Institute for Health Research, La Paz Hospital, Madrid, Spain
| | - Xavier Hébuterne
- Department of Gastroenterology and Clinical Nutrition, CHU of Nice and University Côte d'Azur, Nice, France
| | - Plamen Penchev
- Department of Gastroenterology, Medical University of Sofia, Sofia, Bulgaria
| | - Silvio Danese
- Gastroenterology and Endoscopy, IRCCS Ospedale San Raffaele and University Vita-Salute San Raffaele, Milan, Italy
| | | | | | - Filip Baert
- Department of Gastroenterology, AZ Delta, Roeselare, Belgium
| |
Collapse
|
21
|
Deep contextual multi-task feature fusion for enhanced concept, negation and speculation detection from clinical notes. INFORMATICS IN MEDICINE UNLOCKED 2022. [DOI: 10.1016/j.imu.2022.101109] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open
|
22
|
Vorisek CN, Lehne M, Klopfenstein SAI, Mayer PJ, Bartschke A, Haese T, Thun S. Fast Healthcare Interoperability Resources (FHIR) for Interoperability in Health Research: A Systematic Review (Preprint). JMIR Med Inform 2021; 10:e35724. [PMID: 35852842 PMCID: PMC9346559 DOI: 10.2196/35724] [Citation(s) in RCA: 41] [Impact Index Per Article: 13.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2021] [Revised: 04/22/2022] [Accepted: 05/18/2022] [Indexed: 01/04/2023] Open
Abstract
Background The standard Fast Healthcare Interoperability Resources (FHIR) is widely used in health information technology. However, its use as a standard for health research is still less prevalent. To use existing data sources more efficiently for health research, data interoperability becomes increasingly important. FHIR provides solutions by offering resource domains such as “Public Health & Research” and “Evidence-Based Medicine” while using already established web technologies. Therefore, FHIR could help standardize data across different data sources and improve interoperability in health research. Objective The aim of our study was to provide a systematic review of existing literature and determine the current state of FHIR implementations in health research and possible future directions. Methods We searched the PubMed/MEDLINE, Embase, Web of Science, IEEE Xplore, and Cochrane Library databases for studies published from 2011 to 2022. Studies investigating the use of FHIR in health research were included. Articles published before 2011, abstracts, reviews, editorials, and expert opinions were excluded. We followed the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines and registered this study with PROSPERO (CRD42021235393). Data synthesis was done in tables and figures. Results We identified a total of 998 studies, of which 49 studies were eligible for inclusion. Of the 49 studies, most (73%, n=36) covered the domain of clinical research, whereas the remaining studies focused on public health or epidemiology (6%, n=3) or did not specify their research domain (20%, n=10). Studies used FHIR for data capture (29%, n=14), standardization of data (41%, n=20), analysis (12%, n=6), recruitment (14%, n=7), and consent management (4%, n=2). Most (55%, 27/49) of the studies had a generic approach, and 55% (12/22) of the studies focusing on specific medical specialties (infectious disease, genomics, oncology, environmental health, imaging, and pulmonary hypertension) reported their solutions to be conferrable to other use cases. Most (63%, 31/49) of the studies reported using additional data models or terminologies: Systematized Nomenclature of Medicine Clinical Terms (29%, n=14), Logical Observation Identifiers Names and Codes (37%, n=18), International Classification of Diseases 10th Revision (18%, n=9), Observational Medical Outcomes Partnership common data model (12%, n=6), and others (43%, n=21). Only 4 (8%) studies used a FHIR resource from the domain “Public Health & Research.” Limitations using FHIR included the possible change in the content of FHIR resources, safety, legal matters, and the need for a FHIR server. Conclusions Our review found that FHIR can be implemented in health research, and the areas of application are broad and generalizable in most use cases. The implementation of international terminologies was common, and other standards such as the Observational Medical Outcomes Partnership common data model could be used as a complement to FHIR. Limitations such as the change of FHIR content, lack of FHIR implementation, safety, and legal matters need to be addressed in future releases to expand the use of FHIR and, therefore, interoperability in health research.
Collapse
Affiliation(s)
- Carina Nina Vorisek
- Core Facility Digital Medicine and Interoperability, Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, Germany
| | - Moritz Lehne
- Core Facility Digital Medicine and Interoperability, Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, Germany
| | - Sophie Anne Ines Klopfenstein
- Core Facility Digital Medicine and Interoperability, Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, Germany
- Institute for Medical Informatics, Charité - Universitätsmedizin Berlin, Berlin, Germany
| | - Paula Josephine Mayer
- Core Facility Digital Medicine and Interoperability, Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, Germany
| | - Alexander Bartschke
- Core Facility Digital Medicine and Interoperability, Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, Germany
| | - Thomas Haese
- Core Facility Digital Medicine and Interoperability, Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, Germany
| | - Sylvia Thun
- Core Facility Digital Medicine and Interoperability, Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, Germany
| |
Collapse
|
23
|
Giachelle F, Irrera O, Silvello G. MedTAG: a portable and customizable annotation tool for biomedical documents. BMC Med Inform Decis Mak 2021; 21:352. [PMID: 34922517 PMCID: PMC8684237 DOI: 10.1186/s12911-021-01706-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2021] [Accepted: 12/01/2021] [Indexed: 01/08/2023] Open
Abstract
BACKGROUND Semantic annotators and Natural Language Processing (NLP) methods for Named Entity Recognition and Linking (NER+L) require plenty of training and test data, especially in the biomedical domain. Despite the abundance of unstructured biomedical data, the lack of richly annotated biomedical datasets poses hindrances to the further development of NER+L algorithms for any effective secondary use. In addition, manual annotation of biomedical documents performed by physicians and experts is a costly and time-consuming task. To support, organize and speed up the annotation process, we introduce MedTAG, a collaborative biomedical annotation tool that is open-source, platform-independent, and free to use/distribute. RESULTS We present the main features of MedTAG and how it has been employed in the histopathology domain by physicians and experts to annotate more than seven thousand clinical reports manually. We compare MedTAG with a set of well-established biomedical annotation tools, including BioQRator, ezTag, MyMiner, and tagtog, comparing their pros and cons with those of MedTag. We highlight that MedTAG is one of the very few open-source tools provided with an open license and a straightforward installation procedure supporting cross-platform use. CONCLUSIONS MedTAG has been designed according to five requirements (i.e. available, distributable, installable, workable and schematic) defined in a recent extensive review of manual annotation tools. Moreover, MedTAG satisfies 20 over 22 criteria specified in the same study.
Collapse
Affiliation(s)
- Fabio Giachelle
- Department of Information Engineering, University of Padua, Padua, Italy
| | - Ornella Irrera
- Department of Information Engineering, University of Padua, Padua, Italy
| | - Gianmaria Silvello
- Department of Information Engineering, University of Padua, Padua, Italy
| |
Collapse
|
24
|
Mirza L, Das-Munshi J, Chaturvedi J, Wu H, Kraljevic Z, Searle T, Shaari S, Mascio A, Skiada N, Roberts A, Bean D, Stewart R, Dobson R, Bendayan R. Investigating the association between physical health comorbidities and disability in individuals with severe mental illness. Eur Psychiatry 2021; 64:e77. [PMID: 34842128 PMCID: PMC8727716 DOI: 10.1192/j.eurpsy.2021.2255] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/08/2021] [Revised: 11/12/2021] [Accepted: 11/13/2021] [Indexed: 11/23/2022] Open
Abstract
BACKGROUND Research suggests that an increased risk of physical comorbidities might have a key role in the association between severe mental illness (SMI) and disability. We examined the association between physical multimorbidity and disability in individuals with SMI. METHODS Data were extracted from the clinical record interactive search system at South London and Maudsley Biomedical Research Centre. Our sample (n = 13,933) consisted of individuals who had received a primary or secondary SMI diagnosis between 2007 and 2018 and had available data for Health of Nations Outcome Scale (HoNOS) as disability measure. Physical comorbidities were defined using Chapters II-XIV of the International Classification of Diagnoses (ICD-10). RESULTS More than 60 % of the sample had complex multimorbidity. The most common organ system affected were neurological (34.7%), dermatological (15.4%), and circulatory (14.8%). All specific comorbidities (ICD-10 Chapters) were associated with higher levels of disability, HoNOS total scores. Individuals with musculoskeletal, skin/dermatological, respiratory, endocrine, neurological, hematological, or circulatory disorders were found to be associated with significant difficulties associated with more than five HoNOS domains while others had a lower number of domains affected. CONCLUSIONS Individuals with SMI and musculoskeletal, skin/dermatological, respiratory, endocrine, neurological, hematological, or circulatory disorders are at higher risk of disability compared to those who do not have those comorbidities. Individuals with SMI and physical comorbidities are at greater risk of reporting difficulties associated with activities of daily living, hallucinations, and cognitive functioning. Therefore, these should be targeted for prevention and intervention programs.
Collapse
Affiliation(s)
- Luwaiza Mirza
- NIHR Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King’s College London, London, United Kingdom
| | - Jayati Das-Munshi
- NIHR Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King’s College London, London, United Kingdom
- Department of Psychological Medicine, Institute of Psychiatry, Psychology and Neuroscience, King’s College London, London, United Kingdom
| | - Jaya Chaturvedi
- Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King’s College London, London, United Kingdom
| | - Honghan Wu
- Health Data Research UK London, University College London, London, United Kingdom
- Institute of Health Informatics, University College London, London, United Kingdom
| | - Zeljko Kraljevic
- Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King’s College London, London, United Kingdom
| | - Thomas Searle
- Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King’s College London, London, United Kingdom
| | - Shaweena Shaari
- NIHR Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King’s College London, London, United Kingdom
| | - Aurelie Mascio
- Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King’s College London, London, United Kingdom
| | - Naoko Skiada
- Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King’s College London, London, United Kingdom
| | - Angus Roberts
- NIHR Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King’s College London, London, United Kingdom
- Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King’s College London, London, United Kingdom
- Health Data Research UK London, University College London, London, United Kingdom
- Institute of Health Informatics, University College London, London, United Kingdom
| | - Daniel Bean
- Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King’s College London, London, United Kingdom
- Health Data Research UK London, University College London, London, United Kingdom
| | - Robert Stewart
- NIHR Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King’s College London, London, United Kingdom
- Department of Psychological Medicine, Institute of Psychiatry, Psychology and Neuroscience, King’s College London, London, United Kingdom
| | - Richard Dobson
- NIHR Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King’s College London, London, United Kingdom
- Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King’s College London, London, United Kingdom
- Health Data Research UK London, University College London, London, United Kingdom
- Institute of Health Informatics, University College London, London, United Kingdom
| | - Rebecca Bendayan
- NIHR Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King’s College London, London, United Kingdom
- Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King’s College London, London, United Kingdom
| |
Collapse
|
25
|
Dong H, Suarez-Paniagua V, Zhang H, Wang M, Whitfield E, Wu H. Rare Disease Identification from Clinical Notes with Ontologies and Weak Supervision. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2021; 2021:2294-2298. [PMID: 34891745 DOI: 10.1109/embc46164.2021.9630043] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
The identification of rare diseases from clinical notes with Natural Language Processing (NLP) is challenging due to the few cases available for machine learning and the need of data annotation from clinical experts. We propose a method using ontologies and weak supervision. The approach includes two steps: (i) Text-to-UMLS, linking text mentions to concepts in Unified Medical Language System (UMLS), with a named entity linking tool (e.g. SemEHR) and weak supervision based on customised rules and Bidirectional Encoder Representations from Transformers (BERT) based contextual representations, and (ii) UMLS-to-ORDO, matching UMLS concepts to rare diseases in Orphanet Rare Disease Ontology (ORDO). Using MIMIC-III US intensive care discharge summaries as a case study, we show that the Text-to-UMLS process can be greatly improved with weak supervision, without any annotated data from domain experts. Our analysis shows that the overall pipeline processing discharge summaries can surface rare disease cases, which are mostly uncaptured in manual ICD codes of the hospital admissions.
Collapse
|
26
|
Newman-Griffis D, Divita G, Desmet B, Zirikly A, Rosé CP, Fosler-Lussier E. Ambiguity in medical concept normalization: An analysis of types and coverage in electronic health record datasets. J Am Med Inform Assoc 2021; 28:516-532. [PMID: 33319905 DOI: 10.1093/jamia/ocaa269] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2020] [Revised: 09/13/2020] [Accepted: 11/17/2020] [Indexed: 12/18/2022] Open
Abstract
OBJECTIVES Normalizing mentions of medical concepts to standardized vocabularies is a fundamental component of clinical text analysis. Ambiguity-words or phrases that may refer to different concepts-has been extensively researched as part of information extraction from biomedical literature, but less is known about the types and frequency of ambiguity in clinical text. This study characterizes the distribution and distinct types of ambiguity exhibited by benchmark clinical concept normalization datasets, in order to identify directions for advancing medical concept normalization research. MATERIALS AND METHODS We identified ambiguous strings in datasets derived from the 2 available clinical corpora for concept normalization and categorized the distinct types of ambiguity they exhibited. We then compared observed string ambiguity in the datasets with potential ambiguity in the Unified Medical Language System (UMLS) to assess how representative available datasets are of ambiguity in clinical language. RESULTS We found that <15% of strings were ambiguous within the datasets, while over 50% were ambiguous in the UMLS, indicating only partial coverage of clinical ambiguity. The percentage of strings in common between any pair of datasets ranged from 2% to only 36%; of these, 40% were annotated with different sets of concepts, severely limiting generalization. Finally, we observed 12 distinct types of ambiguity, distributed unequally across the available datasets, reflecting diverse linguistic and medical phenomena. DISCUSSION Existing datasets are not sufficient to cover the diversity of clinical concept ambiguity, limiting both training and evaluation of normalization methods for clinical text. Additionally, the UMLS offers important semantic information for building and evaluating normalization methods. CONCLUSIONS Our findings identify 3 opportunities for concept normalization research, including a need for ambiguity-specific clinical datasets and leveraging the rich semantics of the UMLS in new methods and evaluation measures for normalization.
Collapse
Affiliation(s)
- Denis Newman-Griffis
- Rehabilitation Medicine Department, National Institutes of Health Clinical Center, Bethesda, Maryland, USA.,Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio, USA
| | - Guy Divita
- Rehabilitation Medicine Department, National Institutes of Health Clinical Center, Bethesda, Maryland, USA
| | - Bart Desmet
- Rehabilitation Medicine Department, National Institutes of Health Clinical Center, Bethesda, Maryland, USA
| | - Ayah Zirikly
- Rehabilitation Medicine Department, National Institutes of Health Clinical Center, Bethesda, Maryland, USA
| | - Carolyn P Rosé
- Rehabilitation Medicine Department, National Institutes of Health Clinical Center, Bethesda, Maryland, USA.,Language Technologies Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| | - Eric Fosler-Lussier
- Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio, USA
| |
Collapse
|
27
|
Zeng K, Xu Y, Lin G, Liang L, Hao T. Automated classification of clinical trial eligibility criteria text based on ensemble learning and metric learning. BMC Med Inform Decis Mak 2021; 21:129. [PMID: 34330259 PMCID: PMC8323220 DOI: 10.1186/s12911-021-01492-z] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2021] [Accepted: 04/08/2021] [Indexed: 12/02/2022] Open
Abstract
BACKGROUND Eligibility criteria are the primary strategy for screening the target participants of a clinical trial. Automated classification of clinical trial eligibility criteria text by using machine learning methods improves recruitment efficiency to reduce the cost of clinical research. However, existing methods suffer from poor classification performance due to the complexity and imbalance of eligibility criteria text data. METHODS An ensemble learning-based model with metric learning is proposed for eligibility criteria classification. The model integrates a set of pre-trained models including Bidirectional Encoder Representations from Transformers (BERT), A Robustly Optimized BERT Pretraining Approach (RoBERTa), XLNet, Pre-training Text Encoders as Discriminators Rather Than Generators (ELECTRA), and Enhanced Representation through Knowledge Integration (ERNIE). Focal Loss is used as a loss function to address the data imbalance problem. Metric learning is employed to train the embedding of each base model for feature distinguish. Soft Voting is applied to achieve final classification of the ensemble model. The dataset is from the standard evaluation task 3 of 5th China Health Information Processing Conference containing 38,341 eligibility criteria text in 44 categories. RESULTS Our ensemble method had an accuracy of 0.8497, a precision of 0.8229, and a recall of 0.8216 on the dataset. The macro F1-score was 0.8169, outperforming state-of-the-art baseline methods by 0.84% improvement on average. In addition, the performance improvement had a p-value of 2.152e-07 with a standard t-test, indicating that our model achieved a significant improvement. CONCLUSIONS A model for classifying eligibility criteria text of clinical trials based on multi-model ensemble learning and metric learning was proposed. The experiments demonstrated that the classification performance was improved by our ensemble model significantly. In addition, metric learning was able to improve word embedding representation and the focal loss reduced the impact of data imbalance to model performance.
Collapse
Affiliation(s)
- Kun Zeng
- School of Data and Computer Science, Sun Yat-Sen University, Guangzhou, China
| | - Yibin Xu
- School of Data and Computer Science, Sun Yat-Sen University, Guangzhou, China
| | - Ge Lin
- National Engineering Research Center of Digital Life, Sun Yat-Sen University, Guangzhou, China
| | - Likeng Liang
- School of Computer Science, South China Normal University, Guangzhou, China
| | - Tianyong Hao
- School of Computer Science, South China Normal University, Guangzhou, China
| |
Collapse
|
28
|
Canales L, Menke S, Marchesseau S, D'Agostino A, Del Rio-Bermudez C, Taberna M, Tello J. Assessing the Performance of Clinical Natural Language Processing Systems: Development of an Evaluation Methodology. JMIR Med Inform 2021; 9:e20492. [PMID: 34297002 PMCID: PMC8367121 DOI: 10.2196/20492] [Citation(s) in RCA: 37] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Key Words] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2020] [Revised: 07/31/2020] [Accepted: 06/17/2021] [Indexed: 12/22/2022] Open
Abstract
Background Clinical natural language processing (cNLP) systems are of crucial importance due to their increasing capability in extracting clinically important information from free text contained in electronic health records (EHRs). The conversion of a nonstructured representation of a patient’s clinical history into a structured format enables medical doctors to generate clinical knowledge at a level that was not possible before. Finally, the interpretation of the insights gained provided by cNLP systems has a great potential in driving decisions about clinical practice. However, carrying out robust evaluations of those cNLP systems is a complex task that is hindered by a lack of standard guidance on how to systematically approach them. Objective Our objective was to offer natural language processing (NLP) experts a methodology for the evaluation of cNLP systems to assist them in carrying out this task. By following the proposed phases, the robustness and representativeness of the performance metrics of their own cNLP systems can be assured. Methods The proposed evaluation methodology comprised five phases: (1) the definition of the target population, (2) the statistical document collection, (3) the design of the annotation guidelines and annotation project, (4) the external annotations, and (5) the cNLP system performance evaluation. We presented the application of all phases to evaluate the performance of a cNLP system called “EHRead Technology” (developed by Savana, an international medical company), applied in a study on patients with asthma. As part of the evaluation methodology, we introduced the Sample Size Calculator for Evaluations (SLiCE), a software tool that calculates the number of documents needed to achieve a statistically useful and resourceful gold standard. Results The application of the proposed evaluation methodology on a real use-case study of patients with asthma revealed the benefit of the different phases for cNLP system evaluations. By using SLiCE to adjust the number of documents needed, a meaningful and resourceful gold standard was created. In the presented use-case, using as little as 519 EHRs, it was possible to evaluate the performance of the cNLP system and obtain performance metrics for the primary variable within the expected CIs. Conclusions We showed that our evaluation methodology can offer guidance to NLP experts on how to approach the evaluation of their cNLP systems. By following the five phases, NLP experts can assure the robustness of their evaluation and avoid unnecessary investment of human and financial resources. Besides the theoretical guidance, we offer SLiCE as an easy-to-use, open-source Python library.
Collapse
Affiliation(s)
- Lea Canales
- Department of Software and Computing System, University of Alicante, Alicante, Spain
| | | | | | | | | | | | | |
Collapse
|
29
|
Kraljevic Z, Searle T, Shek A, Roguski L, Noor K, Bean D, Mascio A, Zhu L, Folarin AA, Roberts A, Bendayan R, Richardson MP, Stewart R, Shah AD, Wong WK, Ibrahim Z, Teo JT, Dobson RJB. Multi-domain clinical natural language processing with MedCAT: The Medical Concept Annotation Toolkit. Artif Intell Med 2021; 117:102083. [PMID: 34127232 DOI: 10.1016/j.artmed.2021.102083] [Citation(s) in RCA: 57] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2020] [Revised: 03/24/2021] [Accepted: 04/28/2021] [Indexed: 11/30/2022]
Abstract
Electronic health records (EHR) contain large volumes of unstructured text, requiring the application of information extraction (IE) technologies to enable clinical analysis. We present the open source Medical Concept Annotation Toolkit (MedCAT) that provides: (a) a novel self-supervised machine learning algorithm for extracting concepts using any concept vocabulary including UMLS/SNOMED-CT; (b) a feature-rich annotation interface for customizing and training IE models; and (c) integrations to the broader CogStack ecosystem for vendor-agnostic health system deployment. We show improved performance in extracting UMLS concepts from open datasets (F1:0.448-0.738 vs 0.429-0.650). Further real-world validation demonstrates SNOMED-CT extraction at 3 large London hospitals with self-supervised training over ∼8.8B words from ∼17M clinical records and further fine-tuning with ∼6K clinician annotated examples. We show strong transferability (F1 > 0.94) between hospitals, datasets and concept types indicating cross-domain EHR-agnostic utility for accelerated clinical and research use cases.
Collapse
Affiliation(s)
- Zeljko Kraljevic
- Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK
| | - Thomas Searle
- Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK; NIHR Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King's College London, London, UK
| | - Anthony Shek
- Department of Clinical Neuroscience, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK
| | - Lukasz Roguski
- Health Data Research UK London, University College London, London, UK; Institute of Health Informatics, University College London, London, UK; NIHR BRC Clinical Research Informatics Unit, University College London Hospitals, NHS Foundation Trust, London, UK
| | - Kawsar Noor
- Health Data Research UK London, University College London, London, UK; Institute of Health Informatics, University College London, London, UK; NIHR BRC Clinical Research Informatics Unit, University College London Hospitals, NHS Foundation Trust, London, UK
| | - Daniel Bean
- Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK; Health Data Research UK London, University College London, London, UK
| | - Aurelie Mascio
- Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK; NIHR Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King's College London, London, UK
| | - Leilei Zhu
- Institute of Health Informatics, University College London, London, UK; NIHR BRC Clinical Research Informatics Unit, University College London Hospitals, NHS Foundation Trust, London, UK
| | - Amos A Folarin
- Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK; Institute of Health Informatics, University College London, London, UK; NIHR Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King's College London, London, UK
| | - Angus Roberts
- Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK; Health Data Research UK London, University College London, London, UK; NIHR Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King's College London, London, UK
| | - Rebecca Bendayan
- Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK; NIHR Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King's College London, London, UK
| | - Mark P Richardson
- Department of Clinical Neuroscience, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK
| | - Robert Stewart
- Department of Psychological Medicine, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK; NIHR Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King's College London, London, UK
| | - Anoop D Shah
- Health Data Research UK London, University College London, London, UK; Institute of Health Informatics, University College London, London, UK; NIHR BRC Clinical Research Informatics Unit, University College London Hospitals, NHS Foundation Trust, London, UK
| | - Wai Keong Wong
- Institute of Health Informatics, University College London, London, UK; NIHR BRC Clinical Research Informatics Unit, University College London Hospitals, NHS Foundation Trust, London, UK
| | - Zina Ibrahim
- Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK
| | - James T Teo
- Department of Clinical Neuroscience, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK; Department of Neurology, King's College Hospital NHS Foundation Trust, London, UK
| | - Richard J B Dobson
- Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK; Health Data Research UK London, University College London, London, UK; Institute of Health Informatics, University College London, London, UK; NIHR Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King's College London, London, UK.
| |
Collapse
|
30
|
Rannikmäe K, Wu H, Tominey S, Whiteley W, Allen N, Sudlow C. Developing automated methods for disease subtyping in UK Biobank: an exemplar study on stroke. BMC Med Inform Decis Mak 2021; 21:191. [PMID: 34130677 PMCID: PMC8204419 DOI: 10.1186/s12911-021-01556-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2021] [Accepted: 06/08/2021] [Indexed: 11/13/2022] Open
Abstract
BACKGROUND Better phenotyping of routinely collected coded data would be useful for research and health improvement. For example, the precision of coded data for hemorrhagic stroke (intracerebral hemorrhage [ICH] and subarachnoid hemorrhage [SAH]) may be as poor as < 50%. This work aimed to investigate the feasibility and added value of automated methods applied to clinical radiology reports to improve stroke subtyping. METHODS From a sub-population of 17,249 Scottish UK Biobank participants, we ascertained those with an incident stroke code in hospital, death record or primary care administrative data by September 2015, and ≥ 1 clinical brain scan report. We used a combination of natural language processing and clinical knowledge inference on brain scan reports to assign a stroke subtype (ischemic vs ICH vs SAH) for each participant and assessed performance by precision and recall at entity and patient levels. RESULTS Of 225 participants with an incident stroke code, 207 had a relevant brain scan report and were included in this study. Entity level precision and recall ranged from 78 to 100%. Automated methods showed precision and recall at patient level that were very good for ICH (both 89%), good for SAH (both 82%), but, as expected, lower for ischemic stroke (73%, and 64%, respectively), suggesting coded data remains the preferred method for identifying the latter stroke subtype. CONCLUSIONS Our automated method applied to radiology reports provides a feasible, scalable and accurate solution to improve disease subtyping when used in conjunction with administrative coded health data. Future research should validate these findings in a different population setting.
Collapse
Affiliation(s)
- Kristiina Rannikmäe
- Centre for Medical Informatics, University of Edinburgh, NINE Edinburgh BioQuarter, 9 Little France Road, Edinburgh, EH16 4UX, UK.
- Health Data Research UK, London, UK.
| | - Honghan Wu
- Health Data Research UK, London, UK
- Institute of Health Informatics, University College London, London, UK
| | | | - William Whiteley
- Centre for Clinical Brain Sciences, University of Edinburgh, Edinburgh, UK
- Nuffield Department of Population Health, University of Oxford, Oxford, UK
| | - Naomi Allen
- Nuffield Department of Population Health, University of Oxford, Oxford, UK
- UK Biobank, Stockport, UK
| | - Cathie Sudlow
- Centre for Medical Informatics, University of Edinburgh, NINE Edinburgh BioQuarter, 9 Little France Road, Edinburgh, EH16 4UX, UK
- Health Data Research UK, London, UK
- BHF Data Science Centre, London, UK
| |
Collapse
|
31
|
Park J, You SC, Jeong E, Weng C, Park D, Roh J, Lee DY, Cheong JY, Choi JW, Kang M, Park RW. A Framework (SOCRATex) for Hierarchical Annotation of Unstructured Electronic Health Records and Integration Into a Standardized Medical Database: Development and Usability Study. JMIR Med Inform 2021; 9:e23983. [PMID: 33783361 PMCID: PMC8044740 DOI: 10.2196/23983] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2020] [Revised: 11/14/2020] [Accepted: 01/23/2021] [Indexed: 02/06/2023] Open
Abstract
BACKGROUND Although electronic health records (EHRs) have been widely used in secondary assessments, clinical documents are relatively less utilized owing to the lack of standardized clinical text frameworks across different institutions. OBJECTIVE This study aimed to develop a framework for processing unstructured clinical documents of EHRs and integration with standardized structured data. METHODS We developed a framework known as Staged Optimization of Curation, Regularization, and Annotation of clinical text (SOCRATex). SOCRATex has the following four aspects: (1) extracting clinical notes for the target population and preprocessing the data, (2) defining the annotation schema with a hierarchical structure, (3) performing document-level hierarchical annotation using the annotation schema, and (4) indexing annotations for a search engine system. To test the usability of the proposed framework, proof-of-concept studies were performed on EHRs. We defined three distinctive patient groups and extracted their clinical documents (ie, pathology reports, radiology reports, and admission notes). The documents were annotated and integrated into the Observational Medical Outcomes Partnership (OMOP)-common data model (CDM) database. The annotations were used for creating Cox proportional hazard models with different settings of clinical analyses to measure (1) all-cause mortality, (2) thyroid cancer recurrence, and (3) 30-day hospital readmission. RESULTS Overall, 1055 clinical documents of 953 patients were extracted and annotated using the defined annotation schemas. The generated annotations were indexed into an unstructured textual data repository. Using the annotations of pathology reports, we identified that node metastasis and lymphovascular tumor invasion were associated with all-cause mortality among colon and rectum cancer patients (both P=.02). The other analyses involving measuring thyroid cancer recurrence using radiology reports and 30-day hospital readmission using admission notes in depressive disorder patients also showed results consistent with previous findings. CONCLUSIONS We propose a framework for hierarchical annotation of textual data and integration into a standardized OMOP-CDM medical database. The proof-of-concept studies demonstrated that our framework can effectively process and integrate diverse clinical documents with standardized structured data for clinical research.
Collapse
Affiliation(s)
- Jimyung Park
- Department of Biomedical Sciences, Ajou University Graduate School of Medicine, Suwon, Republic of Korea
| | - Seng Chan You
- Department of Preventive Medicine and Public Health, Yonsei University College of Medicine, Seoul, Republic of Korea
| | - Eugene Jeong
- Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, TN, United States
| | - Chunhua Weng
- Department of Biomedical Informatics, Columbia University, New York, NY, United States
| | - Dongsu Park
- Department of Biomedical Informatics, Ajou University School of Medicine, Suwon, Republic of Korea
| | - Jin Roh
- Department of Pathology, Ajou University Hospital, Suwon, Republic of Korea
| | - Dong Yun Lee
- Department of Biomedical Informatics, Ajou University School of Medicine, Suwon, Republic of Korea
| | - Jae Youn Cheong
- Department of Gastroenterology, Ajou University School of Medicine, Suwon, Republic of Korea
| | - Jin Wook Choi
- Department of Radiology, Ajou University School of Medicine, Suwon, Republic of Korea
| | - Mira Kang
- Department of Digital Health, Samsung Advanced Institute for Health Sciences & Technology, Sungkyunkwan University, Seoul, Republic of Korea
| | - Rae Woong Park
- Department of Biomedical Sciences, Ajou University Graduate School of Medicine, Suwon, Republic of Korea
- Department of Biomedical Informatics, Ajou University School of Medicine, Suwon, Republic of Korea
| |
Collapse
|
32
|
Ford E, Curlewis K, Squires E, Griffiths LJ, Stewart R, Jones KH. The Potential of Research Drawing on Clinical Free Text to Bring Benefits to Patients in the United Kingdom: A Systematic Review of the Literature. Front Digit Health 2021; 3:606599. [PMID: 34713089 PMCID: PMC8521813 DOI: 10.3389/fdgth.2021.606599] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2020] [Accepted: 01/15/2021] [Indexed: 11/13/2022] Open
Abstract
Background: The analysis of clinical free text from patient records for research has potential to contribute to the medical evidence base but access to clinical free text is frequently denied by data custodians who perceive that the privacy risks of data-sharing are too high. Engagement activities with patients and regulators, where views on the sharing of clinical free text data for research have been discussed, have identified that stakeholders would like to understand the potential clinical benefits that could be achieved if access to free text for clinical research were improved. We aimed to systematically review all UK research studies which used clinical free text and report direct or potential benefits to patients, synthesizing possible benefits into an easy to communicate taxonomy for public engagement and policy discussions. Methods: We conducted a systematic search for articles which reported primary research using clinical free text, drawn from UK health record databases, which reported a benefit or potential benefit for patients, actionable in a clinical environment or health service, and not solely methods development or data quality improvement. We screened eligible papers and thematically analyzed information about clinical benefits reported in the paper to create a taxonomy of benefits. Results: We identified 43 papers and derived five themes of benefits: health-care quality or services improvement, observational risk factor-outcome research, drug prescribing safety, case-finding for clinical trials, and development of clinical decision support. Five papers compared study quality with and without free text and found an improvement of accuracy when free text was included in analytical models. Conclusions: Findings will help stakeholders weigh the potential benefits of free text research against perceived risks to patient privacy. The taxonomy can be used to aid public and policy discussions, and identified studies could form a public-facing repository which will help the health-care text analysis research community better communicate the impact of their work.
Collapse
Affiliation(s)
- Elizabeth Ford
- Department of Primary Care and Public Health, Brighton and Sussex Medical School, Brighton, United Kingdom
| | - Keegan Curlewis
- Department of Primary Care and Public Health, Brighton and Sussex Medical School, Brighton, United Kingdom
| | - Emma Squires
- Swansea Medical School, University of Swansea, Swansea, United Kingdom
| | - Lucy J. Griffiths
- Swansea Medical School, University of Swansea, Swansea, United Kingdom
| | - Robert Stewart
- King's College London, London, United Kingdom
- South London and Maudsley NHS Foundation Trust, London, United Kingdom
| | - Kerina H. Jones
- Swansea Medical School, University of Swansea, Swansea, United Kingdom
| |
Collapse
|
33
|
DeLozier S, Speltz P, Brito J, Tang LA, Wang J, Smith JC, Giuse D, Phillips E, Williams K, Strickland T, Davogustto G, Roden D, Denny JC. Real-time clinical note monitoring to detect conditions for rapid follow-up: A case study of clinical trial enrollment in drug-induced torsades de pointes and Stevens-Johnson syndrome. J Am Med Inform Assoc 2021; 28:126-131. [PMID: 33120413 DOI: 10.1093/jamia/ocaa213] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2020] [Revised: 07/07/2020] [Accepted: 08/20/2020] [Indexed: 11/13/2022] Open
Abstract
Identifying acute events as they occur is challenging in large hospital systems. Here, we describe an automated method to detect 2 rare adverse drug events (ADEs), drug-induced torsades de pointes and Stevens-Johnson syndrome and toxic epidermal necrolysis, in near real time for participant recruitment into prospective clinical studies. A text processing system searched clinical notes from the electronic health record (EHR) for relevant keywords and alerted study personnel via email of potential patients for chart review or in-person evaluation. Between 2016 and 2018, the automated recruitment system resulted in capture of 138 true cases of drug-induced rare events, improving recall from 43% to 93%. Our focused electronic alert system maintained 2-year enrollment, including across an EHR migration from a bespoke system to Epic. Real-time monitoring of EHR notes may accelerate research for certain conditions less amenable to conventional study recruitment paradigms.
Collapse
Affiliation(s)
- Sarah DeLozier
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Peter Speltz
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Jason Brito
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Leigh Anne Tang
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Janey Wang
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Joshua C Smith
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Dario Giuse
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Elizabeth Phillips
- Department of Medicine, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Kristina Williams
- Department of Medicine, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Teresa Strickland
- Department of Medicine, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Giovanni Davogustto
- Department of Medicine, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Dan Roden
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA.,Department of Medicine, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Joshua C Denny
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA.,Department of Medicine, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| |
Collapse
|
34
|
Xiong Y, Chen S, Chen Q, Yan J, Tang B. Using Character-Level and Entity-Level Representations to Enhance Bidirectional Encoder Representation From Transformers-Based Clinical Semantic Textual Similarity Model: ClinicalSTS Modeling Study. JMIR Med Inform 2020; 8:e23357. [PMID: 33372664 PMCID: PMC7803475 DOI: 10.2196/23357] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2020] [Revised: 11/10/2020] [Accepted: 11/16/2020] [Indexed: 12/03/2022] Open
Abstract
Background With the popularity of electronic health records (EHRs), the quality of health care has been improved. However, there are also some problems caused by EHRs, such as the growing use of copy-and-paste and templates, resulting in EHRs of low quality in content. In order to minimize data redundancy in different documents, Harvard Medical School and Mayo Clinic organized a national natural language processing (NLP) clinical challenge (n2c2) on clinical semantic textual similarity (ClinicalSTS) in 2019. The task of this challenge is to compute the semantic similarity among clinical text snippets. Objective In this study, we aim to investigate novel methods to model ClinicalSTS and analyze the results. Methods We propose a semantically enhanced text matching model for the 2019 n2c2/Open Health NLP (OHNLP) challenge on ClinicalSTS. The model includes 3 representation modules to encode clinical text snippet pairs at different levels: (1) character-level representation module based on convolutional neural network (CNN) to tackle the out-of-vocabulary problem in NLP; (2) sentence-level representation module that adopts a pretrained language model bidirectional encoder representation from transformers (BERT) to encode clinical text snippet pairs; and (3) entity-level representation module to model clinical entity information in clinical text snippets. In the case of entity-level representation, we compare 2 methods. One encodes entities by the entity-type label sequence corresponding to text snippet (called entity I), whereas the other encodes entities by their representation in MeSH, a knowledge graph in the medical domain (called entity II). Results We conduct experiments on the ClinicalSTS corpus of the 2019 n2c2/OHNLP challenge for model performance evaluation. The model only using BERT for text snippet pair encoding achieved a Pearson correlation coefficient (PCC) of 0.848. When character-level representation and entity-level representation are individually added into our model, the PCC increased to 0.857 and 0.854 (entity I)/0.859 (entity II), respectively. When both character-level representation and entity-level representation are added into our model, the PCC further increased to 0.861 (entity I) and 0.868 (entity II). Conclusions Experimental results show that both character-level information and entity-level information can effectively enhance the BERT-based STS model.
Collapse
Affiliation(s)
- Ying Xiong
- Harbin Institute of Technology, Shenzhen, China
| | - Shuai Chen
- Harbin Institute of Technology, Shenzhen, China
| | - Qingcai Chen
- Harbin Institute of Technology, Shenzhen, China.,Peng Cheng Laboratory, Shenzhen, China
| | - Jun Yan
- Yidu Cloud Technology Company Limited, Beijing, China
| | - Buzhou Tang
- Harbin Institute of Technology, Shenzhen, China.,Peng Cheng Laboratory, Shenzhen, China
| |
Collapse
|
35
|
Kersloot MG, van Putten FJP, Abu-Hanna A, Cornet R, Arts DL. Natural language processing algorithms for mapping clinical text fragments onto ontology concepts: a systematic review and recommendations for future studies. J Biomed Semantics 2020; 11:14. [PMID: 33198814 PMCID: PMC7670625 DOI: 10.1186/s13326-020-00231-z] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2020] [Accepted: 11/03/2020] [Indexed: 12/23/2022] Open
Abstract
BACKGROUND Free-text descriptions in electronic health records (EHRs) can be of interest for clinical research and care optimization. However, free text cannot be readily interpreted by a computer and, therefore, has limited value. Natural Language Processing (NLP) algorithms can make free text machine-interpretable by attaching ontology concepts to it. However, implementations of NLP algorithms are not evaluated consistently. Therefore, the objective of this study was to review the current methods used for developing and evaluating NLP algorithms that map clinical text fragments onto ontology concepts. To standardize the evaluation of algorithms and reduce heterogeneity between studies, we propose a list of recommendations. METHODS Two reviewers examined publications indexed by Scopus, IEEE, MEDLINE, EMBASE, the ACM Digital Library, and the ACL Anthology. Publications reporting on NLP for mapping clinical text from EHRs to ontology concepts were included. Year, country, setting, objective, evaluation and validation methods, NLP algorithms, terminology systems, dataset size and language, performance measures, reference standard, generalizability, operational use, and source code availability were extracted. The studies' objectives were categorized by way of induction. These results were used to define recommendations. RESULTS Two thousand three hundred fifty five unique studies were identified. Two hundred fifty six studies reported on the development of NLP algorithms for mapping free text to ontology concepts. Seventy-seven described development and evaluation. Twenty-two studies did not perform a validation on unseen data and 68 studies did not perform external validation. Of 23 studies that claimed that their algorithm was generalizable, 5 tested this by external validation. A list of sixteen recommendations regarding the usage of NLP systems and algorithms, usage of data, evaluation and validation, presentation of results, and generalizability of results was developed. CONCLUSION We found many heterogeneous approaches to the reporting on the development and evaluation of NLP algorithms that map clinical text to ontology concepts. Over one-fourth of the identified publications did not perform an evaluation. In addition, over one-fourth of the included studies did not perform a validation, and 88% did not perform external validation. We believe that our recommendations, alongside an existing reporting standard, will increase the reproducibility and reusability of future studies and NLP algorithms in medicine.
Collapse
Affiliation(s)
- Martijn G. Kersloot
- Amsterdam UMC, University of Amsterdam, Department of Medical Informatics, Amsterdam Public Health Research Institute Castor EDC, Room J1B-109, PO Box 22700, 1100 DE Amsterdam, The Netherlands
- Castor EDC, Amsterdam, The Netherlands
| | - Florentien J. P. van Putten
- Amsterdam UMC, University of Amsterdam, Department of Medical Informatics, Amsterdam Public Health Research Institute Castor EDC, Room J1B-109, PO Box 22700, 1100 DE Amsterdam, The Netherlands
| | - Ameen Abu-Hanna
- Amsterdam UMC, University of Amsterdam, Department of Medical Informatics, Amsterdam Public Health Research Institute Castor EDC, Room J1B-109, PO Box 22700, 1100 DE Amsterdam, The Netherlands
| | - Ronald Cornet
- Amsterdam UMC, University of Amsterdam, Department of Medical Informatics, Amsterdam Public Health Research Institute Castor EDC, Room J1B-109, PO Box 22700, 1100 DE Amsterdam, The Netherlands
| | - Derk L. Arts
- Amsterdam UMC, University of Amsterdam, Department of Medical Informatics, Amsterdam Public Health Research Institute Castor EDC, Room J1B-109, PO Box 22700, 1100 DE Amsterdam, The Netherlands
- Castor EDC, Amsterdam, The Netherlands
| |
Collapse
|
36
|
Application of BERT to Enable Gene Classification Based on Clinical Evidence. BIOMED RESEARCH INTERNATIONAL 2020; 2020:5491963. [PMID: 33083472 PMCID: PMC7563092 DOI: 10.1155/2020/5491963] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/31/2020] [Revised: 08/31/2020] [Accepted: 09/07/2020] [Indexed: 12/29/2022]
Abstract
The identification of profiled cancer-related genes plays an essential role in cancer diagnosis and treatment. Based on literature research, the classification of genetic mutations continues to be done manually nowadays. Manual classification of genetic mutations is pathologist-dependent, subjective, and time-consuming. To improve the accuracy of clinical interpretation, scientists have proposed computational-based approaches for automatic analysis of mutations with the advent of next-generation sequencing technologies. Nevertheless, some challenges, such as multiple classifications, the complexity of texts, redundant descriptions, and inconsistent interpretation, have limited the development of algorithms. To overcome these difficulties, we have adapted a deep learning method named Bidirectional Encoder Representations from Transformers (BERT) to classify genetic mutations based on text evidence from an annotated database. During the training, three challenging features such as the extreme length of texts, biased data presentation, and high repeatability were addressed. Finally, the BERT+abstract demonstrates satisfactory results with 0.80 logarithmic loss, 0.6837 recall, and 0.705 F-measure. It is feasible for BERT to classify the genomic mutation text within literature-based datasets. Consequently, BERT is a practical tool for facilitating and significantly speeding up cancer research towards tumor progression, diagnosis, and the design of more precise and effective treatments.
Collapse
|
37
|
Liu S, Wang Y, Wen A, Wang L, Hong N, Shen F, Bedrick S, Hersh W, Liu H. Implementation of a Cohort Retrieval System for Clinical Data Repositories Using the Observational Medical Outcomes Partnership Common Data Model: Proof-of-Concept System Validation. JMIR Med Inform 2020; 8:e17376. [PMID: 33021486 PMCID: PMC7576539 DOI: 10.2196/17376] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2019] [Revised: 06/04/2020] [Accepted: 07/28/2020] [Indexed: 12/17/2022] Open
Abstract
BACKGROUND Widespread adoption of electronic health records has enabled the secondary use of electronic health record data for clinical research and health care delivery. Natural language processing techniques have shown promise in their capability to extract the information embedded in unstructured clinical data, and information retrieval techniques provide flexible and scalable solutions that can augment natural language processing systems for retrieving and ranking relevant records. OBJECTIVE In this paper, we present the implementation of a cohort retrieval system that can execute textual cohort selection queries on both structured data and unstructured text-Cohort Retrieval Enhanced by Analysis of Text from Electronic Health Records (CREATE). METHODS CREATE is a proof-of-concept system that leverages a combination of structured queries and information retrieval techniques on natural language processing results to improve cohort retrieval performance using the Observational Medical Outcomes Partnership Common Data Model to enhance model portability. The natural language processing component was used to extract common data model concepts from textual queries. We designed a hierarchical index to support the common data model concept search utilizing information retrieval techniques and frameworks. RESULTS Our case study on 5 cohort identification queries, evaluated using the precision at 5 information retrieval metric at both the patient-level and document-level, demonstrates that CREATE achieves a mean precision at 5 of 0.90, which outperforms systems using only structured data or only unstructured text with mean precision at 5 values of 0.54 and 0.74, respectively. CONCLUSIONS The implementation and evaluation of Mayo Clinic Biobank data demonstrated that CREATE outperforms cohort retrieval systems that only use one of either structured data or unstructured text in complex textual cohort queries.
Collapse
Affiliation(s)
- Sijia Liu
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN, United States
| | - Yanshan Wang
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN, United States
| | - Andrew Wen
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN, United States
| | - Liwei Wang
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN, United States
| | - Na Hong
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN, United States
| | - Feichen Shen
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN, United States
| | - Steven Bedrick
- Department of Computer Science and Electrical Engineering, Oregon Health & Science University, Portland, OR, United States
| | - William Hersh
- Department of Medical Informatics and Clinical Epidemiology, Oregon Health & Science University, Portland, OR, United States
| | - Hongfang Liu
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN, United States
| |
Collapse
|
38
|
Chamberlin SR, Bedrick SD, Cohen AM, Wang Y, Wen A, Liu S, Liu H, Hersh WR. Evaluation of patient-level retrieval from electronic health record data for a cohort discovery task. JAMIA Open 2020; 3:395-404. [PMID: 33215074 PMCID: PMC7660955 DOI: 10.1093/jamiaopen/ooaa026] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2019] [Revised: 04/17/2020] [Accepted: 06/03/2020] [Indexed: 11/24/2022] Open
Abstract
OBJECTIVE Growing numbers of academic medical centers offer patient cohort discovery tools to their researchers, yet the performance of systems for this use case is not well understood. The objective of this research was to assess patient-level information retrieval methods using electronic health records for different types of cohort definition retrieval. MATERIALS AND METHODS We developed a test collection consisting of about 100 000 patient records and 56 test topics that characterized patient cohort requests for various clinical studies. Automated information retrieval tasks using word-based approaches were performed, varying 4 different parameters for a total of 48 permutations, with performance measured using B-Pref. We subsequently created structured Boolean queries for the 56 topics for performance comparisons. In addition, we performed a more detailed analysis of 10 topics. RESULTS The best-performing word-based automated query parameter settings achieved a mean B-Pref of 0.167 across all 56 topics. The way a topic was structured (topic representation) had the largest impact on performance. Performance not only varied widely across topics, but there was also a large variance in sensitivity to parameter settings across the topics. Structured queries generally performed better than automated queries on measures of recall and precision but were still not able to recall all relevant patients found by the automated queries. CONCLUSION While word-based automated methods of cohort retrieval offer an attractive solution to the labor-intensive nature of this task currently used at many medical centers, we generally found suboptimal performance in those approaches, with better performance obtained from structured Boolean queries. Future work will focus on using the test collection to develop and evaluate new approaches to query structure, weighting algorithms, and application of semantic methods.
Collapse
Affiliation(s)
- Steven R Chamberlin
- Department of Medical Informatics & Clinical Epidemiology, Oregon Health & Science University, Portland, Oregon, USA
| | - Steven D Bedrick
- Department of Medical Informatics & Clinical Epidemiology, Oregon Health & Science University, Portland, Oregon, USA
- Center for Spoken Language Understanding, Oregon Health & Science University, Portland, Oregon, USA
| | - Aaron M Cohen
- Department of Medical Informatics & Clinical Epidemiology, Oregon Health & Science University, Portland, Oregon, USA
| | - Yanshan Wang
- Division of Digital Health Sciences, Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota, USA
| | - Andrew Wen
- Division of Digital Health Sciences, Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota, USA
| | - Sijia Liu
- Division of Digital Health Sciences, Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota, USA
| | - Hongfang Liu
- Division of Digital Health Sciences, Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota, USA
| | - William R Hersh
- Department of Medical Informatics & Clinical Epidemiology, Oregon Health & Science University, Portland, Oregon, USA
| |
Collapse
|
39
|
Tissot HC, Shah AD, Brealey D, Harris S, Agbakoba R, Folarin A, Romao L, Roguski L, Dobson R, Asselbergs FW. Natural Language Processing for Mimicking Clinical Trial Recruitment in Critical Care: A Semi-Automated Simulation Based on the LeoPARDS Trial. IEEE J Biomed Health Inform 2020; 24:2950-2959. [PMID: 32149659 DOI: 10.1109/jbhi.2020.2977925] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Clinical trials often fail to recruit an adequate number of appropriate patients. Identifying eligible trial participants is resource-intensive when relying on manual review of clinical notes, particularly in critical care settings where the time window is short. Automated review of electronic health records (EHR) may help, but much of the information is in free text rather than a computable form. We applied natural language processing (NLP) to free text EHR data using the CogStack platform to simulate recruitment into the LeoPARDS study, a clinical trial aiming to reduce organ dysfunction in septic shock. We applied an algorithm to identify eligible patients using a moving 1-hour time window, and compared patients identified by our approach with those actually screened and recruited for the trial, for the time period that data were available. We manually reviewed records of a random sample of patients identified by the algorithm but not screened in the original trial. Our method identified 376 patients, including 34 patients with EHR data available who were actually recruited to LeoPARDS in our centre. The sensitivity of CogStack for identifying patients screened was 90% (95% CI 85%, 93%). Of the 203 patients identified by both manual screening and CogStack, the index date matched in 95 (47%) and CogStack was earlier in 94 (47%). In conclusion, analysis of EHR data using NLP could effectively replicate recruitment in a critical care trial, and identify some eligible patients at an earlier stage, potentially improving trial recruitment if implemented in real time.
Collapse
|
40
|
Robinson PN, Haendel MA. Ontologies, Knowledge Representation, and Machine Learning for Translational Research: Recent Contributions. Yearb Med Inform 2020; 29:159-162. [PMID: 32823310 PMCID: PMC7442528 DOI: 10.1055/s-0040-1701991] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
Objectives
: To select, present, and summarize the most relevant papers published in 2018 and 2019 in the field of Ontologies and Knowledge Representation, with a particular focus on the intersection between Ontologies and Machine Learning.
Methods
: A comprehensive review of the medical informatics literature was performed to select the most interesting papers published in 2018 and 2019 and that document the utility of ontologies for computational analysis, including machine learning.
Results
: Fifteen articles were selected for inclusion in this survey paper. The chosen articles belong to three major themes: (i) the identification of phenotypic abnormalities in electronic health record (EHR) data using the Human Phenotype Ontology ; (ii) word and node embedding algorithms to supplement natural language processing (NLP) of EHRs and other medical texts; and (iii) hybrid ontology and NLP-based approaches to extracting structured and unstructured components of EHRs.
Conclusion
: Unprecedented amounts of clinically relevant data are now available for clinical and research use. Machine learning is increasingly being applied to these data sources for predictive analytics, precision medicine, and differential diagnosis. Ontologies have become an essential component of software pipelines designed to extract, code, and analyze clinical information by machine learning algorithms. The intersection of machine learning and semantics is proving to be an innovative space in clinical research.
Collapse
Affiliation(s)
- Peter N Robinson
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA.,Institute for Systems Genomics, University of Connecticut, Farmington, CT, USA
| | - Melissa A Haendel
- Oregon Clinical & Translational Research Institute, Oregon Health & Science University, Portland, OR, USA.,Department of Environmental and Molecular Toxicology, Oregon State University, Corvallis, OR, USA
| |
Collapse
|
41
|
Zeng K, Pan Z, Xu Y, Qu Y. An Ensemble Learning Strategy for Eligibility Criteria Text Classification for Clinical Trial Recruitment: Algorithm Development and Validation. JMIR Med Inform 2020; 8:e17832. [PMID: 32609092 PMCID: PMC7367522 DOI: 10.2196/17832] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2020] [Revised: 03/09/2020] [Accepted: 03/14/2020] [Indexed: 01/09/2023] Open
Abstract
BACKGROUND Eligibility criteria are the main strategy for screening appropriate participants for clinical trials. Automatic analysis of clinical trial eligibility criteria by digital screening, leveraging natural language processing techniques, can improve recruitment efficiency and reduce the costs involved in promoting clinical research. OBJECTIVE We aimed to create a natural language processing model to automatically classify clinical trial eligibility criteria. METHODS We proposed a classifier for short text eligibility criteria based on ensemble learning, where a set of pretrained models was integrated. The pretrained models included state-of-the-art deep learning methods for training and classification, including Bidirectional Encoder Representations from Transformers (BERT), XLNet, and A Robustly Optimized BERT Pretraining Approach (RoBERTa). The classification results by the integrated models were combined as new features for training a Light Gradient Boosting Machine (LightGBM) model for eligibility criteria classification. RESULTS Our proposed method obtained an accuracy of 0.846, a precision of 0.803, and a recall of 0.817 on a standard data set from a shared task of an international conference. The macro F1 value was 0.807, outperforming the state-of-the-art baseline methods on the shared task. CONCLUSIONS We designed a model for screening short text classification criteria for clinical trials based on multimodel ensemble learning. Through experiments, we concluded that performance was improved significantly with a model ensemble compared to a single model. The introduction of focal loss could reduce the impact of class imbalance to achieve better performance.
Collapse
Affiliation(s)
- Kun Zeng
- School of Data and Computer Science, Sun Yat-sen University, Guangzhou, China
| | - Zhiwei Pan
- School of Data and Computer Science, Sun Yat-sen University, Guangzhou, China
| | - Yibin Xu
- School of Computer Science, South China Normal University, Guangzhou, China
| | - Yingying Qu
- School of Business, Guangdong University of Foreign Studies, Guangzhou, China
| |
Collapse
|
42
|
Jones KH, Ford EM, Lea N, Griffiths LJ, Hassan L, Heys S, Squires E, Nenadic G. Toward the Development of Data Governance Standards for Using Clinical Free-Text Data in Health Research: Position Paper. J Med Internet Res 2020; 22:e16760. [PMID: 32597785 PMCID: PMC7367542 DOI: 10.2196/16760] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2019] [Revised: 03/06/2020] [Accepted: 03/23/2020] [Indexed: 01/17/2023] Open
Abstract
BACKGROUND Clinical free-text data (eg, outpatient letters or nursing notes) represent a vast, untapped source of rich information that, if more accessible for research, would clarify and supplement information coded in structured data fields. Data usually need to be deidentified or anonymized before they can be reused for research, but there is a lack of established guidelines to govern effective deidentification and use of free-text information and avoid damaging data utility as a by-product. OBJECTIVE This study aimed to develop recommendations for the creation of data governance standards to integrate with existing frameworks for personal data use, to enable free-text data to be used safely for research for patient and public benefit. METHODS We outlined data protection legislation and regulations relating to the United Kingdom for context and conducted a rapid literature review and UK-based case studies to explore data governance models used in working with free-text data. We also engaged with stakeholders, including text-mining researchers and the general public, to explore perceived barriers and solutions in working with clinical free-text. RESULTS We proposed a set of recommendations, including the need for authoritative guidance on data governance for the reuse of free-text data, to ensure public transparency in data flows and uses, to treat deidentified free-text data as potentially identifiable with use limited to accredited data safe havens, and to commit to a culture of continuous improvement to understand the relationships between the efficacy of deidentification and reidentification risks, so this can be communicated to all stakeholders. CONCLUSIONS By drawing together the findings of a combination of activities, we present a position paper to contribute to the development of data governance standards for the reuse of clinical free-text data for secondary purposes. While working in accordance with existing data governance frameworks, there is a need for further work to take forward the recommendations we have proposed, with commitment and investment, to assure and expand the safe reuse of clinical free-text data for public benefit.
Collapse
Affiliation(s)
- Kerina H Jones
- Population Data Science, Medical School, Swansea University, Swansea, United Kingdom
| | | | - Nathan Lea
- Institute of Health Informatics, University College London, London, United Kingdom
| | - Lucy J Griffiths
- Population Data Science, Medical School, Swansea University, Swansea, United Kingdom
| | - Lamiece Hassan
- Division of Informatics, Imaging & Data Sciences, University of Manchester, Manchester, United Kingdom
| | - Sharon Heys
- Population Data Science, Medical School, Swansea University, Swansea, United Kingdom
| | - Emma Squires
- Population Data Science, Medical School, Swansea University, Swansea, United Kingdom
| | - Goran Nenadic
- Department of Computer Science, University of Manchester & The Alan Turing Institute, Manchester, United Kingdom
| |
Collapse
|
43
|
Hemingway H, Lyons R, Li Q, Buchan I, Ainsworth J, Pell J, Morris A. A national initiative in data science for health: an evaluation of the UK Farr Institute. Int J Popul Data Sci 2020; 5:1128. [PMID: 32935051 PMCID: PMC7480324 DOI: 10.23889/ijpds.v5i1.1128] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023] Open
Abstract
OBJECTIVE To evaluate the extent to which the inter-institutional, inter-disciplinary mobilisation of data and skills in the Farr Institute contributed to establishing the emerging field of data science for health in the UK. DESIGN AND OUTCOME MEASURES We evaluated evidence of six domains characterising a new field of science:defining central scientific challenges,demonstrating how the central challenges might be solved,creating novel interactions among groups of scientists,training new types of experts,re-organising universities,demonstrating impacts in society.We carried out citation, network and time trend analyses of publications, and a narrative review of infrastructure, methods and tools. SETTING Four UK centres in London, North England, Scotland and Wales (23 university partners), 2013-2018. RESULTS 1. The Farr Institute helped define a central scientific challenge publishing a research corpus, demonstrating insights from electronic health record (EHR) and administrative data at each stage of the translational cycle in 593 papers with at least one Farr Institute author affiliation on PubMed. 2. The Farr Institute offered some demonstrations of how these scientific challenges might be solved: it established the first four ISO27001 certified trusted research environments in the UK, and approved more than 1000 research users, published on 102 unique EHR and administrative data sources, although there was no clear evidence of an increase in novel, sustained record linkages. The Farr Institute established open platforms for the EHR phenotyping algorithms and validations (>70 diseases, CALIBER). Sample sizes showed some evidence of increase but remained less than 10% of the UK population in primary care-hospital care linked studies. 3.The Farr Institute created novel interactions among researchers: the co-author publication network expanded from 944 unique co-authors (based on 67 publications in the first 30 months) to 3839 unique co-authors (545 papers in the final 30 months). 4. Training expanded substantially with 3 new masters courses, training >400 people at masters, short-course and leadership level and 48 PhD students. 5. Universities reorganised with 4/5 Centres established 27 new faculty (tenured) positions, 3 new university institutes. 6. Emerging evidence of impacts included: > 3200 citations for the 10 most cited papers and Farr research informed eight practice-changing clinical guidelines and policies relevant to the health of millions of UK citizens. CONCLUSION The Farr Institute played a major role in establishing and growing the field of data science for health in the UK, with some initial evidence of benefits for health and healthcare. The Farr Institute has now expanded into Health Data Research (HDR) UK but key challenges remain including, how to network such activities internationally.
Collapse
Affiliation(s)
- H Hemingway
- HDR UK London
- UCL Institute of Health Informatics, 222 Euston Road, London NW1 2DA
| | - R Lyons
- HDRUK Wales/Northern Ireland
- Swansea University Medical School, Fourth Floor, Data Science Building, Singleton Campus, Swansea, SA2 8PP
| | - Q Li
- UCL Institute of Health Informatics, 222 Euston Road, London NW1 2DA
- West China Hospital, Chengdu, China
| | - I Buchan
- University of Liverpool, Liverpool L69 3BX
| | - J Ainsworth
- Division of Informatics, Imaging & Data Sciences, The University of Manchester, Oxford Rd, Manchester M13 9PL
| | - J Pell
- Institute of Health and Wellbeing, University of Glasgow, 1 Lilybank Gardens, Glasgow G12 8RZ
| | | |
Collapse
|
44
|
Kugathasan P, Wu H, Gaughran F, Nielsen RE, Pritchard M, Dobson R, Stewart R, Stubbs B. Association of physical health multimorbidity with mortality in people with schizophrenia spectrum disorders: Using a novel semantic search system that captures physical diseases in electronic patient records. Schizophr Res 2020; 216:408-415. [PMID: 31787481 DOI: 10.1016/j.schres.2019.10.061] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/28/2019] [Revised: 06/22/2019] [Accepted: 10/31/2019] [Indexed: 01/07/2023]
Abstract
OBJECTIVE Single physical comorbidities have been associated with the premature mortality in people with schizophrenia-spectrum disorders (SSD). We investigated the association of physical multimorbidity (≥two physical health conditions) with mortality in people with SSD. METHODS A retrospective cohort study between 2013 and 2017. All people with a diagnosis of SSD (ICD-10: F20-F29), who had contact with secondary mental healthcare within South London during 2011-2012 were included. A novel semantic search system captured conditions from electronic mental health records, and all-cause mortality were retrieved. Hazard ratios (HRs) and population attributable fractions (PAFs) were calculated for associations between physical multimorbidity and all-cause mortality. RESULTS Among the 9775 people with SSD (mean (SD) age, 45.9 (15.4); males, 59.3%), 6262 (64%) had physical multimorbidity, and 880 (9%) died during the 5-year follow-up. The top three physical multimorbidity combinations with highest mortality were cardiovascular-respiratory (HR: 2.23; 95% CI, 1.49-3.32), respiratory-skin (HR: 2.06; 95% CI, 1.31-3.24), and respiratory-digestive (HR: 1.88; 95% CI, 1.14-3.11), when adjusted for age, gender, and all other physical disease systems. Combinations of physical diseases with highest PAFs were cardiovascular-respiratory (PAF: 35.7%), neurologic-respiratory (PAF: 32.7%), as well as respiratory-skin (PAF: 29.8%). CONCLUSIONS Approximately 2/3 of patients with SSD had physical multimorbidity and the risk of mortality in these patients was further increased compared to those with none or single physical conditions. These findings suggest that in order to reduce the physical health burden and subsequent mortality in people with SSD, proactive coordinated prevention and management efforts are required and should extend beyond the current focus on single physical comorbidities.
Collapse
Affiliation(s)
- Pirathiv Kugathasan
- Psychiatry, Aalborg University Hospital, Aalborg, Denmark; Department of Clinical Medicine, Aalborg University, Aalborg, Denmark.
| | - Honghan Wu
- Centre for Medical Informatics, Usher Institute of Population Health Sciences and Informatics, The University of Edinburgh, Scotland, United Kingdom
| | - Fiona Gaughran
- King's College London, Institute of Psychiatry, Psychology and Neuroscience (IoPPN), De Crespigny Park, London, United Kingdom; South London and Maudsley NHS Foundation Trust, Denmark Hill, London, United Kingdom
| | - René Ernst Nielsen
- Psychiatry, Aalborg University Hospital, Aalborg, Denmark; Department of Clinical Medicine, Aalborg University, Aalborg, Denmark
| | - Megan Pritchard
- King's College London, Institute of Psychiatry, Psychology and Neuroscience (IoPPN), De Crespigny Park, London, United Kingdom; South London and Maudsley NHS Foundation Trust, Denmark Hill, London, United Kingdom
| | - Richard Dobson
- Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology & Neuroscience, King's College London, London, United Kingdom; Health Data Research UK London, Institute of Health Informatics, University College London, London, United Kingdom
| | - Robert Stewart
- King's College London, Institute of Psychiatry, Psychology and Neuroscience (IoPPN), De Crespigny Park, London, United Kingdom; South London and Maudsley NHS Foundation Trust, Denmark Hill, London, United Kingdom
| | - Brendon Stubbs
- King's College London, Institute of Psychiatry, Psychology and Neuroscience (IoPPN), De Crespigny Park, London, United Kingdom; South London and Maudsley NHS Foundation Trust, Denmark Hill, London, United Kingdom.
| |
Collapse
|
45
|
Ju M, Short AD, Thompson P, Bakerly ND, Gkoutos GV, Tsaprouni L, Ananiadou S. Annotating and detecting phenotypic information for chronic obstructive pulmonary disease. JAMIA Open 2020; 2:261-271. [PMID: 31984360 PMCID: PMC6951876 DOI: 10.1093/jamiaopen/ooz009] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2018] [Revised: 02/21/2019] [Accepted: 03/19/2019] [Indexed: 12/29/2022] Open
Abstract
Objectives Chronic obstructive pulmonary disease (COPD) phenotypes cover a range of lung abnormalities. To allow text mining methods to identify pertinent and potentially complex information about these phenotypes from textual data, we have developed a novel annotated corpus, which we use to train a neural network-based named entity recognizer to detect fine-grained COPD phenotypic information. Materials and methods Since COPD phenotype descriptions often mention other concepts within them (proteins, treatments, etc.), our corpus annotations include both outermost phenotype descriptions and concepts nested within them. Our neural layered bidirectional long short-term memory conditional random field (BiLSTM-CRF) network firstly recognizes nested mentions, which are fed into subsequent BiLSTM-CRF layers, to help to recognize enclosing phenotype mentions. Results Our corpus of 30 full papers (available at: http://www.nactem.ac.uk/COPD) is annotated by experts with 27 030 phenotype-related concept mentions, most of which are automatically linked to UMLS Metathesaurus concepts. When trained using the corpus, our BiLSTM-CRF network outperforms other popular approaches in recognizing detailed phenotypic information. Discussion Information extracted by our method can facilitate efficient location and exploration of detailed information about phenotypes, for example, those specifically concerning reactions to treatments. Conclusion The importance of our corpus for developing methods to extract fine-grained information about COPD phenotypes is demonstrated through its successful use to train a layered BiLSTM-CRF network to extract phenotypic information at various levels of granularity. The minimal human intervention needed for training should permit ready adaption to extracting phenotypic information about other diseases.
Collapse
Affiliation(s)
- Meizhi Ju
- National Centre for Text Mining, School of Computer Science, The University of Manchester, Manchester, UK
| | - Andrea D Short
- Faculty of Biology, Medicine and Health, The University of Manchester, Manchester, UK
| | - Paul Thompson
- National Centre for Text Mining, School of Computer Science, The University of Manchester, Manchester, UK
| | - Nawar Diar Bakerly
- Salford Royal NHS Foundation Trust; and School of Health Sciences, The University of Manchester, Manchester, UK
| | - Georgios V Gkoutos
- College of Medical and Dental Sciences, Institute of Cancer and Genomic Sciences, Centre for Computational Biology, University of Birmingham, Birmingham, UK.,Institute of Translational Medicine, University Hospitals Birmingham NHS Foundation Trust, Birmingham, UK.,MRC Health Data Research UK (HDR UK).,NIHR Experimental Cancer Medicine Centre, Birmingham, UK.,NIHR Surgical Reconstruction and Microbiology Research Centre, Birmingham, UK.,NIHR Biomedical Research Centre, Birmingham, UK
| | - Loukia Tsaprouni
- School of Health Sciences, Centre for Life and Sport Sciences, Birmingham City University, Birmingham, UK
| | - Sophia Ananiadou
- National Centre for Text Mining, School of Computer Science, The University of Manchester, Manchester, UK
| |
Collapse
|
46
|
Wu H, Hodgson K, Dyson S, Morley KI, Ibrahim ZM, Iqbal E, Stewart R, Dobson RJ, Sudlow C. Efficient Reuse of Natural Language Processing Models for Phenotype-Mention Identification in Free-text Electronic Medical Records: A Phenotype Embedding Approach. JMIR Med Inform 2019; 7:e14782. [PMID: 31845899 PMCID: PMC6938594 DOI: 10.2196/14782] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2019] [Revised: 10/08/2019] [Accepted: 10/22/2019] [Indexed: 12/16/2022] Open
Abstract
BACKGROUND Much effort has been put into the use of automated approaches, such as natural language processing (NLP), to mine or extract data from free-text medical records in order to construct comprehensive patient profiles for delivering better health care. Reusing NLP models in new settings, however, remains cumbersome, as it requires validation and retraining on new data iteratively to achieve convergent results. OBJECTIVE The aim of this work is to minimize the effort involved in reusing NLP models on free-text medical records. METHODS We formally define and analyze the model adaptation problem in phenotype-mention identification tasks. We identify "duplicate waste" and "imbalance waste," which collectively impede efficient model reuse. We propose a phenotype embedding-based approach to minimize these sources of waste without the need for labelled data from new settings. RESULTS We conduct experiments on data from a large mental health registry to reuse NLP models in four phenotype-mention identification tasks. The proposed approach can choose the best model for a new task, identifying up to 76% waste (duplicate waste), that is, phenotype mentions without the need for validation and model retraining and with very good performance (93%-97% accuracy). It can also provide guidance for validating and retraining the selected model for novel language patterns in new tasks, saving around 80% waste (imbalance waste), that is, the effort required in "blind" model-adaptation approaches. CONCLUSIONS Adapting pretrained NLP models for new tasks can be more efficient and effective if the language pattern landscapes of old settings and new settings can be made explicit and comparable. Our experiments show that the phenotype-mention embedding approach is an effective way to model language patterns for phenotype-mention identification tasks and that its use can guide efficient NLP model reuse.
Collapse
Affiliation(s)
- Honghan Wu
- Centre for Medical Informatics, Usher Institute, University of Edinburgh, Edinburgh, United Kingdom
- School of Computer and Software, Nanjing University of Information Science and Technology, Nanjing, China
- Health Data Research UK, University of Edinburgh, Edinburgh, United Kingdom
| | - Karen Hodgson
- Department of Psychosis Studies, Institute of Psychiatry, Psychology & Neuroscience, King's College London, London, United Kingdom
| | - Sue Dyson
- Department of Psychosis Studies, Institute of Psychiatry, Psychology & Neuroscience, King's College London, London, United Kingdom
| | - Katherine I Morley
- Department of Psychosis Studies, Institute of Psychiatry, Psychology & Neuroscience, King's College London, London, United Kingdom
- South London and Maudsley NHS Foundation Trust, London, United Kingdom
- Centre for Epidemiology and Biostatistics, Melbourne School of Global and Population Health, The University of Melbourne, Melbourne, Australia
| | - Zina M Ibrahim
- Department of Psychosis Studies, Institute of Psychiatry, Psychology & Neuroscience, King's College London, London, United Kingdom
- Health Data Research UK, University College London, London, United Kingdom
| | - Ehtesham Iqbal
- Department of Psychosis Studies, Institute of Psychiatry, Psychology & Neuroscience, King's College London, London, United Kingdom
| | - Robert Stewart
- Department of Psychosis Studies, Institute of Psychiatry, Psychology & Neuroscience, King's College London, London, United Kingdom
- South London and Maudsley NHS Foundation Trust, London, United Kingdom
| | - Richard Jb Dobson
- Department of Psychosis Studies, Institute of Psychiatry, Psychology & Neuroscience, King's College London, London, United Kingdom
- Health Data Research UK, University College London, London, United Kingdom
| | - Cathie Sudlow
- Centre for Medical Informatics, Usher Institute, University of Edinburgh, Edinburgh, United Kingdom
- Health Data Research UK, University of Edinburgh, Edinburgh, United Kingdom
| |
Collapse
|
47
|
Denaxas S, Gonzalez-Izquierdo A, Direk K, Fitzpatrick NK, Fatemifar G, Banerjee A, Dobson RJB, Howe LJ, Kuan V, Lumbers RT, Pasea L, Patel RS, Shah AD, Hingorani AD, Sudlow C, Hemingway H. UK phenomics platform for developing and validating electronic health record phenotypes: CALIBER. J Am Med Inform Assoc 2019; 26:1545-1559. [PMID: 31329239 PMCID: PMC6857510 DOI: 10.1093/jamia/ocz105] [Citation(s) in RCA: 104] [Impact Index Per Article: 20.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2019] [Revised: 04/25/2019] [Accepted: 05/29/2019] [Indexed: 01/13/2023] Open
Abstract
OBJECTIVE Electronic health records (EHRs) are a rich source of information on human diseases, but the information is variably structured, fragmented, curated using different coding systems, and collected for purposes other than medical research. We describe an approach for developing, validating, and sharing reproducible phenotypes from national structured EHR in the United Kingdom with applications for translational research. MATERIALS AND METHODS We implemented a rule-based phenotyping framework, with up to 6 approaches of validation. We applied our framework to a sample of 15 million individuals in a national EHR data source (population-based primary care, all ages) linked to hospitalization and death records in England. Data comprised continuous measurements (for example, blood pressure; medication information; coded diagnoses, symptoms, procedures, and referrals), recorded using 5 controlled clinical terminologies: (1) read (primary care, subset of SNOMED-CT [Systematized Nomenclature of Medicine Clinical Terms]), (2) International Classification of Diseases-Ninth Revision and Tenth Revision (secondary care diagnoses and cause of mortality), (3) Office of Population Censuses and Surveys Classification of Surgical Operations and Procedures, Fourth Revision (hospital surgical procedures), and (4) DM+D prescription codes. RESULTS Using the CALIBER phenotyping framework, we created algorithms for 51 diseases, syndromes, biomarkers, and lifestyle risk factors and provide up to 6 validation approaches. The EHR phenotypes are curated in the open-access CALIBER Portal (https://www.caliberresearch.org/portal) and have been used by 40 national and international research groups in 60 peer-reviewed publications. CONCLUSIONS We describe a UK EHR phenomics approach within the CALIBER EHR data platform with initial evidence of validity and use, as an important step toward international use of UK EHR data for health research.
Collapse
Affiliation(s)
- Spiros Denaxas
- Institute of Health Informatics, University College London, London,United Kingdom
- Health Data Research UK, London, United Kingdom
- The Alan Turing Institute, London, United Kingdom
- The National Institute for Health Research University College London Hospitals Biomedical Research Centre, University College London, London, United Kingdom
- British Heart Foundation Research Accelerator, University College London, London, United Kingdom
| | - Arturo Gonzalez-Izquierdo
- Institute of Health Informatics, University College London, London,United Kingdom
- Health Data Research UK, London, United Kingdom
- The National Institute for Health Research University College London Hospitals Biomedical Research Centre, University College London, London, United Kingdom
| | - Kenan Direk
- Institute of Health Informatics, University College London, London,United Kingdom
- Health Data Research UK, London, United Kingdom
- The National Institute for Health Research University College London Hospitals Biomedical Research Centre, University College London, London, United Kingdom
| | - Natalie K Fitzpatrick
- Institute of Health Informatics, University College London, London,United Kingdom
- Health Data Research UK, London, United Kingdom
| | - Ghazaleh Fatemifar
- Institute of Health Informatics, University College London, London,United Kingdom
- Health Data Research UK, London, United Kingdom
| | - Amitava Banerjee
- Institute of Health Informatics, University College London, London,United Kingdom
- Health Data Research UK, London, United Kingdom
- British Heart Foundation Research Accelerator, University College London, London, United Kingdom
| | - Richard J B Dobson
- Institute of Health Informatics, University College London, London,United Kingdom
- Health Data Research UK, London, United Kingdom
- Department of Biostatistics and Health Informatics, Institute of Psychiatry Psychology and Neuroscience, King’s College London, London, United Kingdom
- The National Institute for Health Research University College London Hospitals Biomedical Research Centre, University College London, London, United Kingdom
- British Heart Foundation Research Accelerator, University College London, London, United Kingdom
| | - Laurence J Howe
- Institute of Cardiovascular Science, University College London, London, United Kingdom
| | - Valerie Kuan
- Health Data Research UK, London, United Kingdom
- Institute of Cardiovascular Science, University College London, London, United Kingdom
| | - R Tom Lumbers
- Institute of Health Informatics, University College London, London,United Kingdom
- Health Data Research UK, London, United Kingdom
- British Heart Foundation Research Accelerator, University College London, London, United Kingdom
| | - Laura Pasea
- Institute of Health Informatics, University College London, London,United Kingdom
- Health Data Research UK, London, United Kingdom
| | - Riyaz S Patel
- Institute of Cardiovascular Science, University College London, London, United Kingdom
- British Heart Foundation Research Accelerator, University College London, London, United Kingdom
| | - Anoop D Shah
- Institute of Health Informatics, University College London, London,United Kingdom
- Health Data Research UK, London, United Kingdom
- British Heart Foundation Research Accelerator, University College London, London, United Kingdom
| | - Aroon D Hingorani
- Health Data Research UK, London, United Kingdom
- Institute of Cardiovascular Science, University College London, London, United Kingdom
| | - Cathie Sudlow
- Centre for Medical Informatics, Usher Institute of Population Health Science and Informatics, University of Edinburgh, Edinburgh, United Kingdom
- Health Data Research UK, Scotland, United Kingdom
| | - Harry Hemingway
- Institute of Health Informatics, University College London, London,United Kingdom
- Health Data Research UK, London, United Kingdom
- The National Institute for Health Research University College London Hospitals Biomedical Research Centre, University College London, London, United Kingdom
- British Heart Foundation Research Accelerator, University College London, London, United Kingdom
| |
Collapse
|
48
|
Bean DM, Teo J, Wu H, Oliveira R, Patel R, Bendayan R, Shah AM, Dobson RJB, Scott PA. Semantic computational analysis of anticoagulation use in atrial fibrillation from real world data. PLoS One 2019; 14:e0225625. [PMID: 31765395 PMCID: PMC6876873 DOI: 10.1371/journal.pone.0225625] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2019] [Accepted: 11/09/2019] [Indexed: 12/03/2022] Open
Abstract
Atrial fibrillation (AF) is the most common arrhythmia and significantly increases stroke risk. This risk is effectively managed by oral anticoagulation. Recent studies using national registry data indicate increased use of anticoagulation resulting from changes in guidelines and the availability of newer drugs. The aim of this study is to develop and validate an open source risk scoring pipeline for free-text electronic health record data using natural language processing. AF patients discharged from 1st January 2011 to 1st October 2017 were identified from discharge summaries (N = 10,030, 64.6% male, average age 75.3 ± 12.3 years). A natural language processing pipeline was developed to identify risk factors in clinical text and calculate risk for ischaemic stroke (CHA2DS2-VASc) and bleeding (HAS-BLED). Scores were validated vs two independent experts for 40 patients. Automatic risk scores were in strong agreement with the two independent experts for CHA2DS2-VASc (average kappa 0.78 vs experts, compared to 0.85 between experts). Agreement was lower for HAS-BLED (average kappa 0.54 vs experts, compared to 0.74 between experts). In high-risk patients (CHA2DS2-VASc ≥2) OAC use has increased significantly over the last 7 years, driven by the availability of DOACs and the transitioning of patients from AP medication alone to OAC. Factors independently associated with OAC use included components of the CHA2DS2-VASc and HAS-BLED scores as well as discharging specialty and frailty. OAC use was highest in patients discharged under cardiology (69%). Electronic health record text can be used for automatic calculation of clinical risk scores at scale. Open source tools are available today for this task but require further validation. Analysis of routinely collected EHR data can replicate findings from large-scale curated registries.
Collapse
Affiliation(s)
- Daniel M. Bean
- Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King’s College London, London, England, United Kingdom
- Health Data Research UK London, University College London, London, England, United Kingdom
| | - James Teo
- Department of Stroke and Neurology, King’s College Hospital NHS Foundation Trust, London, England, United Kingdom
| | - Honghan Wu
- Centre for Medical Informatics, Usher Institute, University of Edinburgh, Scotland, United Kingdom
- School of Computer and Software, Nanjing University of Information Science and Technology, Nanjing, China
- Health Data Research UK Scotland, Edinburgh, Scotland, United Kingdom
| | - Ricardo Oliveira
- Unidade de Doenças Imunomediadas Sistémicas (UDIMS), S. Medicina IV, Hospital Prof. Doutor Fernando Fonseca, Amadora, Portugal
| | - Raj Patel
- Department of Haematology, King’s College Hospital NHS Foundation Trust, London, England, United Kingdom
| | - Rebecca Bendayan
- Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King’s College London, London, England, United Kingdom
- NIHR Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King’s College London, London, England, United Kingdom
| | - Ajay M. Shah
- British Heart Foundation Centre, King’s College London, London, England, United Kingdom
- Department of Cardiology, King’s College Hospital NHS Foundation Trust, London, England, United Kingdom
| | - Richard J. B. Dobson
- Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King’s College London, London, England, United Kingdom
- Health Data Research UK London, University College London, London, England, United Kingdom
- NIHR Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King’s College London, London, England, United Kingdom
- Institute of Health Informatics, University College London, London, England, United Kingdom
| | - Paul A. Scott
- British Heart Foundation Centre, King’s College London, London, England, United Kingdom
- Department of Cardiology, King’s College Hospital NHS Foundation Trust, London, England, United Kingdom
| |
Collapse
|
49
|
Vezertzis K, Lambrou GI, Koutsouris D. Development of Patient Databases for Endocrinological Clinical and Pharmaceutical Trials: A Survey. Rev Recent Clin Trials 2019; 15:5-21. [PMID: 31744453 DOI: 10.2174/1574887114666191118122714] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2019] [Revised: 10/22/2019] [Accepted: 11/05/2019] [Indexed: 11/22/2022]
Abstract
BACKGROUND According to European legislation, a clinical trial is a research involving patients, which also includes a research end-product. The main objective of the clinical trial is to prove that the research product, i.e. a proposed medication or treatment, is effective and safe for patients. The implementation, development, and operation of a patient database, which will function as a matrix of samples with the appropriate parameterization, may provide appropriate tools to generate samples for clinical trials. AIMS The aim of the present work is to review the literature with respect to the up-to-date progress on the development of databases for clinical trials and patient recruitment using free and open-source software in the field of endocrinology. METHODS An electronic literature search was conducted by the authors from 1984 to June 2019. Original articles and systematic reviews selected, and the titles and abstracts of papers screened to determine whether they met the eligibility criteria, and full texts of the selected articles were retrieved. RESULTS The present review has indicated that the electronic health records are related with both the patient recruitment and the decision support systems in the domain of endocrinology. The free and open-source software provides integrated solutions concerning electronic health records, patient recruitment, and the decision support systems. CONCLUSION The patient recruitment relates closely to the electronic health record. There is maturity at the academic and research level, which may lead to good practices for the deployment of the electronic health record in selecting the right patients for clinical trials.
Collapse
Affiliation(s)
- Konstantinos Vezertzis
- School of Electrical and Computer Engineering, Biomedical Engineering Laboratory, National Technical University of Athens, Heroon Polytecniou 9, Athens, 15780, Athens, Greece
| | - George I Lambrou
- School of Electrical and Computer Engineering, Biomedical Engineering Laboratory, National Technical University of Athens, Heroon Polytecniou 9, Athens, 15780, Athens, Greece.,First Department of Pediatrics, Choremeio Research Laboratory, National and Kapodistrian University of Athens, Thivon & Levadeias 8, 11527, Goudi, Athens, Greece
| | - Dimitrios Koutsouris
- School of Electrical and Computer Engineering, Biomedical Engineering Laboratory, National Technical University of Athens, Heroon Polytecniou 9, Athens, 15780, Athens, Greece
| |
Collapse
|
50
|
von Martial S, Brix TJ, Klotz L, Neuhaus P, Berger K, Warnke C, Meuth SG, Wiendl H, Dugas M. EMR-integrated minimal core dataset for routine health care and multiple research settings: A case study for neuroinflammatory demyelinating diseases. PLoS One 2019; 14:e0223886. [PMID: 31613917 PMCID: PMC6793844 DOI: 10.1371/journal.pone.0223886] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2019] [Accepted: 10/01/2019] [Indexed: 11/18/2022] Open
Abstract
Although routine health care and clinical trials usually require the documentation of similar information, data collection is performed independently from each other, resulting in redundant documentation efforts. Standardizing routine documentation can enable secondary use for medical research. Neuroinflammatory demyelinating diseases (NIDs) represent a heterogeneous group of diseases requiring further research to improve patient management. The aim of this work is to develop, implement and evaluate a minimal core dataset in routine health care with a focus on secondary use as case study for NIDs. Therefore, a draft minimal core dataset for NIDs was created by analyzing routine, clinical trial, registry, biobank documentation and existing data standards for NIDs. Data elements (DEs) were converted into the standard format Operational Data Model, semantically annotated and analyzed via frequency analysis. The analysis produced 1958 DEs based on 864 distinct medical concepts. After review and finalization by an interdisciplinary team of neurologists, epidemiologists and medical computer scientists, the minimal core dataset (NID CDEs) consists of 46 common DEs capturing disease-specific information for reuse in the discharge letter and other research settings. It covers the areas of diagnosis, laboratory results, disease progress, expanded disability status scale, therapy and magnetic resonance imaging findings. NID CDEs was implemented in two German university hospitals and a usability study in clinical routine was conducted (participants n = 16) showing a good usability (Mean SUS = 75). From May 2017 to February 2018, 755 patients were documented with the NID CDEs, which indicates the feasibility of developing a minimal core dataset for structured documentation based on previously used documentation standards and integrating the dataset into clinical routine. By sharing, translating and reusing the minimal dataset, a transnational harmonized documentation of patients with NIDs might be realized, supporting interoperability in medical research.
Collapse
Affiliation(s)
- Sophia von Martial
- Institute of Medical Informatics, University of Münster, Münster, Germany
| | - Tobias J. Brix
- Institute of Medical Informatics, University of Münster, Münster, Germany
| | - Luisa Klotz
- Department of Neurology, University of Münster, Münster, Germany
| | - Philipp Neuhaus
- Institute of Medical Informatics, University of Münster, Münster, Germany
| | - Klaus Berger
- Institute of Epidemiology and Social Medicine, University of Münster, Münster, Germany
| | - Clemens Warnke
- Department of Neurology, University of Köln, Köln, Germany
| | - Sven G. Meuth
- Department of Neurology, University of Münster, Münster, Germany
| | - Heinz Wiendl
- Department of Neurology, University of Münster, Münster, Germany
| | - Martin Dugas
- Institute of Medical Informatics, University of Münster, Münster, Germany
| |
Collapse
|