1
|
Henke E, Zoch M, Peng Y, Reinecke I, Sedlmayr M, Bathelt F. Conceptual design of a generic data harmonization process for OMOP common data model. BMC Med Inform Decis Mak 2024; 24:58. [PMID: 38408983 PMCID: PMC10895818 DOI: 10.1186/s12911-024-02458-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2023] [Accepted: 02/09/2024] [Indexed: 02/28/2024] Open
Abstract
BACKGROUND To gain insight into the real-life care of patients in the healthcare system, data from hospital information systems and insurance systems are required. Consequently, linking clinical data with claims data is necessary. To ensure their syntactic and semantic interoperability, the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) from the Observational Health Data Sciences and Informatics (OHDSI) community was chosen. However, there is no detailed guide that would allow researchers to follow a generic process for data harmonization, i.e. the transformation of local source data into the standardized OMOP CDM format. Thus, the aim of this paper is to conceptualize a generic data harmonization process for OMOP CDM. METHODS For this purpose, we conducted a literature review focusing on publications that address the harmonization of clinical or claims data in OMOP CDM. Subsequently, the process steps used and their chronological order as well as applied OHDSI tools were extracted for each included publication. The results were then compared to derive a generic sequence of the process steps. RESULTS From 23 publications included, a generic data harmonization process for OMOP CDM was conceptualized, consisting of nine process steps: dataset specification, data profiling, vocabulary identification, coverage analysis of vocabularies, semantic mapping, structural mapping, extract-transform-load-process, qualitative and quantitative data quality analysis. Furthermore, we identified seven OHDSI tools which supported five of the process steps. CONCLUSIONS The generic data harmonization process can be used as a step-by-step guide to assist other researchers in harmonizing source data in OMOP CDM.
Collapse
Affiliation(s)
- Elisa Henke
- Institute for Medical Informatics and Biometry, Carl Gustav Carus Faculty of Medicine, Technische Universität Dresden, 01307, Dresden, Germany.
| | - Michele Zoch
- Institute for Medical Informatics and Biometry, Carl Gustav Carus Faculty of Medicine, Technische Universität Dresden, 01307, Dresden, Germany
| | - Yuan Peng
- Institute for Medical Informatics and Biometry, Carl Gustav Carus Faculty of Medicine, Technische Universität Dresden, 01307, Dresden, Germany
| | - Ines Reinecke
- Data Integration Center, Center for Medical Informatics, University Hospital Carl Gustav Carus Dresden, 01307, Dresden, Germany
| | - Martin Sedlmayr
- Institute for Medical Informatics and Biometry, Carl Gustav Carus Faculty of Medicine, Technische Universität Dresden, 01307, Dresden, Germany
| | | |
Collapse
|
2
|
Henke E, Zoch M, Kallfelz M, Ruhnke T, Leutner LA, Spoden M, Günster C, Sedlmayr M, Bathelt F. Assessing the Use of German Claims Data Vocabularies for Research in the Observational Medical Outcomes Partnership Common Data Model: Development and Evaluation Study. JMIR Med Inform 2023; 11:e47959. [PMID: 37942786 PMCID: PMC10653283 DOI: 10.2196/47959] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2023] [Revised: 09/07/2023] [Accepted: 09/09/2023] [Indexed: 11/10/2023] Open
Abstract
Background National classifications and terminologies already routinely used for documentation within patient care settings enable the unambiguous representation of clinical information. However, the diversity of different vocabularies across health care institutions and countries is a barrier to achieving semantic interoperability and exchanging data across sites. The Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) enables the standardization of structure and medical terminology. It allows the mapping of national vocabularies into so-called standard concepts, representing normative expressions for international analyses and research. Within our project "Hybrid Quality Indicators Using Machine Learning Methods" (Hybrid-QI), we aim to harmonize source codes used in German claims data vocabularies that are currently unavailable in the OMOP CDM. Objective This study aims to increase the coverage of German vocabularies in the OMOP CDM. We aim to completely transform the source codes used in German claims data into the OMOP CDM without data loss and make German claims data usable for OMOP CDM-based research. Methods To prepare the missing German vocabularies for the OMOP CDM, we defined a vocabulary preparation approach consisting of the identification of all codes of the corresponding vocabularies, their assembly into machine-readable tables, and the translation of German designations into English. Furthermore, we used 2 proposed approaches for OMOP-compliant vocabulary preparation: the mapping to standard concepts using the Observational Health Data Sciences and Informatics (OHDSI) tool Usagi and the preparation of new 2-billion concepts (ie, concept_id >2 billion). Finally, we evaluated the prepared vocabularies regarding completeness and correctness using synthetic German claims data and calculated the coverage of German claims data vocabularies in the OMOP CDM. Results Our vocabulary preparation approach was able to map 3 missing German vocabularies to standard concepts and prepare 8 vocabularies as new 2-billion concepts. The completeness evaluation showed that the prepared vocabularies cover 44.3% (3288/7417) of the source codes contained in German claims data. The correctness evaluation revealed that the specified validity periods in the OMOP CDM are compliant for the majority (705,531/706,032, 99.9%) of source codes and associated dates in German claims data. The calculation of the vocabulary coverage showed a noticeable decrease of missing vocabularies from 55% (11/20) to 10% (2/20) due to our preparation approach. Conclusions By preparing 10 vocabularies, we showed that our approach is applicable to any type of vocabulary used in a source data set. The prepared vocabularies are currently limited to German vocabularies, which can only be used in national OMOP CDM research projects, because the mapping of new 2-billion concepts to standard concepts is missing. To participate in international OHDSI network studies with German claims data, future work is required to map the prepared 2-billion concepts to standard concepts.
Collapse
Affiliation(s)
- Elisa Henke
- Institute for Medical Informatics and Biometry, Carl Gustav Carus Faculty of Medicine, Technische Universität Dresden, Dresden, Germany
| | - Michéle Zoch
- Institute for Medical Informatics and Biometry, Carl Gustav Carus Faculty of Medicine, Technische Universität Dresden, Dresden, Germany
| | | | - Thomas Ruhnke
- Wissenschaftliches Institut der AOK (AOK Research Institute), Berlin, Germany
| | - Liz Annika Leutner
- Institute for Medical Informatics and Biometry, Carl Gustav Carus Faculty of Medicine, Technische Universität Dresden, Dresden, Germany
| | - Melissa Spoden
- Wissenschaftliches Institut der AOK (AOK Research Institute), Berlin, Germany
| | - Christian Günster
- Wissenschaftliches Institut der AOK (AOK Research Institute), Berlin, Germany
| | - Martin Sedlmayr
- Institute for Medical Informatics and Biometry, Carl Gustav Carus Faculty of Medicine, Technische Universität Dresden, Dresden, Germany
| | | |
Collapse
|
3
|
de Groot R, Püttmann DP, Fleuren LM, Thoral PJ, Elbers PWG, de Keizer NF, Cornet R. Determining and assessing characteristics of data element names impacting the performance of annotation using Usagi. Int J Med Inform 2023; 178:105200. [PMID: 37703800 DOI: 10.1016/j.ijmedinf.2023.105200] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2023] [Revised: 08/11/2023] [Accepted: 08/23/2023] [Indexed: 09/15/2023]
Abstract
INTRODUCTION Hospitals generate large amounts of data and this data is generally modeled and labeled in a proprietary way, hampering its exchange and integration. Manually annotating data element names to internationally standardized data element identifiers is a time-consuming effort. Tools can support performing this task automatically. This study aimed to determine what factors influence the quality of automatic annotations. METHODS Data element names were used from the Dutch COVID-19 ICU Data Warehouse containing data on intensive care patients with COVID-19 from 25 hospitals in the Netherlands. In this data warehouse, the data had been merged using a proprietary terminology system while also storing the original hospital labels (synonymous names). Usagi, an OHDSI annotation tool, was used to perform the annotation for the data. A gold standard was used to determine if Usagi made correct annotations. Logistic regression was used to determine if the number of characters, number of words, match score (Usagi's certainty) and hospital label origin influenced Usagi's performance to annotate correctly. RESULTS Usagi automatically annotated 30.5% of the data element names correctly and 5.5% of the synonymous names. The match score is the best predictor for Usagi finding the correct annotation. It was determined that the AUC of data element names was 0.651 and 0.752 for the synonymous names respectively. The AUC for the individual hospital label origins varied between 0.460 to 0.905. DISCUSSION The results show that Usagi performed better to annotate the data element names than the synonymous names. The hospital origin in the synonymous names dataset was associated with the amount of correctly annotated concepts. Hospitals that performed better had shorter synonymous names and fewer words. Using shorter data element names or synonymous names should be considered to optimize the automatic annotating process. Overall, the performance of Usagi is too poor to completely rely on for automatic annotation.
Collapse
Affiliation(s)
- Rowdy de Groot
- Amsterdam UMC Location University of Amsterdam, Department of Medical Informatics, Amsterdam, the Netherlands.
| | - Daniel P Püttmann
- Amsterdam UMC Location University of Amsterdam, Department of Medical Informatics, Amsterdam, the Netherlands
| | - Lucas M Fleuren
- Department of Intensive Care Medicine, Center for Critical Care Computation Intelligence (C4i), Amsterdam Medical Data Science (AMDS), Amsterdam Public Health (APH), Amsterdam Cardiovascular Science (ACS), Amsterdam Institute for Infection and Immunity (AII), Amsterdam UMC, Vrije Universiteit, Amsterdam, the Netherlands
| | - Patrick J Thoral
- Department of Intensive Care Medicine, Center for Critical Care Computation Intelligence (C4i), Amsterdam Medical Data Science (AMDS), Amsterdam Public Health (APH), Amsterdam Cardiovascular Science (ACS), Amsterdam Institute for Infection and Immunity (AII), Amsterdam UMC, Vrije Universiteit, Amsterdam, the Netherlands
| | - Paul W G Elbers
- Department of Intensive Care Medicine, Center for Critical Care Computation Intelligence (C4i), Amsterdam Medical Data Science (AMDS), Amsterdam Public Health (APH), Amsterdam Cardiovascular Science (ACS), Amsterdam Institute for Infection and Immunity (AII), Amsterdam UMC, Vrije Universiteit, Amsterdam, the Netherlands
| | - Nicolette F de Keizer
- Amsterdam UMC Location University of Amsterdam, Department of Medical Informatics, Amsterdam, the Netherlands
| | - Ronald Cornet
- Amsterdam UMC Location University of Amsterdam, Department of Medical Informatics, Amsterdam, the Netherlands
| |
Collapse
|
4
|
Kumar S, Nanelia A, Mariappan R, Rajagopal A, Rajan V. Patient Representation Learning From Heterogeneous Data Sources and Knowledge Graphs Using Deep Collective Matrix Factorization: Evaluation Study. JMIR Med Inform 2022; 10:e28842. [PMID: 35049514 PMCID: PMC8814927 DOI: 10.2196/28842] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2021] [Revised: 11/07/2021] [Accepted: 11/14/2021] [Indexed: 11/13/2022] Open
Abstract
BACKGROUND Patient representation learning aims to learn features, also called representations, from input sources automatically, often in an unsupervised manner, for use in predictive models. This obviates the need for cumbersome, time- and resource-intensive manual feature engineering, especially from unstructured data such as text, images, or graphs. Most previous techniques have used neural network-based autoencoders to learn patient representations, primarily from clinical notes in electronic medical records (EMRs). Knowledge graphs (KGs), with clinical entities as nodes and their relations as edges, can be extracted automatically from biomedical literature and provide complementary information to EMR data that have been found to provide valuable predictive signals. OBJECTIVE This study aims to evaluate the efficacy of collective matrix factorization (CMF), both the classical variant and a recent neural architecture called deep CMF (DCMF), in integrating heterogeneous data sources from EMR and KG to obtain patient representations for clinical decision support tasks. METHODS Using a recent formulation for obtaining graph representations through matrix factorization within the context of CMF, we infused auxiliary information during patient representation learning. We also extended the DCMF architecture to create a task-specific end-to-end model that learns to simultaneously find effective patient representations and predictions. We compared the efficacy of such a model to that of first learning unsupervised representations and then independently learning a predictive model. We evaluated patient representation learning using CMF-based methods and autoencoders for 2 clinical decision support tasks on a large EMR data set. RESULTS Our experiments show that DCMF provides a seamless way for integrating multiple sources of data to obtain patient representations, both in unsupervised and supervised settings. Its performance in single-source settings is comparable with that of previous autoencoder-based representation learning methods. When DCMF is used to obtain representations from a combination of EMR and KG, where most previous autoencoder-based methods cannot be used directly, its performance is superior to that of previous nonneural methods for CMF. Infusing information from KGs into patient representations using DCMF was found to improve downstream predictive performance. CONCLUSIONS Our experiments indicate that DCMF is a versatile model that can be used to obtain representations from single and multiple data sources and combine information from EMR data and KGs. Furthermore, DCMF can be used to learn representations in both supervised and unsupervised settings. Thus, DCMF offers an effective way of integrating heterogeneous data sources and infusing auxiliary knowledge into patient representations.
Collapse
Affiliation(s)
| | - Alicia Nanelia
- Department of Information Systems and Analytics, National University of Singapore, Singapore, Singapore
| | - Ragunathan Mariappan
- Department of Information Systems and Analytics, National University of Singapore, Singapore, Singapore
| | | | - Vaibhav Rajan
- Department of Information Systems and Analytics, National University of Singapore, Singapore, Singapore
| |
Collapse
|
5
|
A contextual multi-task neural approach to medication and adverse events identification from clinical text. J Biomed Inform 2021; 125:103960. [PMID: 34875387 DOI: 10.1016/j.jbi.2021.103960] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2021] [Revised: 11/04/2021] [Accepted: 11/22/2021] [Indexed: 12/27/2022]
Abstract
Effective wide-scale pharmacovigilance calls for accurate named entity recognition (NER) of medication entities such as drugs, dosages, reasons, and adverse drug events (ADE) from clinical text. The scarcity of adverse event annotations and underlying semantic ambiguities make accurate scope identification challenging. The current research explores integrating contextualized language models and multi-task learning from diverse clinical NER datasets to mitigate this challenge. We propose a novel multi-task adaptation method to refine the embeddings generated by the Bidirectional Encoder Representations from Transformers (BERT) language model to improve inter-task knowledge sharing. We integrated the adapted BERT model into a unique hierarchical multi-task neural network comprised of the medication and auxiliary clinical NER tasks. We validated the model using two different versions of BERT on diverse well-studied clinical tasks: Medication and ADE (n2c2 2018/n2c2 2009), Clinical Concepts (n2c2 2010/n2c2 2012), Disorders (ShAReCLEF 2013). Overall medication extraction performance enhanced by up to +1.19 F1 (n2c2 2018) while generalization enhanced by +5.38 F1 (n2c2 2009) as compared to standalone BERT baselines. ADE recognition enhanced significantly (McNemar's test), out-performing prior baselines. Similar benefits were observed on the auxiliary clinical and disorder tasks. We demonstrate that combining multi-dataset BERT adaptation and multi-task learning out-performs prior medication extraction methods without requiring additional features, newer training data, or ensembling. Taken together, the study contributes an initial case study towards integrating diverse clinical datasets in an end-to-end NER model for clinical decision support.
Collapse
|