1
|
Gierend K, Krüger F, Genehr S, Hartmann F, Siegel F, Waltemath D, Ganslandt T, Zeleke AA. Provenance Information for Biomedical Data and Workflows: Scoping Review. J Med Internet Res 2024; 26:e51297. [PMID: 39178413 PMCID: PMC11380065 DOI: 10.2196/51297] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2023] [Revised: 05/30/2024] [Accepted: 06/17/2024] [Indexed: 08/25/2024] Open
Abstract
BACKGROUND The record of the origin and the history of data, known as provenance, holds importance. Provenance information leads to higher interpretability of scientific results and enables reliable collaboration and data sharing. However, the lack of comprehensive evidence on provenance approaches hinders the uptake of good scientific practice in clinical research. OBJECTIVE This scoping review aims to identify approaches and criteria for provenance tracking in the biomedical domain. We reviewed the state-of-the-art frameworks, associated artifacts, and methodologies for provenance tracking. METHODS This scoping review followed the methodological framework developed by Arksey and O'Malley. We searched the PubMed and Web of Science databases for English-language articles published from 2006 to 2022. Title and abstract screening were carried out by 4 independent reviewers using the Rayyan screening tool. A majority vote was required for consent on the eligibility of papers based on the defined inclusion and exclusion criteria. Full-text reading and screening were performed independently by 2 reviewers, and information was extracted into a pretested template for the 5 research questions. Disagreements were resolved by a domain expert. The study protocol has previously been published. RESULTS The search resulted in a total of 764 papers. Of 624 identified, deduplicated papers, 66 (10.6%) studies fulfilled the inclusion criteria. We identified diverse provenance-tracking approaches ranging from practical provenance processing and managing to theoretical frameworks distinguishing diverse concepts and details of data and metadata models, provenance components, and notations. A substantial majority investigated underlying requirements to varying extents and validation intensities but lacked completeness in provenance coverage. Mostly, cited requirements concerned the knowledge about data integrity and reproducibility. Moreover, these revolved around robust data quality assessments, consistent policies for sensitive data protection, improved user interfaces, and automated ontology development. We found that different stakeholder groups benefit from the availability of provenance information. Thereby, we recognized that the term provenance is subjected to an evolutionary and technical process with multifaceted meanings and roles. Challenges included organizational and technical issues linked to data annotation, provenance modeling, and performance, amplified by subsequent matters such as enhanced provenance information and quality principles. CONCLUSIONS As data volumes grow and computing power increases, the challenge of scaling provenance systems to handle data efficiently and assist complex queries intensifies, necessitating automated and scalable solutions. With rising legal and scientific demands, there is an urgent need for greater transparency in implementing provenance systems in research projects, despite the challenges of unresolved granularity and knowledge bottlenecks. We believe that our recommendations enable quality and guide the implementation of auditable and measurable provenance approaches as well as solutions in the daily tasks of biomedical scientists. INTERNATIONAL REGISTERED REPORT IDENTIFIER (IRRID) RR2-10.2196/31750.
Collapse
Affiliation(s)
- Kerstin Gierend
- Department of Biomedical Informatics, Mannheim Institute for intelligent Systems in Medicine, Medical Faculty Mannheim, Heidelberg University, Mannheim, Germany
| | - Frank Krüger
- Faculty of Engineering, Wismar University of Applied Sciences, Wismar, Germany
- Institute of Communications Engineering, University of Rostock, Rostock, Germany
| | - Sascha Genehr
- Institute of Communications Engineering, University of Rostock, Rostock, Germany
| | - Francisca Hartmann
- Department of Biomedical Informatics, Mannheim Institute for intelligent Systems in Medicine, Medical Faculty Mannheim, Heidelberg University, Mannheim, Germany
| | - Fabian Siegel
- Department of Biomedical Informatics, Mannheim Institute for intelligent Systems in Medicine, Medical Faculty Mannheim, Heidelberg University, Mannheim, Germany
| | - Dagmar Waltemath
- Department of Medical Informatics, University Medicine Greifswald, Greifswald, Germany
| | - Thomas Ganslandt
- Chair of Medical Informatics, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
| | | |
Collapse
|
2
|
Kamdje Wabo G, Moorthy P, Siegel F, Seuchter SA, Ganslandt T. Evaluating and Enhancing the Fitness-for-Purpose of Electronic Health Record Data: Qualitative Study on Current Practices and Pathway to an Automated Approach Within the Medical Informatics for Research and Care in University Medicine Consortium. JMIR Med Inform 2024; 12:e57153. [PMID: 39158950 PMCID: PMC11369535 DOI: 10.2196/57153] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2024] [Revised: 05/31/2024] [Accepted: 07/22/2024] [Indexed: 08/20/2024] Open
Abstract
BACKGROUND Leveraging electronic health record (EHR) data for clinical or research purposes heavily depends on data fitness. However, there is a lack of standardized frameworks to evaluate EHR data suitability, leading to inconsistent quality in data use projects (DUPs). This research focuses on the Medical Informatics for Research and Care in University Medicine (MIRACUM) Data Integration Centers (DICs) and examines empirical practices on assessing and automating the fitness-for-purpose of clinical data in German DIC settings. OBJECTIVE The study aims (1) to capture and discuss how MIRACUM DICs evaluate and enhance the fitness-for-purpose of observational health care data and examine the alignment with existing recommendations and (2) to identify the requirements for designing and implementing a computer-assisted solution to evaluate EHR data fitness within MIRACUM DICs. METHODS A qualitative approach was followed using an open-ended survey across DICs of 10 German university hospitals affiliated with MIRACUM. Data were analyzed using thematic analysis following an inductive qualitative method. RESULTS All 10 MIRACUM DICs participated, with 17 participants revealing various approaches to assessing data fitness, including the 4-eyes principle and data consistency checks such as cross-system data value comparison. Common practices included a DUP-related feedback loop on data fitness and using self-designed dashboards for monitoring. Most experts had a computer science background and a master's degree, suggesting strong technological proficiency but potentially lacking clinical or statistical expertise. Nine key requirements for a computer-assisted solution were identified, including flexibility, understandability, extendibility, and practicability. Participants used heterogeneous data repositories for evaluating data quality criteria and practical strategies to communicate with research and clinical teams. CONCLUSIONS The study identifies gaps between current practices in MIRACUM DICs and existing recommendations, offering insights into the complexities of assessing and reporting clinical data fitness. Additionally, a tripartite modular framework for fitness-for-purpose assessment was introduced to streamline the forthcoming implementation. It provides valuable input for developing and integrating an automated solution across multiple locations. This may include statistical comparisons to advanced machine learning algorithms for operationalizing frameworks such as the 3×3 data quality assessment framework. These findings provide foundational evidence for future design and implementation studies to enhance data quality assessments for specific DUPs in observational health care settings.
Collapse
Affiliation(s)
- Gaetan Kamdje Wabo
- Center for Preventive Medicine and Digital Health Baden-Wuerttemberg, Department of Biomedical Informatics, Medical Faculty of Mannheim, University of Heidelberg, Mannheim, Germany
| | - Preetha Moorthy
- Center for Preventive Medicine and Digital Health Baden-Wuerttemberg, Department of Biomedical Informatics, Medical Faculty of Mannheim, University of Heidelberg, Mannheim, Germany
| | - Fabian Siegel
- Center for Preventive Medicine and Digital Health Baden-Wuerttemberg, Department of Biomedical Informatics, Medical Faculty of Mannheim, University of Heidelberg, Mannheim, Germany
- Department of Urology and Urosurgery, University Medical Center of Mannheim, Medical Faculty of Mannheim, University of Heidelberg, Mannheim, Germany
| | - Susanne A Seuchter
- Medical Center for Information and Communication Technology, Erlangen University Hospital, Erlangen, Germany
| | - Thomas Ganslandt
- Medical Center for Information and Communication Technology, Erlangen University Hospital, Erlangen, Germany
- Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
| |
Collapse
|
3
|
Waltemath D, Beyan O, Crameri K, Dedié A, Gierend K, Gröber P, Inau ET, Michaelis L, Reinecke I, Sedlmayr M, Thun S, Krefting D. [FAIR health data in the national and international data space]. Bundesgesundheitsblatt Gesundheitsforschung Gesundheitsschutz 2024; 67:710-720. [PMID: 38750239 PMCID: PMC11166787 DOI: 10.1007/s00103-024-03884-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2023] [Accepted: 04/19/2024] [Indexed: 06/12/2024]
Abstract
Health data are extremely important in today's data-driven world. Through automation, healthcare processes can be optimized, and clinical decisions can be supported. For any reuse of data, the quality, validity, and trustworthiness of data are essential, and it is the only way to guarantee that data can be reused sensibly. Specific requirements for the description and coding of reusable data are defined in the FAIR guiding principles for data stewardship. Various national research associations and infrastructure projects in the German healthcare sector have already clearly positioned themselves on the FAIR principles: both the infrastructures of the Medical Informatics Initiative and the University Medicine Network operate explicitly on the basis of the FAIR principles, as do the National Research Data Infrastructure for Personal Health Data and the German Center for Diabetes Research.To ensure that a resource complies with the FAIR principles, the degree of FAIRness should first be determined (so-called FAIR assessment), followed by the prioritization for improvement steps (so-called FAIRification). Since 2016, a set of tools and guidelines have been developed for both steps, based on the different, domain-specific interpretations of the FAIR principles.Neighboring European countries have also invested in the development of a national framework for semantic interoperability in the context of the FAIR (Findable, Accessible, Interoperable, Reusable) principles. Concepts for comprehensive data enrichment were developed to simplify data analysis, for example, in the European Health Data Space or via the Observational Health Data Sciences and Informatics network. With the support of the European Open Science Cloud, among others, structured FAIRification measures have already been taken for German health datasets.
Collapse
Affiliation(s)
- Dagmar Waltemath
- Abteilung Medizininformatik, Institut für Community Medicine, Walther-Rathenau-Straße 48, 17475, Greifswald, Deutschland.
| | - Oya Beyan
- Medizinische Fakultät und Uniklinik Köln, Institut für Biomedizininformatik, Universität zu Köln, Köln, Deutschland
| | - Katrin Crameri
- Schweizerisches Institut für Bioinformatik, Personalisierte Gesundheitsinformatik, Basel, Schweiz
| | - Angela Dedié
- Deutsches Zentrum für Diabetesforschung (DZD), Geschäftsstelle am Helmholtz Zentrum München, München, Deutschland
| | - Kerstin Gierend
- Abteilung für Biomedizinische Informatik am Zentrum für Präventivmedizin und Digitale Gesundheit (CPD), Medizinische Fakultät Mannheim der Universität Heidelberg, Mannheim, Deutschland
| | - Petra Gröber
- Datenintegrationszentrum Universitätsmedizin Rostock, Rostock, Deutschland
| | - Esther Thea Inau
- Abteilung Medizininformatik, Institut für Community Medicine, Walther-Rathenau-Straße 48, 17475, Greifswald, Deutschland
| | - Lea Michaelis
- Abteilung Medizininformatik, Institut für Community Medicine, Walther-Rathenau-Straße 48, 17475, Greifswald, Deutschland
| | - Ines Reinecke
- Datenintegrationszentrum, Zentrum für Medizinische Informatik, Universitätsklinikum Carl Gustav Carus Dresden, Dresden, Deutschland
| | - Martin Sedlmayr
- Institut für Medizinische Informatik und Biometrie, Med. Fakultät Carl Gustav Carus, TU Dresden, Dresden, Deutschland
| | - Sylvia Thun
- Berliner Institut für Gesundheitsforschung in der Charité - Universitätsmedizin Berlin, Berlin, Deutschland
| | - Dagmar Krefting
- Institut für Medizinische Informatik, Universitätsmedizin Göttingen und Deutsches Zentrum für Herz-Kreislauf-Forschung, Partner Site Göttingen, Göttingen, Deutschland
| |
Collapse
|
4
|
Gierend K, Waltemath D, Ganslandt T, Siegel F. Traceable Research Data Sharing in a German Medical Data Integration Center With FAIR (Findability, Accessibility, Interoperability, and Reusability)-Geared Provenance Implementation: Proof-of-Concept Study. JMIR Form Res 2023; 7:e50027. [PMID: 38060305 PMCID: PMC10739241 DOI: 10.2196/50027] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2023] [Revised: 10/25/2023] [Accepted: 11/01/2023] [Indexed: 12/08/2023] Open
Abstract
BACKGROUND Secondary investigations into digital health records, including electronic patient data from German medical data integration centers (DICs), pave the way for enhanced future patient care. However, only limited information is captured regarding the integrity, traceability, and quality of the (sensitive) data elements. This lack of detail diminishes trust in the validity of the collected data. From a technical standpoint, adhering to the widely accepted FAIR (Findability, Accessibility, Interoperability, and Reusability) principles for data stewardship necessitates enriching data with provenance-related metadata. Provenance offers insights into the readiness for the reuse of a data element and serves as a supplier of data governance. OBJECTIVE The primary goal of this study is to augment the reusability of clinical routine data within a medical DIC for secondary utilization in clinical research. Our aim is to establish provenance traces that underpin the status of data integrity, reliability, and consequently, trust in electronic health records, thereby enhancing the accountability of the medical DIC. We present the implementation of a proof-of-concept provenance library integrating international standards as an initial step. METHODS We adhered to a customized road map for a provenance framework, and examined the data integration steps across the ETL (extract, transform, and load) phases. Following a maturity model, we derived requirements for a provenance library. Using this research approach, we formulated a provenance model with associated metadata and implemented a proof-of-concept provenance class. Furthermore, we seamlessly incorporated the internationally recognized Word Wide Web Consortium (W3C) provenance standard, aligned the resultant provenance records with the interoperable health care standard Fast Healthcare Interoperability Resources, and presented them in various representation formats. Ultimately, we conducted a thorough assessment of provenance trace measurements. RESULTS This study marks the inaugural implementation of integrated provenance traces at the data element level within a German medical DIC. We devised and executed a practical method that synergizes the robustness of quality- and health standard-guided (meta)data management practices. Our measurements indicate commendable pipeline execution times, attaining notable levels of accuracy and reliability in processing clinical routine data, thereby ensuring accountability in the medical DIC. These findings should inspire the development of additional tools aimed at providing evidence-based and reliable electronic health record services for secondary use. CONCLUSIONS The research method outlined for the proof-of-concept provenance class has been crafted to promote effective and reliable core data management practices. It aims to enhance biomedical data by imbuing it with meaningful provenance, thereby bolstering the benefits for both research and society. Additionally, it facilitates the streamlined reuse of biomedical data. As a result, the system mitigates risks, as data analysis without knowledge of the origin and quality of all data elements is rendered futile. While the approach was initially developed for the medical DIC use case, these principles can be universally applied throughout the scientific domain.
Collapse
Affiliation(s)
- Kerstin Gierend
- Department of Biomedical Informatics at the Center for Preventive Medicine and Digital Health, Medical Faculty Mannheim, Heidelberg University, Mannheim, Germany
| | - Dagmar Waltemath
- Core Unit Data Integration Center and Medical Informatics Laboratory, University Medicine Greifswald, Greifswald, Germany
| | - Thomas Ganslandt
- Chair of Medical Informatics, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
| | - Fabian Siegel
- Department of Biomedical Informatics at the Center for Preventive Medicine and Digital Health, Medical Faculty Mannheim, Heidelberg University, Mannheim, Germany
| |
Collapse
|