1
|
Naef A, Coduti E, Windisch PY. The Anonymous Data Warehouse: A Hands-On Framework for Anonymizing Data From Digital Health Applications. Cureus 2024; 16:e57519. [PMID: 38707006 PMCID: PMC11067565 DOI: 10.7759/cureus.57519] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 04/02/2024] [Indexed: 05/07/2024] Open
Abstract
The digital health space is growing rapidly, and so is the interest in sharing anonymized health data. However, data anonymization techniques have yet to see much coverage in the medical literature. The purpose of this article is, therefore, to provide a practical framework for anonymization with a focus on the unique properties of data from digital health applications. Literature trends, as well as common anonymization techniques, were synthesized into a framework that considers the opportunities and challenges of digital health data. A rationale for each design decision is provided, and the advantages and disadvantages are discussed. We propose a framework based on storing data separately, anonymizing the data where the identified data is located, only exporting selected data, minimizing static attributes, ensuring k-anonymity of users and their static attributes, and preventing defined metrics from acting as quasi-identifiers by using aggregation, rounding, and capping. Data anonymization requires a pragmatic approach that preserves the utility of the data while minimizing reidentification risk. The proposed framework should be modified according to the characteristics of the respective data set.
Collapse
Affiliation(s)
- André Naef
- Innovation Team, dacadoo AG, Zürich, CHE
| | | | | |
Collapse
|
2
|
Privacy in electronic health records: a systematic mapping study. J Public Health (Oxf) 2023. [DOI: 10.1007/s10389-022-01795-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023] Open
Abstract
Abstract
Main
Electronic health record (EHR) applications are digital versions of paper-based patient health information. Traditionally, medical records are made on paper. However, nowadays, advances in information and communication technology have made it possible to change medical records from paper to EHR. Therefore, preserving user data privacy is extremely important in healthcare environments. The main challenges are providing ways to make EHR systems increasingly capable of ensuring data privacy and at the same time not compromising the performance and interoperability of these systems.
Subject and methods
This systematic mapping study intends to investigate the current research on security and privacy requirements in EHR systems and identify potential research gaps in the literature. The main challenges are providing ways to make EHR systems increasingly capable of ensuring data privacy, and at the same time, not compromising the performance and interoperability of these systems. Our research was carried out in the Scopus database, the largest database of abstracts and citations in the literature with peer review.
Results
We have collected 848 articles related to the area. After disambiguation and filtering, we selected 30 articles for analysis. The result of such an analysis provides a comprehensive view of current research.
Conclusions
We can highlight some relevant research possibilities. First, we noticed a growing interest in privacy in EHR research in the last 6 years. Second, blockchain has been used in many EHR systems as a solution to achieve data privacy. However, it is a challenge to maintain traceability by recording metadata that can be mapped to private data of the users applying a particular mapping function that can be hosted outside the blockchain. Finally, the lack of a systematic approach between EHR solutions and existing laws or policies leads to better strategies for developing a certification process for EHR systems.
Collapse
|
3
|
Sepas A, Bangash AH, Alraoui O, El Emam K, El-Hussuna A. Algorithms to anonymize structured medical and healthcare data: A systematic review. FRONTIERS IN BIOINFORMATICS 2022; 2:984807. [PMID: 36619476 PMCID: PMC9815524 DOI: 10.3389/fbinf.2022.984807] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2022] [Accepted: 11/28/2022] [Indexed: 12/24/2022] Open
Abstract
Introduction: With many anonymization algorithms developed for structured medical health data (SMHD) in the last decade, our systematic review provides a comprehensive bird's eye view of algorithms for SMHD anonymization. Methods: This systematic review was conducted according to the recommendations in the Cochrane Handbook for Reviews of Interventions and reported according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA). Eligible articles from the PubMed, ACM digital library, Medline, IEEE, Embase, Web of Science Collection, Scopus, ProQuest Dissertation, and Theses Global databases were identified through systematic searches. The following parameters were extracted from the eligible studies: author, year of publication, sample size, and relevant algorithms and/or software applied to anonymize SMHD, along with the summary of outcomes. Results: Among 1,804 initial hits, the present study considered 63 records including research articles, reviews, and books. Seventy five evaluated the anonymization of demographic data, 18 assessed diagnosis codes, and 3 assessed genomic data. One of the most common approaches was k-anonymity, which was utilized mainly for demographic data, often in combination with another algorithm; e.g., l-diversity. No approaches have yet been developed for protection against membership disclosure attacks on diagnosis codes. Conclusion: This study reviewed and categorized different anonymization approaches for MHD according to the anonymized data types (demographics, diagnosis codes, and genomic data). Further research is needed to develop more efficient algorithms for the anonymization of diagnosis codes and genomic data. The risk of reidentification can be minimized with adequate application of the addressed anonymization approaches. Systematic Review Registration: [http://www.crd.york.ac.uk/prospero], identifier [CRD42021228200].
Collapse
Affiliation(s)
- Ali Sepas
- Open Source Research Collaboration, Aalborg, Denmark
- Department of Materials and Production, Aalborg University, Aalborg, Denmark
| | - Ali Haider Bangash
- Open Source Research Collaboration, Aalborg, Denmark
- STMU Shifa College of Medicine, Islamabad, Pakistan
| | - Omar Alraoui
- Department of Health Science and Technology, Aalborg University, Aalborg, Denmark
| | - Khaled El Emam
- Canada Research Chair in Medical AI, University of Ottawa, Ottawa, ON, Canada
| | | |
Collapse
|
4
|
Olatunji IE, Rauch J, Katzensteiner M, Khosla M. A Review of Anonymization for Healthcare Data. BIG DATA 2022. [PMID: 35271377 DOI: 10.1089/big.2021.0169] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Mining health data can lead to faster medical decisions, improvement in the quality of treatment, disease prevention, and reduced cost, and it drives innovative solutions within the healthcare sector. However, health data are highly sensitive and subject to regulations such as the General Data Protection Regulation, which aims to ensure patient's privacy. Anonymization or removal of patient identifiable information, although the most conventional way, is the first important step to adhere to the regulations and incorporate privacy concerns. In this article, we review the existing anonymization techniques and their applicability to various types (relational and graph based) of health data. Besides, we provide an overview of possible attacks on anonymized data. We illustrate via a reconstruction attack that anonymization, although necessary, is not sufficient to address patient privacy and discuss methods for protecting against such attacks. Finally, we discuss tools that can be used to achieve anonymization.
Collapse
Affiliation(s)
| | - Jens Rauch
- Health Informatics Research Group, University of Applied Sciences, Osnabrück, Germany
| | | | - Megha Khosla
- L3S Research Center, Leibniz University, Hannover, Germany
| |
Collapse
|
5
|
Zhong H, Loukides G, Pissis SP. Clustering demographics and sequences of diagnosis codes. IEEE J Biomed Health Inform 2021; 26:2351-2359. [PMID: 34797768 DOI: 10.1109/jbhi.2021.3129461] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
A Relational-Sequential dataset (or RS-dataset for short) contains records comprised of a patients values in demographic attributes and their sequence of diagnosis codes. The task of clustering an RS-dataset is helpful for analyses ranging from pattern mining to classification. However, existing methods are not appropriate to perform this task. Thus, we initiate a study of how an RS-dataset can be clustered effectively and efficiently. We formalize the task of clustering an RS-dataset as an optimization problem. At the heart of the problem is a distance measure we design to quantify the pairwise similarity between records of an RS-dataset. Our measure uses a tree structure that encodes hierarchical relationships between records, based on their demographics, as well as an edit-distance-like measure that captures both the sequentiality and the semantic similarity of diagnosis codes. We also develop an algorithm which first identifies k representative records (centers), for a given k, and then constructs clusters, each containing one center and the records that are closer to the center compared to other centers. Experiments using two Electronic Health Record datasets demonstrate that our algorithm constructs compact and well-separated clusters, which preserve meaningful relationships between demographics and sequences of diagnosis codes, while being efficient and scalable.
Collapse
|
6
|
Pedrosa M, Zuquete A, Costa C. A Pseudonymisation Protocol With Implicit and Explicit Consent Routes for Health Records in Federated Ledgers. IEEE J Biomed Health Inform 2021; 25:2172-2183. [PMID: 33006933 DOI: 10.1109/jbhi.2020.3028454] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Healthcare data for primary use (diagnosis) may be encrypted for confidentiality purposes; however, secondary uses such as feeding machine learning algorithms requires open access. Full anonymity has no traceable identifiers to report diagnosis results. Moreover, implicit and explicit consent routes are of practical importance under recent data protection regulations (GDPR), translating directly into break-the-glass requirements. Pseudonymisation is an acceptable compromise when dealing with such orthogonal requirements and is an advisable measure to protect data. Our work presents a pseudonymisation protocol that is compliant with implicit and explicit consent routes. The protocol is constructed on a (t,n)-threshold secret sharing scheme and public key cryptography. The pseudonym is safely derived from a fragment of public information without requiring any data-subject's secret. The method is proven secure under reasonable cryptographic assumptions and scalable from the experimental results.
Collapse
|
7
|
Clustering datasets with demographics and diagnosis codes. J Biomed Inform 2020; 102:103360. [PMID: 31904428 DOI: 10.1016/j.jbi.2019.103360] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2019] [Revised: 11/30/2019] [Accepted: 12/16/2019] [Indexed: 11/21/2022]
Abstract
Clustering data derived from Electronic Health Record (EHR) systems is important to discover relationships between the clinical profiles of patients and as a preprocessing step for analysis tasks, such as classification. However, the heterogeneity of these data makes the application of existing clustering methods difficult and calls for new clustering approaches. In this paper, we propose the first approach for clustering a dataset in which each record contains a patient's values in demographic attributes and their set of diagnosis codes. Our approach represents the dataset in a binary form in which the features are selected demographic values, as well as combinations (patterns) of frequent and correlated diagnosis codes. This representation enables measuring similarity between records using cosine similarity, an effective measure for binary-represented data, and finding compact, well-separated clusters through hierarchical clustering. Our experiments using two publicly available EHR datasets, comprised of over 26,000 and 52,000 records, demonstrate that our approach is able to construct clusters with correlated demographics and diagnosis codes, and that it is efficient and scalable.
Collapse
|
8
|
DePriest KN, Shields TM, Curriero FC. Returning to our roots: The use of geospatial data for nurse-led community research. Res Nurs Health 2019; 42:467-475. [PMID: 31599459 DOI: 10.1002/nur.21984] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2019] [Accepted: 09/23/2019] [Indexed: 12/22/2022]
Abstract
In the early 20th century, public health nurse, Lillian Wald, addressed the social determinants of health (SDOH) through her work in New York City and her advocacy to improve policy in workplace conditions, education, recreation, and housing. In the early 21st century, addressing the SDOH is a renewed priority and provides nurse researchers with an opportunity to return to our roots. The purpose of this methods paper is to examine how the incorporation of geospatial data and spatial methodologies in community research can enhance the analyses of the complex relationships between social determinants and health. Geospatial technologies, software for mapping and working with geospatial data, statistical methods, and unique considerations are discussed. An exemplar for using geospatial data is presented regarding associations between neighborhood greenspace, neighborhood violence, and children's asthma control. This innovative use of geospatial data illustrates a new frontier in investigating nontraditional connections between the environment and SDOH outcomes.
Collapse
Affiliation(s)
- Kelli N DePriest
- School of Nursing, Johns Hopkins University, Baltimore, Maryland
| | - Timothy M Shields
- Department of Epidemiology, Spatial Science for Public Health Center, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland
| | - Frank C Curriero
- Department of Epidemiology, Spatial Science for Public Health Center, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland
| |
Collapse
|
9
|
Mandala J, Chandra Sekhara Rao M. Privacy preservation of data using crow search with adaptive awareness probability. JOURNAL OF INFORMATION SECURITY AND APPLICATIONS 2019. [DOI: 10.1016/j.jisa.2018.12.005] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
10
|
Probabilistic record linkage of de-identified research datasets with discrepancies using diagnosis codes. Sci Data 2019; 6:180298. [PMID: 30620344 PMCID: PMC6326114 DOI: 10.1038/sdata.2018.298] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2018] [Accepted: 11/26/2018] [Indexed: 12/19/2022] Open
Abstract
We develop an algorithm for probabilistic linkage of de-identified research datasets at the patient level, when only diagnosis codes with discrepancies and no personal health identifiers such as name or date of birth are available. It relies on Bayesian modelling of binarized diagnosis codes, and provides a posterior probability of matching for each patient pair, while considering all the data at once. Both in our simulation study (using an administrative claims dataset for data generation) and in two real use-cases linking patient electronic health records from a large tertiary care network, our method exhibits good performance and compares favourably to the standard baseline Fellegi-Sunter algorithm. We propose a scalable, fast and efficient open-source implementation in the ludic R package available on CRAN, which also includes the anonymized diagnosis code data from our real use-case. This work suggests it is possible to link de-identified research databases stripped of any personal health identifiers using only diagnosis codes, provided sufficient information is shared between the data sources.
Collapse
|
11
|
Arellano AM, Dai W, Wang S, Jiang X, Ohno-Machado L. Privacy Policy and Technology in Biomedical Data Science. Annu Rev Biomed Data Sci 2018; 1:115-129. [PMID: 31058261 PMCID: PMC6497413 DOI: 10.1146/annurev-biodatasci-080917-013416] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
Abstract
Privacyis an important consideration when sharing clinical data, which often contain sensitive information. Adequate protection to safeguard patient privacy and to increase public trust in biomedical research is paramount. This review covers topics in policy and technology in the context of clinical data sharing. We review policy articles related to (a) the Common Rule, HIPAA privacy and security rules, and governance; (b) patients' viewpoints and consent practices; and (c) research ethics. We identify key features of the revised Common Rule and the most notable changes since its previous version. We address data governance for research in addition to the increasing emphasis on ethical and social implications. Research ethics topics include data sharing best practices, use of data from populations of low socioeconomic status (SES), recent updates to institutional review board (IRB) processes to protect human subjects' data, and important concerns about the limitations of current policies to address data deidentification. In terms of technology, we focus on articles that have applicability in real world health care applications: deidentification methods that comply with HIPAA, data anonymization approaches to satisfy well-acknowledged issues in deidentified data, encryption methods to safeguard data analyses, and privacy-preserving predictive modeling. The first two technology topics are mostly relevant to methodologies that attempt to sanitize structured or unstructured data. The third topic includes analysis on encrypted data. The last topic includes various mechanisms to build statistical models without sharing raw data.
Collapse
Affiliation(s)
- April Moreno Arellano
- Department of Biomedical Informatics, School of Medicine, University of California, San Diego, La Jolla, California 92093, USA;
| | - Wenrui Dai
- Department of Biomedical Informatics, School of Medicine, University of California, San Diego, La Jolla, California 92093, USA;
| | - Shuang Wang
- Department of Biomedical Informatics, School of Medicine, University of California, San Diego, La Jolla, California 92093, USA;
| | - Xiaoqian Jiang
- Department of Biomedical Informatics, School of Medicine, University of California, San Diego, La Jolla, California 92093, USA;
| | - Lucila Ohno-Machado
- Department of Biomedical Informatics, School of Medicine, University of California, San Diego, La Jolla, California 92093, USA;
| |
Collapse
|