1
|
Gourabathina A, Wan Z, Brown JT, Yan C, Malin BA. PanDa Game: Optimized Privacy-Preserving Publishing of Individual-Level Pandemic Data Based on a Game Theoretic Model. IEEE Trans Nanobioscience 2023; 22:808-817. [PMID: 37289605 PMCID: PMC10702143 DOI: 10.1109/tnb.2023.3284092] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Sharing individual-level pandemic data is essential for accelerating the understanding of a disease. For example, COVID-19 data have been widely collected to support public health surveillance and research. In the United States, these data are typically de-identified before publication to protect the privacy of the corresponding individuals. However, current data publishing approaches for this type of data, such as those adopted by the U.S. Centers for Disease Control and Prevention (CDC), have not flexed over time to account for the dynamic nature of infection rates. Thus, the policies generated by these strategies have the potential to both raise privacy risks or overprotect the data and impair the data utility (or usability). To optimize the tradeoff between privacy risk and data utility, we introduce a game theoretic model that adaptively generates policies for the publication of individual-level COVID-19 data according to infection dynamics. We model the data publishing process as a two-player Stackelberg game between a data publisher and a data recipient and then search for the best strategy for the publisher. In this game, we consider 1) average performance of predicting future case counts; and 2) mutual information between the original data and the released data. We use COVID-19 case data from Vanderbilt University Medical Center from March 2020 to December 2021 to demonstrate the effectiveness of the new model. The results indicate that the game theoretic model outperforms all state-of-the-art baseline approaches, including those adopted by CDC, while maintaining low privacy risk. We further perform an extensive sensitivity analyses to show that our findings are robust to order-of-magnitude parameter fluctuations.
Collapse
Affiliation(s)
- Abinitha Gourabathina
- Department of Operations Research & Financial Engineering, Princeton University, Princeton, NJ 08540 USA
| | - Zhiyu Wan
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203 USA
| | - J. Thomas Brown
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203 USA
| | - Chao Yan
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203 USA
| | - Bradley A. Malin
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203 USA
- Department of Computer Science, Vanderbilt University, Nashville, TN 37212 USA
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN 37203 USA
| |
Collapse
|
2
|
Sepas A, Bangash AH, Alraoui O, El Emam K, El-Hussuna A. Algorithms to anonymize structured medical and healthcare data: A systematic review. FRONTIERS IN BIOINFORMATICS 2022; 2:984807. [PMID: 36619476 PMCID: PMC9815524 DOI: 10.3389/fbinf.2022.984807] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2022] [Accepted: 11/28/2022] [Indexed: 12/24/2022] Open
Abstract
Introduction: With many anonymization algorithms developed for structured medical health data (SMHD) in the last decade, our systematic review provides a comprehensive bird's eye view of algorithms for SMHD anonymization. Methods: This systematic review was conducted according to the recommendations in the Cochrane Handbook for Reviews of Interventions and reported according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA). Eligible articles from the PubMed, ACM digital library, Medline, IEEE, Embase, Web of Science Collection, Scopus, ProQuest Dissertation, and Theses Global databases were identified through systematic searches. The following parameters were extracted from the eligible studies: author, year of publication, sample size, and relevant algorithms and/or software applied to anonymize SMHD, along with the summary of outcomes. Results: Among 1,804 initial hits, the present study considered 63 records including research articles, reviews, and books. Seventy five evaluated the anonymization of demographic data, 18 assessed diagnosis codes, and 3 assessed genomic data. One of the most common approaches was k-anonymity, which was utilized mainly for demographic data, often in combination with another algorithm; e.g., l-diversity. No approaches have yet been developed for protection against membership disclosure attacks on diagnosis codes. Conclusion: This study reviewed and categorized different anonymization approaches for MHD according to the anonymized data types (demographics, diagnosis codes, and genomic data). Further research is needed to develop more efficient algorithms for the anonymization of diagnosis codes and genomic data. The risk of reidentification can be minimized with adequate application of the addressed anonymization approaches. Systematic Review Registration: [http://www.crd.york.ac.uk/prospero], identifier [CRD42021228200].
Collapse
Affiliation(s)
- Ali Sepas
- Open Source Research Collaboration, Aalborg, Denmark
- Department of Materials and Production, Aalborg University, Aalborg, Denmark
| | - Ali Haider Bangash
- Open Source Research Collaboration, Aalborg, Denmark
- STMU Shifa College of Medicine, Islamabad, Pakistan
| | - Omar Alraoui
- Department of Health Science and Technology, Aalborg University, Aalborg, Denmark
| | - Khaled El Emam
- Canada Research Chair in Medical AI, University of Ottawa, Ottawa, ON, Canada
| | | |
Collapse
|
3
|
Zhong H, Loukides G, Pissis SP. Clustering demographics and sequences of diagnosis codes. IEEE J Biomed Health Inform 2021; 26:2351-2359. [PMID: 34797768 DOI: 10.1109/jbhi.2021.3129461] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
A Relational-Sequential dataset (or RS-dataset for short) contains records comprised of a patients values in demographic attributes and their sequence of diagnosis codes. The task of clustering an RS-dataset is helpful for analyses ranging from pattern mining to classification. However, existing methods are not appropriate to perform this task. Thus, we initiate a study of how an RS-dataset can be clustered effectively and efficiently. We formalize the task of clustering an RS-dataset as an optimization problem. At the heart of the problem is a distance measure we design to quantify the pairwise similarity between records of an RS-dataset. Our measure uses a tree structure that encodes hierarchical relationships between records, based on their demographics, as well as an edit-distance-like measure that captures both the sequentiality and the semantic similarity of diagnosis codes. We also develop an algorithm which first identifies k representative records (centers), for a given k, and then constructs clusters, each containing one center and the records that are closer to the center compared to other centers. Experiments using two Electronic Health Record datasets demonstrate that our algorithm constructs compact and well-separated clusters, which preserve meaningful relationships between demographics and sequences of diagnosis codes, while being efficient and scalable.
Collapse
|
4
|
Gagalova KK, Leon Elizalde MA, Portales-Casamar E, Görges M. What You Need to Know Before Implementing a Clinical Research Data Warehouse: Comparative Review of Integrated Data Repositories in Health Care Institutions. JMIR Form Res 2020; 4:e17687. [PMID: 32852280 PMCID: PMC7484778 DOI: 10.2196/17687] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2020] [Revised: 06/09/2020] [Accepted: 07/17/2020] [Indexed: 12/23/2022] Open
Abstract
Background Integrated data repositories (IDRs), also referred to as clinical data warehouses, are platforms used for the integration of several data sources through specialized analytical tools that facilitate data processing and analysis. IDRs offer several opportunities for clinical data reuse, and the number of institutions implementing an IDR has grown steadily in the past decade. Objective The architectural choices of major IDRs are highly diverse and determining their differences can be overwhelming. This review aims to explore the underlying models and common features of IDRs, provide a high-level overview for those entering the field, and propose a set of guiding principles for small- to medium-sized health institutions embarking on IDR implementation. Methods We reviewed manuscripts published in peer-reviewed scientific literature between 2008 and 2020, and selected those that specifically describe IDR architectures. Of 255 shortlisted articles, we found 34 articles describing 29 different architectures. The different IDRs were analyzed for common features and classified according to their data processing and integration solution choices. Results Despite common trends in the selection of standard terminologies and data models, the IDRs examined showed heterogeneity in the underlying architecture design. We identified 4 common architecture models that use different approaches for data processing and integration. These different approaches were driven by a variety of features such as data sources, whether the IDR was for a single institution or a collaborative project, the intended primary data user, and purpose (research-only or including clinical or operational decision making). Conclusions IDR implementations are diverse and complex undertakings, which benefit from being preceded by an evaluation of requirements and definition of scope in the early planning stage. Factors such as data source diversity and intended users of the IDR influence data flow and synchronization, both of which are crucial factors in IDR architecture planning.
Collapse
Affiliation(s)
- Kristina K Gagalova
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, Canada.,Bioinformatics Graduate Program, University of British Columbia, Vancouver, BC, Canada.,Research Institute, BC Children's Hospital, Vancouver, BC, Canada
| | - M Angelica Leon Elizalde
- Research Institute, BC Children's Hospital, Vancouver, BC, Canada.,School of Population and Public Health, University of British Columbia, Vancouver, BC, Canada
| | - Elodie Portales-Casamar
- Research Institute, BC Children's Hospital, Vancouver, BC, Canada.,Department of Pediatrics, University of British Columbia, Vancouver, BC, Canada
| | - Matthias Görges
- Research Institute, BC Children's Hospital, Vancouver, BC, Canada.,Department of Anesthesiology, Pharmacology and Therapeutics, University of British Columbia, Vancouver, BC, Canada
| |
Collapse
|
5
|
Song X, Waitman LR, Hu Y, Luo B, Li F, Liu M. The Impact of Medical Big Data Anonymization on Early Acute Kidney Injury Risk Prediction. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2020; 2020:617-625. [PMID: 32477684 PMCID: PMC7233037] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Artificial intelligence enabled medical big data analysis has the potential to revolutionize medical practice from diagnosis and prediction of complex diseases to making recommendations and resource allocation decisions in an evidence-based manner. However, big data comes with big disclosure risks. To preserve privacy, excessive data anonymization is often necessary, leading to significant loss of data utility. In this paper, we develop a systematic data scrubbing procedure for large datasets when key variables are uncertain for re-identification risk assessment and assess the trade-off between anonymization of electronic health record data for sharing in support of open science and performance of machine learning models for early acute kidney injury risk prediction using the data. Results demonstrate that our proposed data scrubbing procedure can maintain good feature diversity and moderate data utility but raises concerns regarding its impact on knowledge discovery capability.
Collapse
Affiliation(s)
- Xing Song
- University of Kansas Medical Center, Department of Internal Medicine, Division of Medical Informatics, Kansas City, KS, USA
| | - Lemuel R Waitman
- University of Kansas Medical Center, Department of Internal Medicine, Division of Medical Informatics, Kansas City, KS, USA
| | - Yong Hu
- Jinan University, Big Data Decision Institute, Guangzhou, PRC
| | - Bo Luo
- University of Kansas, Department of Electrical Engineering and Computer Science, Lawrence, KS, USA
| | - Fengjun Li
- University of Kansas, Department of Electrical Engineering and Computer Science, Lawrence, KS, USA
| | - Mei Liu
- University of Kansas Medical Center, Department of Internal Medicine, Division of Medical Informatics, Kansas City, KS, USA
| |
Collapse
|
6
|
Clustering datasets with demographics and diagnosis codes. J Biomed Inform 2020; 102:103360. [PMID: 31904428 DOI: 10.1016/j.jbi.2019.103360] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2019] [Revised: 11/30/2019] [Accepted: 12/16/2019] [Indexed: 11/21/2022]
Abstract
Clustering data derived from Electronic Health Record (EHR) systems is important to discover relationships between the clinical profiles of patients and as a preprocessing step for analysis tasks, such as classification. However, the heterogeneity of these data makes the application of existing clustering methods difficult and calls for new clustering approaches. In this paper, we propose the first approach for clustering a dataset in which each record contains a patient's values in demographic attributes and their set of diagnosis codes. Our approach represents the dataset in a binary form in which the features are selected demographic values, as well as combinations (patterns) of frequent and correlated diagnosis codes. This representation enables measuring similarity between records using cosine similarity, an effective measure for binary-represented data, and finding compact, well-separated clusters through hierarchical clustering. Our experiments using two publicly available EHR datasets, comprised of over 26,000 and 52,000 records, demonstrate that our approach is able to construct clusters with correlated demographics and diagnosis codes, and that it is efficient and scalable.
Collapse
|
7
|
Chevrier R, Foufi V, Gaudet-Blavignac C, Robert A, Lovis C. Use and Understanding of Anonymization and De-Identification in the Biomedical Literature: Scoping Review. J Med Internet Res 2019; 21:e13484. [PMID: 31152528 PMCID: PMC6658290 DOI: 10.2196/13484] [Citation(s) in RCA: 42] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2019] [Revised: 03/29/2019] [Accepted: 04/26/2019] [Indexed: 01/19/2023] Open
Abstract
Background The secondary use of health data is central to biomedical research in the era of data science and precision medicine. National and international initiatives, such as the Global Open Findable, Accessible, Interoperable, and Reusable (GO FAIR) initiative, are supporting this approach in different ways (eg, making the sharing of research data mandatory or improving the legal and ethical frameworks). Preserving patients’ privacy is crucial in this context. De-identification and anonymization are the two most common terms used to refer to the technical approaches that protect privacy and facilitate the secondary use of health data. However, it is difficult to find a consensus on the definitions of the concepts or on the reliability of the techniques used to apply them. A comprehensive review is needed to better understand the domain, its capabilities, its challenges, and the ratio of risk between the data subjects’ privacy on one side, and the benefit of scientific advances on the other. Objective This work aims at better understanding how the research community comprehends and defines the concepts of de-identification and anonymization. A rich overview should also provide insights into the use and reliability of the methods. Six aspects will be studied: (1) terminology and definitions, (2) backgrounds and places of work of the researchers, (3) reasons for anonymizing or de-identifying health data, (4) limitations of the techniques, (5) legal and ethical aspects, and (6) recommendations of the researchers. Methods Based on a scoping review protocol designed a priori, MEDLINE was searched for publications discussing de-identification or anonymization and published between 2007 and 2017. The search was restricted to MEDLINE to focus on the life sciences community. The screening process was performed by two reviewers independently. Results After searching 7972 records that matched at least one search term, 135 publications were screened and 60 full-text articles were included. (1) Terminology: Definitions of the terms de-identification and anonymization were provided in less than half of the articles (29/60, 48%). When both terms were used (41/60, 68%), their meanings divided the authors into two equal groups (19/60, 32%, each) with opposed views. The remaining articles (3/60, 5%) were equivocal. (2) Backgrounds and locations: Research groups were based predominantly in North America (31/60, 52%) and in the European Union (22/60, 37%). The authors came from 19 different domains; computer science (91/248, 36.7%), biomedical informatics (47/248, 19.0%), and medicine (38/248, 15.3%) were the most prevalent ones. (3) Purpose: The main reason declared for applying these techniques is to facilitate biomedical research. (4) Limitations: Progress is made on specific techniques but, overall, limitations remain numerous. (5) Legal and ethical aspects: Differences exist between nations in the definitions, approaches, and legal practices. (6) Recommendations: The combination of organizational, legal, ethical, and technical approaches is necessary to protect health data. Conclusions Interest is growing for privacy-enhancing techniques in the life sciences community. This interest crosses scientific boundaries, involving primarily computer science, biomedical informatics, and medicine. The variability observed in the use of the terms de-identification and anonymization emphasizes the need for clearer definitions as well as for better education and dissemination of information on the subject. The same observation applies to the methods. Several legislations, such as the American Health Insurance Portability and Accountability Act (HIPAA) and the European General Data Protection Regulation (GDPR), regulate the domain. Using the definitions they provide could help address the variable use of these two concepts in the research community.
Collapse
Affiliation(s)
- Raphaël Chevrier
- Division of Medical Information Sciences, University Hospitals of Geneva, Geneva, Switzerland.,Faculty of Medicine, University of Geneva, Geneva, Switzerland
| | - Vasiliki Foufi
- Division of Medical Information Sciences, University Hospitals of Geneva, Geneva, Switzerland.,Faculty of Medicine, University of Geneva, Geneva, Switzerland
| | - Christophe Gaudet-Blavignac
- Division of Medical Information Sciences, University Hospitals of Geneva, Geneva, Switzerland.,Faculty of Medicine, University of Geneva, Geneva, Switzerland
| | - Arnaud Robert
- Division of Medical Information Sciences, University Hospitals of Geneva, Geneva, Switzerland.,Faculty of Medicine, University of Geneva, Geneva, Switzerland
| | - Christian Lovis
- Division of Medical Information Sciences, University Hospitals of Geneva, Geneva, Switzerland.,Faculty of Medicine, University of Geneva, Geneva, Switzerland
| |
Collapse
|
8
|
Mandala J, Chandra Sekhara Rao M. Privacy preservation of data using crow search with adaptive awareness probability. JOURNAL OF INFORMATION SECURITY AND APPLICATIONS 2019. [DOI: 10.1016/j.jisa.2018.12.005] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
9
|
Lantos JD. Ethical and Psychosocial Issues in Whole Genome Sequencing (WGS) for Newborns. Pediatrics 2019; 143:S1-S5. [PMID: 30600264 DOI: 10.1542/peds.2018-1099b] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 07/03/2018] [Indexed: 11/24/2022] Open
Abstract
In this article, I review some of the ethical issues that have arisen in the past when genetic testing has been done in newborns. I then suggest how whole genome sequencing may raise a new set of issues. Finally, I introduce a series of other articles in which the authors address different controversies that arise when whole genome sequencing is used in the newborn period.
Collapse
Affiliation(s)
- John D Lantos
- Bioethics Center, Children's Mercy Hospital and University of Missouri - Kansas City, Kansas City, Missouri
| |
Collapse
|
10
|
Arellano AM, Dai W, Wang S, Jiang X, Ohno-Machado L. Privacy Policy and Technology in Biomedical Data Science. Annu Rev Biomed Data Sci 2018; 1:115-129. [PMID: 31058261 PMCID: PMC6497413 DOI: 10.1146/annurev-biodatasci-080917-013416] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
Abstract
Privacyis an important consideration when sharing clinical data, which often contain sensitive information. Adequate protection to safeguard patient privacy and to increase public trust in biomedical research is paramount. This review covers topics in policy and technology in the context of clinical data sharing. We review policy articles related to (a) the Common Rule, HIPAA privacy and security rules, and governance; (b) patients' viewpoints and consent practices; and (c) research ethics. We identify key features of the revised Common Rule and the most notable changes since its previous version. We address data governance for research in addition to the increasing emphasis on ethical and social implications. Research ethics topics include data sharing best practices, use of data from populations of low socioeconomic status (SES), recent updates to institutional review board (IRB) processes to protect human subjects' data, and important concerns about the limitations of current policies to address data deidentification. In terms of technology, we focus on articles that have applicability in real world health care applications: deidentification methods that comply with HIPAA, data anonymization approaches to satisfy well-acknowledged issues in deidentified data, encryption methods to safeguard data analyses, and privacy-preserving predictive modeling. The first two technology topics are mostly relevant to methodologies that attempt to sanitize structured or unstructured data. The third topic includes analysis on encrypted data. The last topic includes various mechanisms to build statistical models without sharing raw data.
Collapse
Affiliation(s)
- April Moreno Arellano
- Department of Biomedical Informatics, School of Medicine, University of California, San Diego, La Jolla, California 92093, USA;
| | - Wenrui Dai
- Department of Biomedical Informatics, School of Medicine, University of California, San Diego, La Jolla, California 92093, USA;
| | - Shuang Wang
- Department of Biomedical Informatics, School of Medicine, University of California, San Diego, La Jolla, California 92093, USA;
| | - Xiaoqian Jiang
- Department of Biomedical Informatics, School of Medicine, University of California, San Diego, La Jolla, California 92093, USA;
| | - Lucila Ohno-Machado
- Department of Biomedical Informatics, School of Medicine, University of California, San Diego, La Jolla, California 92093, USA;
| |
Collapse
|
11
|
Meystre SM, Lovis C, Bürkle T, Tognola G, Budrionis A, Lehmann CU. Clinical Data Reuse or Secondary Use: Current Status and Potential Future Progress. Yearb Med Inform 2017; 26:38-52. [PMID: 28480475 PMCID: PMC6239225 DOI: 10.15265/iy-2017-007] [Citation(s) in RCA: 89] [Impact Index Per Article: 12.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2017] [Indexed: 12/30/2022] Open
Abstract
Objective: To perform a review of recent research in clinical data reuse or secondary use, and envision future advances in this field. Methods: The review is based on a large literature search in MEDLINE (through PubMed), conference proceedings, and the ACM Digital Library, focusing only on research published between 2005 and early 2016. Each selected publication was reviewed by the authors, and a structured analysis and summarization of its content was developed. Results: The initial search produced 359 publications, reduced after a manual examination of abstracts and full publications. The following aspects of clinical data reuse are discussed: motivations and challenges, privacy and ethical concerns, data integration and interoperability, data models and terminologies, unstructured data reuse, structured data mining, clinical practice and research integration, and examples of clinical data reuse (quality measurement and learning healthcare systems). Conclusion: Reuse of clinical data is a fast-growing field recognized as essential to realize the potentials for high quality healthcare, improved healthcare management, reduced healthcare costs, population health management, and effective clinical research.
Collapse
Affiliation(s)
- S. M. Meystre
- Medical University of South Carolina, Charleston, SC, USA
| | - C. Lovis
- Division of Medical Information Sciences, University Hospitals of Geneva, Switzerland
| | - T. Bürkle
- University of Applied Sciences, Bern, Switzerland
| | - G. Tognola
- Institute of Electronics, Computer and Telecommunication Engineering, Italian Natl. Research Council IEIIT-CNR, Milan, Italy
| | - A. Budrionis
- Norwegian Centre for E-health Research, University Hospital of North Norway, Tromsø, Norway
| | - C. U. Lehmann
- Departments of Biomedical Informatics and Pediatrics, Vanderbilt University Medical Center, Nashville, TN, USA
| |
Collapse
|
12
|
Ozalp I, Gursoy ME, Nergiz ME, Saygin Y. Privacy-Preserving Publishing of Hierarchical Data. ACM TRANSACTIONS ON PRIVACY AND SECURITY 2016. [DOI: 10.1145/2976738] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Abstract
Many applications today rely on storage and management of semi-structured information, for example, XML databases and document-oriented databases. These data often have to be shared with untrusted third parties, which makes individuals’ privacy a fundamental problem. In this article, we propose anonymization techniques for privacy-preserving publishing of hierarchical data. We show that the problem of anonymizing hierarchical data poses unique challenges that cannot be readily solved by existing mechanisms. We extend two standards for privacy protection in tabular data (
k
-anonymity and ℓ-diversity) and apply them to hierarchical data. We present utility-aware algorithms that enforce these definitions of privacy using generalizations and suppressions of data values. To evaluate our algorithms and their heuristics, we experiment on synthetic and real datasets obtained from two universities. Our experiments show that we significantly outperform related methods that provide comparable privacy guarantees.
Collapse
|
13
|
Poulis G, Loukides G, Skiadopoulos S, Gkoulalas-Divanis A. Anonymizing datasets with demographics and diagnosis codes in the presence of utility constraints. J Biomed Inform 2016; 65:76-96. [PMID: 27832965 DOI: 10.1016/j.jbi.2016.11.001] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2016] [Revised: 10/22/2016] [Accepted: 11/01/2016] [Indexed: 10/20/2022]
Abstract
Publishing data about patients that contain both demographics and diagnosis codes is essential to perform large-scale, low-cost medical studies. However, preserving the privacy and utility of such data is challenging, because it requires: (i) guarding against identity disclosure (re-identification) attacks based on both demographics and diagnosis codes, (ii) ensuring that the anonymized data remain useful in intended analysis tasks, and (iii) minimizing the information loss, incurred by anonymization, to preserve the utility of general analysis tasks that are difficult to determine before data publishing. Existing anonymization approaches are not suitable for being used in this setting, because they cannot satisfy all three requirements. Therefore, in this work, we propose a new approach to deal with this problem. We enforce the requirement (i) by applying (k,km)-anonymity, a privacy principle that prevents re-identification from attackers who know the demographics of a patient and up to m of their diagnosis codes, where k and m are tunable parameters. To capture the requirement (ii), we propose the concept of utility constraint for both demographics and diagnosis codes. Utility constraints limit the amount of generalization and are specified by data owners (e.g., the healthcare institution that performs anonymization). We also capture requirement (iii), by employing well-established information loss measures for demographics and for diagnosis codes. To realize our approach, we develop an algorithm that enforces (k,km)-anonymity on a dataset containing both demographics and diagnosis codes, in a way that satisfies the specified utility constraints and with minimal information loss, according to the measures. Our experiments with a large dataset containing more than 200,000 electronic health records show the effectiveness and efficiency of our algorithm.
Collapse
Affiliation(s)
- Giorgos Poulis
- Department of Informatics and Telecommunications, University of the Peloponnese, Greece.
| | | | - Spiros Skiadopoulos
- Department of Informatics and Telecommunications, University of the Peloponnese, Greece.
| | | |
Collapse
|
14
|
Kanbar LJ, Shalish W, Robles-Rubio CA, Precup D, Brown K, Sant'Anna GM, Kearney RE. Organizational principles of cloud storage to support collaborative biomedical research. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2016; 2015:1231-4. [PMID: 26736489 DOI: 10.1109/embc.2015.7318589] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
This paper describes organizational guidelines and an anonymization protocol for the management of sensitive information in interdisciplinary, multi-institutional studies with multiple collaborators. This protocol is flexible, automated, and suitable for use in cloud-based projects as well as for publication of supplementary information in journal papers. A sample implementation of the anonymization protocol is illustrated for an ongoing study dealing with Automated Prediction of EXtubation readiness (APEX).
Collapse
|
15
|
Udtha M, Nomie K, Yu E, Sanner J. Novel and emerging strategies for longitudinal data collection. J Nurs Scholarsh 2014; 47:152-60. [PMID: 25490868 DOI: 10.1111/jnu.12116] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 09/28/2014] [Indexed: 01/07/2023]
Abstract
PURPOSE To describe novel and emerging strategies practiced globally in research to improve longitudinal data collection. ORGANIZING CONSTRUCT In research studies, numerous strategies such as telephone interviews, postal mailing, online questionnaires, and electronic mail are traditionally utilized in longitudinal data collection. However, due to technological advances, novel and emerging strategies have been applied to longitudinal data collection, such as two-way short message service, smartphone applications (or "apps"), retrieval capabilities applied to the electronic medical record, and an adapted cloud interface. In this review, traditional longitudinal data collection strategies are briefly described, emerging and novel strategies are detailed and explored, and information regarding the impact of novel methods on participant response rates, the timeliness of participant responses, and cost is provided. We further discuss how these novel and emerging strategies affect longitudinal data collection and advance research, specifically nursing research. CONCLUSIONS Evidence suggests that the novel and emerging longitudinal data collection strategies discussed in this review are valuable approaches to consider. These strategies facilitate collecting longitudinal research data to better understand a variety of health-related conditions. Future studies, including nursing research, should consider using novel and emerging strategies to advance longitudinal data collection. CLINICAL RELEVANCE A better understanding of novel and emerging longitudinal data collection strategies will ultimately improve longitudinal data collection as well as foster research efforts. Nurse researchers, along with all researchers, must be aware of and consider implementing novel and emerging strategies to ensure future healthcare research success.
Collapse
Affiliation(s)
- Malini Udtha
- Lab and Research Coordinator of Nursing Systems, University of Texas Health Science Center at Houston School of Nursing, Houston, TX, USA
| | | | | | | |
Collapse
|
16
|
|
17
|
Secondary use of clinical data: the Vanderbilt approach. J Biomed Inform 2014; 52:28-35. [PMID: 24534443 DOI: 10.1016/j.jbi.2014.02.003] [Citation(s) in RCA: 174] [Impact Index Per Article: 17.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2013] [Revised: 12/21/2013] [Accepted: 02/04/2014] [Indexed: 01/04/2023]
Abstract
The last decade has seen an exponential growth in the quantity of clinical data collected nationwide, triggering an increase in opportunities to reuse the data for biomedical research. The Vanderbilt research data warehouse framework consists of identified and de-identified clinical data repositories, fee-for-service custom services, and tools built atop the data layer to assist researchers across the enterprise. Providing resources dedicated to research initiatives benefits not only the research community, but also clinicians, patients and institutional leadership. This work provides a summary of our approach in the secondary use of clinical data for research domain, including a description of key components and a list of lessons learned, designed to assist others assembling similar services and infrastructure.
Collapse
|
18
|
Atreya RV, Smith JC, McCoy AB, Malin B, Miller RA. Reducing patient re-identification risk for laboratory results within research datasets. J Am Med Inform Assoc 2013; 20:95-101. [PMID: 22822040 PMCID: PMC3555327 DOI: 10.1136/amiajnl-2012-001026] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2012] [Accepted: 07/02/2012] [Indexed: 01/08/2023] Open
Abstract
OBJECTIVE To try to lower patient re-identification risks for biomedical research databases containing laboratory test results while also minimizing changes in clinical data interpretation. MATERIALS AND METHODS In our threat model, an attacker obtains 5-7 laboratory results from one patient and uses them as a search key to discover the corresponding record in a de-identified biomedical research database. To test our models, the existing Vanderbilt TIME database of 8.5 million Safe Harbor de-identified laboratory results from 61 280 patients was used. The uniqueness of unaltered laboratory results in the dataset was examined, and then two data perturbation models were applied-simple random offsets and an expert-derived clinical meaning-preserving model. A rank-based re-identification algorithm to mimic an attack was used. The re-identification risk and the retention of clinical meaning for each model's perturbed laboratory results were assessed. RESULTS Differences in re-identification rates between the algorithms were small despite substantial divergence in altered clinical meaning. The expert algorithm maintained the clinical meaning of laboratory results better (affecting up to 4% of test results) than simple perturbation (affecting up to 26%). DISCUSSION AND CONCLUSION With growing impetus for sharing clinical data for research, and in view of healthcare-related federal privacy regulation, methods to mitigate risks of re-identification are important. A practical, expert-derived perturbation algorithm that demonstrated potential utility was developed. Similar approaches might enable administrators to select data protection scheme parameters that meet their preferences in the trade-off between the protection of privacy and the retention of clinical meaning of shared data.
Collapse
Affiliation(s)
- Ravi V Atreya
- Department of Biomedical Informatics, School of Medicine, Vanderbilt University, Nashville, TN 37232-8340, USA.
| | | | | | | | | |
Collapse
|
19
|
|
20
|
Russell MW, Wilder NS. Getting personal: understanding how genetic variation affects clinical outcomes in patients with tetralogy of Fallot. Pediatr Res 2012; 72:334-6. [PMID: 23032507 PMCID: PMC3576875 DOI: 10.1038/pr.2012.104] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
The work by Jeewa et al. is an important step toward “personalizing” or individualizing our approach to care of patients with tetralogy of Fallot. Although future studies will need to confirm the potential role of HIF1A-mediated signaling in right ventricular remodeling, it raises the possibility that modulation of the HIF1A signaling pathway or its downstream effectors such as TGF-β may allow better preservation of ventricular function in patients with TOF. Furthermore, directed genotyping for HIF1A and other genetic variants may help identify patients at risk for adverse outcomes. This study demonstrates the potential for genetics-of- outcomes studies to evaluate novel therapeutic targets and to identify at-risk populations that may require specific therapeutic considerations.
Collapse
Affiliation(s)
- Mark W. Russell
- Division of Pediatric Cardiology, Department of Pediatrics and Communicable Diseases, University of Michigan, Ann Arbor, Michigan
| | - Nicole S. Wilder
- Division of Pediatric Anesthesiology, Department of Anesthesiology, University of Michigan, Ann Arbor, Michigan
| |
Collapse
|