1
|
Khatiwada P, Yang B, Lin JC, Blobel B. Patient-Generated Health Data (PGHD): Understanding, Requirements, Challenges, and Existing Techniques for Data Security and Privacy. J Pers Med 2024; 14:282. [PMID: 38541024 PMCID: PMC10971637 DOI: 10.3390/jpm14030282] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2024] [Revised: 02/21/2024] [Accepted: 02/28/2024] [Indexed: 11/27/2024] Open
Abstract
The evolution of Patient-Generated Health Data (PGHD) represents a major shift in healthcare, fueled by technological progress. The advent of PGHD, with technologies such as wearable devices and home monitoring systems, extends data collection beyond clinical environments, enabling continuous monitoring and patient engagement in their health management. Despite the growing prevalence of PGHD, there is a lack of clear understanding among stakeholders about its meaning, along with concerns about data security, privacy, and accuracy. This article aims to thoroughly review and clarify PGHD by examining its origins, types, technological foundations, and the challenges it faces, especially in terms of privacy and security regulations. The review emphasizes the role of PGHD in transforming healthcare through patient-centric approaches, their understanding, and personalized care, while also exploring emerging technologies and addressing data privacy and security issues, offering a comprehensive perspective on the current state and future directions of PGHD. The methodology employed for this review followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines and Rayyan, AI-Powered Tool for Systematic Literature Reviews. This approach ensures a systematic and comprehensive coverage of the available literature on PGHD, focusing on the various aspects outlined in the objective. The review encompassed 36 peer-reviewed articles from various esteemed publishers and databases, reflecting a diverse range of methodologies, including interviews, regular articles, review articles, and empirical studies to address three RQs exploratory, impact assessment, and solution-oriented questions related to PGHD. Additionally, to address the future-oriented fourth RQ for PGHD not covered in the above review, we have incorporated existing domain knowledge articles. This inclusion aims to provide answers encompassing both basic and advanced security measures for PGHD, thereby enhancing the depth and scope of our analysis.
Collapse
Affiliation(s)
- Pankaj Khatiwada
- Department of Information Security and Communication Technology (IIK), Norwegian University of Science and Technology (NTNU), 7034 Trondheim, Norway; (B.Y.); (J.-C.L.)
| | - Bian Yang
- Department of Information Security and Communication Technology (IIK), Norwegian University of Science and Technology (NTNU), 7034 Trondheim, Norway; (B.Y.); (J.-C.L.)
| | - Jia-Chun Lin
- Department of Information Security and Communication Technology (IIK), Norwegian University of Science and Technology (NTNU), 7034 Trondheim, Norway; (B.Y.); (J.-C.L.)
| | - Bernd Blobel
- Medical Faculty, University of Regensburg, 93053 Regensburg, Germany;
| |
Collapse
|
2
|
Sarowar Sattar AHM, Li J, Liu J, Heatherly R, Malin B. A Probabilistic Approach to Mitigate Composition Attacks on Privacy in Non-Coordinated Environments. Knowl Based Syst 2015; 67:361-372. [PMID: 25598581 DOI: 10.1016/j.knosys.2014.04.019] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Organizations share data about individuals to drive business and comply with law and regulation. However, an adversary may expose confidential information by tracking an individual across disparate data publications using quasi-identifying attributes (e.g., age, geocode and sex) associated with the records. Various studies have shown that well-established privacy protection models (e.g., k-anonymity and its extensions) fail to protect an individual's privacy against this "composition attack". This type of attack can be thwarted when organizations coordinate prior to data publication, but such a practice is not always feasible. In this paper, we introduce a probabilistic model called (d, α)-linkable, which mitigates composition attack without coordination. The model ensures that d confidential values are associated with a quasi-identifying group with a likelihood of α. We realize this model through an efficient extension to k-anonymization and use extensive experiments to show our strategy significantly reduces the likelihood of a successful composition attack and can preserve more utility than alternative privacy models, such as differential privacy.
Collapse
Affiliation(s)
- A H M Sarowar Sattar
- School of Information Technology and Mathematical Science, University of South Australia, Mawson Lakes, SA-5095, Australia
| | - Jiuyong Li
- School of Information Technology and Mathematical Science, University of South Australia, Mawson Lakes, SA-5095, Australia
| | - Jixue Liu
- School of Information Technology and Mathematical Science, University of South Australia, Mawson Lakes, SA-5095, Australia
| | - Raymond Heatherly
- Department of Biomedical Informatics, Vanderbilt University, Nashville, Tennessee, USA
| | - Bradley Malin
- Department of Biomedical Informatics, Vanderbilt University, Nashville, Tennessee, USA ; Department of Electrical Engineering and Computer Science, Vanderbilt University, Nashville, Tennessee, USA
| |
Collapse
|
3
|
Krishnamoorthy P, Gupta D, Chatterjee S, Huston J, Ryan JJ. A review of the role of electronic health record in genomic research. J Cardiovasc Transl Res 2014; 7:692-700. [PMID: 25119857 DOI: 10.1007/s12265-014-9586-0] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/05/2014] [Accepted: 08/05/2014] [Indexed: 12/15/2022]
Abstract
Electronic health record (EHR)-driven genomic research is a recent strategy used to answer research questions using EHR data linked to DNA samples. In models using EHR, after the subject's DNA is collected, a linkage between the DNA sample and the EHR data is maintained. This makes the EHR the paramount source of phenotypic information. The National Human Genome Research Institute sponsored Electronic Medical Records and Genomics (eMERGE) network began in five sites in 2007 and was expanded to nine sites in 2012. This network has developed the methods and best practices for utilizing EHR as a tool for genomic research. Therefore, it is vital to understand the configuration of EHR used to capture data in clinical practice and feasibility of integration with clinical genetic test results. We present a detailed review of the role and importance of EHR in the field of genomic research.
Collapse
|
4
|
Gottesman O, Kuivaniemi H, Tromp G, Faucett WA, Li R, Manolio TA, Sanderson SC, Kannry J, Zinberg R, Basford MA, Brilliant M, Carey DJ, Chisholm RL, Chute CG, Connolly JJ, Crosslin D, Denny JC, Gallego CJ, Haines JL, Hakonarson H, Harley J, Jarvik GP, Kohane I, Kullo IJ, Larson EB, McCarty C, Ritchie MD, Roden DM, Smith ME, Böttinger EP, Williams MS. The Electronic Medical Records and Genomics (eMERGE) Network: past, present, and future. Genet Med 2013; 15:761-71. [PMID: 23743551 PMCID: PMC3795928 DOI: 10.1038/gim.2013.72] [Citation(s) in RCA: 528] [Impact Index Per Article: 44.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2013] [Accepted: 04/18/2013] [Indexed: 12/13/2022] Open
Abstract
The Electronic Medical Records and Genomics Network is a National Human Genome Research Institute–funded consortium engaged in the development of methods and best practices for using the electronic medical record as a tool for genomic research. Now in its sixth year and second funding cycle, and comprising nine research groups and a coordinating center, the network has played a major role in validating the concept that clinical data derived from electronic medical records can be used successfully for genomic research. Current work is advancing knowledge in multiple disciplines at the intersection of genomics and health-care informatics, particularly for electronic phenotyping, genome-wide association studies, genomic medicine implementation, and the ethical and regulatory issues associated with genomics research and returning results to study participants. Here, we describe the evolution, accomplishments, opportunities, and challenges of the network from its inception as a five-group consortium focused on genotype–phenotype associations for genomic discovery to its current form as a nine-group consortium pivoting toward the implementation of genomic medicine. Genet Med15 10, 761–771.
Collapse
|
5
|
Erdal BS, Liu J, Ding J, Chen J, Marsh CB, Kamal J, Clymer BD. A database de-identification framework to enable direct queries on medical data for secondary use. Methods Inf Med 2012; 51:229-41. [PMID: 22311158 DOI: 10.3414/me11-01-0048] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2011] [Accepted: 11/08/2011] [Indexed: 11/09/2022]
Abstract
OBJECTIVE To qualify the use of patient clinical records as non-human-subject for research purpose, electronic medical record data must be de-identified so there is minimum risk to protected health information exposure. This study demonstrated a robust framework for structured data de-identification that can be applied to any relational data source that needs to be de-identified. METHODS Using a real world clinical data warehouse, a pilot implementation of limited subject areas were used to demonstrate and evaluate this new de-identification process. Query results and performances are compared between source and target system to validate data accuracy and usability. RESULTS The combination of hashing, pseudonyms, and session dependent randomizer provides a rigorous de-identification framework to guard against 1) source identifier exposure; 2) internal data analyst manually linking to source identifiers; and 3) identifier cross-link among different researchers or multiple query sessions by the same researcher. In addition, a query rejection option is provided to refuse queries resulting in less than preset numbers of subjects and total records to prevent users from accidental subject identification due to low volume of data. This framework does not prevent subject re-identification based on prior knowledge and sequence of events. Also, it does not deal with medical free text de-identification, although text de-identification using natural language processing can be included due its modular design. CONCLUSION We demonstrated a framework resulting in HIPAA Compliant databases that can be directly queried by researchers. This technique can be augmented to facilitate inter-institutional research data sharing through existing middleware such as caGrid.
Collapse
Affiliation(s)
- B S Erdal
- Information Warehouse, The Ohio State University Medical Center, Columbus, Ohio, USA
| | | | | | | | | | | | | |
Collapse
|
6
|
Airoldi EM, Bai X, Malin BA. An Entropy Approach to Disclosure Risk Assessment: Lessons from Real Applications and Simulated Domains. DECISION SUPPORT SYSTEMS 2011; 51:10-20. [PMID: 21647242 PMCID: PMC3107517 DOI: 10.1016/j.dss.2010.11.014] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
We live in an increasingly mobile world, which leads to the duplication of information across domains. Though organizations attempt to obscure the identities of their constituents when sharing information for worthwhile purposes, such as basic research, the uncoordinated nature of such environment can lead to privacy vulnerabilities. For instance, disparate healthcare providers can collect information on the same patient. Federal policy requires that such providers share "de-identified" sensitive data, such as biomedical (e.g., clinical and genomic) records. But at the same time, such providers can share identified information, devoid of sensitive biomedical data, for administrative functions. On a provider-by-provider basis, the biomedical and identified records appear unrelated, however, links can be established when multiple providers' databases are studied jointly. The problem, known as trail disclosure, is a generalized phenomenon and occurs because an individual's location access pattern can be matched across the shared databases. Due to technical and legal constraints, it is often difficult to coordinate between providers and thus it is critical to assess the disclosure risk in distributed environments, so that we can develop techniques to mitigate such risks. Research on privacy protection has so far focused on developing technologies to suppress or encrypt identifiers associated with sensitive information. There is growing body of work on the formal assessment of the disclosure risk of database entries in publicly shared databases, but a less attention has been paid to the distributed setting. In this research, we review the trail disclosure problem in several domains with known vulnerabilities and show that disclosure risk is influenced by the distribution of how people visit service providers. Based on empirical evidence, we propose an entropy metric for assessing such risk in shared databases prior to their release. This metric assesses risk by leveraging the statistical characteristics of a visit distribution, as opposed to person-level data. It is computationally efficient and superior to existing risk assessment methods, which rely on ad hoc assessment that are often computationally expensive and unreliable. We evaluate our approach on a range of location access patterns in simulated environments. Our results demonstrate the approach is effective at estimating trail disclosure risks and the amount of self-information contained in a distributed system is one of the main driving factors.
Collapse
Affiliation(s)
| | - Xue Bai
- School of Business, University of Connecticut, Storrs, CT 06269, USA
| | - Bradley A. Malin
- Department of Biomedical Informatics, Vanderbilt University, Nashville, TN 37203 USA
| |
Collapse
|