1
|
Doetsch JN, Dias V, Indredavik MS, Reittu J, Devold RK, Teixeira R, Kajantie E, Barros H. Record linkage of population-based cohort data from minors with national register data: a scoping review and comparative legal analysis of four European countries. Open Res Eur 2021; 1:58. [PMID: 37645179 PMCID: PMC10445839 DOI: 10.12688/openreseurope.13689.2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Accepted: 09/20/2021] [Indexed: 08/31/2023]
Abstract
Background: The GDPR was implemented to build an overarching framework for personal data protection across the EU/EEA. Linkage of data directly collected from cohort participants, potentially serving as a prominent tool for health research, must respect data protection rules and privacy rights. Our objective was to investigate law possibilities of linking cohort data of minors with routinely collected education and health data comparing EU/EEA member states. Methods: A legal comparative analysis and scoping review was conducted of openly accessible published laws and regulations in EUR-Lex and national law databases on GDPR's implementation in Portugal, Finland, Norway, and the Netherlands and its connected national regulations purposing record linkage for health research that have been implemented up until April 30, 2021. Results: The GDPR does not ensure total uniformity in data protection legislation across member states offering flexibility for national legislation. Exceptions to process personal data, e.g., public interest and scientific research, must be laid down in EU/EEA or national law. Differences in national interpretation caused obstacles in cross-national research and record linkage: Portugal requires written consent and ethical approval; Finland allows linkage mostly without consent through the national Social and Health Data Permit Authority; Norway when based on regional ethics committee's approval and adequate information technology safeguarding confidentiality; the Netherlands mainly bases linkage on the opt-out system and Data Protection Impact Assessment. Conclusions: Though the GDPR is the most important legal framework, national legislation execution matters most when linking cohort data with routinely collected health and education data. As national interpretation varies, legal intervention balancing individual right to informational self-determination and public good is gravely needed for health research. More harmonization across EU/EEA could be helpful but should not be detrimental in those member states which already opened a leeway for registries and research for the public good without explicit consent.
Collapse
Affiliation(s)
- Julia Nadine Doetsch
- Laboratory for Integrative and Translational Research in Population Health (ITR), Porto, 4050-600, Portugal
- EPIUnit, Instituto de Saúde Pública da, Universidade do Porto (ISPUP), Porto, 4050-600, Portugal
| | - Vasco Dias
- INESC TEC -Institute for Systems and Computer Engineering, Technology and Science, Campus da Faculdade de Engenharia da Universidade do Porto, Porto, 4050-091, Portugal
| | - Marit S. Indredavik
- Department of Clinical and Molecular Medicine, Faculty of Medicine and Health Sciences, NTNU – Norwegian University of Science and Technology, Trondheim, NO-7491, Norway
| | - Jarkko Reittu
- Finnish Institute for Health and Welfare, Legal Services, Helsinki, Finland
- University of Helsinki, Faculty of Law, Helsinki, Finland
| | - Randi Kallar Devold
- Faculty of Medicine and Health Sciences, NTNU – Norwegian University of Science and Technology, Trondheim, NO-7491, Norway
| | - Raquel Teixeira
- Laboratory for Integrative and Translational Research in Population Health (ITR), Porto, 4050-600, Portugal
- EPIUnit, Instituto de Saúde Pública da, Universidade do Porto (ISPUP), Porto, 4050-600, Portugal
| | - Eero Kajantie
- Department of Clinical and Molecular Medicine, Faculty of Medicine and Health Sciences, NTNU – Norwegian University of Science and Technology, Trondheim, NO-7491, Norway
- Finnish Institute for Health and Welfare, Population Health Unit, Helsinki and Oulu, Finland
- PEDEGO Research Unit, MRC Oulu, University of Oulu and Oulu University Hospital, Oulu, Finland
- Children’s Hospital, Helsinki University Hospital and University of Helsinki, Helsinki, Finland
| | - Henrique Barros
- Laboratory for Integrative and Translational Research in Population Health (ITR), Porto, 4050-600, Portugal
- EPIUnit, Instituto de Saúde Pública da, Universidade do Porto (ISPUP), Porto, 4050-600, Portugal
- Departamento de Ciências da Saúde Pública e Forenses e Educação Médica, Faculdade de Medicina, Universidade do Porto (FMUP), Porto, Portugal
| |
Collapse
|
2
|
Heidt CM, Hund H, Fegeler C. A Federated Record Linkage Algorithm for Secure Medical Data Sharing. Stud Health Technol Inform 2021; 278:142-9. [PMID: 34042887 DOI: 10.3233/SHTI210062] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register]
Abstract
The process of consolidating medical records from multiple institutions into one data set makes privacy-preserving record linkage (PPRL) a necessity. Most PPRL approaches, however, are only designed to link records from two institutions, and existing multi-party approaches tend to discard non-matching records, leading to incomplete result sets. In this paper, we propose a new algorithm for federated record linkage between multiple parties by a trusted third party using record-level bloom filters to preserve patient data privacy. We conduct a study to find optimal weights for linkage-relevant data fields and are able to achieve 99.5% linkage accuracy testing on the Febrl record linkage dataset. This approach is integrated into an end-to-end pseudonymization framework for medical data sharing.
Collapse
|
3
|
Shan M, Thomas KS, Gutman R. A MULTIPLE IMPUTATION PROCEDURE FOR RECORD LINKAGE AND CAUSAL INFERENCE TO ESTIMATE THE EFFECTS OF HOME-DELIVERED MEALS. Ann Appl Stat 2021; 15:412-436. [PMID: 35755005 PMCID: PMC9222523 DOI: 10.1214/20-aoas1397] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2024]
Abstract
Causal analysis of observational studies requires data that comprise of a set of covariates, a treatment assignment indicator, and the observed outcomes. However, data confidentiality restrictions or the nature of data collection may distribute these variables across two or more datasets. In the absence of unique identifiers to link records across files, probabilistic record linkage algorithms can be leveraged to merge the datasets. Current applications of record linkage are concerned with estimation of associations between variables that are exclusive to one file and not causal relationships. We propose a Bayesian framework for record linkage and causal inference where one file comprises all the covariate and observed outcome information, and the second file consists of a list of all individuals who receive the active treatment. Under certain ignorability assumptions, the procedure properly propagates the error in the record linkage process, resulting in valid statistical inferences. To estimate the causal effects, we devise a two-stage procedure. The first stage of the procedure performs Bayesian record linkage to multiply impute the treatment assignment for all individuals in the first file, while adjustments for covariates' imbalance and imputation of missing potential outcomes are performed in the second stage. This procedure is used to evaluate the effect of Meals on Wheels services on mortality and healthcare utilization among homebound older adults in Rhode Island. In addition, an interpretable sensitivity analysis is developed to assess potential violations of the ignorability assumptions.
Collapse
|
4
|
Maratea A, Ciaramella A, Cianci GP. Record linkage of banks and municipalities through multiple criteria and neural networks. PeerJ Comput Sci 2020; 6:e258. [PMID: 33816910 PMCID: PMC7924437 DOI: 10.7717/peerj-cs.258] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2019] [Accepted: 01/21/2020] [Indexed: 06/12/2023]
Abstract
Record linkage aims to identify records from multiple data sources that refer to the same entity of the real world. It is a well known data quality process studied since the second half of the last century, with an established pipeline and a rich literature of case studies mainly covering census, administrative or health domains. In this paper, a method to recognize matching records from real municipalities and banks through multiple similarity criteria and a Neural Network classifier is proposed: starting from a labeled subset of the available data, first several similarity measures are combined and weighted to build a feature vector, then a Multi-Layer Perceptron (MLP) network is trained and tested to find matching pairs. For validation, seven real datasets have been used (three from banks and four from municipalities), purposely chosen in the same geographical area to increase the probability of matches. The training only involved two municipalities, while testing involved all sources (municipalities vs. municipalities, banks vs banks and and municipalities vs. banks). The proposed method scored remarkable results in terms of both precision and recall, clearly outperforming threshold-based competitors.
Collapse
Affiliation(s)
- Antonio Maratea
- Department of Science and Technology, University of Naples “Parthenope”, Naples, Italy
| | - Angelo Ciaramella
- Department of Science and Technology, University of Naples “Parthenope”, Naples, Italy
| | - Giuseppe Pio Cianci
- Department of Science and Technology, University of Naples “Parthenope”, Naples, Italy
| |
Collapse
|
5
|
Abstract
OBJECTIVES Differences in the availability of a Social Security Number (SSN) by race/ethnicity could affect the ability to link with death certificate data in passive follow-up studies and possibly bias mortality disparities reported with linked data. Using 1989-2009 National Health Interview Survey (NHIS) data linked with the National Death Index (NDI) through 2011, we compared the availability of a SSN by race/ethnicity, estimated the percent of links likely missed due to lack of SSNs, and assessed if these estimated missed links affect race/ethnicity disparities reported in the NHIS-linked mortality data. METHODS We used preventive fraction methods based on race/ethnicity-specific Cox proportional hazards models of the relationship between availability of SSN and mortality based on observed links, adjusted for survey year, sex, age, respondent-rated health, education, and US nativity. RESULTS Availability of a SSN and observed percent linked were significantly lower for Hispanic and Asian/Pacific Islander (PI) participants compared with White non-Hispanic participants. We estimated that more than 18% of expected links were missed due to lack of SSNs among Hispanic and Asian/PI participants compared with about 10% among White non-Hispanic participants. However, correcting the observed links for expected missed links appeared to only have a modest impact on mortality disparities by race/ethnicity. CONCLUSIONS Researchers conducting analyses of mortality disparities using the NDI or other linked death records, need to be cognizant of the potential for differential linkage to contribute to their results.
Collapse
Affiliation(s)
- Eric A Miller
- National Center for Health Statistics, Centers for Disease Control and Prevention, Hyattsville, Maryland, United States
| | - Frances A McCarty
- National Center for Health Statistics, Centers for Disease Control and Prevention, Hyattsville, Maryland, United States
| | - Jennifer D Parker
- National Center for Health Statistics, Centers for Disease Control and Prevention, Hyattsville, Maryland, United States
| |
Collapse
|
6
|
Yigzaw KY, Michalas A, Bellika JG. Secure and scalable deduplication of horizontally partitioned health data for privacy-preserving distributed statistical computation. BMC Med Inform Decis Mak 2017; 17:1. [PMID: 28049465 PMCID: PMC5209873 DOI: 10.1186/s12911-016-0389-x] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2016] [Accepted: 11/10/2016] [Indexed: 11/17/2022] Open
Abstract
Background Techniques have been developed to compute statistics on distributed datasets without revealing private information except the statistical results. However, duplicate records in a distributed dataset may lead to incorrect statistical results. Therefore, to increase the accuracy of the statistical analysis of a distributed dataset, secure deduplication is an important preprocessing step. Methods We designed a secure protocol for the deduplication of horizontally partitioned datasets with deterministic record linkage algorithms. We provided a formal security analysis of the protocol in the presence of semi-honest adversaries. The protocol was implemented and deployed across three microbiology laboratories located in Norway, and we ran experiments on the datasets in which the number of records for each laboratory varied. Experiments were also performed on simulated microbiology datasets and data custodians connected through a local area network. Results The security analysis demonstrated that the protocol protects the privacy of individuals and data custodians under a semi-honest adversarial model. More precisely, the protocol remains secure with the collusion of up to N − 2 corrupt data custodians. The total runtime for the protocol scales linearly with the addition of data custodians and records. One million simulated records distributed across 20 data custodians were deduplicated within 45 s. The experimental results showed that the protocol is more efficient and scalable than previous protocols for the same problem. Conclusions The proposed deduplication protocol is efficient and scalable for practical uses while protecting the privacy of patients and data custodians. Electronic supplementary material The online version of this article (doi:10.1186/s12911-016-0389-x) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Kassaye Yitbarek Yigzaw
- Department of Computer Science, UiT The Arctic University of Norway, 9037, Tromsø, Norway. .,Norwegian Centre for E-health Research, University Hospital of North Norway, 9019, Tromsø, Norway.
| | - Antonis Michalas
- Department of Computer Science, University of Westminster, 115 New Cavendish Street, London, W1W 6UW, UK
| | - Johan Gustav Bellika
- Norwegian Centre for E-health Research, University Hospital of North Norway, 9019, Tromsø, Norway.,Department of Clinical Medicine, UiT The Arctic University of Norway, 9037, Tromsø, Norway
| |
Collapse
|
7
|
Ramsay S, Grundy E, O'Reilly D. The relationship between informal caregiving and mortality: an analysis using the ONS Longitudinal Study of England and Wales. J Epidemiol Community Health 2013; 67:655-60. [PMID: 23737544 DOI: 10.1136/jech-2012-202237] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/04/2022]
Abstract
BACKGROUND Many studies have suggested that caregiving has a detrimental impact on health. However, these conclusions are challenged by research which finds evidence of a comparative survivorship advantage, as well as work which controls for group differences in the demand for care. METHODS We use a large record linkage study of England and Wales to investigate the mortality risks of carers identified in the 2001 Census. The analysis focuses on individuals aged 35-74 living with others in private households and a distinction is made between those providing 1-19 and 20 or more hours of care per week. Logit models identify differences in carers' health at baseline and postcensal survival is analysed using Cox proportional hazards models. RESULTS 12.2% of study members reported providing 1-19 h of care and 5.4% reported providing 20 or more hours. While carers were significantly more likely to report poorer health at baseline, survival analyses suggested that they were at a significantly lower risk of dying. This comparative advantage also held when the analyses were restricted to individuals living with at least one person with poor health. CONCLUSIONS The comparative mortality advantage revealed in this analysis challenges common characterisations of carers' health and draws attention to important differences in the way carers are defined in existing analyses. The survival results are consistent with work using similar data for Northern Ireland. However, the study also affords more uniform conclusions about carers' baseline health and this provides grounds for questioning existing hypotheses about the reasons for this advantage.
Collapse
Affiliation(s)
- Susan Ramsay
- Department for Epidemiology and Public Health, University College London, London, UK.
| | | | | |
Collapse
|
8
|
Abstract
End-of-life medical expenses are a significant proportion of all health care expenditures. These costs were studied using costs of services from Medicare claims and cause of death (CoD) from death certificates. In the absence of a unique identifier linking the two datasets, common variables identified unique matches for only 33% of deaths. The remaining cases formed cells with multiple cases (32% in cells with an equal number of cases from each file and 35% in cells with an unequal number). We sampled from the joint posterior distribution of model parameters and the permutations that link cases from the two files within each cell. The linking models included the regression of location of death on CoD and other parameters, and the regression of cost measures with a monotone missing data pattern on CoD and other demographic characteristics. Permutations were sampled by enumerating the exact distribution for small cells and by the Metropolis algorithm for large cells. Sparse matrix data structures enabled efficient calculations despite the large dataset (≈1.7 million cases). The procedure generates m datasets in which the matches between the two files are imputed. The m datasets can be analyzed independently and results combined using Rubin's multiple imputation rules. Our approach can be applied in other file linking applications.
Collapse
Affiliation(s)
- Roee Gutman
- Department of Biostatistics, Brown University, Providence, RI 02912
| | | | | |
Collapse
|