1
|
Shan M, Thomas KS, Gutman R. A Bayesian MultiLayer Record Linkage Procedure to Analyze Post-Acute Care Recovery of Patients with Traumatic Brain Injury. Biostatistics 2023; 24:743-759. [PMID: 35579386 PMCID: PMC10345988 DOI: 10.1093/biostatistics/kxac016] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2021] [Revised: 04/11/2022] [Accepted: 04/18/2022] [Indexed: 07/20/2023] Open
Abstract
Understanding associations between injury severity and postacute care recovery for patients with traumatic brain injury (TBI) is crucial to improving care. Estimating these associations requires information on patients' injury, demographics, and healthcare utilization, which are dispersed across multiple data sets. Because of privacy regulations, unique identifiers are not available to link records across these data sets. Record linkage methods identify records that represent the same patient across data sets in the absence of unique identifiers. With a large number of records, these methods may result in many false links. Health providers are a natural grouping scheme for patients, because only records that receive care from the same provider can represent the same patient. In some cases, providers are defined within each data set, but they are not uniquely identified across data sets. We propose a Bayesian record linkage procedure that simultaneously links providers and patients. The procedure improves the accuracy of the estimated links compared to current methods. We use this procedure to merge a trauma registry with Medicare claims to estimate the association between TBI patients' injury severity and postacute care recovery.
Collapse
Affiliation(s)
- Mingyang Shan
- Department of Biostatistics, Brown University, 121 South Main Street, Box G-S121-7, Providence, RI 02912, USA
| | - Kali S Thomas
- Department of Health Services, Policy and Practice, Brown University Box G-S121(6), Providence, RI 02912, USA
| | - Roee Gutman
- Department of Biostatistics, Brown University, 121 South Main Street, Box G-S121-7, Providence, RI 02912, USA
| |
Collapse
|
2
|
Cardinal RN, Moore A, Burchell M, Lewis JR. De-identified Bayesian personal identity matching for privacy-preserving record linkage despite errors: development and validation. BMC Med Inform Decis Mak 2023; 23:85. [PMID: 37147600 PMCID: PMC10163749 DOI: 10.1186/s12911-023-02176-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2022] [Accepted: 04/21/2023] [Indexed: 05/07/2023] Open
Abstract
BACKGROUND Epidemiological research may require linkage of information from multiple organizations. This can bring two problems: (1) the information governance desirability of linkage without sharing direct identifiers, and (2) a requirement to link databases without a common person-unique identifier. METHODS We develop a Bayesian matching technique to solve both. We provide an open-source software implementation capable of de-identified probabilistic matching despite discrepancies, via fuzzy representations and complete mismatches, plus de-identified deterministic matching if required. We validate the technique by testing linkage between multiple medical records systems in a UK National Health Service Trust, examining the effects of decision thresholds on linkage accuracy. We report demographic factors associated with correct linkage. RESULTS The system supports dates of birth (DOBs), forenames, surnames, three-state gender, and UK postcodes. Fuzzy representations are supported for all except gender, and there is support for additional transformations, such as accent misrepresentation, variation for multi-part surnames, and name re-ordering. Calculated log odds predicted a proband's presence in the sample database with an area under the receiver operating curve of 0.997-0.999 for non-self database comparisons. Log odds were converted to a decision via a consideration threshold θ and a leader advantage threshold δ. Defaults were chosen to penalize misidentification 20-fold versus linkage failure. By default, complete DOB mismatches were disallowed for computational efficiency. At these settings, for non-self database comparisons, the mean probability of a proband being correctly declared to be in the sample was 0.965 (range 0.931-0.994), and the misidentification rate was 0.00249 (range 0.00123-0.00429). Correct linkage was positively associated with male gender, Black or mixed ethnicity, and the presence of diagnostic codes for severe mental illnesses or other mental disorders, and negatively associated with birth year, unknown ethnicity, residential area deprivation, and presence of a pseudopostcode (e.g. indicating homelessness). Accuracy rates would be improved further if person-unique identifiers were also used, as supported by the software. Our two largest databases were linked in 44 min via an interpreted programming language. CONCLUSIONS Fully de-identified matching with high accuracy is feasible without a person-unique identifier and appropriate software is freely available.
Collapse
Affiliation(s)
- Rudolf N. Cardinal
- Department of Psychiatry, University of Cambridge, Cambridge Biomedical Campus, Clifford Allbutt Building, Bay 13, Cambridge, CB2 0AH UK
- Cambridgeshire & Peterborough NHS Foundation Trust, Fulbourn Hospital, Cambridge, CB21 5EF UK
| | - Anna Moore
- Department of Psychiatry, University of Cambridge, Cambridge Biomedical Campus, Clifford Allbutt Building, Bay 13, Cambridge, CB2 0AH UK
- Cambridgeshire & Peterborough NHS Foundation Trust, Fulbourn Hospital, Cambridge, CB21 5EF UK
| | - Martin Burchell
- Department of Psychiatry, University of Cambridge, Cambridge Biomedical Campus, Clifford Allbutt Building, Bay 13, Cambridge, CB2 0AH UK
| | - Jonathan R. Lewis
- Cambridgeshire & Peterborough NHS Foundation Trust, Fulbourn Hospital, Cambridge, CB21 5EF UK
| |
Collapse
|
3
|
Sato J, Mitsutake N, Yamada H, Kitsuregawa M, Goda K. Virtual patient identifier (vPID): Improving patient traceability using anonymized identifiers in Japanese healthcare insurance claims database. Heliyon 2023; 9:e16209. [PMID: 37234615 PMCID: PMC10205637 DOI: 10.1016/j.heliyon.2023.e16209] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2023] [Revised: 05/09/2023] [Accepted: 05/10/2023] [Indexed: 05/28/2023] Open
Abstract
Objective Japan's national-level healthcare insurance claims database (NDB) is a collective database that contains the entire information on healthcare services being provided to all citizens. However, existing anonymized identifiers (ID1 and ID2) have a poor capability of tracing patients' claims in the database, hindering longitudinal analyses. This study presents a virtual patient identifier (vPID), which we have developed on top of these existing identifiers, to improve the patient traceability. Methods vPID is a new composite identifier that intensively consolidates ID1 and ID2 co-occurring in an identical claim to allow to collect claims of each patient even though its ID1 or ID2 may change due to life events or clerical errors. We conducted a verification test with prefecture-level datasets of healthcare insurance claims and enrollee history records, which allowed us to compare vPID with the ground truth, in terms of an identifiability score (indicating a capability of distinguishing a patient's claims from another patient's claims) and a traceability score (indicating a capability of collecting claims of an identical patient). Results The verification test has clarified that vPID offers significantly higher traceability scores (0.994, Mie; 0.997, Gifu) than ID1 (0.863, Mie; 0.884, Gifu) and ID2 (0.602, Mie; 0.839, Gifu), and comparable (0.996, Mie) and lower (0.979, Gifu) identifiability scores. Discussion vPID is seemingly useful for a wide spectrum of analytic studies unless they focus on sensitive cases to the design limitation of vPID, such as patients experiencing marriage and job change, simultaneously, and same-sex twin children. Conclusion vPID successfully improves patient traceability, providing an opportunity for longitudinal analyses that used to be practically impossible for NDB. Further exploration is also necessary, in particular, for mitigating identification errors.
Collapse
Affiliation(s)
- Jumpei Sato
- Institute of Industrial Science, The University of Tokyo, Meguro-ku, Tokyo, Japan
| | | | - Hiroyuki Yamada
- Institute of Industrial Science, The University of Tokyo, Meguro-ku, Tokyo, Japan
| | - Masaru Kitsuregawa
- Institute of Industrial Science, The University of Tokyo, Meguro-ku, Tokyo, Japan
| | - Kazuo Goda
- Institute of Industrial Science, The University of Tokyo, Meguro-ku, Tokyo, Japan
| |
Collapse
|
4
|
Smith D, Elliot M, Sakshaug JW. To Link or Synthesize? An Approach to Data Quality Comparison. ACM JOURNAL OF DATA AND INFORMATION QUALITY 2023. [DOI: 10.1145/3580487] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/23/2023]
Abstract
Linking administrative data to produce more informative data for subsequent analysis has become an increasingly common practice. However, there might be concomitant risks of disclosing sensitive information about individuals. One practice that reduces these risks is data synthesis. In data synthesis the data are used to fit a model from which synthetic data are then generated. The synthetic data are then released to end users. There are some scenarios where an end user might have the option of using linked data, or accepting synthesized data. However, linkage and synthesis are susceptible to errors that could limit their usefulness. Here, we investigate the problem of comparing the quality of linked data to synthesized data and demonstrate through simulations how the problem might be approached. These comparisons are important when considering how an end user can be supplied with the highest quality data, and in situations where one must consider risk / utility trade-offs.
Collapse
Affiliation(s)
| | - Mark Elliot
- The University of Manchester, United Kingdom
| | - Joseph W. Sakshaug
- Institute for Employment Research & Ludwig Maximilian University of Munich, Germany
| |
Collapse
|
5
|
Extending the Fellegi-Sunter record linkage model for mixed-type data with application to the French national health data system. Comput Stat Data Anal 2022. [DOI: 10.1016/j.csda.2022.107656] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
6
|
|
7
|
Tuoto T, Di Cecco D, Tancredi A. Bayesian analysis of one-inflated models for elusive population size estimation. Biom J 2022; 64:912-933. [PMID: 35534439 PMCID: PMC9314905 DOI: 10.1002/bimj.202100187] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2021] [Revised: 01/11/2022] [Accepted: 02/05/2022] [Indexed: 12/04/2022]
Abstract
The identification and treatment of “one‐inflation” in estimating the size of an elusive population has received increasing attention in capture–recapture literature in recent years. The phenomenon occurs when the number of units captured exactly once clearly exceeds the expectation under a baseline count distribution. Ignoring one‐inflation has serious consequences for estimation of the population size, which can be drastically overestimated. In this paper we propose a Bayesian approach for Poisson, geometric, and negative binomial one‐inflated count distributions. Posterior inference for population size will be obtained applying a Gibbs sampler approach. We also provide a Bayesian approach to model selection. We illustrate the proposed methodology with simulated and real data and propose a new application in official statistics to estimate the number of people implicated in the exploitation of prostitution in Italy.
Collapse
Affiliation(s)
- Tiziana Tuoto
- Istat - Istituto nazionale di statistica, Rome, Italy.,Department of Methods and Models for Economics Territory and Finance, Sapienza University of Rome, Rome, Italy
| | - Davide Di Cecco
- Istat - Istituto nazionale di statistica, Rome, Italy.,Department of Methods and Models for Economics Territory and Finance, Sapienza University of Rome, Rome, Italy
| | - Andrea Tancredi
- Department of Methods and Models for Economics Territory and Finance, Sapienza University of Rome, Rome, Italy
| |
Collapse
|
8
|
Improving Wildlife Population Inference Using Aerial Imagery and Entity Resolution. JOURNAL OF AGRICULTURAL, BIOLOGICAL AND ENVIRONMENTAL STATISTICS 2022. [DOI: 10.1007/s13253-021-00484-w] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
9
|
On the consistent estimation of linkage errors without training data. JAPANESE JOURNAL OF STATISTICS AND DATA SCIENCE 2022. [DOI: 10.1007/s42081-022-00153-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
10
|
Optimizing the Retrieval of the Vital Status of Cancer Patients for Health Data Warehouses by Using Open Government Data in France. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 2022; 19:ijerph19074272. [PMID: 35409956 PMCID: PMC8998644 DOI: 10.3390/ijerph19074272] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/09/2022] [Revised: 03/22/2022] [Accepted: 03/30/2022] [Indexed: 02/06/2023]
Abstract
Electronic Medical Records (EMR) and Electronic Health Records (EHR) are often missing critical information about the death of a patient, although it is an essential metric for medical research in oncology to assess survival outcomes, particularly for evaluating the efficacy of new therapeutic approaches. We used open government data in France from 1970 to September 2021 to identify deceased patients and match them with patient data collected from the Institut de Cancérologie de l’Ouest (ICO) data warehouse (Integrated Center of Oncology—the third largest cancer center in France) between January 2015 and November 2021. To meet our objective, we evaluated algorithms to perform a deterministic record linkage: an exact matching algorithm and a fuzzy matching algorithm. Because we lacked reference data, we needed to assess the algorithms by estimating the number of homonyms that could lead to false links, using the same open dataset of deceased persons in France. The exact matching algorithm allowed us to double the number of dates of death in the ICO data warehouse, and the fuzzy matching algorithm tripled it. Studying homonyms assured us that there was a low risk of misidentification, with precision values of 99.96% for the exact matching and 99.68% for the fuzzy matching. However, estimating the number of false negatives proved more difficult than anticipated. Nevertheless, using open government data can be a highly interesting way to improve the completeness of the date of death variable for oncology patients in data warehouses
Collapse
|
11
|
Kaplan A, Betancourt B, Steorts RC. A Practical Approach to Proper Inference with Linked Data. AM STAT 2022. [DOI: 10.1080/00031305.2022.2041482] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/01/2022]
Affiliation(s)
- Andee Kaplan
- Department of Statistics, Colorado State University, Fort Collins, CO
| | | | - Rebecca C. Steorts
- Departments of Statistical Science and Computer Science, Duke University, Durham, NC
| |
Collapse
|
12
|
Aleshin-Guendel S, Sadinle M. Multifile Partitioning for Record Linkage and Duplicate Detection. J Am Stat Assoc 2022; 118:1786-1795. [PMID: 37771512 PMCID: PMC10530869 DOI: 10.1080/01621459.2021.2013242] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2020] [Accepted: 11/28/2021] [Indexed: 10/19/2022]
Abstract
Merging datafiles containing information on overlapping sets of entities is a challenging task in the absence of unique identifiers, and is further complicated when some entities are duplicated in the datafiles. Most approaches to this problem have focused on linking two files assumed to be free of duplicates, or on detecting which records in a single file are duplicates. However, it is common in practice to encounter scenarios that fit somewhere in between or beyond these two settings. We propose a Bayesian approach for the general setting of multifile record linkage and duplicate detection. We use a novel partition representation to propose a structured prior for partitions that can incorporate prior information about the data collection processes of the datafiles in a flexible manner, and extend previous models for comparison data to accommodate the multifile setting. We also introduce a family of loss functions to derive Bayes estimates of partitions that allow uncertain portions of the partitions to be left unresolved. The performance of our proposed methodology is explored through extensive simulations.
Collapse
Affiliation(s)
| | - Mauricio Sadinle
- Department of Biostatistics, University of Washington, Seattle, WA 98195
| |
Collapse
|
13
|
Shan M, Thomas KS, Gutman R. A MULTIPLE IMPUTATION PROCEDURE FOR RECORD LINKAGE AND CAUSAL INFERENCE TO ESTIMATE THE EFFECTS OF HOME-DELIVERED MEALS. Ann Appl Stat 2021; 15:412-436. [PMID: 35755005 PMCID: PMC9222523 DOI: 10.1214/20-aoas1397] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2024]
Abstract
Causal analysis of observational studies requires data that comprise of a set of covariates, a treatment assignment indicator, and the observed outcomes. However, data confidentiality restrictions or the nature of data collection may distribute these variables across two or more datasets. In the absence of unique identifiers to link records across files, probabilistic record linkage algorithms can be leveraged to merge the datasets. Current applications of record linkage are concerned with estimation of associations between variables that are exclusive to one file and not causal relationships. We propose a Bayesian framework for record linkage and causal inference where one file comprises all the covariate and observed outcome information, and the second file consists of a list of all individuals who receive the active treatment. Under certain ignorability assumptions, the procedure properly propagates the error in the record linkage process, resulting in valid statistical inferences. To estimate the causal effects, we devise a two-stage procedure. The first stage of the procedure performs Bayesian record linkage to multiply impute the treatment assignment for all individuals in the first file, while adjustments for covariates' imbalance and imputation of missing potential outcomes are performed in the second stage. This procedure is used to evaluate the effect of Meals on Wheels services on mortality and healthcare utilization among homebound older adults in Rhode Island. In addition, an interpretable sensitivity analysis is developed to assess potential violations of the ignorability assumptions.
Collapse
|
14
|
Marchant NG, Kaplan A, Elazar DN, Rubinstein BIP, Steorts RC. d-blink: Distributed End-to-End Bayesian Entity Resolution. J Comput Graph Stat 2021. [DOI: 10.1080/10618600.2020.1825451] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Affiliation(s)
- Neil G. Marchant
- School of Computing and Information Systems, University of Melbourne , Parkville , VIC , Australia
| | - Andee Kaplan
- Department of Statistics, Colorado State University , Fort Collins , CO
| | - Daniel N. Elazar
- Methodology Division, Australian Bureau of Statistics , Belconnen , ACT , Australia
| | | | - Rebecca C. Steorts
- Department of Statistical Science and Computer Science, Duke University , Durham , NC
- Principal Mathematical Statistician, United States Census Bureau (DRB #: CBDRB-FY20-309)
| |
Collapse
|
15
|
Affiliation(s)
| | - Giacomo Zanella
- Department of Decision Sciences, Bocconi University, BIDSA and IGIER, Milan, Italy
| | - Rebecca C. Steorts
- Department of Statistical Science and Computer Science, Duke University, Durham, NC
| |
Collapse
|
16
|
|
17
|
Xu H, Li X, Shen C, Hui SL, Grannis S. Incorporating conditional dependence in latent class models for probabilistic record linkage: Does it matter? Ann Appl Stat 2019. [DOI: 10.1214/19-aoas1256] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
18
|
Affiliation(s)
- Giacomo Zanella
- Department of Decision Sciences, BIDSA and IGIER, Bocconi University, Milan, Italy
| |
Collapse
|
19
|
Affiliation(s)
- Ying Han
- Department of Mathematics and Joint Program of Survey Methodology University of Maryland College Park MD USA
| | - Partha Lahiri
- Department of Mathematics University of Maryland College Park MD USA
| |
Collapse
|
20
|
Dalzell NM, Reiter JP. Regression Modeling and File Matching Using Possibly Erroneous Matching Variables. J Comput Graph Stat 2018. [DOI: 10.1080/10618600.2018.1458624] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
Affiliation(s)
- Nicole M. Dalzell
- Department of Mathematics and Statistics, Wake Forest University, Winston-Salem, NC
| | | |
Collapse
|
21
|
Hurley PD, Oliver S, Mehta A. Creating longitudinal datasets and cleaning existing data identifiers in a cystic fibrosis registry using a novel Bayesian probabilistic approach from astronomy. PLoS One 2018; 13:e0199815. [PMID: 29985939 PMCID: PMC6037350 DOI: 10.1371/journal.pone.0199815] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2017] [Accepted: 06/14/2018] [Indexed: 11/18/2022] Open
Abstract
Patient registry data are commonly collected as annual snapshots that need to be amalgamated to understand the longitudinal progress of each patient. However, patient identifiers can either change or may not be available for legal reasons when longitudinal data are collated from patients living in different countries. Here, we apply astronomical statistical matching techniques to link individual patient records that can be used where identifiers are absent or to validate uncertain identifiers. We adopt a Bayesian model framework used for probabilistically linking records in astronomy. We adapt this and validate it across blinded, annually collected data. This is a high-quality (Danish) sub-set of data held in the European Cystic Fibrosis Society Patient Registry (ECFSPR). Our initial experiments achieved a precision of 0.990 at a recall value of 0.987. However, detailed investigation of the discrepancies uncovered typing errors in 27 of the identifiers in the original Danish sub-set. After fixing these errors to create a new gold standard our algorithm correctly linked individual records across years achieving a precision of 0.997 at a recall value of 0.987 without recourse to identifiers. Our Bayesian framework provides the probability of whether a pair of records belong to the same patient. Unlike other record linkage approaches, our algorithm can also use physical models, such as body mass index curves, as prior information for record linkage. We have shown our framework can create longitudinal samples where none existed and validate pre-existing patient identifiers. We have demonstrated that in this specific case this automated approach is better than the existing identifiers.
Collapse
Affiliation(s)
- Peter Donald Hurley
- Department of Physics and Astronomy, University of Sussex, Brighton, United Kingdom
| | - Seb Oliver
- Department of Physics and Astronomy, University of Sussex, Brighton, United Kingdom
| | - Anil Mehta
- Division of Medical Sciences, University of Dundee, Dundee, United Kingdom
| |
Collapse
|
22
|
Chen B, Shrivastava A, Steorts RC. Unique entity estimation with application to the Syrian conflict. Ann Appl Stat 2018. [DOI: 10.1214/18-aoas1163] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
23
|
Sadinle M. Bayesian propagation of record linkage uncertainty into population size estimation of human rights violations. Ann Appl Stat 2018. [DOI: 10.1214/18-aoas1178] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
24
|
Briscolini D, Di Consiglio L, Liseo B, Tancredi A, Tuoto T. New methods for small area estimation with linkage uncertainty. Int J Approx Reason 2018. [DOI: 10.1016/j.ijar.2017.12.005] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
25
|
Affiliation(s)
- Michel H. Hof
- Department of Clinical Epidemiology, Biostatistics and Bioinformatics, University of Amsterdam, Amsterdam, The Netherlands
| | - Anita C. Ravelli
- Department of Clinical Informatics, University of Amsterdam, Amsterdam, The Netherlands
| | - Aeilko H. Zwinderman
- Department of Clinical Epidemiology, Biostatistics and Bioinformatics, University of Amsterdam, Amsterdam, The Netherlands
| |
Collapse
|
26
|
Goldstein H, Harron K, Cortina-Borja M. A scaling approach to record linkage. Stat Med 2017; 36:2514-2521. [PMID: 28303597 PMCID: PMC6205620 DOI: 10.1002/sim.7287] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2016] [Accepted: 03/02/2017] [Indexed: 11/10/2022]
Abstract
With increasing availability of large datasets derived from administrative and other sources, there is an increasing demand for the successful linking of these to provide rich sources of data for further analysis. Variation in the quality of identifiers used to carry out linkage means that existing approaches are often based upon 'probabilistic' models, which are based on a number of assumptions, and can make heavy computational demands. In this paper, we suggest a new approach to classifying record pairs in linkage, based upon weights (scores) derived using a scaling algorithm. The proposed method does not rely on training data, is computationally fast, requires only moderate amounts of storage and has intuitive appeal. Copyright © 2017 John Wiley & Sons, Ltd.
Collapse
Affiliation(s)
- Harvey Goldstein
- University of Bristol, Bristol, U.K
- University College London, London, U.K
| | - Katie Harron
- London School of Hygiene and Tropical Medicine, London, U.K
| | | |
Collapse
|
27
|
Affiliation(s)
- Mauricio Sadinle
- Department of Statistical Science, Duke University, Durham, NC, and the National Institute of Statistical Sciences—NISS, Research Triangle Park, NC
| |
Collapse
|
28
|
Steorts RC, Hall R, Fienberg SE. A Bayesian Approach to Graphical Record Linkage and Deduplication. J Am Stat Assoc 2017. [DOI: 10.1080/01621459.2015.1105807] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Affiliation(s)
- Rebecca C. Steorts
- Departments of Statistical Science and Computer Science, Duke University, Durham, NC, USA
| | | | - Stephen E. Fienberg
- Department of Statistics, Machine Learning Department, Heinz College, and Cylab, Carnegie Mellon University, Pittsburgh, PA, USA
| |
Collapse
|
29
|
McClintock BT, Bailey LL, Dreher BP, Link WA. Probit models for capture–recapture data subject to imperfect detection, individual heterogeneity and misidentification. Ann Appl Stat 2014. [DOI: 10.1214/14-aoas783] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
30
|
Sadinle M. Detecting duplicates in a homicide registry using a Bayesian partitioning approach. Ann Appl Stat 2014. [DOI: 10.1214/14-aoas779] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
31
|
|
32
|
Kum HC, Krishnamurthy A, Machanavajjhala A, Reiter MK, Ahalt S. Privacy preserving interactive record linkage (PPIRL). J Am Med Inform Assoc 2013; 21:212-20. [PMID: 24201028 DOI: 10.1136/amiajnl-2013-002165] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/04/2022] Open
Abstract
OBJECTIVE Record linkage to integrate uncoordinated databases is critical in biomedical research using Big Data. Balancing privacy protection against the need for high quality record linkage requires a human-machine hybrid system to safely manage uncertainty in the ever changing streams of chaotic Big Data. METHODS In the computer science literature, private record linkage is the most published area. It investigates how to apply a known linkage function safely when linking two tables. However, in practice, the linkage function is rarely known. Thus, there are many data linkage centers whose main role is to be the trusted third party to determine the linkage function manually and link data for research via a master population list for a designated region. Recently, a more flexible computerized third-party linkage platform, Secure Decoupled Linkage (SDLink), has been proposed based on: (1) decoupling data via encryption, (2) obfuscation via chaffing (adding fake data) and universe manipulation; and (3) minimum information disclosure via recoding. RESULTS We synthesize this literature to formalize a new framework for privacy preserving interactive record linkage (PPIRL) with tractable privacy and utility properties and then analyze the literature using this framework. CONCLUSIONS Human-based third-party linkage centers for privacy preserving record linkage are the accepted norm internationally. We find that a computer-based third-party platform that can precisely control the information disclosed at the micro level and allow frequent human interaction during the linkage process, is an effective human-machine hybrid system that significantly improves on the linkage center model both in terms of privacy and utility.
Collapse
Affiliation(s)
- Hye-Chung Kum
- Population Informatics Research Group, Department of Computer Science, UNC-CH & Department of Health Policy and Management, Texas A&M Health Science Center, USA
| | | | | | | | | |
Collapse
|
33
|
|
34
|
Xu H, Hui SL, Grannis S. Optimal two-phase sampling design for comparing accuracies of two binary classification rules. Stat Med 2013; 33:500-13. [DOI: 10.1002/sim.5946] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2012] [Accepted: 07/22/2013] [Indexed: 11/11/2022]
Affiliation(s)
- Huiping Xu
- Department of Biostatistics; Indiana University School of Public Health and School of Medicine; Indianapolis IN U.S.A
| | - Siu L. Hui
- Department of Biostatistics; Indiana University School of Public Health and School of Medicine; Indianapolis IN U.S.A
- Regenstrief Institute, Inc.; Indianapolis IN U.S.A
| | - Shaun Grannis
- Regenstrief Institute, Inc.; Indianapolis IN U.S.A
- Department of Family Medicine; Indiana University School of Public Health and School of Medicine; Indianapolis IN U.S.A
| |
Collapse
|
35
|
Sadinle M, Fienberg SE. A Generalized Fellegi–Sunter Framework for Multiple Record Linkage With Application to Homicide Record Systems. J Am Stat Assoc 2013. [DOI: 10.1080/01621459.2012.757231] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
36
|
Gutman R, Afendulis CC, Zaslavsky AM. A Bayesian Procedure for File Linking to Analyze End-of-Life Medical Costs. J Am Stat Assoc 2013; 108:34-47. [PMID: 23645944 DOI: 10.1080/01621459.2012.726889] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
End-of-life medical expenses are a significant proportion of all health care expenditures. These costs were studied using costs of services from Medicare claims and cause of death (CoD) from death certificates. In the absence of a unique identifier linking the two datasets, common variables identified unique matches for only 33% of deaths. The remaining cases formed cells with multiple cases (32% in cells with an equal number of cases from each file and 35% in cells with an unequal number). We sampled from the joint posterior distribution of model parameters and the permutations that link cases from the two files within each cell. The linking models included the regression of location of death on CoD and other parameters, and the regression of cost measures with a monotone missing data pattern on CoD and other demographic characteristics. Permutations were sampled by enumerating the exact distribution for small cells and by the Metropolis algorithm for large cells. Sparse matrix data structures enabled efficient calculations despite the large dataset (≈1.7 million cases). The procedure generates m datasets in which the matches between the two files are imputed. The m datasets can be analyzed independently and results combined using Rubin's multiple imputation rules. Our approach can be applied in other file linking applications.
Collapse
Affiliation(s)
- Roee Gutman
- Department of Biostatistics, Brown University, Providence, RI 02912
| | | | | |
Collapse
|
37
|
|