1
|
Xu H, Li X, Zhang Z, Grannis S. Variable selection for latent class analysis in the presence of missing data with application to record linkage. Stat Methods Med Res 2024:9622802241242317. [PMID: 38592341 DOI: 10.1177/09622802241242317] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/10/2024]
Abstract
The Fellegi-Sunter model is a latent class model widely used in probabilistic linkage to identify records that belong to the same entity. Record linkage practitioners typically employ all available matching fields in the model with the premise that more fields convey greater information about the true match status and hence result in improved match performance. In the context of model-based clustering, it is well known that such a premise is incorrect and the inclusion of noisy variables could compromise the clustering. Variable selection procedures have therefore been developed to remove noisy variables. Although these procedures have the potential to improve record matching, they cannot be applied directly due to the ubiquity of the missing data in record linkage applications. In this paper, we modify the stepwise variable selection procedure proposed by Fop, Smart, and Murphy and extend it to account for missing data common in record linkage. Through simulation studies, our proposed method is shown to select the correct set of matching fields across various settings, leading to better-performing algorithms. The improved match performance is also seen in a real-world application. We therefore recommend the use of our proposed selection procedure to identify informative matching fields for probabilistic record linkage algorithms.
Collapse
Affiliation(s)
- Huiping Xu
- Department of Biostatistics and Health Data Science, Indiana University, Indianapolis, IN, USA
| | - Xiaochun Li
- Department of Biostatistics and Health Data Science, Indiana University, Indianapolis, IN, USA
| | | | | |
Collapse
|
2
|
Moretti A, Shlomo N. Improving Probabilistic Record Linkage Using Statistical Prediction Models. Int Stat Rev 2022. [DOI: 10.1111/insr.12535] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Affiliation(s)
- Angelo Moretti
- Department of Methodology and Statistics Utrecht University Sjoerd Groenmangebouw, Padualaan 14 3584 CH Utrecht The Netherlands
| | - Natalie Shlomo
- Social Statistics Department University of Manchester Oxford Road Manchester M13 9PL UK
| |
Collapse
|
3
|
Xu H, Li X, Zhang Z, Grannis S. Score test for assessing the conditional dependence in latent class models and its application to record linkage. J R Stat Soc Ser C Appl Stat 2022. [DOI: 10.1111/rssc.12590] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- Huiping Xu
- Department of Biostatistics and Health Data Science Indiana University Indianapolis Indiana USA
| | - Xiaochun Li
- Department of Biostatistics and Health Data Science Indiana University Indianapolis Indiana USA
| | | | | |
Collapse
|
4
|
Sosa J, Rodríguez A. A Bayesian approach for de-duplication in the presence of relational data. J Appl Stat 2022; 51:197-215. [PMID: 38283048 PMCID: PMC10810674 DOI: 10.1080/02664763.2022.2118678] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2021] [Accepted: 08/18/2022] [Indexed: 10/14/2022]
Abstract
In this paper, we study the impact of combining profile and network data in solving record de-duplication problems. We also assess the influence of a range of prior distributions on the linkage structure, and explore the use of stochastic gradient Hamiltonian Monte Carlo methods as a faster alternative to obtain samples from the posterior distribution for network parameters. Our methodology is evaluated using the RLdata500 data, which is a popular dataset in the record linkage literature.
Collapse
Affiliation(s)
- Juan Sosa
- Departamento de Estadística, Universidad Nacional de Colombia, Bogotá, Colombia
| | - Abel Rodríguez
- Department of Statistics, University of Washington, Seattle, WA, USA
| |
Collapse
|
5
|
|
6
|
Kaplan A, Betancourt B, Steorts RC. A Practical Approach to Proper Inference with Linked Data. AM STAT 2022. [DOI: 10.1080/00031305.2022.2041482] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/01/2022]
Affiliation(s)
- Andee Kaplan
- Department of Statistics, Colorado State University, Fort Collins, CO
| | | | - Rebecca C. Steorts
- Departments of Statistical Science and Computer Science, Duke University, Durham, NC
| |
Collapse
|
7
|
Aleshin-Guendel S, Sadinle M. Multifile Partitioning for Record Linkage and Duplicate Detection. J Am Stat Assoc 2022; 118:1786-1795. [PMID: 37771512 PMCID: PMC10530869 DOI: 10.1080/01621459.2021.2013242] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2020] [Accepted: 11/28/2021] [Indexed: 10/19/2022]
Abstract
Merging datafiles containing information on overlapping sets of entities is a challenging task in the absence of unique identifiers, and is further complicated when some entities are duplicated in the datafiles. Most approaches to this problem have focused on linking two files assumed to be free of duplicates, or on detecting which records in a single file are duplicates. However, it is common in practice to encounter scenarios that fit somewhere in between or beyond these two settings. We propose a Bayesian approach for the general setting of multifile record linkage and duplicate detection. We use a novel partition representation to propose a structured prior for partitions that can incorporate prior information about the data collection processes of the datafiles in a flexible manner, and extend previous models for comparison data to accommodate the multifile setting. We also introduce a family of loss functions to derive Bayes estimates of partitions that allow uncertain portions of the partitions to be left unresolved. The performance of our proposed methodology is explored through extensive simulations.
Collapse
Affiliation(s)
| | - Mauricio Sadinle
- Department of Biostatistics, University of Washington, Seattle, WA 98195
| |
Collapse
|
8
|
Ali A, Emran NA, Asmai SA. Missing values compensation in duplicates detection using hot deck method. JOURNAL OF BIG DATA 2021; 8:112. [DOI: 10.1186/s40537-021-00502-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/01/2021] [Accepted: 08/08/2021] [Indexed: 09/01/2023]
Abstract
AbstractDuplicate record is a common problem within data sets especially in huge volume databases. The accuracy of duplicate detection determines the efficiency of duplicate removal process. However, duplicate detection has become more challenging due to the presence of missing values within the records where during the clustering and matching process, missing values can cause records deemed similar to be inserted into the wrong group, hence, leading to undetected duplicates. In this paper, duplicate detection improvement was proposed despite the presence of missing values within a data set through Duplicate Detection within the Incomplete Data set (DDID) method. The missing values were hypothetically added to the key attributes of three data sets under study, using an arbitrary pattern to simulate both complete and incomplete data sets. The results were analyzed, then, the performance of duplicate detection was evaluated by using the Hot Deck method to compensate for the missing values in the key attributes. It was hypothesized that by using Hot Deck, duplicate detection performance would be improved. Furthermore, the DDID performance was compared to an early duplicate detection method namely DuDe, in terms of its accuracy and speed. The findings yielded that even though the data sets were incomplete, DDID was able to offer a better accuracy and faster duplicate detection as compared to DuDe. The results of this study offer insights into constraints of duplicate detection within incomplete data sets.
Collapse
|
9
|
Xu H, Li X, Grannis S. A simple two-step procedure using the Fellegi-Sunter model for frequency-based record linkage. J Appl Stat 2021; 49:2789-2804. [PMID: 35909667 DOI: 10.1080/02664763.2021.1922615] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Abstract
The widely used Fellegi-Sunter model for probabilistic record linkage does not leverage information contained in field values and consequently leads to identical classification of match status regardless of whether records agree on rare or common values. Since agreement on rare values is less likely to occur by chance than agreement on common values, records agreeing on rare values are more likely to be matches. Existing frequency-based methods typically rely on knowledge of error probabilities associated with field values and frequencies of agreed field values among matches, often derived using prior studies or training data. When such information is unavailable, applications of these methods are challenging. In this paper, we propose a simple two-step procedure for frequency-based matching using the Fellegi-Sunter framework to overcome these challenges. Matching weights are adjusted based on frequency distributions of the agreed field values among matches and non-matches, estimated by the Fellegi-Sunter model without relying on prior studies or training data. Through a real-world application and simulation, our method is found to produce comparable or better performance than the unadjusted method. Furthermore, frequency-based matching provides greater improvement in matching accuracy when using poorly discriminating fields with diminished benefit as the discriminating power of matching fields increases.
Collapse
Affiliation(s)
- Huiping Xu
- Department of Biostatistics, Indiana University School of Medicine, Indianapolis, IN, USA
| | - Xiaochun Li
- Department of Biostatistics, Indiana University School of Medicine, Indianapolis, IN, USA
| | | |
Collapse
|
10
|
Marchant NG, Kaplan A, Elazar DN, Rubinstein BIP, Steorts RC. d-blink: Distributed End-to-End Bayesian Entity Resolution. J Comput Graph Stat 2021. [DOI: 10.1080/10618600.2020.1825451] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Affiliation(s)
- Neil G. Marchant
- School of Computing and Information Systems, University of Melbourne , Parkville , VIC , Australia
| | - Andee Kaplan
- Department of Statistics, Colorado State University , Fort Collins , CO
| | - Daniel N. Elazar
- Methodology Division, Australian Bureau of Statistics , Belconnen , ACT , Australia
| | | | - Rebecca C. Steorts
- Department of Statistical Science and Computer Science, Duke University , Durham , NC
- Principal Mathematical Statistician, United States Census Bureau (DRB #: CBDRB-FY20-309)
| |
Collapse
|
11
|
Affiliation(s)
| | - Giacomo Zanella
- Department of Decision Sciences, Bocconi University, BIDSA and IGIER, Milan, Italy
| | - Rebecca C. Steorts
- Department of Statistical Science and Computer Science, Duke University, Durham, NC
| |
Collapse
|
12
|
Xu H, Li X, Shen C, Hui SL, Grannis S. Incorporating conditional dependence in latent class models for probabilistic record linkage: Does it matter? Ann Appl Stat 2019. [DOI: 10.1214/19-aoas1256] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
13
|
Hurley PD, Oliver S, Mehta A. Creating longitudinal datasets and cleaning existing data identifiers in a cystic fibrosis registry using a novel Bayesian probabilistic approach from astronomy. PLoS One 2018; 13:e0199815. [PMID: 29985939 PMCID: PMC6037350 DOI: 10.1371/journal.pone.0199815] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2017] [Accepted: 06/14/2018] [Indexed: 11/18/2022] Open
Abstract
Patient registry data are commonly collected as annual snapshots that need to be amalgamated to understand the longitudinal progress of each patient. However, patient identifiers can either change or may not be available for legal reasons when longitudinal data are collated from patients living in different countries. Here, we apply astronomical statistical matching techniques to link individual patient records that can be used where identifiers are absent or to validate uncertain identifiers. We adopt a Bayesian model framework used for probabilistically linking records in astronomy. We adapt this and validate it across blinded, annually collected data. This is a high-quality (Danish) sub-set of data held in the European Cystic Fibrosis Society Patient Registry (ECFSPR). Our initial experiments achieved a precision of 0.990 at a recall value of 0.987. However, detailed investigation of the discrepancies uncovered typing errors in 27 of the identifiers in the original Danish sub-set. After fixing these errors to create a new gold standard our algorithm correctly linked individual records across years achieving a precision of 0.997 at a recall value of 0.987 without recourse to identifiers. Our Bayesian framework provides the probability of whether a pair of records belong to the same patient. Unlike other record linkage approaches, our algorithm can also use physical models, such as body mass index curves, as prior information for record linkage. We have shown our framework can create longitudinal samples where none existed and validate pre-existing patient identifiers. We have demonstrated that in this specific case this automated approach is better than the existing identifiers.
Collapse
Affiliation(s)
- Peter Donald Hurley
- Department of Physics and Astronomy, University of Sussex, Brighton, United Kingdom
| | - Seb Oliver
- Department of Physics and Astronomy, University of Sussex, Brighton, United Kingdom
| | - Anil Mehta
- Division of Medical Sciences, University of Dundee, Dundee, United Kingdom
| |
Collapse
|
14
|
Johndrow JE, Lum K, Dunson DB. Theoretical limits of microclustering for record linkage. Biometrika 2018; 105:431-446. [PMID: 29880978 PMCID: PMC5963577 DOI: 10.1093/biomet/asy003] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2017] [Indexed: 11/12/2022] Open
Abstract
There has been substantial recent interest in record linkage, where one attempts to group the records pertaining to the same entities from one or more large databases that lack unique identifiers. This can be viewed as a type of microclustering, with few observations per cluster and a very large number of clusters. We show that the problem is fundamentally hard from a theoretical perspective and, even in idealized cases, accurate entity resolution is effectively impossible unless the number of entities is small relative to the number of records and/or the separation between records from different entities is extremely large. These results suggest conservatism in interpretation of the results of record linkage, support collection of additional data to more accurately disambiguate the entities, and motivate a focus on coarser inference. For example, results from a simulation study suggest that sometimes one may obtain accurate results for population size estimation even when fine-scale entity resolution is inaccurate.
Collapse
Affiliation(s)
- J E Johndrow
- Department of Statistics, Stanford University, Sequoia Hall, 390 Serra Mall, Stanford, California 94305, U.S.A
| | - K Lum
- Human Rights Data Analysis Group, San Francisco, California 94110, U.S.A
| | - D B Dunson
- Department of Statistical Science, Duke University, Box 90251, Durham, North Carolina 27708, U.S.A
| |
Collapse
|
15
|
Chen B, Shrivastava A, Steorts RC. Unique entity estimation with application to the Syrian conflict. Ann Appl Stat 2018. [DOI: 10.1214/18-aoas1163] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
16
|
Sadinle M. Bayesian propagation of record linkage uncertainty into population size estimation of human rights violations. Ann Appl Stat 2018. [DOI: 10.1214/18-aoas1178] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
17
|
Affiliation(s)
- Michel H. Hof
- Department of Clinical Epidemiology, Biostatistics and Bioinformatics, University of Amsterdam, Amsterdam, The Netherlands
| | - Anita C. Ravelli
- Department of Clinical Informatics, University of Amsterdam, Amsterdam, The Netherlands
| | - Aeilko H. Zwinderman
- Department of Clinical Epidemiology, Biostatistics and Bioinformatics, University of Amsterdam, Amsterdam, The Netherlands
| |
Collapse
|
18
|
Affiliation(s)
- Mauricio Sadinle
- Department of Statistical Science, Duke University, Durham, NC, and the National Institute of Statistical Sciences—NISS, Research Triangle Park, NC
| |
Collapse
|
19
|
Steorts RC, Hall R, Fienberg SE. A Bayesian Approach to Graphical Record Linkage and Deduplication. J Am Stat Assoc 2017. [DOI: 10.1080/01621459.2015.1105807] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Affiliation(s)
- Rebecca C. Steorts
- Departments of Statistical Science and Computer Science, Duke University, Durham, NC, USA
| | | | - Stephen E. Fienberg
- Department of Statistics, Machine Learning Department, Heinz College, and Cylab, Carnegie Mellon University, Pittsburgh, PA, USA
| |
Collapse
|
20
|
ZHANG GUANGYU, PARKER JENNIFERD, SCHENKER NATHANIEL. MULTIPLE IMPUTATION FOR MISSINGNESS DUE TO NONLINKAGE AND PROGRAM CHARACTERISTICS: A CASE STUDY OF THE NATIONAL HEALTH INTERVIEW SURVEY LINKED TO MEDICARE CLAIMS. JOURNAL OF SURVEY STATISTICS AND METHODOLOGY 2016; 4:316-338. [PMID: 30949519 PMCID: PMC6444366 DOI: 10.1093/jssam/smw002] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Record linkage is a valuable and efficient tool for connecting information from different data sources. The National Center for Health Statistics (NCHS) has linked its population-based health surveys with administrative data, including Medicare enrollment and claims records. However, the linked NCHS-Medicare files are subject to missing data; first, not all survey participants agree to record linkage, and second, Medicare claims data are only consistently available for beneficiaries enrolled in the Fee-for-Service (FFS) program, not in Medicare Advantage (MA) plans. In this research, we examine the usefulness of multiple imputation for handling missing data in linked National Health Interview Survey (NHIS)-Medicare files. The motivating example is a study of mammography status from 1999 to 2004 among women aged 65 years and older enrolled in the FFS program. In our example, mammography screening status and FFS/MA plan type are missing for NHIS survey participants who were not linkage eligible. Mammography status is also missing for linked participants in an MA plan. We explore three imputation approaches: (i) imputing screening status first, (ii) imputing FFS/MA plan type first, (iii) and imputing the two longitudinal processes simultaneously. We conduct simulation studies to evaluate these methods and compare them using the linked NHIS-Medicare files. The imputation procedures described in our paper would also be applicable to other public health-related research using linked data files with missing data issues arising from program characteristics (e.g., intermittent enrollment or data collection) reflected in administrative data and linkage eligibility by survey participants.
Collapse
Affiliation(s)
- GUANGYU ZHANG
- National Center for Health Statistics, Hyattsville, MD 20782, USA
| | | | | |
Collapse
|
21
|
Ventura SL, Nugent R, Fuchs ER. Seeing the non-stars: (Some) sources of bias in past disambiguation approaches and a new public tool leveraging labeled records. RESEARCH POLICY 2015. [DOI: 10.1016/j.respol.2014.12.010] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
22
|
Sadinle M. Detecting duplicates in a homicide registry using a Bayesian partitioning approach. Ann Appl Stat 2014. [DOI: 10.1214/14-aoas779] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|