1
|
FIRLA: A Fast Incremental Record Linkage Algorithm. J Biomed Inform 2022; 130:104094. [PMID: 35550929 DOI: 10.1016/j.jbi.2022.104094] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2021] [Revised: 05/02/2022] [Accepted: 05/04/2022] [Indexed: 11/23/2022]
Abstract
Record linkage is an important problem studied widely in many domains including biomedical informatics. A standard version of this problem is to cluster records from several datasets, such that each cluster has records pertinent to just one individual. Typically, datasets are huge in size. Hence, existing record linkage algorithms take a very long time. It is thus essential to develop novel fast algorithms for record linkage. The incremental version of this problem is to link previously clustered records with new records added to the input datasets. A novel algorithm has been created to efficiently perform standard and incremental record linkage. This algorithm leverages a set of efficient techniques that significantly restrict the number of record pair comparisons and distance computations. Our algorithm shows an average speed-up of 2.4x (up to 4x) for the standard linkage problem as compared to the state-of-the-art, without any drop in linkage performance at all. On average, our algorithm can incrementally link records in just 33% of the time required for linking them from scratch. Our algorithms achieve comparable or superior linkage performance and outperform the state-of-the-art in terms of linking time in all cases where the number of comparison attributes is greater than two. In practice, more than two comparison attributes are quite common. The proposed algorithm is very efficient and could be used in practice for record linkage applications especially when records are being added over time and linkage output needs to be updated frequently.
Collapse
|
2
|
Domingues MAP, Camacho R, Rodrigues PP. CMIID: A comprehensive medical information identifier for clinical search harmonization in Data Safe Havens. J Biomed Inform 2020; 114:103669. [PMID: 33359111 DOI: 10.1016/j.jbi.2020.103669] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2020] [Revised: 11/28/2020] [Accepted: 12/16/2020] [Indexed: 11/27/2022]
Abstract
Over the last decades clinical research has been driven by informatics changes nourished by distinct research endeavors. Inherent to this evolution, several issues have been the focus of a variety of studies: multi-location patient data access, interoperability between terminological and classification systems and clinical practice and records harmonization. Having these problems in mind, the Data Safe Haven paradigm emerged to promote a newborn architecture, better reasoning and safe and easy access to distinct Clinical Data Repositories. This study aim is to present a novel solution for clinical search harmonization within a safe environment, making use of a hybrid coding taxonomy that enables researchers to collect information from multiple repositories based on a clinical domain query definition. Results show that is possible to query multiple repositories using a single query definition based on clinical domains and the capabilities of the Unified Medical Language System, although it leads to deterioration of the framework response times. Participants of a Focus Group and a System Usability Scale questionnaire rated the framework with a median value of 72.5, indicating the hybrid coding taxonomy could be enriched with additional metadata to further improve the refinement of the results and enable the possibility of using this system as data quality tagging mechanism.
Collapse
Affiliation(s)
| | - Rui Camacho
- Faculty of Engineering of the University of Porto, Portugal; LIAAD-INESC TEC, Porto, Portugal
| | - Pedro Pereira Rodrigues
- CINTESIS - Center for Health Technology and Services Research, Portugal; Faculty of Medicine of the University of Porto, Portugal
| |
Collapse
|
3
|
McManus BM, Richardson Z, Schenkman M, Murphy NJ, Everhart RM, Hambidge S, Morrato E. Child characteristics and early intervention referral and receipt of services: a retrospective cohort study. BMC Pediatr 2020; 20:84. [PMID: 32087676 PMCID: PMC7036184 DOI: 10.1186/s12887-020-1965-x] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/23/2019] [Accepted: 02/07/2020] [Indexed: 02/06/2023] Open
Abstract
BACKGROUND Early Intervention (EI) is a federally mandated, state-administered system of care for children with developmental delays and disabilities under the age of three. Gaps exist in the process of accessing EI through pediatric primary care, and low rates of EI access are well documented and disproportionately affect poor and minority children. The aims of this paper are to examine child characteristics associated with gaps in EI (1) referral, (2) access and (3) service use. To our knowledge, this is the first study to leverage linked safety net health system pediatric primary care and EI records data to follow EI-referred children longitudinally to understand EI service use gaps from EI referral to EI service utilization. METHODS In a retrospective cohort design (14,710 children with developmental disability or delay), we linked pediatric primary care records between a large, integrated safety net health system in metro Denver and its corresponding EI program (2014-2016). Using adjusted marginal effects [ME, (95% CI)], we estimated gaps in EI referral, access, and service type (i.e., physical [PT], occupational [OT], speech therapy [ST] and developmental intervention [DI]). Analyses accounted for child characteristics including socio-demographics, diagnosis, condition severity, and baseline function. RESULTS Only 18.7% of EI-eligible children (N = 2726) received a referral; 26% of those (N = 722) received services for a net enrollment rate of 5% among EI-eligible children. Having the most severe developmental condition was positively associated with EI referral [ME = 0.334 [0.249, 0.420]) and Individualized Family Services Plan (IFSP) receipt [ME = 0.156 [0.088, 0.223]). Children less likely to be EI-referred were Black, non-Hispanic (BNH) [ME = -0.029 (- 0.054, - 0.004)] and had a diagnosed condition ([ME = - 0.046 (- 0.087, - 0.005)]. Children with a diagnosis and those with higher income were more likely to receive PT or OT. Higher baseline cognitive and adaptive skills were associated with lower likelihood of PT [ME = -0.029 (- 0.054, - 0.004)], OT [ME = -0.029 (- 0.054, - 0.004)], and ST [ME = -0.029 (- 0.054, - 0.004)]. CONCLUSIONS We identified and characterized gaps in EI referral, access, and service use in an urban safety-net population of children with high rates of developmental delay. Interventions are needed to improve integrated systems of care affecting primary care and EI processes and coordination.
Collapse
Affiliation(s)
- Beth M McManus
- Department of Health Systems, Management and Policy, Colorado School of Public Health, 13001 E 17th Place, MS B119, Aurora, Colorado, 80045, USA.
| | - Zachary Richardson
- Department of Health Systems, Management and Policy, Colorado School of Public Health, 13001 E 17th Place, MS B119, Aurora, Colorado, 80045, USA
| | - Margaret Schenkman
- Physical Therapy Program, University of Colorado School of Medicine, 13121 East 17th Ave. Mail Stop C244, Aurora, Colorado, 80045, USA
| | - Natalie J Murphy
- Physical Therapy Program, University of Colorado School of Medicine, 13121 East 17th Ave. Mail Stop C244, Aurora, Colorado, 80045, USA
| | - Rachel M Everhart
- Ambulatory Care Services Data and Analytics Denver Health, 777 Bannock St., Denver, Colorado, 80204, USA
| | - Simon Hambidge
- Denver Community Health Services, 777 Bannock St., Denver, Colorado, 80204, USA
| | - Elaine Morrato
- Department of Health Systems, Management and Policy, Colorado School of Public Health, 13001 E 17th Place, MS B119, Aurora, Colorado, 80045, USA
| |
Collapse
|
4
|
Lattar H, Salem AB, Ben Ghezala HH. Does data cleaning improve heart disease prediction? ACTA ACUST UNITED AC 2020. [DOI: 10.1016/j.procs.2020.09.109] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
5
|
Chin EL, Simmons G, Bouzid YY, Kan A, Burnett DJ, Tagkopoulos I, Lemay DG. Nutrient Estimation from 24-Hour Food Recalls Using Machine Learning and Database Mapping: A Case Study with Lactose. Nutrients 2019; 11:E3045. [PMID: 31847188 PMCID: PMC6950225 DOI: 10.3390/nu11123045] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2019] [Revised: 11/30/2019] [Accepted: 12/06/2019] [Indexed: 01/03/2023] Open
Abstract
The Automated Self-Administered 24-Hour Dietary Assessment Tool (ASA24) is a free dietary recall system that outputs fewer nutrients than the Nutrition Data System for Research (NDSR). NDSR uses the Nutrition Coordinating Center (NCC) Food and Nutrient Database, both of which require a license. Manual lookup of ASA24 foods into NDSR is time-consuming but currently the only way to acquire NCC-exclusive nutrients. Using lactose as an example, we evaluated machine learning and database matching methods to estimate this NCC-exclusive nutrient from ASA24 reports. ASA24-reported foods were manually looked up into NDSR to obtain lactose estimates and split into training (n = 378) and test (n = 189) datasets. Nine machine learning models were developed to predict lactose from the nutrients common between ASA24 and the NCC database. Database matching algorithms were developed to match NCC foods to an ASA24 food using only nutrients ("Nutrient-Only") or the nutrient and food descriptions ("Nutrient + Text"). For both methods, the lactose values were compared to the manual curation. Among machine learning models, the XGB-Regressor model performed best on held-out test data (R2 = 0.33). For the database matching method, Nutrient + Text matching yielded the best lactose estimates (R2 = 0.76), a vast improvement over the status quo of no estimate. These results suggest that computational methods can successfully estimate an NCC-exclusive nutrient for foods reported in ASA24.
Collapse
Affiliation(s)
- Elizabeth L Chin
- Western Human Nutrition Research Center, USDA ARS, Davis, CA 95616, USA
- Genome Center, University of California Davis, Davis, CA 95616, USA
| | - Gabriel Simmons
- Department of Mechanical Engineering, University of California Davis, Davis, CA 95616, USA
| | - Yasmine Y Bouzid
- Western Human Nutrition Research Center, USDA ARS, Davis, CA 95616, USA
- Department of Nutrition, University of California Davis, Davis, CA 95616, USA
| | - Annie Kan
- Western Human Nutrition Research Center, USDA ARS, Davis, CA 95616, USA
- Department of Nutrition, University of California Davis, Davis, CA 95616, USA
| | - Dustin J Burnett
- Western Human Nutrition Research Center, USDA ARS, Davis, CA 95616, USA
- Department of Nutrition, University of California Davis, Davis, CA 95616, USA
| | - Ilias Tagkopoulos
- Genome Center, University of California Davis, Davis, CA 95616, USA
- Department of Computer Science, University of California Davis, Davis, CA 95616, USA
| | - Danielle G Lemay
- Western Human Nutrition Research Center, USDA ARS, Davis, CA 95616, USA
- Genome Center, University of California Davis, Davis, CA 95616, USA
- Department of Nutrition, University of California Davis, Davis, CA 95616, USA
| |
Collapse
|
6
|
Transfusion Safety: The Nature and Outcomes of Errors in Patient Registration. Transfus Med Rev 2019; 33:78-83. [DOI: 10.1016/j.tmrv.2018.11.004] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2018] [Revised: 11/18/2018] [Accepted: 11/28/2018] [Indexed: 11/23/2022]
|
7
|
Agopian AJ, Salemi JL, Tanner JP, Kirby RS. Using birth defects surveillance programs for population-based estimation of sibling recurrence risks. Birth Defects Res 2018; 110:1383-1387. [PMID: 30338928 DOI: 10.1002/bdr2.1387] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2018] [Revised: 07/30/2018] [Accepted: 08/02/2018] [Indexed: 11/06/2022]
Affiliation(s)
- A J Agopian
- Department of Epidemiology, Human Genetics, and Environmental Sciences, UTHealth School of Public Health, Houston, Texas
| | - Jason L Salemi
- Department of Family and Community Medicine, Baylor College of Medicine, Houston, Texas
| | - Jean Paul Tanner
- Birth Defects Surveillance Program, Department of Community and Family Health, College of Public Health, University of South Florida, Tampa, Florida
| | - Russell S Kirby
- Birth Defects Surveillance Program, Department of Community and Family Health, College of Public Health, University of South Florida, Tampa, Florida
| |
Collapse
|
8
|
de Paula AA, Pires DF, Filho PA, de Lemos KRV, Barçante E, Pacheco AG. A comparison of accuracy and computational feasibility of two record linkage algorithms in retrieving vital status information from HIV/AIDS patients registered in Brazilian public databases. Int J Med Inform 2018; 114:45-51. [PMID: 29673602 DOI: 10.1016/j.ijmedinf.2018.03.005] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2017] [Revised: 03/19/2018] [Accepted: 03/19/2018] [Indexed: 11/19/2022]
Abstract
BACKGROUND AND OBJECTIVE While cross-referencing information from people living with HIV/AIDS (PLWHA) to the official mortality database is a critical step in monitoring the HIV/AIDS epidemic in Brazil, the accuracy of the linkage routine may compromise the validity of the final database, yielding to biased epidemiological estimates. We compared the accuracy and the total runtime of two linkage algorithms applied to retrieve vital status information from PLWHA in Brazilian public databases. METHODS Nominally identified records from PLWHA were obtained from three distinct government databases. Linkage routines included an algorithm in Python language (PLA) and Reclink software (RlS), a probabilistic software largely utilized in Brazil. Records from PLWHA1 known to be alive were added to those from patients reported as deceased. Data were then searched into the mortality system. Scenarios where 5% and 50% of patients actually dead were simulated, considering both complete cases and 20% missing maternal names. RESULTS When complete information was available both algorithms had comparable accuracies. In the scenario of 20% missing maternal names, PLA2 and RlS3 had sensitivities of 94.5% and 94.6% (p > 0.5), respectively; after manual reviewing, PLA sensitivity increased to 98.4% (96.6-100.0) exceeding that for RlS (p < 0.01). PLA had higher positive predictive value in 5% death proportion. Manual reviewing was intrinsically required by RlS in up to 14% register for people actually dead, whereas the corresponding proportion ranged from 1.5% to 2% for PLA. The lack of manual inspection did not alter PLA sensitivity when complete information was available. When incomplete data was available PLA sensitivity increased from 94.5% to 98.4%, thus exceeding that presented by RlS (94.6%, p < 0.05). RlS spanned considerably less processing time compared to PLA. CONCLUSION Both linkage algorithms presented interchangeable accuracies in retrieving vital status data from PLWHA. RlS had a considerably lesser runtime but intrinsically required manually reviewing a fastidious proportion of the matched registries. On the other hand, PLA spent quite more runtime but spared manual reviewing at no expense of accuracy.
Collapse
Affiliation(s)
| | | | - Pedro Alves Filho
- Rio de Janeiro State Health Secretariat, Rua México, 128, Rio de Janeiro, Brazil.
| | | | - Eduardo Barçante
- DataUERJ/UERJ, Rua São Francisco Xavier, 524, Rio de Janeiro, Brazil.
| | | |
Collapse
|
9
|
Stausberg J, Nasseh D. Evaluation of a Binary Semi-supervised Classification Technique for Probabilistic Record Linkage. Methods Inf Med 2018; 55:136-43. [DOI: 10.3414/me14-01-0087] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2014] [Accepted: 03/25/2015] [Indexed: 11/09/2022]
Abstract
SummaryBackground: The process of merging data of different data sources is referred to as record linkage. A medical environment with increased preconditions on privacy protection demands the transformation of clear-text attributes like first name or date of birth into one-way encrypted pseudonyms. When performing an automated or privacy preserving record linkage there might be the need of a binary classification deciding whether two records should be classified as the same entity. The classification is the final of the four main phases of the record linkage process: Preprocessing, indexing, matching and classification. The choice of binary classification techniques in dependence of project specifications in particular data quality has not extensively been studied yet.Objectives: The aim of this work is the introduction and evaluation of an automatable semi-supervised binary classification system applied within the field of record linkage capable of competing or even surpassing advanced automated techniques of the domain of unsupervised classification.Methods: This work describes the rationale leading to the model and the final implementation of an automatable semi-supervised binary classification system and the comparison of its classification performance to an advanced active learning approach out of the domain of unsupervised learning. The performance of both systems has been measured on a broad variety of artificial test sets (n = 400), based on real patient data, with distinct and unique characteristics.Results: While the classification performance for both methods measured as F-measure was relatively close on test sets with maximum defined data quality, 0.996 for semi-supervised classification, 0.993 for unsupervised classification, it incrementally diverged for test sets of worse data quality dropping to 0.964 for semi-supervised classification and 0.803 for unsupervised classification.Conclusions: Aside from supplying a viable model for semi-supervised classification for automated probabilistic record linkage, the tests conducted on a large amount of test sets suggest that semi-supervised techniques might generally be capable of outperforming unsupervised techniques especially on data with lower levels of data quality.
Collapse
|
10
|
Corradi JP, Chhabra J, Mather JF, Waszynski CM, Dicks RS. Analysis of multi-dimensional contemporaneous EHR data to refine delirium assessments. Comput Biol Med 2016; 75:267-74. [PMID: 27340924 DOI: 10.1016/j.compbiomed.2016.06.013] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2016] [Revised: 06/10/2016] [Accepted: 06/13/2016] [Indexed: 12/16/2022]
Abstract
Delirium is a potentially lethal condition of altered mental status, attention, and level of consciousness with an acute onset and fluctuating course. Its causes are multi-factorial, and its pathophysiology is not well understood; therefore clinical focus has been on prevention strategies and early detection. One patient evaluation technique in routine use is the Confusion Assessment Method (CAM): a relatively simple test resulting in 'positive', 'negative' or 'unable-to-assess' (UTA) ratings. Hartford Hospital nursing staff use the CAM regularly on all non-critical care units, and a high frequency of UTA was observed after reviewing several years of records. In addition, patients with UTA ratings displayed poor outcomes such as in-hospital mortality, longer lengths of stay, and discharge to acute and long term care facilities. We sought to better understand the use of UTA, especially outside of critical care environments, in order to improve delirium detection throughout the hospital. An unsupervised clustering approach was used with additional, concurrent assessment data available in the EHR to categorize patient visits with UTA CAMs. The results yielded insights into the most common situations in which the UTA rating was used (e.g. impaired verbal communication, dementia), suggesting potentially inappropriate ratings that could be refined with further evaluation and remedied with updated clinical training. Analysis of the patient clusters also suggested that unrecognized delirium may contribute to the poor outcomes associated with the use of UTA. This method of using temporally related high dimensional EHR data to illuminate a dynamic medical condition could have wider applicability.
Collapse
Affiliation(s)
- John P Corradi
- Research Department, Hartford Hospital, 80 Seymour Street, Hartford, CT 06102 USA.
| | - Jyoti Chhabra
- Research Department, Hartford Hospital, 80 Seymour Street, Hartford, CT 06102 USA
| | - Jeffrey F Mather
- Research Department, Hartford Hospital, 80 Seymour Street, Hartford, CT 06102 USA
| | - Christine M Waszynski
- Division of Geriatric Medicine, Hartford Hospital, 80 Seymour Street, Hartford, CT 06102 USA
| | - Robert S Dicks
- Division of Geriatric Medicine, Hartford Hospital, 80 Seymour Street, Hartford, CT 06102 USA
| |
Collapse
|
11
|
Zech J, Husk G, Moore T, Shapiro JS. Measuring the Degree of Unmatched Patient Records in a Health Information Exchange Using Exact Matching. Appl Clin Inform 2016; 7:330-40. [PMID: 27437044 DOI: 10.4338/aci-2015-11-ra-0158] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2015] [Accepted: 02/26/2016] [Indexed: 11/23/2022] Open
Abstract
BACKGROUND Health information exchange (HIE) facilitates the exchange of patient information across different healthcare organizations. To match patient records across sites, HIEs usually rely on a master patient index (MPI), a database responsible for determining which medical records at different healthcare facilities belong to the same patient. A single patient's records may be improperly split across multiple profiles in the MPI. OBJECTIVES We investigated the how often two individuals shared the same first name, last name, and date of birth in the Social Security Death Master File (SSDMF), a US government database containing over 85 million individuals, to determine the feasibility of using exact matching as a split record detection tool. We demonstrated how a method based on exact record matching could be used to partially measure the degree of probable split patient records in the MPI of an HIE. METHODS We calculated the percentage of individuals who were uniquely identified in the SSDMF using first name, last name, and date of birth. We defined a measure consisting of the average number of unique identifiers associated with a given first name, last name, and date of birth. We calculated a reference value for this measure on a subsample of SSDMF data. We compared this measure value to data from a functioning HIE. RESULTS We found that it was unlikely for two individuals to share the same first name, last name, and date of birth in a large US database including over 85 million individuals. 98.81% of individuals were uniquely identified in this dataset using only these three items. We compared the value of our measure on a subsample of Social Security data (1.00089) to that of HIE data (1.1238) and found a significant difference (t-test p-value < 0.001). CONCLUSIONS This method may assist HIEs in detecting split patient records.
Collapse
Affiliation(s)
- John Zech
- Icahn School of Medicine at Mount Sinai , New York, NY, USA
| | | | | | - Jason S Shapiro
- Department of Emergency Medicine, Icahn School of Medicine at Mount Sinai , New York, NY, USA
| |
Collapse
|
12
|
[Completeness assessment of the Breton registry of congenital abnormalities: A checking tool based on hospital discharge data]. Rev Epidemiol Sante Publique 2015; 63:223-35. [PMID: 26119557 DOI: 10.1016/j.respe.2015.04.012] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2014] [Revised: 03/18/2015] [Accepted: 04/08/2015] [Indexed: 11/23/2022] Open
Abstract
BACKGROUND Exhaustiveness is required for registries. In the Breton registry of congenital abnormalities, cases are recorded at the source. We use hospital discharge data in order to verify the completeness of the registry. In this paper, we present a computerized tool for completeness assessment applied to the Breton registry. METHODS All the medical information departments were solicited once a year, asking for infant medical stays for newborns alive at one year old and for mother's stays if not. Files were transmitted by secure messaging and data were processed on a secure server. An identity-matching algorithm was applied and a similarity score calculated. When the record was not linked automatically or manually, the medical record had to be consulted. The exhaustiveness rate was assessed using the capture recapture method and the proportion of cases matched manually was used to assess the identity matching algorithm. RESULTS The computerized tool bas been used in common practice since June 2012 by the registry investigators. The results presented concerned the years 2011 and 2012. There were 470 potential cases identified from the hospital discharge data in 2011 and 538 in 2012, 35 new cases were detected in 2011 (32 children born alive and 3 stillborn), and 33 in 2012 (children born alive). There were respectively 85 and 137 false-positive cases. The theorical exhaustiveness rate reached 91% for both years. The rate of exact matching amounted to 68%; 6% of the potential cases were linked manually. CONCLUSION Hospital discharge databases contribute to the quality of the registry even though reports are made at the source. The implemented tool facilitates the investigator's work. In the future, use of the national identifying number, when allowed, should facilitate linkage between registry data and hospital discharge data.
Collapse
|
13
|
Rudniy A, Song M, Geller J. Mapping biological entities using the longest approximately common prefix method. BMC Bioinformatics 2014; 15:187. [PMID: 24928653 PMCID: PMC4086698 DOI: 10.1186/1471-2105-15-187] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2013] [Accepted: 05/29/2014] [Indexed: 11/24/2022] Open
Abstract
Background The significant growth in the volume of electronic biomedical data in recent decades has pointed to the need for approximate string matching algorithms that can expedite tasks such as named entity recognition, duplicate detection, terminology integration, and spelling correction. The task of source integration in the Unified Medical Language System (UMLS) requires considerable expert effort despite the presence of various computational tools. This problem warrants the search for a new method for approximate string matching and its UMLS-based evaluation. Results This paper introduces the Longest Approximately Common Prefix (LACP) method as an algorithm for approximate string matching that runs in linear time. We compare the LACP method for performance, precision and speed to nine other well-known string matching algorithms. As test data, we use two multiple-source samples from the Unified Medical Language System (UMLS) and two SNOMED Clinical Terms-based samples. In addition, we present a spell checker based on the LACP method. Conclusions The Longest Approximately Common Prefix method completes its string similarity evaluations in less time than all nine string similarity methods used for comparison. The Longest Approximately Common Prefix outperforms these nine approximate string matching methods in its Maximum F1 measure when evaluated on three out of the four datasets, and in its average precision on two of the four datasets.
Collapse
Affiliation(s)
| | - Min Song
- Department of Library and Information Science, Yonsei University, 50 Yonsei-ro, Seoul 120-749, Korea.
| | | |
Collapse
|
14
|
Kum HC, Krishnamurthy A, Machanavajjhala A, Reiter MK, Ahalt S. Privacy preserving interactive record linkage (PPIRL). J Am Med Inform Assoc 2013; 21:212-20. [PMID: 24201028 DOI: 10.1136/amiajnl-2013-002165] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/04/2022] Open
Abstract
OBJECTIVE Record linkage to integrate uncoordinated databases is critical in biomedical research using Big Data. Balancing privacy protection against the need for high quality record linkage requires a human-machine hybrid system to safely manage uncertainty in the ever changing streams of chaotic Big Data. METHODS In the computer science literature, private record linkage is the most published area. It investigates how to apply a known linkage function safely when linking two tables. However, in practice, the linkage function is rarely known. Thus, there are many data linkage centers whose main role is to be the trusted third party to determine the linkage function manually and link data for research via a master population list for a designated region. Recently, a more flexible computerized third-party linkage platform, Secure Decoupled Linkage (SDLink), has been proposed based on: (1) decoupling data via encryption, (2) obfuscation via chaffing (adding fake data) and universe manipulation; and (3) minimum information disclosure via recoding. RESULTS We synthesize this literature to formalize a new framework for privacy preserving interactive record linkage (PPIRL) with tractable privacy and utility properties and then analyze the literature using this framework. CONCLUSIONS Human-based third-party linkage centers for privacy preserving record linkage are the accepted norm internationally. We find that a computer-based third-party platform that can precisely control the information disclosed at the micro level and allow frequent human interaction during the linkage process, is an effective human-machine hybrid system that significantly improves on the linkage center model both in terms of privacy and utility.
Collapse
Affiliation(s)
- Hye-Chung Kum
- Population Informatics Research Group, Department of Computer Science, UNC-CH & Department of Health Policy and Management, Texas A&M Health Science Center, USA
| | | | | | | | | |
Collapse
|
15
|
Cox S, Martin R, Somaia P, Smith K. The development of a data-matching algorithm to define the ‘case patient’. AUST HEALTH REV 2013; 37:54-9. [DOI: 10.1071/ah11161] [Citation(s) in RCA: 44] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2012] [Accepted: 07/02/2012] [Indexed: 11/23/2022]
Abstract
Objectives. To describe a model that matches electronic patient care records within a given case to one or more patients within that case. Method. This retrospective study included data from all metropolitan Ambulance Victoria electronic patient care records (n = 445 576) for the time period 1 January 2009–31 May 2010. Data were captured via VACIS (Ambulance Victoria, Melbourne, Vic., Australia), an in-field electronic data capture system linked to an integrated data warehouse database. The case patient algorithm included ‘Jaro–Winkler’, ‘Soundex’ and ‘weight matching’ conditions. Results. The case patient matching algorithm has a sensitivity of 99.98%, a specificity of 99.91% and an overall accuracy of 99.98%. Conclusions. The case patient algorithm provides Ambulance Victoria with a sophisticated, efficient and highly accurate method of matching patient records within a given case. This method has applicability to other emergency services where unique identifiers are case based rather than patient based. What is known about the topic? Accurate pre-hospital data that can be linked to patient outcomes is widely accepted as critical to support pre-hospital patient care and system performance. What does this paper add? There is a paucity of literature describing electronic matching of patient care records at the patient level rather than the case level. Ambulance Victoria has developed a complex yet efficient and highly accurate method for electronically matching patient records, in the absence of a patient-specific unique identifier. Linkage of patient information from multiple patient care records to determine if the records are for the same individual defines the ‘case patient’. What are the implications for practitioners? This paper describes a model of record linkage where patients are matched within a given case at the patient level as opposed to the case level. This methodology is applicable to other emergency services where unique identifiers are case based.
Collapse
|
16
|
Finney JM, Walker AS, Peto TEA, Wyllie DH. An efficient record linkage scheme using graphical analysis for identifier error detection. BMC Med Inform Decis Mak 2011; 11:7. [PMID: 21284874 PMCID: PMC3039555 DOI: 10.1186/1472-6947-11-7] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2010] [Accepted: 02/01/2011] [Indexed: 11/10/2022] Open
Abstract
Background Integration of information on individuals (record linkage) is a key problem in healthcare delivery, epidemiology, and "business intelligence" applications. It is now common to be required to link very large numbers of records, often containing various combinations of theoretically unique identifiers, such as NHS numbers, which are both incomplete and error-prone. Methods We describe a two-step record linkage algorithm in which identifiers with high cardinality are identified or generated, and used to perform an initial exact match based linkage. Subsequently, the resulting clusters are studied and, if appropriate, partitioned using a graph based algorithm detecting erroneous identifiers. Results The system was used to cluster over 250 million health records from five data sources within a large UK hospital group. Linkage, which was completed in about 30 minutes, yielded 3.6 million clusters of which about 99.8% contain, with high likelihood, records from one patient. Although computationally efficient, the algorithm's requirement for exact matching of at least one identifier of each record to another for cluster formation may be a limitation in some databases containing records of low identifier quality. Conclusions The technique described offers a simple, fast and highly efficient two-step method for large scale initial linkage for records commonly found in the UK's National Health Service.
Collapse
Affiliation(s)
- John M Finney
- NIHR Biomedical Research Centre, John Radcliffe Hospital, Oxford, UK
| | | | | | | |
Collapse
|
17
|
Silveira DPD, Artmann E. Accuracy of probabilistic record linkage applied to health databases: systematic review. Rev Saude Publica 2009; 43:875-82. [PMID: 19784456 DOI: 10.1590/s0034-89102009005000060] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2008] [Accepted: 04/15/2009] [Indexed: 11/21/2022] Open
Abstract
OBJECTIVE To analyze both national and international literature on validity of record linkage procedure of health databases focusing on quality assessment of results. METHODS A systematic review of cohort, case-control, and cross-sectional studies that evaluated quality of probabilistic record linkage of health databases was conducted. Cochrane methodology of systematic reviews was used. The following databases were widely searched: Medline, LILACS, Scopus, SciELO and Scirus. A time filter was not applied and articles were searched in the following languages: Portuguese, Spanish, French and English. RESULTS Summary measures of the quality of probabilistic record linkage were sensitivity, specificity, and positive predictive value. There were identified 202 studies, and after applying the inclusion criteria, a total of 33 articles were reviewed. Only six had complete data on the summary measures of interest. The main limitations were: no reviewer to evaluate titles and abstracts; and no blinding of the article's authors in the review process. Most scientific publications in this field were from the United States, United Kingdom, and New Zealand. Overall, the accuracy of probabilistic record linkage of databases ranged from 74% to 98% sensitivity and 99% to 100% specificity. CONCLUSIONS Probabilistic record linkage of health databases has notably been characterized by high sensitivity and greater flexibility of the procedure's sensitivity, indicating concern with data accuracy. The positive predictive value in studies shows a high proportion of truly positive record pairs. The quality assessment of these procedures has been proved essential for validating the results obtained in these studies, and can also contribute to improve large health databases available in Brazil.
Collapse
Affiliation(s)
- Daniele Pinto da Silveira
- Programa de Pós-Graduação em Saúde Pública, Escola Nacional de Saúde Pública Sergio Arouca, Fundação Oswaldo Cruz, Rio de Janeiro, RJ, Brasil
| | | |
Collapse
|