1
|
Barreto ML, Ichihara MY, Pescarini JM, Ali MS, Borges GL, Fiaccone RL, Ribeiro-Silva RDC, Teles CA, Almeida D, Sena S, Carreiro RP, Cabral L, Almeida BA, Barbosa GCG, Pita R, Barreto ME, Mendes AAF, Ramos DO, Brickley EB, Bispo N, Machado DB, Paixao ES, Rodrigues LC, Smeeth L. Cohort Profile: The 100 Million Brazilian Cohort. Int J Epidemiol 2022; 51:e27-e38. [PMID: 34922344 PMCID: PMC9082797 DOI: 10.1093/ije/dyab213] [Citation(s) in RCA: 15] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2021] [Accepted: 09/17/2021] [Indexed: 11/16/2022] Open
Affiliation(s)
- Mauricio L Barreto
- Centre for Data and Knowledge Integration for Health (CIDACS), Fundação Oswaldo Cruz, Salvador, Brazil
- Institute of Collective Health, Federal University of Bahia (UFBA), Salvador, Brazil
| | - Maria Yury Ichihara
- Centre for Data and Knowledge Integration for Health (CIDACS), Fundação Oswaldo Cruz, Salvador, Brazil
- Institute of Collective Health, Federal University of Bahia (UFBA), Salvador, Brazil
| | - Julia M Pescarini
- Centre for Data and Knowledge Integration for Health (CIDACS), Fundação Oswaldo Cruz, Salvador, Brazil
- Faculty of Epidemiology and Population Health, London School of Hygiene and Tropical Medicine, London, UK
| | - M Sanni Ali
- Centre for Data and Knowledge Integration for Health (CIDACS), Fundação Oswaldo Cruz, Salvador, Brazil
- Faculty of Epidemiology and Population Health, London School of Hygiene and Tropical Medicine, London, UK
- Nuffield Department of Orthopaedics, Rheumatology and Musculoskeletal Sciences, Center for Statistics in Medicine, University of Oxford, Oxford, UK
| | - Gabriela L Borges
- Centre for Data and Knowledge Integration for Health (CIDACS), Fundação Oswaldo Cruz, Salvador, Brazil
| | - Rosemeire L Fiaccone
- Centre for Data and Knowledge Integration for Health (CIDACS), Fundação Oswaldo Cruz, Salvador, Brazil
- Department of Statistics, Federal University of Bahia, Salvador, Brazil
| | - Rita de Cássia Ribeiro-Silva
- Centre for Data and Knowledge Integration for Health (CIDACS), Fundação Oswaldo Cruz, Salvador, Brazil
- Department of Nutrition, Federal University of Bahia, Salvador, Brazil
| | - Carlos A Teles
- Centre for Data and Knowledge Integration for Health (CIDACS), Fundação Oswaldo Cruz, Salvador, Brazil
| | - Daniela Almeida
- Centre for Data and Knowledge Integration for Health (CIDACS), Fundação Oswaldo Cruz, Salvador, Brazil
| | - Samila Sena
- Centre for Data and Knowledge Integration for Health (CIDACS), Fundação Oswaldo Cruz, Salvador, Brazil
| | - Roberto P Carreiro
- Centre for Data and Knowledge Integration for Health (CIDACS), Fundação Oswaldo Cruz, Salvador, Brazil
| | - Liliana Cabral
- Centre for Data and Knowledge Integration for Health (CIDACS), Fundação Oswaldo Cruz, Salvador, Brazil
| | - Bethania A Almeida
- Centre for Data and Knowledge Integration for Health (CIDACS), Fundação Oswaldo Cruz, Salvador, Brazil
| | - George C G Barbosa
- Centre for Data and Knowledge Integration for Health (CIDACS), Fundação Oswaldo Cruz, Salvador, Brazil
| | - Robespierre Pita
- Centre for Data and Knowledge Integration for Health (CIDACS), Fundação Oswaldo Cruz, Salvador, Brazil
| | - Marcos E Barreto
- Centre for Data and Knowledge Integration for Health (CIDACS), Fundação Oswaldo Cruz, Salvador, Brazil
- Department of Statistics, London School of Economics and Political Science, London, UK
| | - Andre A F Mendes
- Centre for Data and Knowledge Integration for Health (CIDACS), Fundação Oswaldo Cruz, Salvador, Brazil
| | - Dandara O Ramos
- Centre for Data and Knowledge Integration for Health (CIDACS), Fundação Oswaldo Cruz, Salvador, Brazil
- Institute of Collective Health, Federal University of Bahia (UFBA), Salvador, Brazil
| | - Elizabeth B Brickley
- Faculty of Epidemiology and Population Health, London School of Hygiene and Tropical Medicine, London, UK
| | - Nivea Bispo
- Centre for Data and Knowledge Integration for Health (CIDACS), Fundação Oswaldo Cruz, Salvador, Brazil
- Department of Statistics, Federal University of Bahia, Salvador, Brazil
| | - Daiane B Machado
- Centre for Data and Knowledge Integration for Health (CIDACS), Fundação Oswaldo Cruz, Salvador, Brazil
| | - Enny S Paixao
- Centre for Data and Knowledge Integration for Health (CIDACS), Fundação Oswaldo Cruz, Salvador, Brazil
- Faculty of Epidemiology and Population Health, London School of Hygiene and Tropical Medicine, London, UK
| | - Laura C Rodrigues
- Centre for Data and Knowledge Integration for Health (CIDACS), Fundação Oswaldo Cruz, Salvador, Brazil
- Faculty of Epidemiology and Population Health, London School of Hygiene and Tropical Medicine, London, UK
| | - Liam Smeeth
- Faculty of Epidemiology and Population Health, London School of Hygiene and Tropical Medicine, London, UK
| |
Collapse
|
2
|
Barbosa GCG, Ali MS, Araujo B, Reis S, Sena S, Ichihara MYT, Pescarini J, Fiaccone RL, Amorim LD, Pita R, Barreto ME, Smeeth L, Barreto ML. CIDACS-RL: a novel indexing search and scoring-based record linkage system for huge datasets with high accuracy and scalability. BMC Med Inform Decis Mak 2020; 20:289. [PMID: 33167998 PMCID: PMC7654019 DOI: 10.1186/s12911-020-01285-w] [Citation(s) in RCA: 27] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2019] [Accepted: 10/11/2020] [Indexed: 12/13/2022] Open
Abstract
Background Record linkage is the process of identifying and combining records about the same individual from two or more different datasets. While there are many open source and commercial data linkage tools, the volume and complexity of currently available datasets for linkage pose a huge challenge; hence, designing an efficient linkage tool with reasonable accuracy and scalability is required. Methods We developed CIDACS-RL (Centre for Data and Knowledge Integration for Health – Record Linkage), a novel iterative deterministic record linkage algorithm based on a combination of indexing search and scoring algorithms (provided by Apache Lucene). We described how the algorithm works and compared its performance with four open source linkage tools (AtyImo, Febrl, FRIL and RecLink) in terms of sensitivity and positive predictive value using gold standard dataset. We also evaluated its accuracy and scalability using a case-study and its scalability and execution time using a simulated cohort in serial (single core) and multi-core (eight core) computation settings. Results Overall, CIDACS-RL algorithm had a superior performance: positive predictive value (99.93% versus AtyImo 99.30%, RecLink 99.5%, Febrl 98.86%, and FRIL 96.17%) and sensitivity (99.87% versus AtyImo 98.91%, RecLink 73.75%, Febrl 90.58%, and FRIL 74.66%). In the case study, using a ROC curve to choose the most appropriate cut-off value (0.896), the obtained metrics were: sensitivity = 92.5% (95% CI 92.07–92.99), specificity = 93.5% (95% CI 93.08–93.8) and area under the curve (AUC) = 97% (95% CI 96.97–97.35). The multi-core computation was about four times faster (150 seconds) than the serial setting (550 seconds) when using a dataset of 20 million records. Conclusion CIDACS-RL algorithm is an innovative linkage tool for huge datasets, with higher accuracy, improved scalability, and substantially shorter execution time compared to other existing linkage tools. In addition, CIDACS-RL can be deployed on standard computers without the need for high-speed processors and distributed infrastructures.
Collapse
Affiliation(s)
- George C G Barbosa
- Centre for Data and Knowledge Integration for Health (CIDACS), Fiocruz Bahia, Parque Tecnológico da Bahia, Edf. Tecnocentro, sala 315, Rua Mundo, no 121, Salvador, 41301-110, Brazil.
| | - M Sanni Ali
- Centre for Data and Knowledge Integration for Health (CIDACS), Fiocruz Bahia, Parque Tecnológico da Bahia, Edf. Tecnocentro, sala 315, Rua Mundo, no 121, Salvador, 41301-110, Brazil.,Department of Non-communicable Disease Epidemiology, London School of Hygiene and Tropical Medicine, London, UK.,NDORMS, Center for Statistics in Medicine, University of Oxford, Oxford, UK
| | - Bruno Araujo
- Centre for Data and Knowledge Integration for Health (CIDACS), Fiocruz Bahia, Parque Tecnológico da Bahia, Edf. Tecnocentro, sala 315, Rua Mundo, no 121, Salvador, 41301-110, Brazil
| | - Sandra Reis
- Centre for Data and Knowledge Integration for Health (CIDACS), Fiocruz Bahia, Parque Tecnológico da Bahia, Edf. Tecnocentro, sala 315, Rua Mundo, no 121, Salvador, 41301-110, Brazil
| | - Samila Sena
- Centre for Data and Knowledge Integration for Health (CIDACS), Fiocruz Bahia, Parque Tecnológico da Bahia, Edf. Tecnocentro, sala 315, Rua Mundo, no 121, Salvador, 41301-110, Brazil
| | - Maria Y T Ichihara
- Centre for Data and Knowledge Integration for Health (CIDACS), Fiocruz Bahia, Parque Tecnológico da Bahia, Edf. Tecnocentro, sala 315, Rua Mundo, no 121, Salvador, 41301-110, Brazil
| | - Julia Pescarini
- Centre for Data and Knowledge Integration for Health (CIDACS), Fiocruz Bahia, Parque Tecnológico da Bahia, Edf. Tecnocentro, sala 315, Rua Mundo, no 121, Salvador, 41301-110, Brazil
| | - Rosemeire L Fiaccone
- Centre for Data and Knowledge Integration for Health (CIDACS), Fiocruz Bahia, Parque Tecnológico da Bahia, Edf. Tecnocentro, sala 315, Rua Mundo, no 121, Salvador, 41301-110, Brazil.,Department of Statistics, Federal University of Bahia (UFBA), Salvador, Brazil
| | - Leila D Amorim
- Centre for Data and Knowledge Integration for Health (CIDACS), Fiocruz Bahia, Parque Tecnológico da Bahia, Edf. Tecnocentro, sala 315, Rua Mundo, no 121, Salvador, 41301-110, Brazil.,Department of Statistics, Federal University of Bahia (UFBA), Salvador, Brazil
| | - Robespierre Pita
- Centre for Data and Knowledge Integration for Health (CIDACS), Fiocruz Bahia, Parque Tecnológico da Bahia, Edf. Tecnocentro, sala 315, Rua Mundo, no 121, Salvador, 41301-110, Brazil
| | - Marcos E Barreto
- Centre for Data and Knowledge Integration for Health (CIDACS), Fiocruz Bahia, Parque Tecnológico da Bahia, Edf. Tecnocentro, sala 315, Rua Mundo, no 121, Salvador, 41301-110, Brazil.,Computer Science Department, Federal University of Bahia (UFBA), Salvador, Brazil.,Department of Statistics, London School of Economics and Political Science (LSE), London, UK
| | - Liam Smeeth
- Department of Non-communicable Disease Epidemiology, London School of Hygiene and Tropical Medicine, London, UK
| | - Mauricio L Barreto
- Centre for Data and Knowledge Integration for Health (CIDACS), Fiocruz Bahia, Parque Tecnológico da Bahia, Edf. Tecnocentro, sala 315, Rua Mundo, no 121, Salvador, 41301-110, Brazil.,Institute of Public Health, Federal University of Bahia (UFBA), Salvador, Brazil
| |
Collapse
|
3
|
Almeida D, Gorender D, Ichihara MY, Sena S, Menezes L, Barbosa GCG, Fiaccone RL, Paixão ES, Pita R, Barreto ML. Examining the quality of record linkage process using nationwide Brazilian administrative databases to build a large birth cohort. BMC Med Inform Decis Mak 2020; 20:173. [PMID: 32711532 PMCID: PMC7382864 DOI: 10.1186/s12911-020-01192-0] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2020] [Accepted: 07/17/2020] [Indexed: 12/16/2022] Open
Abstract
BACKGROUND Research using linked routine population-based data collected for non-research purposes has increased in recent years because they are a rich and detailed source of information. The objective of this study is to present an approach to prepare and link data from administrative sources in a middle-income country, to estimate its quality and to identify potential sources of bias by comparing linked and non-linked individuals. METHODS We linked two administrative datasets with data covering the period 2001 to 2015, using maternal attributes (name, age, date of birth, and municipally of residence) from Brazil: live birth information system and the 100 Million Brazilian Cohort (created using administrative records from over 114 million individuals whose families applied for social assistance via the Unified Register for Social Programmes) implementing an in house developed linkage tool CIDACS-RL. We then estimated the proportion of highly probably link and examined the characteristics of missed-matches to identify any potential source of bias. RESULTS A total of 27,699,891 live births were submited to linkage with maternal information recorded in the baseline of the 100 Million Brazilian Cohort dataset of those, 16,447,414 (59.4%) children were found registered in the 100 Million Brazilian Cohort dataset. The proportion of highly probably link ranged from 39.3% in 2001 to 82.1% in 2014. A substantial improvement in the linkage after the introduction of maternal date of birth attribute, in 2011, was observed. Our analyses indicated a slightly higher proportion of missing data among missed matches and a higher proportion of people living in an urban area and self-declared as Caucasian among linked pairs when compared with non-linked sets. DISCUSSION We demonstrated that CIDACS-RL is capable of performing high quality linkage even with a limited number of common attributes, using indexation as a blocking strategy in larg e routine databases from a middle-income country. However, residual records occurred more among people under worse living conditions. The results presented in this study reinforce the need of evaluating linkage quality and when necessary to take linkage error into account for the analyses of any generated dataset.
Collapse
Affiliation(s)
- Daniela Almeida
- Centre for Data and Knowledge Integration for Health (CIDACS), Fiocruz Bahia, Salvador, Brazil
| | - David Gorender
- Centre for Data and Knowledge Integration for Health (CIDACS), Fiocruz Bahia, Salvador, Brazil
| | - Maria Yury Ichihara
- Centre for Data and Knowledge Integration for Health (CIDACS), Fiocruz Bahia, Salvador, Brazil
| | - Samila Sena
- Centre for Data and Knowledge Integration for Health (CIDACS), Fiocruz Bahia, Salvador, Brazil
| | - Luan Menezes
- Centre for Data and Knowledge Integration for Health (CIDACS), Fiocruz Bahia, Salvador, Brazil
| | - George C G Barbosa
- Centre for Data and Knowledge Integration for Health (CIDACS), Fiocruz Bahia, Salvador, Brazil.,University of Arizona, Computer Science Department, Tucson, Arizona, USA
| | - Rosimeire L Fiaccone
- Centre for Data and Knowledge Integration for Health (CIDACS), Fiocruz Bahia, Salvador, Brazil.,Department of Statistics, Federal University of Bahia (UFBA), Salvador, Brazil
| | - Enny S Paixão
- Centre for Data and Knowledge Integration for Health (CIDACS), Fiocruz Bahia, Salvador, Brazil. .,Epidemiology and Population Health, London School of Hygiene and Tropical Medicine, London, UK.
| | - Robespierre Pita
- Centre for Data and Knowledge Integration for Health (CIDACS), Fiocruz Bahia, Salvador, Brazil
| | - Mauricio L Barreto
- Centre for Data and Knowledge Integration for Health (CIDACS), Fiocruz Bahia, Salvador, Brazil
| |
Collapse
|