1
|
Lotspeich SC, Shepherd BE, Kariuki MA, Wools-Kaloustian K, McGowan CC, Musick B, Semeere A, Crabtree Ramírez BE, Mkwashapi DM, Cesar C, Ssemakadde M, Machado DM, Ngeresa A, Ferreira FF, Lwali J, Marcelin A, Cardoso SW, Luque MT, Otero L, Cortés CP, Duda SN. Lessons learned from over a decade of data audits in international observational HIV cohorts in Latin America and East Africa. J Clin Transl Sci 2023; 7:e245. [PMID: 38033704 PMCID: PMC10685260 DOI: 10.1017/cts.2023.659] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2023] [Revised: 10/13/2023] [Accepted: 10/16/2023] [Indexed: 12/02/2023] Open
Abstract
Introduction Routine patient care data are increasingly used for biomedical research, but such "secondary use" data have known limitations, including their quality. When leveraging routine care data for observational research, developing audit protocols that can maximize informational return and minimize costs is paramount. Methods For more than a decade, the Latin America and East Africa regions of the International epidemiology Databases to Evaluate AIDS (IeDEA) consortium have been auditing the observational data drawn from participating human immunodeficiency virus clinics. Since our earliest audits, where external auditors used paper forms to record audit findings from paper medical records, we have streamlined our protocols to obtain more efficient and informative audits that keep up with advancing technology while reducing travel obligations and associated costs. Results We present five key lessons learned from conducting data audits of secondary-use data from resource-limited settings for more than 10 years and share eight recommendations for other consortia looking to implement data quality initiatives. Conclusion After completing multiple audit cycles in both the Latin America and East Africa regions of the IeDEA consortium, we have established a rich reference for data quality in our cohorts, as well as large, audited analytical datasets that can be used to answer important clinical questions with confidence. By sharing our audit processes and how they have been adapted over time, we hope that others can develop protocols informed by our lessons learned from more than a decade of experience in these large, diverse cohorts.
Collapse
Affiliation(s)
- Sarah C. Lotspeich
- Department of Statistical Sciences, Wake Forest University, Winston-Salem, NC, USA
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Bryan E. Shepherd
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN, USA
| | | | - Kara Wools-Kaloustian
- Department of Medicine, Indiana University School of Medicine, Indianapolis, IN, USA
| | - Catherine C. McGowan
- Division of Infectious Diseases, Department of Medicine, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Beverly Musick
- Department of Biostatistics, Indiana University School of Medicine, Indianapolis, IN, USA
| | - Aggrey Semeere
- Infectious Diseases Institute, Makerere University, Kampala, Uganda
| | - Brenda E. Crabtree Ramírez
- Department of Infectious Diseases, Instituto Nacional de Ciencias Méxicas y Nutrición Salvador Zubirán, Mexico City, Mexico
| | - Denna M. Mkwashapi
- Sexual and Reproductive Health Program, National Institute for Medical Research Mwanza, United Republic of Tanzania, Mwanza, Tanzania
| | | | | | - Daisy Maria Machado
- Departamento de Pediatria, Universidade Federal de São Paulo, São Paulo, Brazil
| | - Antony Ngeresa
- Academic Model Providing Access to Health Care (AMPATH), Eldoret, Kenya
| | | | - Jerome Lwali
- Tumbi Hospital HIV Care and Treatment Clinic, United Republic of Tanzania, Kibaha, Tanzania
| | - Adias Marcelin
- Le Groupe Haïtien d’Etude du Sarcome de Kaposi et des Infections Opportunistes, Port-au-Prince, Haiti
| | | | - Marco Tulio Luque
- Instituto Hondureño de Seguridad Social and Hospital Escuela Universitario, Tegucigalpa, Honduras
| | - Larissa Otero
- Instituto de Medicina Tropical Alexander von Humboldt, Universidad Peruana Cayetano Heredia, Lima, Peru
- School of Medicine, Universidad Peruana Cayetano Heredia, Lima, Peru
| | | | - Stephany N. Duda
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA
| |
Collapse
|
2
|
Gianfrancesco MA, Goldstein ND. A narrative review on the validity of electronic health record-based research in epidemiology. BMC Med Res Methodol 2021; 21:234. [PMID: 34706667 PMCID: PMC8549408 DOI: 10.1186/s12874-021-01416-5] [Citation(s) in RCA: 48] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2021] [Accepted: 09/28/2021] [Indexed: 11/10/2022] Open
Abstract
Electronic health records (EHRs) are widely used in epidemiological research, but the validity of the results is dependent upon the assumptions made about the healthcare system, the patient, and the provider. In this review, we identify four overarching challenges in using EHR-based data for epidemiological analysis, with a particular emphasis on threats to validity. These challenges include representativeness of the EHR to a target population, the availability and interpretability of clinical and non-clinical data, and missing data at both the variable and observation levels. Each challenge reveals layers of assumptions that the epidemiologist is required to make, from the point of patient entry into the healthcare system, to the provider documenting the results of the clinical exam and follow-up of the patient longitudinally; all with the potential to bias the results of analysis of these data. Understanding the extent of as well as remediating potential biases requires a variety of methodological approaches, from traditional sensitivity analyses and validation studies, to newer techniques such as natural language processing. Beyond methods to address these challenges, it will remain crucial for epidemiologists to engage with clinicians and informaticians at their institutions to ensure data quality and accessibility by forming multidisciplinary teams around specific research projects.
Collapse
Affiliation(s)
- Milena A Gianfrancesco
- Division of Rheumatology, University of California School of Medicine, San Francisco, CA, USA
| | - Neal D Goldstein
- Department of Epidemiology and Biostatistics, Drexel University Dornsife School of Public Health, 3215 Market St., Philadelphia, PA, 19104, USA.
| |
Collapse
|
3
|
Boe LA, Tinker LF, Shaw PA. An approximate quasi-likelihood approach for error-prone failure time outcomes and exposures. Stat Med 2021; 40:5006-5024. [PMID: 34519082 PMCID: PMC8963256 DOI: 10.1002/sim.9108] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2020] [Revised: 04/21/2021] [Accepted: 06/03/2021] [Indexed: 11/08/2022]
Abstract
Measurement error arises commonly in clinical research settings that rely on data from electronic health records or large observational cohorts. In particular, self-reported outcomes are typical in cohort studies for chronic diseases such as diabetes in order to avoid the burden of expensive diagnostic tests. Dietary intake, which is also commonly collected by self-report and subject to measurement error, is a major factor linked to diabetes and other chronic diseases. These errors can bias exposure-disease associations that ultimately can mislead clinical decision-making. We have extended an existing semiparametric likelihood-based method for handling error-prone, discrete failure time outcomes to also address covariate error. We conduct an extensive numerical study to compare the proposed method to the naive approach that ignores measurement error in terms of bias and efficiency in the estimation of the regression parameter of interest. In all settings considered, the proposed method showed minimal bias and maintained coverage probability, thus outperforming the naive analysis which showed extreme bias and low coverage. This method is applied to data from the Women's Health Initiative to assess the association between energy and protein intake and the risk of incident diabetes mellitus. Our results show that correcting for errors in both the self-reported outcome and dietary exposures leads to considerably different hazard ratio estimates than those from analyses that ignore measurement error, which demonstrates the importance of correcting for both outcome and covariate error.
Collapse
Affiliation(s)
- Lillian A. Boe
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, USA
| | - Lesley F. Tinker
- Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, Washington, USA
| | - Pamela A. Shaw
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, USA
| |
Collapse
|
4
|
Tao R, Lotspeich SC, Amorim G, Shaw PA, Shepherd BE. Efficient semiparametric inference for two-phase studies with outcome and covariate measurement errors. Stat Med 2021; 40:725-738. [PMID: 33145800 PMCID: PMC8214478 DOI: 10.1002/sim.8799] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2020] [Revised: 09/07/2020] [Accepted: 10/20/2020] [Indexed: 11/07/2022]
Abstract
In modern observational studies using electronic health records or other routinely collected data, both the outcome and covariates of interest can be error-prone and their errors often correlated. A cost-effective solution is the two-phase design, under which the error-prone outcome and covariates are observed for all subjects during the first phase and that information is used to select a validation subsample for accurate measurements of these variables in the second phase. Previous research on two-phase measurement error problems largely focused on scenarios where there are errors in covariates only or the validation sample is a simple random sample of study subjects. Herein, we propose a semiparametric approach to general two-phase measurement error problems with a quantitative outcome, allowing for correlated errors in the outcome and covariates and arbitrary second-phase selection. We devise a computationally efficient and numerically stable expectation-maximization algorithm to maximize the nonparametric likelihood function. The resulting estimators possess desired statistical properties. We demonstrate the superiority of the proposed methods over existing approaches through extensive simulation studies, and we illustrate their use in an observational HIV study.
Collapse
Affiliation(s)
- Ran Tao
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, Tennessee
- Vanderbilt Genetics Institute, Vanderbilt University Medical Center, Nashville, Tennessee
| | - Sarah C. Lotspeich
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, Tennessee
| | - Gustavo Amorim
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, Tennessee
| | - Pamela A. Shaw
- Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania
| | - Bryan E. Shepherd
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, Tennessee
| |
Collapse
|
5
|
Oh EJ, Shepherd BE, Lumley T, Shaw PA. Raking and regression calibration: Methods to address bias from correlated covariate and time-to-event error. Stat Med 2021; 40:631-649. [PMID: 33140432 PMCID: PMC7874496 DOI: 10.1002/sim.8793] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2020] [Revised: 08/05/2020] [Accepted: 10/11/2020] [Indexed: 11/11/2022]
Abstract
Medical studies that depend on electronic health records (EHR) data are often subject to measurement error, as the data are not collected to support research questions under study. These data errors, if not accounted for in study analyses, can obscure or cause spurious associations between patient exposures and disease risk. Methodology to address covariate measurement error has been well developed; however, time-to-event error has also been shown to cause significant bias, but methods to address it are relatively underdeveloped. More generally, it is possible to observe errors in both the covariate and the time-to-event outcome that are correlated. We propose regression calibration (RC) estimators to simultaneously address correlated error in the covariates and the censored event time. Although RC can perform well in many settings with covariate measurement error, it is biased for nonlinear regression models, such as the Cox model. Thus, we additionally propose raking estimators which are consistent estimators of the parameter defined by the population estimating equation. Raking can improve upon RC in certain settings with failure-time data, require no explicit modeling of the error structure, and can be utilized under outcome-dependent sampling designs. We discuss features of the underlying estimation problem that affect the degree of improvement the raking estimator has over the RC approach. Detailed simulation studies are presented to examine the performance of the proposed estimators under varying levels of signal, error, and censoring. The methodology is illustrated on observational EHR data on HIV outcomes from the Vanderbilt Comprehensive Care Clinic.
Collapse
Affiliation(s)
- Eric J. Oh
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Bryan E. Shepherd
- Department of Biostatistics, Vanderbilt University, Nashville, Tennessee, USA
| | - Thomas Lumley
- Department of Statistics, University of Auckland, Auckland, New Zealand
| | - Pamela A. Shaw
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| |
Collapse
|
6
|
Shaw PA, He J, Shepherd BE. Regression calibration to correct correlated errors in outcome and exposure. Stat Med 2021; 40:271-286. [PMID: 33086428 PMCID: PMC8670514 DOI: 10.1002/sim.8773] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2019] [Revised: 08/31/2020] [Accepted: 09/25/2020] [Indexed: 11/07/2022]
Abstract
Measurement error arises through a variety of mechanisms. A rich literature exists on the bias introduced by covariate measurement error and on methods of analysis to address this bias. By comparison, less attention has been given to errors in outcome assessment and nonclassical covariate measurement error. We consider an extension of the regression calibration method to settings with errors in a continuous outcome, where the errors may be correlated with prognostic covariates or with covariate measurement error. This method adjusts for the measurement error in the data and can be applied with either a validation subset, on which the true data are also observed (eg, a study audit), or a reliability subset, where a second observation of error prone measurements are available. For each case, we provide conditions under which the proposed method is identifiable and leads to consistent estimates of the regression parameter. When the second measurement on the reliability subset has no error or classical unbiased measurement error, the proposed method is consistent even when the primary outcome and exposures of interest are subject to both systematic and random error. We examine the performance of the method with simulations for a variety of measurement error scenarios and sizes of the reliability subset. We illustrate the method's application using data from the Women's Health Initiative Dietary Modification Trial.
Collapse
Affiliation(s)
- Pamela A. Shaw
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, USA
| | - Jiwei He
- Office of Biostatistics, Office of Translational Sciences, Center for Drug Evaluation and Research, U.S. Food and Drug Administration, Silver Spring, Maryland, USA
| | - Bryan E. Shepherd
- Department of Biostatistics, Vanderbilt University School of Medicine, Nashville, Tennessee
| |
Collapse
|
7
|
Shepherd BE, Shaw PA. Errors in multiple variables in human immunodeficiency virus (HIV) cohort and electronic health record data: statistical challenges and opportunities. STATISTICAL COMMUNICATIONS IN INFECTIOUS DISEASES 2020; 12:20190015. [PMID: 35880997 PMCID: PMC9204761 DOI: 10.1515/scid-2019-0015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/15/2019] [Accepted: 08/21/2020] [Indexed: 06/15/2023]
Abstract
Objectives: Observational data derived from patient electronic health records (EHR) data are increasingly used for human immunodeficiency virus/acquired immunodeficiency syndrome (HIV/AIDS) research. There are challenges to using these data, in particular with regards to data quality; some are recognized, some unrecognized, and some recognized but ignored. There are great opportunities for the statistical community to improve inference by incorporating validation subsampling into analyses of EHR data.Methods: Methods to address measurement error, misclassification, and missing data are relevant, as are sampling designs such as two-phase sampling. However, many of the existing statistical methods for measurement error, for example, only address relatively simple settings, whereas the errors seen in these datasets span multiple variables (both predictors and outcomes), are correlated, and even affect who is included in the study.Results/Conclusion: We will discuss some preliminary methods in this area with a particular focus on time-to-event outcomes and outline areas of future research.
Collapse
Affiliation(s)
- Bryan E. Shepherd
- Biostatistics, Vanderbilt University, 2525 West End, Suite 11000, 37203Nashville, Tennessee, USA
| | - Pamela A. Shaw
- Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| |
Collapse
|
8
|
Giganti MJ, Shaw PA, Chen G, Bebawy SS, Turner MM, Sterling TR, Shepherd BE. ACCOUNTING FOR DEPENDENT ERRORS IN PREDICTORS AND TIME-TO-EVENT OUTCOMES USING ELECTRONIC HEALTH RECORDS, VALIDATION SAMPLES, AND MULTIPLE IMPUTATION. Ann Appl Stat 2020; 14:1045-1061. [PMID: 32999698 PMCID: PMC7523695 DOI: 10.1214/20-aoas1343] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
Data from electronic health records (EHR) are prone to errors, which are often correlated across multiple variables. The error structure is further complicated when analysis variables are derived as functions of two or more error-prone variables. Such errors can substantially impact estimates, yet we are unaware of methods that simultaneously account for errors in covariates and time-to-event outcomes. Using EHR data from 4217 patients, the hazard ratio for an AIDS-defining event associated with a 100 cell/mm3 increase in CD4 count at ART initiation was 0.74 (95%CI: 0.68-0.80) using unvalidated data and 0.60 (95%CI: 0.53-0.68) using fully validated data. Our goal is to obtain unbiased and efficient estimates after validating a random subset of records. We propose fitting discrete failure time models to the validated subsample and then multiply imputing values for unvalidated records. We demonstrate how this approach simultaneously addresses dependent errors in predictors, time-to-event outcomes, and inclusion criteria. Using the fully validated dataset as a gold standard, we compare the mean squared error of our estimates with those from the unvalidated dataset and the corresponding subsample-only dataset for various subsample sizes. By incorporating reasonably sized validated subsamples and appropriate imputation models, our approach had improved estimation over both the naive analysis and the analysis using only the validation subsample.
Collapse
Affiliation(s)
| | - Pamela A. Shaw
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania
| | - Guanhua Chen
- Department of Biostatistics and Medical Informatics, University of Wisconsin
| | | | | | | | | |
Collapse
|
9
|
Lotspeich SC, Giganti MJ, Maia M, Vieira R, Machado DM, Succi RC, Ribeiro S, Pereira MS, Rodriguez MF, Julmiste G, Luque MT, Caro-Vega Y, Mejia F, Shepherd BE, McGowan CC, Duda SN. Self-audits as alternatives to travel-audits for improving data quality in the Caribbean, Central and South America network for HIV epidemiology. J Clin Transl Sci 2020; 4:125-132. [PMID: 32313702 PMCID: PMC7159809 DOI: 10.1017/cts.2019.442] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2019] [Revised: 11/19/2019] [Accepted: 11/25/2019] [Indexed: 11/25/2022] Open
Abstract
INTRODUCTION Audits play a critical role in maintaining the integrity of observational cohort data. While previous work has validated the audit process, sending trained auditors to sites ("travel-audits") can be costly. We investigate the efficacy of training sites to conduct "self-audits." METHODS In 2017, eight research groups in the Caribbean, Central, and South America network for HIV Epidemiology each audited a subset of their patient records randomly selected by the data coordinating center at Vanderbilt. Designated investigators at each site compared abstracted research data to the original clinical source documents and captured audit findings electronically. Additionally, two Vanderbilt investigators performed on-site travel-audits at three randomly selected sites (one adult and two pediatric) in late summer 2017. RESULTS Self- and travel-auditors, respectively, reported that 93% and 92% of 8919 data entries, captured across 28 unique clinical variables on 65 patients, were entered correctly. Across all entries, 8409 (94%) received the same assessment from self- and travel-auditors (7988 correct and 421 incorrect). Of 421 entries mutually assessed as "incorrect," 304 (82%) were corrected by both self- and travel-auditors and 250 of these (72%) received the same corrections. Reason for changing antiretroviral therapy (ART) regimen, ART end date, viral load value, CD4%, and HIV diagnosis date had the most mismatched corrections. CONCLUSIONS With similar overall error rates, findings suggest that data audits conducted by trained local investigators could provide an alternative to on-site audits by external auditors to ensure continued data quality. However, discrepancies observed between corrections illustrate challenges in determining correct values even with audits.
Collapse
Affiliation(s)
- Sarah C. Lotspeich
- Department of Biostatistics, Vanderbilt University School of Medicine, Nashville, TN, USA
| | - Mark J. Giganti
- Department of Biostatistics, Vanderbilt University School of Medicine, Nashville, TN, USA
| | - Marcelle Maia
- Departamento de Pediatria, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
| | - Renalice Vieira
- Departamento de Pediatria, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
| | - Daisy Maria Machado
- Departamento de Pediatria, Universidade Federal de São Paulo, São Paulo, Brazil
| | - Regina Célia Succi
- Departamento de Pediatria, Universidade Federal de São Paulo, São Paulo, Brazil
| | - Sayonara Ribeiro
- Instituto Nacional de Infectologia Evandro Chagas, Rio de Janeiro, Brazil
| | | | | | - Gaetane Julmiste
- Le Groupe Haïtien d’Etude du Sarcome de Kaposi et des Infections Opportunistes, Port-au-Prince, Haiti
| | - Marco Tulio Luque
- Instituto Hondureño de Seguridad Social and Hospital Escuela Universitario, Tegucigalpa, Honduras
| | - Yanink Caro-Vega
- Departamento de Enfermedades Infecciosas, El Instituto Nacional de Ciencias Médicas y Nutrición Salvador Zubirán, Mexico City, Mexico
| | - Fernando Mejia
- Instituto de Medicina Tropical Alexander von Humboldt, Universidad Peruana Cayetano Heredia, Lima, Peru
| | - Bryan E. Shepherd
- Department of Biostatistics, Vanderbilt University School of Medicine, Nashville, TN, USA
| | - Catherine C. McGowan
- Division of Infectious Diseases, Department of Medicine, Vanderbilt University School of Medicine, Nashville, TN, USA
| | - Stephany N. Duda
- Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, TN, USA
| |
Collapse
|
10
|
Giganti MJ, Shepherd BE, Caro-Vega Y, Luz PM, Rebeiro PF, Maia M, Julmiste G, Cortes C, McGowan CC, Duda SN. The impact of data quality and source data verification on epidemiologic inference: a practical application using HIV observational data. BMC Public Health 2019; 19:1748. [PMID: 31888571 PMCID: PMC6937856 DOI: 10.1186/s12889-019-8105-2] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2019] [Accepted: 12/17/2019] [Indexed: 01/23/2023] Open
Abstract
BACKGROUND Data audits are often evaluated soon after completion, even though the identification of systematic issues may lead to additional data quality improvements in the future. In this study, we assess the impact of the entire data audit process on subsequent statistical analyses. METHODS We conducted on-site audits of datasets from nine international HIV care sites. Error rates were quantified for key demographic and clinical variables among a subset of records randomly selected for auditing. Based on audit results, some sites were tasked with targeted validation of high-error-rate variables resulting in a post-audit dataset. We estimated the times from antiretroviral therapy initiation until death and first AIDS-defining event using the pre-audit data, the audit data, and the post-audit data. RESULTS The overall discrepancy rate between pre-audit and audit data (n = 250) across all audited variables was 17.1%. The estimated probability of mortality and an AIDS-defining event over time was higher in the audited data relative to the pre-audit data. Among patients represented in both the post-audit and pre-audit cohorts (n = 18,999), AIDS and mortality estimates also were higher in the post-audit data. CONCLUSION Though some changes may have occurred independently, our findings suggest that improved data quality following the audit may impact epidemiological inferences.
Collapse
Affiliation(s)
| | | | - Yanink Caro-Vega
- Instituto Nacional de Ciencias Médicas y Nutrición Salvador Zubirán, Mexico City, Mexico
| | - Paula M. Luz
- Instituto Nacional de Infectologia Evandro Chagas, Fundação Oswaldo Cruz, Rio de Janeiro, Brazil
| | | | - Marcelle Maia
- Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
| | | | - Claudia Cortes
- Fundación Arriarán, University of Chile School of Medicine, Santiago, Chile
| | | | | |
Collapse
|
11
|
Gustafson P, Karim ME. When exposure is subject to nondifferential misclassification, are validation data helpful in testing for an exposure–disease association? CAN J STAT 2019. [DOI: 10.1002/cjs.11490] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Affiliation(s)
- Paul Gustafson
- Department of StatisticsUniversity of British ColumbiaVancouver Canada
| | - Mohammad Ehsanul Karim
- School of Population and Public HealthUniversity of British ColumbiaVancouver Canada
- Centre for Health Evaluation and Outcome SciencesProvidence Health CareVancouver Canada
| |
Collapse
|
12
|
Oh EJ, Shepherd BE, Lumley T, Shaw PA. Considerations for analysis of time-to-event outcomes measured with error: Bias and correction with SIMEX. Stat Med 2018; 37:1276-1289. [PMID: 29193180 PMCID: PMC5810403 DOI: 10.1002/sim.7554] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2016] [Revised: 08/10/2017] [Accepted: 10/06/2017] [Indexed: 11/09/2022]
Abstract
For time-to-event outcomes, a rich literature exists on the bias introduced by covariate measurement error in regression models, such as the Cox model, and methods of analysis to address this bias. By comparison, less attention has been given to understanding the impact or addressing errors in the failure time outcome. For many diseases, the timing of an event of interest (such as progression-free survival or time to AIDS progression) can be difficult to assess or reliant on self-report and therefore prone to measurement error. For linear models, it is well known that random errors in the outcome variable do not bias regression estimates. With nonlinear models, however, even random error or misclassification can introduce bias into estimated parameters. We compare the performance of 2 common regression models, the Cox and Weibull models, in the setting of measurement error in the failure time outcome. We introduce an extension of the SIMEX method to correct for bias in hazard ratio estimates from the Cox model and discuss other analysis options to address measurement error in the response. A formula to estimate the bias induced into the hazard ratio by classical measurement error in the event time for a log-linear survival model is presented. Detailed numerical studies are presented to examine the performance of the proposed SIMEX method under varying levels and parametric forms of the error in the outcome. We further illustrate the method with observational data on HIV outcomes from the Vanderbilt Comprehensive Care Clinic.
Collapse
Affiliation(s)
- Eric J. Oh
- Department of Biostatistics, Epidemiology, and Informatics, Perelman
School of Medicine, University of Pennsylvania, Philadelphia, U.S.A
| | - Bryan E. Shepherd
- Department of Biostatistics, Vanderbilt University School of
Medicine, Vanderbilt University, Nashville, Tennessee, U.S.A
| | - Thomas Lumley
- Department of Statistics, University of Auckland, Auckland, New
Zealand
| | - Pamela A. Shaw
- Department of Biostatistics, Epidemiology, and Informatics, Perelman
School of Medicine, University of Pennsylvania, Philadelphia, U.S.A
| |
Collapse
|
13
|
Wang LE, Shaw PA, Mathelier HM, Kimmel SE, French B. EVALUATING RISK-PREDICTION MODELS USING DATA FROM ELECTRONIC HEALTH RECORDS. Ann Appl Stat 2016; 10:286-304. [PMID: 27158296 DOI: 10.1214/15-aoas891] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
The availability of data from electronic health records facilitates the development and evaluation of risk-prediction models, but estimation of prediction accuracy could be limited by outcome misclassification, which can arise if events are not captured. We evaluate the robustness of prediction accuracy summaries, obtained from receiver operating characteristic curves and risk-reclassification methods, if events are not captured (i.e., "false negatives"). We derive estimators for sensitivity and specificity if misclassification is independent of marker values. In simulation studies, we quantify the potential for bias in prediction accuracy summaries if misclassification depends on marker values. We compare the accuracy of alternative prognostic models for 30-day all-cause hospital readmission among 4548 patients discharged from the University of Pennsylvania Health System with a primary diagnosis of heart failure. Simulation studies indicate that if misclassification depends on marker values, then the estimated accuracy improvement is also biased, but the direction of the bias depends on the direction of the association between markers and the probability of misclassification. In our application, 29% of the 1143 readmitted patients were readmitted to a hospital elsewhere in Pennsylvania, which reduced prediction accuracy. Outcome misclassification can result in erroneous conclusions regarding the accuracy of risk-prediction models.
Collapse
Affiliation(s)
- L E Wang
- DEPARTMENT OF BIOSTATISTICS AND EPIDEMIOLOGY, UNIVERSITY OF PENNSYLVANIA, 423 GUARDIAN DRIVE, PHILADELPHIA, PENNSYLVANIA 19104, USA
| | - Pamela A Shaw
- DEPARTMENT OF BIOSTATISTICS AND EPIDEMIOLOGY, UNIVERSITY OF PENNSYLVANIA, 423 GUARDIAN DRIVE, PHILADELPHIA, PENNSYLVANIA 19104, USA
| | - Hansie M Mathelier
- DEPARTMENT OF MEDICINE, UNIVERSITY OF PENNSYLVANIA, 51 N 39TH STREET, PHILADELPHIA, PENNSYLVANIA 19104, USA
| | - Stephen E Kimmel
- DEPARTMENT OF BIOSTATISTICS AND EPIDEMIOLOGY, UNIVERSITY OF PENNSYLVANIA, 423 GUARDIAN DRIVE, PHILADELPHIA, PENNSYLVANIA 19104, USA
| | - Benjamin French
- DEPARTMENT OF BIOSTATISTICS AND EPIDEMIOLOGY, UNIVERSITY OF PENNSYLVANIA, 423 GUARDIAN DRIVE, PHILADELPHIA, PENNSYLVANIA 19104, USA
| |
Collapse
|
14
|
Shepherd BE, Shaw PA, Dodd LE. Using audit information to adjust parameter estimates for data errors in clinical trials. Clin Trials 2012; 9:721-9. [PMID: 22848072 DOI: 10.1177/1740774512450100] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
BACKGROUND Audits are often performed to assess the quality of clinical trial data, but beyond detecting fraud or sloppiness, the audit data are generally ignored. In an earlier study, using data from a nonrandomized study, Shepherd and Yu developed statistical methods to incorporate audit results into study estimates and demonstrated that audit data could be used to eliminate bias. PURPOSE In this article, we examine the usefulness of audit-based error-correction methods in clinical trial settings where a continuous outcome is of primary interest. METHODS We demonstrate the bias of multiple linear regression estimates in general settings with an outcome that may have errors and a set of covariates for which some may have errors and others, including treatment assignment, are recorded correctly for all subjects. We study this bias under different assumptions, including independence between treatment assignment, covariates, and data errors (conceivable in a double-blinded randomized trial) and independence between treatment assignment and covariates but not data errors (possible in an unblinded randomized trial). We review moment-based estimators to incorporate the audit data and propose new multiple imputation estimators. The performance of estimators is studied in simulations. RESULTS When treatment is randomized and unrelated to data errors, estimates of the treatment effect using the original error-prone data (i.e., ignoring the audit results) are unbiased. In this setting, both moment and multiple imputation estimators incorporating audit data are more variable than standard analyses using the original data. In contrast, in settings where treatment is randomized but correlated with data errors and in settings where treatment is not randomized, standard treatment-effect estimates will be biased. And in all settings, parameter estimates for the original, error-prone covariates will be biased. The treatment and covariate effect estimates can be corrected by incorporating audit data using either the multiple imputation or moment-based approaches. Bias, precision, and coverage of confidence intervals improve as the audit size increases. LIMITATIONS The extent of bias and the performance of methods depend on the extent and nature of the error as well as the size of the audit. This study only considers methods for the linear model. Settings much different than those considered here need further study. CONCLUSIONS In randomized trials with continuous outcomes and treatment assignment independent of data errors, standard analyses of treatment effects will be unbiased and are recommended. However, if treatment assignment is correlated with data errors or other covariates, naive analyses may be biased. In these settings, and when covariate effects are of interest, approaches for incorporating audit results should be considered.
Collapse
Affiliation(s)
- Bryan E Shepherd
- Department of Biostatistics, Vanderbilt University School of Medicine, Nashville, TN 37232-2158, USA.
| | | | | |
Collapse
|
15
|
Duda SN, Shepherd BE, Gadd CS, Masys DR, McGowan CC. Measuring the quality of observational study data in an international HIV research network. PLoS One 2012; 7:e33908. [PMID: 22493676 PMCID: PMC3320898 DOI: 10.1371/journal.pone.0033908] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2011] [Accepted: 02/19/2012] [Indexed: 11/29/2022] Open
Abstract
Observational studies of health conditions and outcomes often combine clinical care data from many sites without explicitly assessing the accuracy and completeness of these data. In order to improve the quality of data in an international multi-site observational cohort of HIV-infected patients, the authors conducted on-site, Good Clinical Practice-based audits of the clinical care datasets submitted by participating HIV clinics. Discrepancies between data submitted for research and data in the clinical records were categorized using the audit codes published by the European Organization for the Research and Treatment of Cancer. Five of seven sites had error rates >10% in key study variables, notably laboratory data, weight measurements, and antiretroviral medications. All sites had significant discrepancies in medication start and stop dates. Clinical care data, particularly antiretroviral regimens and associated dates, are prone to substantial error. Verifying data against source documents through audits will improve the quality of databases and research and can be a technique for retraining staff responsible for clinical data collection. The authors recommend that all participants in observational cohorts use data audits to assess and improve the quality of data and to guide future data collection and abstraction efforts at the point of care.
Collapse
Affiliation(s)
- Stephany N Duda
- Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, Tennessee, United States of America.
| | | | | | | | | |
Collapse
|