1
|
Lotspeich SC, Shepherd BE, Kariuki MA, Wools-Kaloustian K, McGowan CC, Musick B, Semeere A, Crabtree Ramírez BE, Mkwashapi DM, Cesar C, Ssemakadde M, Machado DM, Ngeresa A, Ferreira FF, Lwali J, Marcelin A, Cardoso SW, Luque MT, Otero L, Cortés CP, Duda SN. Lessons learned from over a decade of data audits in international observational HIV cohorts in Latin America and East Africa. J Clin Transl Sci 2023; 7:e245. [PMID: 38033704 PMCID: PMC10685260 DOI: 10.1017/cts.2023.659] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2023] [Revised: 10/13/2023] [Accepted: 10/16/2023] [Indexed: 12/02/2023] Open
Abstract
Introduction Routine patient care data are increasingly used for biomedical research, but such "secondary use" data have known limitations, including their quality. When leveraging routine care data for observational research, developing audit protocols that can maximize informational return and minimize costs is paramount. Methods For more than a decade, the Latin America and East Africa regions of the International epidemiology Databases to Evaluate AIDS (IeDEA) consortium have been auditing the observational data drawn from participating human immunodeficiency virus clinics. Since our earliest audits, where external auditors used paper forms to record audit findings from paper medical records, we have streamlined our protocols to obtain more efficient and informative audits that keep up with advancing technology while reducing travel obligations and associated costs. Results We present five key lessons learned from conducting data audits of secondary-use data from resource-limited settings for more than 10 years and share eight recommendations for other consortia looking to implement data quality initiatives. Conclusion After completing multiple audit cycles in both the Latin America and East Africa regions of the IeDEA consortium, we have established a rich reference for data quality in our cohorts, as well as large, audited analytical datasets that can be used to answer important clinical questions with confidence. By sharing our audit processes and how they have been adapted over time, we hope that others can develop protocols informed by our lessons learned from more than a decade of experience in these large, diverse cohorts.
Collapse
Affiliation(s)
- Sarah C. Lotspeich
- Department of Statistical Sciences, Wake Forest
University, Winston-Salem, NC,
USA
- Department of Biostatistics, Vanderbilt University Medical
Center, Nashville, TN, USA
| | - Bryan E. Shepherd
- Department of Biostatistics, Vanderbilt University Medical
Center, Nashville, TN, USA
| | | | - Kara Wools-Kaloustian
- Department of Medicine, Indiana University School of
Medicine, Indianapolis, IN,
USA
| | - Catherine C. McGowan
- Division of Infectious Diseases, Department of Medicine,
Vanderbilt University Medical Center, Nashville,
TN, USA
| | - Beverly Musick
- Department of Biostatistics, Indiana University School of
Medicine, Indianapolis, IN,
USA
| | - Aggrey Semeere
- Infectious Diseases Institute, Makerere University,
Kampala, Uganda
| | - Brenda E. Crabtree Ramírez
- Department of Infectious Diseases, Instituto Nacional de
Ciencias Méxicas y Nutrición Salvador Zubirán, Mexico City,
Mexico
| | - Denna M. Mkwashapi
- Sexual and Reproductive Health Program, National Institute
for Medical Research Mwanza, United Republic of Tanzania,
Mwanza, Tanzania
| | | | | | - Daisy Maria Machado
- Departamento de Pediatria, Universidade Federal de São
Paulo, São Paulo, Brazil
| | - Antony Ngeresa
- Academic Model Providing Access to Health Care (AMPATH),
Eldoret, Kenya
| | | | - Jerome Lwali
- Tumbi Hospital HIV Care and Treatment Clinic, United Republic of
Tanzania, Kibaha, Tanzania
| | - Adias Marcelin
- Le Groupe Haïtien d’Etude du Sarcome de Kaposi et des Infections
Opportunistes, Port-au-Prince, Haiti
| | | | - Marco Tulio Luque
- Instituto Hondureño de Seguridad Social and Hospital Escuela
Universitario, Tegucigalpa, Honduras
| | - Larissa Otero
- Instituto de Medicina Tropical Alexander von Humboldt, Universidad Peruana
Cayetano Heredia, Lima, Peru
- School of Medicine, Universidad Peruana Cayetano Heredia,
Lima, Peru
| | | | - Stephany N. Duda
- Department of Biomedical Informatics, Vanderbilt University
Medical Center, Nashville, TN,
USA
| |
Collapse
|
2
|
Shepherd BE, Han K, Chen T, Bian A, Pugh S, Duda SN, Lumley T, Heerman WJ, Shaw PA. Multiwave validation sampling for error-prone electronic health records. Biometrics 2023; 79:2649-2663. [PMID: 35775996 PMCID: PMC10525037 DOI: 10.1111/biom.13713] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2022] [Accepted: 06/16/2022] [Indexed: 11/29/2022]
Abstract
Electronic health record (EHR) data are increasingly used for biomedical research, but these data have recognized data quality challenges. Data validation is necessary to use EHR data with confidence, but limited resources typically make complete data validation impossible. Using EHR data, we illustrate prospective, multiwave, two-phase validation sampling to estimate the association between maternal weight gain during pregnancy and the risks of her child developing obesity or asthma. The optimal validation sampling design depends on the unknown efficient influence functions of regression coefficients of interest. In the first wave of our multiwave validation design, we estimate the influence function using the unvalidated (phase 1) data to determine our validation sample; then in subsequent waves, we re-estimate the influence function using validated (phase 2) data and update our sampling. For efficiency, estimation combines obesity and asthma sampling frames while calibrating sampling weights using generalized raking. We validated 996 of 10,335 mother-child EHR dyads in six sampling waves. Estimated associations between childhood obesity/asthma and maternal weight gain, as well as other covariates, are compared to naïve estimates that only use unvalidated data. In some cases, estimates markedly differ, underscoring the importance of efficient validation sampling to obtain accurate estimates incorporating validated data.
Collapse
Affiliation(s)
- Bryan E. Shepherd
- Department of Biostatistics, Vanderbilt University, Nashville, Tennessee, USA
| | - Kyunghee Han
- Depart. of Mathematics, Statistics, and Computer Science; Univ. of Illinois at Chicago
| | - Tong Chen
- Department of Statistics, University of Auckland
| | - Aihua Bian
- Department of Biostatistics, Vanderbilt University, Nashville, Tennessee, USA
| | - Shannon Pugh
- Department of Emergency Medicine, Vanderbilt University Medical Center
| | - Stephany N. Duda
- Department of Biomedical Informatics, Vanderbilt University Medical Center
| | | | | | - Pamela A. Shaw
- Biostatistics Unit, Kaiser Permanente Washington Health Research Institute
| |
Collapse
|
3
|
Lotspeich SC, Amorim GGC, Shaw PA, Tao R, Shepherd BE. Optimal multiwave validation of secondary use data with outcome and exposure misclassification. CAN J STAT 2023. [DOI: 10.1002/cjs.11772] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/03/2023]
|
4
|
Rivera-Rodriguez C, Haneuse S, Sauer S. Optimal sampling allocation for outcome-dependent designs in cluster-correlated data settings. Stat Methods Med Res 2022; 31:2400-2414. [PMID: 36039539 PMCID: PMC10897940 DOI: 10.1177/09622802221122423] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
In clinical and public health studies, it is often the case that some variables relevant to the analysis are too difficult or costly to measure for all individuals in the population of interest. Rather, a subsample of these individuals must be identified for additional data collection. A sampling scheme that incorporates readily-available information for the entire target population at the design stage can increase the statistical efficiency of the intended analysis. While there is no universally optimal sampling design, under certain principles and restrictions, a well-designed and efficient sampling strategy can be implemented. In two-phase designs, efficiency can be gained by stratifying on the outcome and/or auxiliary information that is known at phase I. Additional gains in efficiency can be obtained by determining the optimal allocation of the sample sizes across the strata, which depends on the quantity that is being estimated. In this paper, the inference is concerned with one or multiple regression parameter(s) where the study units are naturally clustered and, thus, exhibit correlation in outcomes. We propose several allocation strategies within the framework of two-phase designs for the estimation of the regression parameter(s) obtained from weighted generalized estimating equations. The proposed methods extend existing theory to address the objective of the estimating regression parameters in cluster-correlated data settings by minimizing the asymptotic variance of the estimator subject to a fixed sample size. Through a comprehensive simulation study, we show that the proposed allocation schemes have the potential to yield substantial efficiency gains over alternative strategies.
Collapse
Affiliation(s)
| | - Sebastien Haneuse
- Department of Biostatistics, 1857Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Sara Sauer
- Department of Global Health and Social Medicine, 1857Harvard Medical School, Boston, MA, USA
| |
Collapse
|
5
|
Chen T, Lumley T. Optimal sampling for design-based estimators of regression models. Stat Med 2022; 41:1482-1497. [PMID: 34989429 PMCID: PMC8918008 DOI: 10.1002/sim.9300] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2021] [Revised: 12/02/2021] [Accepted: 12/10/2021] [Indexed: 11/05/2022]
Abstract
Two-phase designs measure variables of interest on a subcohort where the outcome and covariates are readily available or cheap to collect on all individuals in the cohort. Given limited resource availability, it is of interest to find an optimal design that includes more informative individuals in the final sample. We explore the optimal designs and efficiencies for analyses by design-based estimators. Generalized raking is an efficient class of design-based estimators, and they improve on the inverse-probability weighted (IPW) estimator by adjusting weights based on the auxiliary information. We derive a closed-form solution of the optimal design for estimating regression coefficients from generalized raking estimators. We compare it with the optimal design for analysis via the IPW estimator and other two-phase designs in measurement-error settings. We consider general two-phase designs where the outcome variable and variables of interest can be continuous or discrete. Our results show that the optimal designs for analyses by the two classes of design-based estimators can be very different. The optimal design for analysis via the IPW estimator is optimal for IPW estimation and typically gives near-optimal efficiency for generalized raking estimation, though we show there is potential improvement in some settings.
Collapse
Affiliation(s)
- Tong Chen
- Department of Statistics, University of Auckland, Auckland, New Zealand
| | - Thomas Lumley
- Department of Statistics, University of Auckland, Auckland, New Zealand
| |
Collapse
|
6
|
Amorim G, Tao R, Lotspeich S, Shaw PA, Lumley T, Shepherd BE. Two-Phase Sampling Designs for Data Validation in Settings with Covariate Measurement Error and Continuous Outcome. JOURNAL OF THE ROYAL STATISTICAL SOCIETY. SERIES A, (STATISTICS IN SOCIETY) 2021; 184:1368-1389. [PMID: 34975235 PMCID: PMC8715909 DOI: 10.1111/rssa.12689] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Measurement errors are present in many data collection procedures and can harm analyses by biasing estimates. To correct for measurement error, researchers often validate a subsample of records and then incorporate the information learned from this validation sample into estimation. In practice, the validation sample is often selected using simple random sampling (SRS). However, SRS leads to inefficient estimates because it ignores information on the error-prone variables, which can be highly correlated to the unknown truth. Applying and extending ideas from the two-phase sampling literature, we propose optimal and nearly-optimal designs for selecting the validation sample in the classical measurement-error framework. We target designs to improve the efficiency of model-based and design-based estimators, and show how the resulting designs compare to each other. Our results suggest that sampling schemes that extract more information from the error-prone data are substantially more efficient than SRS, for both design- and model-based estimators. The optimal procedure, however, depends on the analysis method, and can differ substantially. This is supported by theory and simulations. We illustrate the various designs using data from an HIV cohort study.
Collapse
Affiliation(s)
- Gustavo Amorim
- Department of Biostatistics, Vanderbilt University Medical Center, Nashvile, TN, USA
| | - Ran Tao
- Department of Biostatistics, Vanderbilt University Medical Center, Nashvile, TN, USA
- Vanderbilt Genetics Institute, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Sarah Lotspeich
- Department of Biostatistics, Vanderbilt University Medical Center, Nashvile, TN, USA
| | - Pamela A. Shaw
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, PA, USA
| | - Thomas Lumley
- Department of Statistics, University of Auckland, Auckland, New Zealand
| | - Bryan E. Shepherd
- Department of Biostatistics, Vanderbilt University Medical Center, Nashvile, TN, USA
| |
Collapse
|
7
|
Greenland S. Invited Commentary: Dealing With the Inevitable Deficiencies of Bias Analysis-and All Analyses. Am J Epidemiol 2021; 190:1617-1621. [PMID: 33778862 DOI: 10.1093/aje/kwab069] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2021] [Revised: 01/26/2021] [Accepted: 02/10/2021] [Indexed: 12/22/2022] Open
Abstract
Lash et al. (Am J Epidemiol. 2021;190(8):1604-1612) have presented detailed critiques of 3 bias analyses that they identify as "suboptimal." This identification raises the question of what "optimal" means for bias analysis, because it is practically impossible to do statistically optimal analyses of typical population studies-with or without bias analysis. At best the analysis can only attempt to satisfy practice guidelines and account for available information both within and outside the study. One should not expect a full accounting for all sources of uncertainty; hence, interval estimates and distributions for causal effects should never be treated as valid uncertainty assessments-they are instead only example analyses that follow from collections of often questionable assumptions. These observations reinforce those of Lash et al. and point to the need for more development of methods for judging bias-parameter distributions and utilization of available information.
Collapse
|
8
|
Oh EJ, Shepherd BE, Lumley T, Shaw PA. Improved generalized raking estimators to address dependent covariate and failure-time outcome error. Biom J 2021; 63:1006-1027. [PMID: 33709462 PMCID: PMC8211389 DOI: 10.1002/bimj.202000187] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2020] [Revised: 10/05/2020] [Accepted: 01/05/2021] [Indexed: 11/12/2022]
Abstract
Biomedical studies that use electronic health records (EHR) data for inference are often subject to bias due to measurement error. The measurement error present in EHR data is typically complex, consisting of errors of unknown functional form in covariates and the outcome, which can be dependent. To address the bias resulting from such errors, generalized raking has recently been proposed as a robust method that yields consistent estimates without the need to model the error structure. We provide rationale for why these previously proposed raking estimators can be expected to be inefficient in failure-time outcome settings involving misclassification of the event indicator. We propose raking estimators that utilize multiple imputation, to impute either the target variables or auxiliary variables, to improve the efficiency. We also consider outcome-dependent sampling designs and investigate their impact on the efficiency of the raking estimators, either with or without multiple imputation. We present an extensive numerical study to examine the performance of the proposed estimators across various measurement error settings. We then apply the proposed methods to our motivating setting, in which we seek to analyze HIV outcomes in an observational cohort with EHR data from the Vanderbilt Comprehensive Care Clinic.
Collapse
Affiliation(s)
- Eric J. Oh
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Philadelphia, PA, USA
| | - Bryan E. Shepherd
- Department of Biostatistics, Vanderbilt University, Nashville, TN, USA
| | - Thomas Lumley
- Department of Statistics, University of Auckland, Auckland, New Zealand
| | - Pamela A. Shaw
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Philadelphia, PA, USA
| |
Collapse
|