1
|
Bassler JR, Cagle I, Crear D, Kay ES, Long DM, Mugavero MJ, Nassel AF, Ostrenga L, Parman M, Preg S, Wang X, Batey DS, Rana A, Levitan EB. Development and implementation of a distributed data network between an academic institution and state health departments to investigate variation in time to HIV viral suppression in the Deep South. BMC Public Health 2023; 23:937. [PMID: 37226199 DOI: 10.1186/s12889-023-15924-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2022] [Accepted: 05/18/2023] [Indexed: 05/26/2023] Open
Abstract
BACKGROUND Achieving early and sustained viral suppression (VS) following diagnosis of HIV infection is critical to improving outcomes for persons with HIV (PWH). The Deep South of the United States (US) is a region that is disproportionately impacted by the domestic HIV epidemic. Time to VS, defined as time from diagnosis to initial VS, is substantially longer in the South than other regions of the US. We describe the development and implementation of a distributed data network between an academic institution and state health departments to investigate variation in time to VS in the Deep South. METHODS Representatives of state health departments, the Centers for Disease Control and Prevention (CDC), and the academic partner met to establish core objectives and procedures at the beginning of the project. Importantly, this project used the CDC-developed Enhanced HIV/AIDS Reporting System (eHARS) through a distributed data network model that maintained the confidentiality and integrity of the data. Software programs to build datasets and calculate time to VS were written by the academic partner and shared with each public health partner. To develop spatial elements of the eHARS data, health departments geocoded residential addresses of each newly diagnosed individual in eHARS between 2012-2019, supported by the academic partner. Health departments conducted all analyses within their own systems. Aggregate results were combined across states using meta-analysis techniques. Additionally, we created a synthetic eHARS data set for code development and testing. RESULTS The collaborative structure and distributed data network have allowed us to refine the study questions and analytic plans to conduct investigations into variation in time to VS for both research and public health practice. Additionally, a synthetic eHARS data set has been created and is publicly available for researchers and public health practitioners. CONCLUSIONS These efforts have leveraged the practice expertise and surveillance data within state health departments and the analytic and methodologic expertise of the academic partner. This study could serve as an illustrative example of effective collaboration between academic institutions and public health agencies and provides resources to facilitate future use of the US HIV surveillance system for research and public health practice.
Collapse
Affiliation(s)
- John R Bassler
- Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL, USA.
| | - Izza Cagle
- Office of HIV Prevention and Care, Alabama Department of Public Health, Montgomery, AL, USA
| | - Danita Crear
- Vaccine-Preventable Diseases and Immunization Program, Tennessee Department of Health, Union City, TN, USA
| | - Emma S Kay
- Magic City Research Institute, Birmingham AIDS Outreach, Birmingham, AL, USA
| | - Dustin M Long
- Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL, USA
| | - Michael J Mugavero
- Department of Medicine, University of Alabama at Birmingham, Birmingham, AL, USA
| | - Ariann F Nassel
- University of Alabama at Birmingham, Lister Hill Center for Health Policy, Birmingham, AL, USA
| | | | - Mariel Parman
- Department of Medicine, University of Alabama at Birmingham, Birmingham, AL, USA
| | - Summer Preg
- Office of HIV Prevention and Care, Alabama Department of Public Health, Montgomery, AL, USA
| | - Xueyuan Wang
- STD/HIV Office, Mississippi State Department of Health, Jackson, MS, USA
| | - D Scott Batey
- School of Social Work, Tulane University, New Orleans, LA, USA
| | - Aadia Rana
- Department of Medicine, University of Alabama at Birmingham, Birmingham, AL, USA
| | - Emily B Levitan
- Department of Epidemiology, University of Alabama at Birmingham, Birmingham, AL, USA
| |
Collapse
|
2
|
Edmondson MJ, Luo C, Nazmul Islam M, Sheils NE, Buresh J, Chen Z, Bian J, Chen Y. Distributed Quasi-Poisson regression algorithm for modeling multi-site count outcomes in distributed data networks. J Biomed Inform 2022; 131:104097. [PMID: 35643272 DOI: 10.1016/j.jbi.2022.104097] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2021] [Revised: 04/20/2022] [Accepted: 05/20/2022] [Indexed: 10/18/2022]
Abstract
BACKGROUND Observational studies incorporating real-world data from multiple institutions facilitate study of rare outcomes or exposures and improve generalizability of results. Due to privacy concerns surrounding patient-level data sharing across institutions, methods for performing regression analyses distributively are desirable. Meta-analysis of institution-specific estimates is commonly used, but has been shown to produce biased estimates in certain settings. While distributed regression methods are increasingly available, methods for analyzing count outcomes are currently limited. Count data in practice are commonly subject to overdispersion, exhibiting greater variability than expected under a given statistical model. OBJECTIVE We propose a novel computational method, a one-shot distributed algorithm for quasi-Poisson regression (ODAP), to distributively model count outcomes while accounting for overdispersion. METHODS ODAP incorporates a surrogate likelihood approach to perform distributed quasi-Poisson regression without requiring patient-level data sharing, only requiring sharing of aggregate data from each participating institution. ODAP requires at most three rounds of non-iterative communication among institutions to generate coefficient estimates and corresponding standard errors. In simulations, we evaluate ODAP under several data scenarios possible in multi-site analyses, comparing ODAP and meta-analysis estimates in terms of error relative to pooled regression estimates, considered the gold standard. In a proof-of-concept real-world data analysis, we similarly compare ODAP and meta-analysis in terms of relative error to pooled estimatation using data from the OneFlorida Clinical Research Consortium, modeling length of stay in COVID-19 patients as a function of various patient characteristics. In a second proof-of-concept analysis, using the same outcome and covariates, we incorporate data from the UnitedHealth Group Clinical Discovery Database together with the OneFlorida data in a distributed analysis to compare estimates produced by ODAP and meta-analysis. RESULTS In simulations, ODAP exhibited negligible error relative to pooled regression estimates across all settings explored. Meta-analysis estimates, while largely unbiased, were increasingly variable as heterogeneity in the outcome increased across institutions. When baseline expected count was 0.2, relative error for meta-analysis was above 5% in 25% of iterations (250/1000), while the largest relative error for ODAP in any iteration was 3.59%. In our proof-of-concept analysis using only OneFlorida data, ODAP estimates were closer to pooled regression estimates than those produced by meta-analysis for all 15 covariates. In our distributed analysis incorporating data from both OneFlorida and the UnitedHealth Group Clinical Discovery Database, ODAP and meta-analysis estimates were largely similar, while some differences in estimates (as large as 13.8%) could be indicative of bias in meta-analytic estimates. CONCLUSIONS ODAP performs privacy-preserving, communication-efficient distributed quasi-Poisson regression to analyze count outcomes using data stored within multiple institutions. Our method produces estimates nearly matching pooled regression estimates and sometimes more accurate than meta-analysis estimates, most notably in settings with relatively low counts and high outcome heterogeneity across institutions.
Collapse
Affiliation(s)
- Mackenzie J Edmondson
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
| | - Chongliang Luo
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
| | | | | | - John Buresh
- Optum Labs at UnitedHealth Group, Minnetonka, MN, USA
| | - Zhaoyi Chen
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA; Cancer Informatics Shared Resource, University of Florida Health Cancer Center, Gainesville, FL, USA
| | - Jiang Bian
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA; Cancer Informatics Shared Resource, University of Florida Health Cancer Center, Gainesville, FL, USA
| | - Yong Chen
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA.
| |
Collapse
|
3
|
Khalid S, Yang C, Blacketer C, Duarte-Salles T, Fernández-Bertolín S, Kim C, Park RW, Park J, Schuemie MJ, Sena AG, Suchard MA, You SC, Rijnbeek PR, Reps JM. A standardized analytics pipeline for reliable and rapid development and validation of prediction models using observational health data. Comput Methods Programs Biomed 2021; 211:106394. [PMID: 34560604 PMCID: PMC8420135 DOI: 10.1016/j.cmpb.2021.106394] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/18/2021] [Accepted: 08/30/2021] [Indexed: 06/13/2023]
Abstract
BACKGROUND AND OBJECTIVE As a response to the ongoing COVID-19 pandemic, several prediction models in the existing literature were rapidly developed, with the aim of providing evidence-based guidance. However, none of these COVID-19 prediction models have been found to be reliable. Models are commonly assessed to have a risk of bias, often due to insufficient reporting, use of non-representative data, and lack of large-scale external validation. In this paper, we present the Observational Health Data Sciences and Informatics (OHDSI) analytics pipeline for patient-level prediction modeling as a standardized approach for rapid yet reliable development and validation of prediction models. We demonstrate how our analytics pipeline and open-source software tools can be used to answer important prediction questions while limiting potential causes of bias (e.g., by validating phenotypes, specifying the target population, performing large-scale external validation, and publicly providing all analytical source code). METHODS We show step-by-step how to implement the analytics pipeline for the question: 'In patients hospitalized with COVID-19, what is the risk of death 0 to 30 days after hospitalization?'. We develop models using six different machine learning methods in a USA claims database containing over 20,000 COVID-19 hospitalizations and externally validate the models using data containing over 45,000 COVID-19 hospitalizations from South Korea, Spain, and the USA. RESULTS Our open-source software tools enabled us to efficiently go end-to-end from problem design to reliable Model Development and evaluation. When predicting death in patients hospitalized with COVID-19, AdaBoost, random forest, gradient boosting machine, and decision tree yielded similar or lower internal and external validation discrimination performance compared to L1-regularized logistic regression, whereas the MLP neural network consistently resulted in lower discrimination. L1-regularized logistic regression models were well calibrated. CONCLUSION Our results show that following the OHDSI analytics pipeline for patient-level prediction modelling can enable the rapid development towards reliable prediction models. The OHDSI software tools and pipeline are open source and available to researchers from all around the world.
Collapse
Affiliation(s)
- Sara Khalid
- Botnar Research Centre, Centre for Statistics in Medicine, Nuffield Department of Orthopaedics Rheumatology and Musculoskeletal Sciences (NDORMS), University of Oxford, Oxford, UK
| | - Cynthia Yang
- Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, the Netherlands
| | - Clair Blacketer
- Observational Health Data Analytics, Janssen Research and Development, Titusville, NJ, USA
| | - Talita Duarte-Salles
- Fundació Institut Universitari per a la recerca a ľAtenció Primària de Salut Jordi Gol i Gurina (IDIAPJGol), Barcelona, Spain
| | - Sergio Fernández-Bertolín
- Fundació Institut Universitari per a la recerca a ľAtenció Primària de Salut Jordi Gol i Gurina (IDIAPJGol), Barcelona, Spain
| | - Chungsoo Kim
- Department of Biomedical Sciences, Ajou University Graduate School of Medicine, Suwon, Republic of Korea
| | - Rae Woong Park
- Department of Biomedical Sciences, Ajou University Graduate School of Medicine, Suwon, Republic of Korea; Department of Biomedical Informatics, Ajou University School of Medicine, Suwon, Republic of Korea
| | - Jimyung Park
- Department of Biomedical Sciences, Ajou University Graduate School of Medicine, Suwon, Republic of Korea
| | - Martijn J Schuemie
- Observational Health Data Analytics, Janssen Research and Development, Titusville, NJ, USA
| | - Anthony G Sena
- Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, the Netherlands; Observational Health Data Analytics, Janssen Research and Development, Titusville, NJ, USA
| | - Marc A Suchard
- Departments of Biomathematics, University of California, Los Angeles, USA
| | - Seng Chan You
- Department of Preventive Medicine and Public Health, Yonsei University College of Medicine, Republic of Korea
| | - Peter R Rijnbeek
- Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, the Netherlands
| | - Jenna M Reps
- Observational Health Data Analytics, Janssen Research and Development, Titusville, NJ, USA.
| |
Collapse
|
4
|
Forrest CB, McTigue KM, Hernandez AF, Cohen LW, Cruz H, Haynes K, Kaushal R, Kho AN, Marsolo KA, Nair VP, Platt R, Puro JE, Rothman RL, Shenkman EA, Waitman LR, Williams NA, Carton TW. PCORnet® 2020: current state, accomplishments, and future directions. J Clin Epidemiol 2020; 129:60-67. [PMID: 33002635 PMCID: PMC7521354 DOI: 10.1016/j.jclinepi.2020.09.036] [Citation(s) in RCA: 84] [Impact Index Per Article: 21.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2020] [Revised: 09/01/2020] [Accepted: 09/22/2020] [Indexed: 11/18/2022]
Abstract
OBJECTIVE To describe PCORnet, a clinical research network developed for patient-centered outcomes research on a national scale. STUDY DESIGN AND SETTING Descriptive study of the current state and future directions for PCORnet. We conducted cross-sectional analyses of the health systems and patient populations of the 9 Clinical Research Networks and 2 Health Plan Research Networks that are part of PCORnet. RESULTS Within the Clinical Research Networks, electronic health data are currently collected from 337 hospitals, 169,695 physicians, 3,564 primary care practices, 338 emergency departments, and 1,024 community clinics. Patients can be recruited for prospective studies from any of these clinical sites. The Clinical Research Networks have accumulated data from 80 million patients with at least one visit from 2009 to 2018. The PCORnet Health Plan Research Network population of individuals with a valid enrollment segment from 2009 to 2019 exceeds 60 million individuals, who on average have 2.63 years of follow-up. CONCLUSION PCORnet's infrastructure comprises clinical data from a diverse cohort of patients and has the capacity to rapidly access these patient populations for pragmatic clinical trials, epidemiological research, and patient-centered research on rare diseases.
Collapse
Affiliation(s)
- Christopher B Forrest
- Applied Clinical Research Center, Children's Hospital of Philadelphia, 2716 South St., Suite 11-473, Philadelphia, PA 19146, USA.
| | - Kathleen M McTigue
- Department of Medicine, University of Pittsburgh, 230 McKee Place, Suite 600, Pittsburgh, PA 15213 USA
| | - Adrian F Hernandez
- Duke Clinical Research Institute, Duke University School of Medicine, 200 Trent Drive, Durham, NC 27710, USA
| | - Lauren W Cohen
- Duke Clinical Research Institute, Duke University School of Medicine, 200 Trent Drive, Durham, NC 27710, USA
| | - Henry Cruz
- Weill Cornell Medicine and New York-Presbyterian Hospital, 515 E 71st St, New York, NY 10021, USA
| | - Kevin Haynes
- Scientific Affairs, HealthCore Inc., 123 Justison St, Wilmington, DE 19801, USA
| | - Rainu Kaushal
- Weill Cornell Medicine and New York-Presbyterian Hospital, 515 E 71st St, New York, NY 10021, USA
| | - Abel N Kho
- Center for Health Information Partnerships, Feinberg School of Medicine, 625 N. Michigan Ave, Chicago, IL 60611, USA
| | - Keith A Marsolo
- Duke Clinical Research Institute, Duke University School of Medicine, 200 Trent Drive, Durham, NC 27710, USA
| | - Vinit P Nair
- PRACnet, 15 South Main Street, Sharon, MA 02067, USA
| | - Richard Platt
- Harvard Medical School Department of Population Medicine, Harvard Pilgrim Health Care Institute, 401 Park Drive, Boston, MA 02215, USA
| | - Jon E Puro
- OCHIN, Inc., 1881 SW Naito Pkwy, Portland, OR 97201, USA
| | - Russell L Rothman
- Institute for Medicine and Public Health, Vanderbilt University Medical Center, 1161 21st Ave S, Nashville, TN 37232, USA
| | - Elizabeth A Shenkman
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, 1600 SW Archer Rd, Gainesville, FL 32610, USA
| | - Lemuel Russell Waitman
- Department of Internal Medicine, Division of Medical Informatics, University of Kansas Medical Center, 3901 Rainbow Blvd, Kansas City, KS 66160, USA
| | - Neely A Williams
- Institute for Medicine and Public Health, Vanderbilt University Medical Center, 1161 21st Ave S, Nashville, TN 37232, USA
| | - Thomas W Carton
- Louisiana Public Health Institute, 1515 Poydras St, New Orleans, LA 70112, USA
| |
Collapse
|
5
|
Abstract
BACKGROUND Data confidentiality and shared use of research data are two desirable but sometimes conflicting goals in research with multi-center studies and distributed data. While ideal for straightforward analysis, confidentiality restrictions forbid creation of a single dataset that includes covariate information of all participants. Current approaches such as aggregate data sharing, distributed regression, meta-analysis and score-based methods can have important limitations. METHODS We propose a novel application of an existing epidemiologic tool, specimen pooling, to enable confidentiality-preserving analysis of data arising from a matched case-control, multi-center design. Instead of pooling specimens prior to assay, we apply the methodology to virtually pool (aggregate) covariates within nodes. Such virtual pooling retains most of the information used in an analysis with individual data and since individual participant data is not shared externally, within-node virtual pooling preserves data confidentiality. We show that aggregated covariate levels can be used in a conditional logistic regression model to estimate individual-level odds ratios of interest. RESULTS The parameter estimates from the standard conditional logistic regression are compared to the estimates based on a conditional logistic regression model with aggregated data. The parameter estimates are shown to be similar to those without pooling and to have comparable standard errors and confidence interval coverage. CONCLUSIONS Virtual data pooling can be used to maintain confidentiality of data from multi-center study and can be particularly useful in research with large-scale distributed data.
Collapse
Affiliation(s)
- P. Saha-Chaudhuri
- Department of Epidemiology, Biostatistics and Occupational Health, McGill University, 1020 Pine Avenue West, Montreal QC, Montreal, Canada
| | - C.R. Weinberg
- Biostatistics and Computational Biology Branch, National Institutes of Environmental Health Sciences, NIH, 111 T.W. Alexander Drive, RTP, Durham, NC USA
| |
Collapse
|