1
|
Cardoso P, McDonald TJ, Patel KA, Pearson ER, Hattersley AT, Shields BM, McKinley TJ. Comparison of Bayesian approaches for developing prediction models in rare disease: application to the identification of patients with Maturity-Onset Diabetes of the Young. BMC Med Res Methodol 2024; 24:128. [PMID: 38834992 PMCID: PMC11149229 DOI: 10.1186/s12874-024-02239-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2024] [Accepted: 05/06/2024] [Indexed: 06/06/2024] Open
Abstract
BACKGROUND Clinical prediction models can help identify high-risk patients and facilitate timely interventions. However, developing such models for rare diseases presents challenges due to the scarcity of affected patients for developing and calibrating models. Methods that pool information from multiple sources can help with these challenges. METHODS We compared three approaches for developing clinical prediction models for population screening based on an example of discriminating a rare form of diabetes (Maturity-Onset Diabetes of the Young - MODY) in insulin-treated patients from the more common Type 1 diabetes (T1D). Two datasets were used: a case-control dataset (278 T1D, 177 MODY) and a population-representative dataset (1418 patients, 96 MODY tested with biomarker testing, 7 MODY positive). To build a population-level prediction model, we compared three methods for recalibrating models developed in case-control data. These were prevalence adjustment ("offset"), shrinkage recalibration in the population-level dataset ("recalibration"), and a refitting of the model to the population-level dataset ("re-estimation"). We then developed a Bayesian hierarchical mixture model combining shrinkage recalibration with additional informative biomarker information only available in the population-representative dataset. We developed a method for dealing with missing biomarker and outcome information using prior information from the literature and other data sources to ensure the clinical validity of predictions for certain biomarker combinations. RESULTS The offset, re-estimation, and recalibration methods showed good calibration in the population-representative dataset. The offset and recalibration methods displayed the lowest predictive uncertainty due to borrowing information from the fitted case-control model. We demonstrate the potential of a mixture model for incorporating informative biomarkers, which significantly enhanced the model's predictive accuracy, reduced uncertainty, and showed higher stability in all ranges of predictive outcome probabilities. CONCLUSION We have compared several approaches that could be used to develop prediction models for rare diseases. Our findings highlight the recalibration mixture model as the optimal strategy if a population-level dataset is available. This approach offers the flexibility to incorporate additional predictors and informed prior probabilities, contributing to enhanced prediction accuracy for rare diseases. It also allows predictions without these additional tests, providing additional information on whether a patient should undergo further biomarker testing before genetic testing.
Collapse
Affiliation(s)
- Pedro Cardoso
- University of Exeter Medical School. Address: Clinical and Biomedical Sciences, RILD Building, Royal Devon & Exeter Hospital, Barrack Road, Exeter, EX2 5DW, UK
| | - Timothy J McDonald
- University of Exeter Medical School. Address: Clinical and Biomedical Sciences, RILD Building, Royal Devon & Exeter Hospital, Barrack Road, Exeter, EX2 5DW, UK
| | - Kashyap A Patel
- University of Exeter Medical School. Address: Clinical and Biomedical Sciences, RILD Building, Royal Devon & Exeter Hospital, Barrack Road, Exeter, EX2 5DW, UK
| | - Ewan R Pearson
- University of Dundee. Address: Division of Population Health & Genomics, Ninewells Hospital and Medical School, University of Dundee, Dundee, DD1 9SY, UK
| | - Andrew T Hattersley
- University of Exeter Medical School. Address: Clinical and Biomedical Sciences, RILD Building, Royal Devon & Exeter Hospital, Barrack Road, Exeter, EX2 5DW, UK
| | - Beverley M Shields
- University of Exeter Medical School. Address: Clinical and Biomedical Sciences, RILD Building, Royal Devon & Exeter Hospital, Barrack Road, Exeter, EX2 5DW, UK
| | - Trevelyan J McKinley
- University of Exeter Medical School. Address: Clinical and Biomedical Sciences, RILD Building, Royal Devon & Exeter Hospital, Barrack Road, Exeter, EX2 5DW, UK.
| |
Collapse
|
2
|
Deng D, Chinchilli VM, Feng H, Chen C, Wang M. Robust integration of secondary outcomes information into primary outcome analysis in the presence of missing data. Stat Methods Med Res 2024:9622802241254195. [PMID: 38767214 DOI: 10.1177/09622802241254195] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/22/2024]
Abstract
In clinical and observational studies, secondary outcomes are frequently collected alongside the primary outcome for each subject, yet their potential to improve the analysis efficiency remains underutilized. Moreover, missing data, commonly encountered in practice, can introduce bias to estimates if not appropriately addressed. This article presents an innovative approach that enhances the empirical likelihood-based information borrowing method by integrating missing-data techniques, ensuring robust data integration. We introduce a plug-in inverse probability weighting estimator to handle missingness in the primary analysis, demonstrating its equivalence to the standard joint estimator under mild conditions. To address potential bias from missing secondary outcomes, we propose a uniform mapping strategy, imputing incomplete secondary outcomes into a unified space. Extensive simulations highlight the effectiveness of our method, showing consistent, efficient, and robust estimators under various scenarios involving missing data and/or misspecified secondary models. Finally, we apply our proposal to the Uniform Data Set from the National Alzheimer's Coordinating Center, exemplifying its practical application.
Collapse
Affiliation(s)
- Daxuan Deng
- Division of Biostatistics and Bioinformatics, Department of Public Health Sciences, Penn State College of Medicine, Hershey, PA, USA
| | - Vernon M Chinchilli
- Division of Biostatistics and Bioinformatics, Department of Public Health Sciences, Penn State College of Medicine, Hershey, PA, USA
| | - Hao Feng
- Department of Population and Quantitative Health Sciences, Case Western Reserve University School of Medicine, Cleveland, OH, USA
| | - Chixiang Chen
- Division of Biostatistics and Bioinformatics, Department of Epidemiology and Public Health, University of Maryland School of Medicine, Baltimore, MD, USA
| | - Ming Wang
- Department of Population and Quantitative Health Sciences, Case Western Reserve University School of Medicine, Cleveland, OH, USA
| |
Collapse
|
3
|
Gu T, Taylor JM, Mukherjee B. A synthetic data integration framework to leverage external summary-level information from heterogeneous populations. Biometrics 2023; 79:3831-3845. [PMID: 36876883 PMCID: PMC10480346 DOI: 10.1111/biom.13852] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2022] [Accepted: 02/24/2023] [Indexed: 03/07/2023]
Abstract
There is a growing need for flexible general frameworks that integrate individual-level data with external summary information for improved statistical inference. External information relevant for a risk prediction model may come in multiple forms, through regression coefficient estimates or predicted values of the outcome variable. Different external models may use different sets of predictors and the algorithm they used to predict the outcome Y given these predictors may or may not be known. The underlying populations corresponding to each external model may be different from each other and from the internal study population. Motivated by a prostate cancer risk prediction problem where novel biomarkers are measured only in the internal study, this paper proposes an imputation-based methodology, where the goal is to fit a target regression model with all available predictors in the internal study while utilizing summary information from external models that may have used only a subset of the predictors. The method allows for heterogeneity of covariate effects across the external populations. The proposed approach generates synthetic outcome data in each external population, uses stacked multiple imputation to create a long dataset with complete covariate information. The final analysis of the stacked imputed data is conducted by weighted regression. This flexible and unified approach can improve statistical efficiency of the estimated coefficients in the internal study, improve predictions by utilizing even partial information available from models that use a subset of the full set of covariates used in the internal study, and provide statistical inference for the external population with potentially different covariate effects from the internal population.
Collapse
Affiliation(s)
- Tian Gu
- Department of Biostatistics, University of Michigan, Ann Arbor, U.S.A
| | | | - Bhramar Mukherjee
- Department of Biostatistics, University of Michigan, Ann Arbor, U.S.A
| |
Collapse
|
4
|
Chhoa H, Chabriat H, Anato AJ, Bamba M, Zittoun F, Chevret S, Biard L. Improvement of an External Predictive Model Based on New Information Using a Synthetic Data Approach: Application to CADASIL. Neurol Genet 2023; 9:e200091. [PMID: 38235365 PMCID: PMC10691224 DOI: 10.1212/nxg.0000000000200091] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2023] [Accepted: 06/07/2023] [Indexed: 01/19/2024]
Abstract
Background and Objectives Cerebral autosomal dominant arteriopathy with subcortical infarcts and leukoencephalopathy (CADASIL) is the most frequent hereditary cerebral small vessel disease. It is caused by mutations of the NOTCH3 gene. The disease evolves progressively over decades leading to stroke, disability, cognitive decline, and functional dependency. The course and clinical severity of CADASIL seem heterogeneous. Predictive models are thus needed to improve prognostic evaluation and inform future clinical trials. A predictive model of the 3-year variation in the Mattis Dementia Rating Scale (MDRS), which reflects the global cognitive performance of patients with CADASIL, was previously proposed. This model made predictions based on demographic, clinical, and MRI data. We aimed to improve this existing predictive model by integrating a new potential factor, the location of the genetic mutation in the different epidermal growth factor (EGFr) domains of the NOTCH3 gene, dichotomized into EGFr domains 1 to 6 or 7 to 34. Methods We used a new synthetic data approach to improve the initial predictive model by incorporating additional genetic information. This method combined the predicted outcomes from the previous model and 5 "synthetic" data sets with the observed outcome in a new data set. We then applied a multiple imputation method for missing data on the mutation location. Results The new data set included 367 patients who were followed up for 30 to 42 months. In the multivariable model with synthetic data, patients with NOTCH3 mutations in EGFr domains 7 to 34 had an additional average decrease of -1.4 points (standard error 0.67, p = 0.035) in their MDRS score variation over 3 years compared with patients with mutations located in EGFr domains 1 to 6. Cross-validation results highlighted the improved predictive performance of the enhanced model. Moreover, the model estimation was found to be more robust than fitting a model without synthetic data. Discussion The use of synthetic data improved the predictive model of MDRS change over 3 years in CADASIL. The predictive performance and estimation robustness of the predictive model were enhanced using this approach, whether genetic information was used. A statistically significant association between the location of the mutation in the NOTCH3 gene and the 3-year MDRS score variation was detected.
Collapse
Affiliation(s)
- Henri Chhoa
- From the ECSTRRA Team (H. Chhoa, S.C., L.B.), Université Paris-Cité, UMR1153, INSERM; Translational Neurovascular Centre (H. Chabriat), GH Saint-Louis-Lariboisière, Assistance Publique des Hôpitaux de Paris APHP, Université Paris-Cité and DHU NeuroVasc Sorbonne Paris-Cité; UMR 1161 (H. Chabriat), INSERM; and ENSAI (A.J.A., M.B., F.Z.), Ecole d'ingénieur statistique, data science et big data, Bruz, France
| | - Hugues Chabriat
- From the ECSTRRA Team (H. Chhoa, S.C., L.B.), Université Paris-Cité, UMR1153, INSERM; Translational Neurovascular Centre (H. Chabriat), GH Saint-Louis-Lariboisière, Assistance Publique des Hôpitaux de Paris APHP, Université Paris-Cité and DHU NeuroVasc Sorbonne Paris-Cité; UMR 1161 (H. Chabriat), INSERM; and ENSAI (A.J.A., M.B., F.Z.), Ecole d'ingénieur statistique, data science et big data, Bruz, France
| | - Adelina Joanita Anato
- From the ECSTRRA Team (H. Chhoa, S.C., L.B.), Université Paris-Cité, UMR1153, INSERM; Translational Neurovascular Centre (H. Chabriat), GH Saint-Louis-Lariboisière, Assistance Publique des Hôpitaux de Paris APHP, Université Paris-Cité and DHU NeuroVasc Sorbonne Paris-Cité; UMR 1161 (H. Chabriat), INSERM; and ENSAI (A.J.A., M.B., F.Z.), Ecole d'ingénieur statistique, data science et big data, Bruz, France
| | - Mamadou Bamba
- From the ECSTRRA Team (H. Chhoa, S.C., L.B.), Université Paris-Cité, UMR1153, INSERM; Translational Neurovascular Centre (H. Chabriat), GH Saint-Louis-Lariboisière, Assistance Publique des Hôpitaux de Paris APHP, Université Paris-Cité and DHU NeuroVasc Sorbonne Paris-Cité; UMR 1161 (H. Chabriat), INSERM; and ENSAI (A.J.A., M.B., F.Z.), Ecole d'ingénieur statistique, data science et big data, Bruz, France
| | - Florent Zittoun
- From the ECSTRRA Team (H. Chhoa, S.C., L.B.), Université Paris-Cité, UMR1153, INSERM; Translational Neurovascular Centre (H. Chabriat), GH Saint-Louis-Lariboisière, Assistance Publique des Hôpitaux de Paris APHP, Université Paris-Cité and DHU NeuroVasc Sorbonne Paris-Cité; UMR 1161 (H. Chabriat), INSERM; and ENSAI (A.J.A., M.B., F.Z.), Ecole d'ingénieur statistique, data science et big data, Bruz, France
| | - Sylvie Chevret
- From the ECSTRRA Team (H. Chhoa, S.C., L.B.), Université Paris-Cité, UMR1153, INSERM; Translational Neurovascular Centre (H. Chabriat), GH Saint-Louis-Lariboisière, Assistance Publique des Hôpitaux de Paris APHP, Université Paris-Cité and DHU NeuroVasc Sorbonne Paris-Cité; UMR 1161 (H. Chabriat), INSERM; and ENSAI (A.J.A., M.B., F.Z.), Ecole d'ingénieur statistique, data science et big data, Bruz, France
| | - Lucie Biard
- From the ECSTRRA Team (H. Chhoa, S.C., L.B.), Université Paris-Cité, UMR1153, INSERM; Translational Neurovascular Centre (H. Chabriat), GH Saint-Louis-Lariboisière, Assistance Publique des Hôpitaux de Paris APHP, Université Paris-Cité and DHU NeuroVasc Sorbonne Paris-Cité; UMR 1161 (H. Chabriat), INSERM; and ENSAI (A.J.A., M.B., F.Z.), Ecole d'ingénieur statistique, data science et big data, Bruz, France
| |
Collapse
|
5
|
Han P, Taylor JM, Mukherjee B. Integrating Information from Existing Risk Prediction Models with No Model Details. CAN J STAT 2023; 51:355-374. [PMID: 37346757 PMCID: PMC10281716 DOI: 10.1002/cjs.11701] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2020] [Accepted: 12/16/2021] [Indexed: 11/07/2022]
Abstract
Consider the setting where (i) individual-level data are collected to build a regression model for the association between an event of interest and certain covariates, and (ii) some risk calculators predicting the risk of the event using less detailed covariates are available, possibly as algorithmic black boxes with little information available about how they were built. We propose a general empirical-likelihood-based framework to integrate the rich auxiliary information contained in the calculators into fitting the regression model, to make the estimation of regression parameters more efficient. Two methods are developed, one using working models to extract the calculator information and one making a direct use of calculator predictions without working models. Theoretical and numerical investigations show that the calculator information can substantially reduce the variance of regression parameter estimation. As an application, we study the dependence of the risk of high grade prostate cancer on both conventional risk factors and newly identified molecular biomarkers by integrating information from the Prostate Biopsy Collaborative Group (PBCG) risk calculator, which was built based on conventional risk factors alone.
Collapse
Affiliation(s)
- Peisong Han
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, USA
| | - Jeremy M.G. Taylor
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, USA
| | - Bhramar Mukherjee
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, USA
| |
Collapse
|
6
|
Fu S, Deng L, Zhang H, Qin J, Yu K. Integrative analysis of individual-level data and high-dimensional summary statistics. BIOINFORMATICS (OXFORD, ENGLAND) 2023; 39:7085950. [PMID: 36964712 PMCID: PMC10361352 DOI: 10.1093/bioinformatics/btad156] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/08/2022] [Revised: 03/19/2023] [Accepted: 03/22/2023] [Indexed: 04/23/2023]
Abstract
MOTIVATION Researchers usually conduct statistical analyses based on models built on raw data collected from individual participants (individual-level data). There is a growing interest in enhancing inference efficiency by incorporating aggregated summary information from other sources, such as summary statistics on genetic markers' marginal associations with a given trait generated from genome-wide association studies. However, combining high-dimensional summary data with individual-level data using existing integrative procedures can be challenging due to various numeric issues in optimizing an objective function over a large number of unknown parameters. RESULTS We develop a procedure to improve the fitting of a targeted statistical model by leveraging external summary data for more efficient statistical inference (both effect estimation and hypothesis testing). To make this procedure scalable to high-dimensional summary data, we propose a divide-and-conquer strategy by breaking the task into easier parallel jobs, each fitting the targeted model by integrating the individual-level data with a small proportion of summary data. We obtain the final estimates of model parameters by pooling results from multiple fitted models through the minimum distance estimation procedure. We improve the procedure for a general class of additive models commonly encountered in genetic studies. We further expand these two approaches to integrate individual-level and high-dimensional summary data from different study populations. We demonstrate the advantage of the proposed methods through simulations and an application to the study of the effect on pancreatic cancer risk by the polygenic risk score defined by BMI-associated genetic markers. AVAILABILITY AND IMPLEMENTATION R package is available at https://github.com/fushengstat/MetaGIM.
Collapse
Affiliation(s)
- Sheng Fu
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, MD 20892, USA
| | - Lu Deng
- School of Statistics and Data Science, Nankai University, Tianjin 300071, China
| | - Han Zhang
- Information Management Services, Inc, Bethesda, MD 20892, USA
| | - Jing Qin
- National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, MD 20892, USA
| | - Kai Yu
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, MD 20892, USA
| |
Collapse
|
7
|
Preference-driven multi-objective GP search for regression models with new dominance principle and performance indicators. APPL INTELL 2022. [DOI: 10.1007/s10489-022-03228-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
|
8
|
Zhai Y, Han P. Data Integration with Oracle Use of External Information from Heterogeneous Populations. J Comput Graph Stat 2022. [DOI: 10.1080/10618600.2022.2050248] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Affiliation(s)
- Yuqi Zhai
- Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109-2029, USA
| | - Peisong Han
- Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109-2029, USA
| |
Collapse
|
9
|
Chen C, Han P, He F. Improving main analysis by borrowing information from auxiliary data. Stat Med 2021; 41:567-579. [PMID: 34796519 DOI: 10.1002/sim.9252] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2020] [Revised: 07/22/2021] [Accepted: 10/21/2021] [Indexed: 12/24/2022]
Abstract
In many clinical and observational studies, auxiliary data from the same subjects, such as repeated measurements or surrogate variables, will be collected in addition to the data of main interest. Not directly related to the main study, these auxiliary data in practice are rarely incorporated into the main analysis, though they may carry extra information that can help improve the estimation in the main analysis. Under the setting where part of or all subjects have auxiliary data available, we propose an effective weighting approach to borrow the auxiliary information by building a working model for the auxiliary data, where improvement of estimation precision over the main analysis is guaranteed regardless of the specification of the working model. An information index is also constructed to assess how well the selected working model works to improve the main analysis. Both theoretical and numerical studies show the excellent and robust performance of the proposed method in comparison to estimation without using the auxiliary data. Finally, we utilize the Atherosclerosis Risk in Communities study for illustration.
Collapse
Affiliation(s)
- Chixiang Chen
- Division of Biostatistics and Bioinformatics, Department of Epidemiology and Public Health, University of Maryland School of Medicine, Baltimore, Maryland, USA
| | - Peisong Han
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, USA
| | - Fan He
- Division of Biostatistics and Bioinformatics, Department of Public Health Sciences, Penn State College of Medicine, Hershey, Pennsylvania, USA
| |
Collapse
|
10
|
Ghosh D, Sabel MS. A Weighted Sample Framework to Incorporate External Calculators for Risk Modeling. STATISTICS IN BIOSCIENCES 2021. [DOI: 10.1007/s12561-021-09325-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
11
|
Zhang H, Deng L, Wheeler W, Qin J, Yu K. Integrative analysis of multiple case-control studies. Biometrics 2021; 78:1080-1091. [PMID: 33768525 DOI: 10.1111/biom.13461] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2020] [Revised: 02/23/2021] [Accepted: 03/12/2021] [Indexed: 11/28/2022]
Abstract
It is often challenging to share detailed individual-level data among studies due to various informatics and privacy constraints. However, it is relatively easy to pool together aggregated summary level data, such as the ones required for standard meta-analyses. Focusing on data generated from case-control studies, we present a flexible inference procedure that integrates individual-level data collected from an "internal" study with summary data borrowed from "external" studies. This procedure is built on a retrospective empirical likelihood framework to account for the sampling bias in case-control studies. It can incorporate summary statistics extracted from various working models adopted by multiple independent or overlapping external studies. It also allows for external studies to be conducted in a population that is different from the internal study population. We show both theoretically and numerically its efficiency advantage over several competing alternatives.
Collapse
Affiliation(s)
- Han Zhang
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, Maryland, USA
| | - Lu Deng
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, Maryland, USA
| | - William Wheeler
- Information Management Services, Silver Spring, Maryland, USA
| | - Jing Qin
- National Institute of Allergy and Infectious Diseases, National Institute of Health, Bethesda, Maryland, USA
| | - Kai Yu
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, Maryland, USA
| |
Collapse
|
12
|
Liang J, Xue Y, Wang J. Bi-objective memetic GP with dispersion-keeping Pareto evaluation for real-world regression. Inf Sci (N Y) 2020. [DOI: 10.1016/j.ins.2020.05.136] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
13
|
Kundu P, Tang R, Chatterjee N. Generalized meta-analysis for multiple regression models across studies with disparate covariate information. Biometrika 2019; 106:567-585. [PMID: 31427822 PMCID: PMC6690173 DOI: 10.1093/biomet/asz030] [Citation(s) in RCA: 25] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2017] [Indexed: 01/23/2023] Open
Abstract
Meta-analysis is widely popular for synthesizing information on common parameters of interest across multiple studies because of its logistical convenience and statistical efficiency. We develop a generalized meta-analysis approach to combining information on multivariate regression parameters across multiple studies that have varying levels of covariate information. Using algebraic relationships among regression parameters in different dimensions, we specify a set of moment equations for estimating parameters of a maximal model through information available from sets of parameter estimates for a series of reduced models from the different studies. The specification of the equations requires a reference dataset for estimating the joint distribution of the covariates. We propose to solve these equations using the generalized method of moments approach, with the optimal weighting of the equations taking into account uncertainty associated with estimates of the parameters of the reduced models. We describe extensions of the iterated reweighted least-squares algorithm for fitting generalized linear regression models using the proposed framework. Based on the same moment equations, we also develop a diagnostic test for detecting violations of underlying model assumptions, such as those arising from heterogeneity in the underlying study populations. The proposed methods are illustrated with extensive simulation studies and a real-data example involving the development of a breast cancer risk prediction model using disparate risk factor information from multiple studies.
Collapse
Affiliation(s)
- Prosenjit Kundu
- Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, 615 N. Wolfe Street, Baltimore, Maryland, U.S.A
| | - Runlong Tang
- Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, 615 N. Wolfe Street, Baltimore, Maryland, U.S.A
| | - Nilanjan Chatterjee
- Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, 615 N. Wolfe Street, Baltimore, Maryland, U.S.A
| |
Collapse
|
14
|
Gu T, Taylor JMG, Cheng W, Mukherjee B. Synthetic data method to incorporate external information into a current study. CAN J STAT 2019; 47:580-603. [PMID: 32773922 DOI: 10.1002/cjs.11513] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
We consider the situation where there is a known regression model that can be used to predict an outcome, Y, from a set of predictor variables X. A new variable B is expected to enhance the prediction of Y. A dataset of size n containing Y, X and B is available, and the challenge is to build an improved model for Y|X,B that uses both the available individual level data and some summary information obtained from the known model for Y|X. We propose a synthetic data approach, which consists of creating m additional synthetic data observations, and then analyzing the combined dataset of size n+m to estimate the parameters of the Y|X, B model. This combined dataset of size n+m now has missing values of B form of the observations, and is analyzed using methods that can handle missing data (e.g. multiple imputation). We present simulation studies and illustrate the method using data from the Prostate Cancer Prevention Trial. Though the synthetic data method is applicable to a general regression context, to provide some justification, we show in two special cases that the asymptotic variance of the parameter estimates in the Y|X, B model are identical to those from an alternative constrained maximum likelihood estimation approach. This correspondence in special cases and the method's broad applicability makes it appealing for use across diverse scenarios.
Collapse
Affiliation(s)
- Tian Gu
- Department of Biostatistics, University of Michigan, Ann Arbor, MI 48105, U.S.A
| | - Jeremy M G Taylor
- Department of Biostatistics, University of Michigan, Ann Arbor, MI 48105, U.S.A
| | - Wenting Cheng
- Department of Biostatistics, University of Michigan, Ann Arbor, MI 48105, U.S.A
| | - Bhramar Mukherjee
- Department of Biostatistics, University of Michigan, Ann Arbor, MI 48105, U.S.A
| |
Collapse
|