1
|
Han P, Li H, Park SK, Mukherjee B, Taylor JMG. Improving prediction of linear regression models by integrating external information from heterogeneous populations: James-Stein estimators. Biometrics 2024; 80:ujae072. [PMID: 39101548 PMCID: PMC11299067 DOI: 10.1093/biomtc/ujae072] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2023] [Revised: 06/19/2024] [Accepted: 07/18/2024] [Indexed: 08/06/2024]
Abstract
We consider the setting where (1) an internal study builds a linear regression model for prediction based on individual-level data, (2) some external studies have fitted similar linear regression models that use only subsets of the covariates and provide coefficient estimates for the reduced models without individual-level data, and (3) there is heterogeneity across these study populations. The goal is to integrate the external model summary information into fitting the internal model to improve prediction accuracy. We adapt the James-Stein shrinkage method to propose estimators that are no worse and are oftentimes better in the prediction mean squared error after information integration, regardless of the degree of study population heterogeneity. We conduct comprehensive simulation studies to investigate the numerical performance of the proposed estimators. We also apply the method to enhance a prediction model for patella bone lead level in terms of blood lead level and other covariates by integrating summary information from published literature.
Collapse
Affiliation(s)
- Peisong Han
- Biostatistics Innovation Group, Gilead Sciences, 333 Lakeside Drive, Foster City, CA 94404, United States
| | - Haoyue Li
- Department of Biostatistics, University of Michigan, 1415 Washington Heights, Ann Arbor, MI 48109, United States
| | - Sung Kyun Park
- Department of Epidemiology, University of Michigan, 1415 Washington Heights, Ann Arbor, MI 48109, United States
| | - Bhramar Mukherjee
- Department of Biostatistics, University of Michigan, 1415 Washington Heights, Ann Arbor, MI 48109, United States
| | - Jeremy M G Taylor
- Department of Biostatistics, University of Michigan, 1415 Washington Heights, Ann Arbor, MI 48109, United States
| |
Collapse
|
2
|
Choi K, Taylor JMG, Han P. Robust data integration from multiple external sources for generalized linear models with binary outcomes. Biometrics 2024; 80:ujad005. [PMID: 38364808 PMCID: PMC10873565 DOI: 10.1093/biomtc/ujad005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2022] [Revised: 08/02/2023] [Accepted: 10/12/2023] [Indexed: 02/18/2024]
Abstract
We aim to estimate parameters in a generalized linear model (GLM) for a binary outcome when, in addition to the raw data from the internal study, more than 1 external study provides summary information in the form of parameter estimates from fitting GLMs with varying subsets of the internal study covariates. We propose an adaptive penalization method that exploits the external summary information and gains efficiency for estimation, and that is both robust and computationally efficient. The robust property comes from exploiting the relationship between parameters of a GLM and parameters of a GLM with omitted covariates and from downweighting external summary information that is less compatible with the internal data through a penalization. The computational burden associated with searching for the optimal tuning parameter for the penalization is reduced by using adaptive weights and by using an information criterion when searching for the optimal tuning parameter. Simulation studies show that the proposed estimator is robust against various types of population distribution heterogeneity and also gains efficiency compared to direct maximum likelihood estimation. The method is applied to improve a logistic regression model that predicts high-grade prostate cancer making use of parameter estimates from 2 external models.
Collapse
Affiliation(s)
- Kyuseong Choi
- Department of Statistics and Data Science, Cornell University, Ithaca, NY 14853, United States
| | - Jeremy M G Taylor
- Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, United States
| | - Peisong Han
- Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, United States
| |
Collapse
|
3
|
Gu T, Taylor JM, Mukherjee B. A synthetic data integration framework to leverage external summary-level information from heterogeneous populations. Biometrics 2023; 79:3831-3845. [PMID: 36876883 PMCID: PMC10480346 DOI: 10.1111/biom.13852] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2022] [Accepted: 02/24/2023] [Indexed: 03/07/2023]
Abstract
There is a growing need for flexible general frameworks that integrate individual-level data with external summary information for improved statistical inference. External information relevant for a risk prediction model may come in multiple forms, through regression coefficient estimates or predicted values of the outcome variable. Different external models may use different sets of predictors and the algorithm they used to predict the outcome Y given these predictors may or may not be known. The underlying populations corresponding to each external model may be different from each other and from the internal study population. Motivated by a prostate cancer risk prediction problem where novel biomarkers are measured only in the internal study, this paper proposes an imputation-based methodology, where the goal is to fit a target regression model with all available predictors in the internal study while utilizing summary information from external models that may have used only a subset of the predictors. The method allows for heterogeneity of covariate effects across the external populations. The proposed approach generates synthetic outcome data in each external population, uses stacked multiple imputation to create a long dataset with complete covariate information. The final analysis of the stacked imputed data is conducted by weighted regression. This flexible and unified approach can improve statistical efficiency of the estimated coefficients in the internal study, improve predictions by utilizing even partial information available from models that use a subset of the full set of covariates used in the internal study, and provide statistical inference for the external population with potentially different covariate effects from the internal population.
Collapse
Affiliation(s)
- Tian Gu
- Department of Biostatistics, University of Michigan, Ann Arbor, U.S.A
| | | | - Bhramar Mukherjee
- Department of Biostatistics, University of Michigan, Ann Arbor, U.S.A
| |
Collapse
|
4
|
Chhoa H, Chabriat H, Anato AJ, Bamba M, Zittoun F, Chevret S, Biard L. Improvement of an External Predictive Model Based on New Information Using a Synthetic Data Approach: Application to CADASIL. Neurol Genet 2023; 9:e200091. [PMID: 38235365 PMCID: PMC10691224 DOI: 10.1212/nxg.0000000000200091] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2023] [Accepted: 06/07/2023] [Indexed: 01/19/2024]
Abstract
Background and Objectives Cerebral autosomal dominant arteriopathy with subcortical infarcts and leukoencephalopathy (CADASIL) is the most frequent hereditary cerebral small vessel disease. It is caused by mutations of the NOTCH3 gene. The disease evolves progressively over decades leading to stroke, disability, cognitive decline, and functional dependency. The course and clinical severity of CADASIL seem heterogeneous. Predictive models are thus needed to improve prognostic evaluation and inform future clinical trials. A predictive model of the 3-year variation in the Mattis Dementia Rating Scale (MDRS), which reflects the global cognitive performance of patients with CADASIL, was previously proposed. This model made predictions based on demographic, clinical, and MRI data. We aimed to improve this existing predictive model by integrating a new potential factor, the location of the genetic mutation in the different epidermal growth factor (EGFr) domains of the NOTCH3 gene, dichotomized into EGFr domains 1 to 6 or 7 to 34. Methods We used a new synthetic data approach to improve the initial predictive model by incorporating additional genetic information. This method combined the predicted outcomes from the previous model and 5 "synthetic" data sets with the observed outcome in a new data set. We then applied a multiple imputation method for missing data on the mutation location. Results The new data set included 367 patients who were followed up for 30 to 42 months. In the multivariable model with synthetic data, patients with NOTCH3 mutations in EGFr domains 7 to 34 had an additional average decrease of -1.4 points (standard error 0.67, p = 0.035) in their MDRS score variation over 3 years compared with patients with mutations located in EGFr domains 1 to 6. Cross-validation results highlighted the improved predictive performance of the enhanced model. Moreover, the model estimation was found to be more robust than fitting a model without synthetic data. Discussion The use of synthetic data improved the predictive model of MDRS change over 3 years in CADASIL. The predictive performance and estimation robustness of the predictive model were enhanced using this approach, whether genetic information was used. A statistically significant association between the location of the mutation in the NOTCH3 gene and the 3-year MDRS score variation was detected.
Collapse
Affiliation(s)
- Henri Chhoa
- From the ECSTRRA Team (H. Chhoa, S.C., L.B.), Université Paris-Cité, UMR1153, INSERM; Translational Neurovascular Centre (H. Chabriat), GH Saint-Louis-Lariboisière, Assistance Publique des Hôpitaux de Paris APHP, Université Paris-Cité and DHU NeuroVasc Sorbonne Paris-Cité; UMR 1161 (H. Chabriat), INSERM; and ENSAI (A.J.A., M.B., F.Z.), Ecole d'ingénieur statistique, data science et big data, Bruz, France
| | - Hugues Chabriat
- From the ECSTRRA Team (H. Chhoa, S.C., L.B.), Université Paris-Cité, UMR1153, INSERM; Translational Neurovascular Centre (H. Chabriat), GH Saint-Louis-Lariboisière, Assistance Publique des Hôpitaux de Paris APHP, Université Paris-Cité and DHU NeuroVasc Sorbonne Paris-Cité; UMR 1161 (H. Chabriat), INSERM; and ENSAI (A.J.A., M.B., F.Z.), Ecole d'ingénieur statistique, data science et big data, Bruz, France
| | - Adelina Joanita Anato
- From the ECSTRRA Team (H. Chhoa, S.C., L.B.), Université Paris-Cité, UMR1153, INSERM; Translational Neurovascular Centre (H. Chabriat), GH Saint-Louis-Lariboisière, Assistance Publique des Hôpitaux de Paris APHP, Université Paris-Cité and DHU NeuroVasc Sorbonne Paris-Cité; UMR 1161 (H. Chabriat), INSERM; and ENSAI (A.J.A., M.B., F.Z.), Ecole d'ingénieur statistique, data science et big data, Bruz, France
| | - Mamadou Bamba
- From the ECSTRRA Team (H. Chhoa, S.C., L.B.), Université Paris-Cité, UMR1153, INSERM; Translational Neurovascular Centre (H. Chabriat), GH Saint-Louis-Lariboisière, Assistance Publique des Hôpitaux de Paris APHP, Université Paris-Cité and DHU NeuroVasc Sorbonne Paris-Cité; UMR 1161 (H. Chabriat), INSERM; and ENSAI (A.J.A., M.B., F.Z.), Ecole d'ingénieur statistique, data science et big data, Bruz, France
| | - Florent Zittoun
- From the ECSTRRA Team (H. Chhoa, S.C., L.B.), Université Paris-Cité, UMR1153, INSERM; Translational Neurovascular Centre (H. Chabriat), GH Saint-Louis-Lariboisière, Assistance Publique des Hôpitaux de Paris APHP, Université Paris-Cité and DHU NeuroVasc Sorbonne Paris-Cité; UMR 1161 (H. Chabriat), INSERM; and ENSAI (A.J.A., M.B., F.Z.), Ecole d'ingénieur statistique, data science et big data, Bruz, France
| | - Sylvie Chevret
- From the ECSTRRA Team (H. Chhoa, S.C., L.B.), Université Paris-Cité, UMR1153, INSERM; Translational Neurovascular Centre (H. Chabriat), GH Saint-Louis-Lariboisière, Assistance Publique des Hôpitaux de Paris APHP, Université Paris-Cité and DHU NeuroVasc Sorbonne Paris-Cité; UMR 1161 (H. Chabriat), INSERM; and ENSAI (A.J.A., M.B., F.Z.), Ecole d'ingénieur statistique, data science et big data, Bruz, France
| | - Lucie Biard
- From the ECSTRRA Team (H. Chhoa, S.C., L.B.), Université Paris-Cité, UMR1153, INSERM; Translational Neurovascular Centre (H. Chabriat), GH Saint-Louis-Lariboisière, Assistance Publique des Hôpitaux de Paris APHP, Université Paris-Cité and DHU NeuroVasc Sorbonne Paris-Cité; UMR 1161 (H. Chabriat), INSERM; and ENSAI (A.J.A., M.B., F.Z.), Ecole d'ingénieur statistique, data science et big data, Bruz, France
| |
Collapse
|
5
|
Han P, Taylor JM, Mukherjee B. Integrating Information from Existing Risk Prediction Models with No Model Details. CAN J STAT 2023; 51:355-374. [PMID: 37346757 PMCID: PMC10281716 DOI: 10.1002/cjs.11701] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2020] [Accepted: 12/16/2021] [Indexed: 11/07/2022]
Abstract
Consider the setting where (i) individual-level data are collected to build a regression model for the association between an event of interest and certain covariates, and (ii) some risk calculators predicting the risk of the event using less detailed covariates are available, possibly as algorithmic black boxes with little information available about how they were built. We propose a general empirical-likelihood-based framework to integrate the rich auxiliary information contained in the calculators into fitting the regression model, to make the estimation of regression parameters more efficient. Two methods are developed, one using working models to extract the calculator information and one making a direct use of calculator predictions without working models. Theoretical and numerical investigations show that the calculator information can substantially reduce the variance of regression parameter estimation. As an application, we study the dependence of the risk of high grade prostate cancer on both conventional risk factors and newly identified molecular biomarkers by integrating information from the Prostate Biopsy Collaborative Group (PBCG) risk calculator, which was built based on conventional risk factors alone.
Collapse
Affiliation(s)
- Peisong Han
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, USA
| | - Jeremy M.G. Taylor
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, USA
| | - Bhramar Mukherjee
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, USA
| |
Collapse
|
6
|
Taylor JMG, Choi K, Han P. Data integration: exploiting ratios of parameter estimates from a reduced external model. Biometrika 2023; 110:119-134. [PMID: 36798840 PMCID: PMC9919493 DOI: 10.1093/biomet/asac022] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2021] [Indexed: 11/12/2022] Open
Abstract
We consider the situation of estimating the parameters in a generalized linear prediction model, from an internal dataset, where the outcome variable [Formula: see text] is binary and there are two sets of covariates, [Formula: see text] and [Formula: see text]. We have information from an external study that provides parameter estimates for a generalized linear model of [Formula: see text] on [Formula: see text]. We propose a method that makes limited assumptions about the similarity of the distributions in the two study populations. The method involves orthogonalizing the [Formula: see text] variables and then borrowing information about the ratio of the coefficients from the external model. The method is justified based on a new result relating the parameters in a generalized linear model to the parameters in a generalized linear model with omitted covariates. The method is applicable if the regression coefficients in the [Formula: see text] given [Formula: see text] model are similar in the two populations, up to an unknown scalar constant. This type of transportability between populations is something that can be checked from the available data. The asymptotic variance of the proposed method is derived. The method is evaluated in a simulation study and shown to gain efficiency compared to simple analysis of the internal dataset, and is robust compared to an alternative method of incorporating external information.
Collapse
Affiliation(s)
- Jeremy M G Taylor
- Department of Biostatistics, University of Michigan, 1415 Washington Heights, Ann Arbor, Michigan 48019, U.S.A
| | - Kyuseong Choi
- Department of Statistics and Data Science, Cornell University, 1198 Comstock Hall, 129 Garden Ave., Ithaca, New York 14853, U.S.A
| | - Peisong Han
- Department of Biostatistics, University of Michigan, 1415 Washington Heights, Ann Arbor, Michigan 48019, U.S.A
| |
Collapse
|
7
|
Gu T, Lee PH, Duan R. COMMUTE: Communication-efficient transfer learning for multi-site risk prediction. J Biomed Inform 2023; 137:104243. [PMID: 36403757 PMCID: PMC9868117 DOI: 10.1016/j.jbi.2022.104243] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2022] [Revised: 09/20/2022] [Accepted: 11/06/2022] [Indexed: 11/19/2022]
Abstract
OBJECTIVES We propose a communication-efficient transfer learning approach (COMMUTE) that effectively incorporates multi-site healthcare data for training a risk prediction model in a target population of interest, accounting for challenges including population heterogeneity and data sharing constraints across sites. METHODS We first train population-specific source models locally within each site. Using data from a given target population, COMMUTE learns a calibration term for each source model, which adjusts for potential data heterogeneity through flexible distance-based regularizations. In a centralized setting where multi-site data can be directly pooled, all data are combined to train the target model after calibration. When individual-level data are not shareable in some sites, COMMUTE requests only the locally trained models from these sites, with which, COMMUTE generates heterogeneity-adjusted synthetic data for training the target model. We evaluate COMMUTE via extensive simulation studies and an application to multi-site data from the electronic Medical Records and Genomics (eMERGE) Network to predict extreme obesity. RESULTS Simulation studies show that COMMUTE outperforms methods without adjusting for population heterogeneity and methods trained in a single population over a broad spectrum of settings. Using eMERGE data, COMMUTE achieves an area under the receiver operating characteristic curve (AUC) around 0.80, which outperforms other benchmark methods with AUC ranging from 0.51 to 0.70. CONCLUSION COMMUTE improves the risk prediction in a target population with limited samples and safeguards against negative transfer when some source populations are highly different from the target. In a federated setting, it is highly communication efficient as it only requires each site to share model parameter estimates once, and no iterative communication or higher-order terms are needed.
Collapse
Affiliation(s)
- Tian Gu
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, United States
| | - Phil H Lee
- Department of Psychiatry, Harvard Medical School, Boston, MA, United States; Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, United States; Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, United States
| | - Rui Duan
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, United States.
| |
Collapse
|
8
|
Zhai Y, Han P. Data Integration with Oracle Use of External Information from Heterogeneous Populations. J Comput Graph Stat 2022. [DOI: 10.1080/10618600.2022.2050248] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Affiliation(s)
- Yuqi Zhai
- Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109-2029, USA
| | - Peisong Han
- Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109-2029, USA
| |
Collapse
|
9
|
Chen C, Han P, He F. Improving main analysis by borrowing information from auxiliary data. Stat Med 2021; 41:567-579. [PMID: 34796519 DOI: 10.1002/sim.9252] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2020] [Revised: 07/22/2021] [Accepted: 10/21/2021] [Indexed: 12/24/2022]
Abstract
In many clinical and observational studies, auxiliary data from the same subjects, such as repeated measurements or surrogate variables, will be collected in addition to the data of main interest. Not directly related to the main study, these auxiliary data in practice are rarely incorporated into the main analysis, though they may carry extra information that can help improve the estimation in the main analysis. Under the setting where part of or all subjects have auxiliary data available, we propose an effective weighting approach to borrow the auxiliary information by building a working model for the auxiliary data, where improvement of estimation precision over the main analysis is guaranteed regardless of the specification of the working model. An information index is also constructed to assess how well the selected working model works to improve the main analysis. Both theoretical and numerical studies show the excellent and robust performance of the proposed method in comparison to estimation without using the auxiliary data. Finally, we utilize the Atherosclerosis Risk in Communities study for illustration.
Collapse
Affiliation(s)
- Chixiang Chen
- Division of Biostatistics and Bioinformatics, Department of Epidemiology and Public Health, University of Maryland School of Medicine, Baltimore, Maryland, USA
| | - Peisong Han
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, USA
| | - Fan He
- Division of Biostatistics and Bioinformatics, Department of Public Health Sciences, Penn State College of Medicine, Hershey, Pennsylvania, USA
| |
Collapse
|
10
|
Gu T, Taylor JMG, Cheng W, Mukherjee B. Authors' corrigenda/corrections des auteurs on synthetic data method to incorporate external information into a current study. CAN J STAT 2021. [DOI: 10.1002/cjs.11657] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Affiliation(s)
- Tian Gu
- Department of Biostatistics University of Michigan Ann Arbor MI 48105 U.S.A
| | | | - Wenting Cheng
- Department of Biostatistics University of Michigan Ann Arbor MI 48105 U.S.A
| | - Bhramar Mukherjee
- Department of Biostatistics University of Michigan Ann Arbor MI 48105 U.S.A
| |
Collapse
|
11
|
Gu T, Taylor JMG, Mukherjee B. A meta-inference framework to integrate multiple external models into a current study. Biostatistics 2021; 24:406-424. [PMID: 34269371 PMCID: PMC10102901 DOI: 10.1093/biostatistics/kxab017] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2020] [Revised: 04/04/2021] [Accepted: 04/16/2021] [Indexed: 11/14/2022] Open
Abstract
It is becoming increasingly common for researchers to consider incorporating external information from large studies to improve the accuracy of statistical inference instead of relying on a modestly sized data set collected internally. With some new predictors only available internally, we aim to build improved regression models based on individual-level data from an "internal" study while incorporating summary-level information from "external" models. We propose a meta-analysis framework along with two weighted estimators as the composite of empirical Bayes estimators, which combines the estimates from different external models. The proposed framework is flexible and robust in the ways that (i) it is capable of incorporating external models that use a slightly different set of covariates; (ii) it is able to identify the most relevant external information and diminish the influence of information that is less compatible with the internal data; and (iii) it nicely balances the bias-variance trade-off while preserving the most efficiency gain. The proposed estimators are more efficient than the naïve analysis of the internal data and other naïve combinations of external estimators.
Collapse
Affiliation(s)
- Tian Gu
- Department of Biostatistics, University of Michigan, 1415 Washington Heights, Ann Arbor, MI 48109, USA
| | - Jeremy M G Taylor
- Department of Biostatistics, University of Michigan, 1415 Washington Heights, Ann Arbor, MI 48109, USA
| | - Bhramar Mukherjee
- Department of Biostatistics, University of Michigan, 1415 Washington Heights, Ann Arbor, MI 48109, USA
| |
Collapse
|