1
|
Han P, Taylor JM, Mukherjee B. Integrating Information from Existing Risk Prediction Models with No Model Details. CAN J STAT 2023; 51:355-374. [PMID: 37346757 PMCID: PMC10281716 DOI: 10.1002/cjs.11701] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2020] [Accepted: 12/16/2021] [Indexed: 11/07/2022]
Abstract
Consider the setting where (i) individual-level data are collected to build a regression model for the association between an event of interest and certain covariates, and (ii) some risk calculators predicting the risk of the event using less detailed covariates are available, possibly as algorithmic black boxes with little information available about how they were built. We propose a general empirical-likelihood-based framework to integrate the rich auxiliary information contained in the calculators into fitting the regression model, to make the estimation of regression parameters more efficient. Two methods are developed, one using working models to extract the calculator information and one making a direct use of calculator predictions without working models. Theoretical and numerical investigations show that the calculator information can substantially reduce the variance of regression parameter estimation. As an application, we study the dependence of the risk of high grade prostate cancer on both conventional risk factors and newly identified molecular biomarkers by integrating information from the Prostate Biopsy Collaborative Group (PBCG) risk calculator, which was built based on conventional risk factors alone.
Collapse
Affiliation(s)
- Peisong Han
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, USA
| | - Jeremy M.G. Taylor
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, USA
| | - Bhramar Mukherjee
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, USA
| |
Collapse
|
2
|
van Nee MM, Wessels LFA, van de Wiel MA. Flexible co-data learning for high-dimensional prediction. Stat Med 2021; 40:5910-5925. [PMID: 34438466 PMCID: PMC9292202 DOI: 10.1002/sim.9162] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2020] [Revised: 05/18/2021] [Accepted: 07/29/2021] [Indexed: 02/06/2023]
Abstract
Clinical research often focuses on complex traits in which many variables play a role in mechanisms driving, or curing, diseases. Clinical prediction is hard when data is high-dimensional, but additional information, like domain knowledge and previously published studies, may be helpful to improve predictions. Such complementary data, or co-data, provide information on the covariates, such as genomic location or P-values from external studies. We use multiple and various co-data to define possibly overlapping or hierarchically structured groups of covariates. These are then used to estimate adaptive multi-group ridge penalties for generalized linear and Cox models. Available group adaptive methods primarily target for settings with few groups, and therefore likely overfit for non-informative, correlated or many groups, and do not account for known structure on group level. To handle these issues, our method combines empirical Bayes estimation of the hyperparameters with an extra level of flexible shrinkage. This renders a uniquely flexible framework as any type of shrinkage can be used on the group level. We describe various types of co-data and propose suitable forms of hypershrinkage. The method is very versatile, as it allows for integration and weighting of multiple co-data sets, inclusion of unpenalized covariates and posterior variable selection. For three cancer genomics applications we demonstrate improvements compared to other models in terms of performance, variable selection stability and validation.
Collapse
Affiliation(s)
- Mirrelijn M van Nee
- Epidemiology & Data Science
- Amsterdam Public Health Research Institute, Amsterdam University Medical Centers, Amsterdam, The Netherlands
| | - Lodewyk F A Wessels
- Molecular Carcinogenesis, Netherlands Cancer Institute, Amsterdam, The Netherlands.,Computational Cancer Biology, Oncode Institute, Amsterdam, The Netherlands.,Intelligent Systems, Delft University of Technology, Delft, The Netherlands
| | - Mark A van de Wiel
- Epidemiology & Data Science
- Amsterdam Public Health Research Institute, Amsterdam University Medical Centers, Amsterdam, The Netherlands.,MRC Biostatistics Unit, University of Cambridge, Cambridge, UK
| |
Collapse
|
3
|
Song X, Dobbin KK. Evaluating biomarkers for treatment selection from reproducibility studies. Biostatistics 2020; 23:173-188. [PMID: 32424421 DOI: 10.1093/biostatistics/kxaa018] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2019] [Revised: 03/20/2020] [Accepted: 03/25/2020] [Indexed: 11/12/2022] Open
Abstract
We consider evaluating new or more accurately measured predictive biomarkers for treatment selection based on a previous clinical trial involving standard biomarkers. Instead of rerunning the clinical trial with the new biomarkers, we propose a more efficient approach which requires only either conducting a reproducibility study in which the new biomarkers and standard biomarkers are both measured on a set of patient samples, or adopting replicated measures of the error-contaminated standard biomarkers in the original study. This approach is easier to conduct and much less expensive than studies that require new samples from patients randomized to the intervention. In addition, it makes it possible to perform the estimation of the clinical performance quickly, since there will be no requirement to wait for events to occur as would be the case with prospective validation. The treatment selection is assessed via a working model, but the proposed estimator of the mean restricted lifetime is valid even if the working model is misspecified. The proposed approach is assessed through simulation studies and applied to a cancer study.
Collapse
Affiliation(s)
- Xiao Song
- Department of Epidemiology and Biostatistics, College of Public Health, University of Georgia, Athens, GA 30602, USA
| | - Kevin K Dobbin
- Department of Epidemiology and Biostatistics, College of Public Health, University of Georgia, Athens, GA 30602, USA
| |
Collapse
|
4
|
Abstract
The era of big data has witnessed an increasing availability of multiple data sources for statistical analyses. We consider estimation of causal effects combining big main data with unmeasured confounders and smaller validation data with supplementary information on these confounders. Under the unconfoundedness assumption with completely observed confounders, the smaller validation data allow for constructing consistent estimators for causal effects, but the big main data can only give error-prone estimators in general. However, by leveraging the information in the big main data in a principled way, we can improve the estimation efficiencies yet preserve the consistencies of the initial estimators based solely on the validation data. Our framework applies to asymptotically normal estimators, including the commonly used regression imputation, weighting, and matching estimators, and does not require a correct specification of the model relating the unmeasured confounders to the observed variables. We also propose appropriate bootstrap procedures, which makes our method straightforward to implement using software routines for existing estimators. Supplementary materials for this article are available online.
Collapse
Affiliation(s)
- Shu Yang
- Department of Statistics, North Carolina State University, Raleigh, NC
| | - Peng Ding
- Department of Statistics, University of California, Berkeley, CA
| |
Collapse
|
5
|
|
6
|
Boonstra PS, Mukherjee B, Taylor JMG. A Small-Sample Choice of the Tuning Parameter in Ridge Regression. Stat Sin 2015; 25:1185-1206. [PMID: 26985140 DOI: 10.5705/ss.2013.284] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
We propose new approaches for choosing the shrinkage parameter in ridge regression, a penalized likelihood method for regularizing linear regression coefficients, when the number of observations is small relative to the number of parameters. Existing methods may lead to extreme choices of this parameter, which will either not shrink the coefficients enough or shrink them by too much. Within this "small-n, large-p" context, we suggest a correction to the common generalized cross-validation (GCV) method that preserves the asymptotic optimality of the original GCV. We also introduce the notion of a "hyperpenalty", which shrinks the shrinkage parameter itself, and make a specific recommendation regarding the choice of hyperpenalty that empirically works well in a broad range of scenarios. A simple algorithm jointly estimates the shrinkage parameter and regression coefficients in the hyperpenalized likelihood. In a comprehensive simulation study of small-sample scenarios, our proposed approaches offer superior prediction over nine other existing methods.
Collapse
Affiliation(s)
- Philip S Boonstra
- Department of Biostatistics, University of Michigan, Ann Arbor 48109
| | - Bhramar Mukherjee
- Department of Biostatistics, University of Michigan, Ann Arbor 48109
| | - Jeremy M G Taylor
- Department of Biostatistics, University of Michigan, Ann Arbor 48109
| |
Collapse
|
7
|
Wey A, Connett J, Rudser K. Combining parametric, semi-parametric, and non-parametric survival models with stacked survival models. Biostatistics 2015; 16:537-49. [PMID: 25662068 DOI: 10.1093/biostatistics/kxv001] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2014] [Accepted: 01/05/2015] [Indexed: 11/13/2022] Open
Abstract
For estimating conditional survival functions, non-parametric estimators can be preferred to parametric and semi-parametric estimators due to relaxed assumptions that enable robust estimation. Yet, even when misspecified, parametric and semi-parametric estimators can possess better operating characteristics in small sample sizes due to smaller variance than non-parametric estimators. Fundamentally, this is a bias-variance trade-off situation in that the sample size is not large enough to take advantage of the low bias of non-parametric estimation. Stacked survival models estimate an optimally weighted combination of models that can span parametric, semi-parametric, and non-parametric models by minimizing prediction error. An extensive simulation study demonstrates that stacked survival models consistently perform well across a wide range of scenarios by adaptively balancing the strengths and weaknesses of individual candidate survival models. In addition, stacked survival models perform as well as or better than the model selected through cross-validation. Finally, stacked survival models are applied to a well-known German breast cancer study.
Collapse
Affiliation(s)
- Andrew Wey
- University of Hawaii, Honolulu, HI 96815, USAUniversity of Minnesota, Minneapolis, MN 55455, USA
| | - John Connett
- University of Hawaii, Honolulu, HI 96815, USAUniversity of Minnesota, Minneapolis, MN 55455, USA
| | - Kyle Rudser
- University of Hawaii, Honolulu, HI 96815, USAUniversity of Minnesota, Minneapolis, MN 55455, USA
| |
Collapse
|
8
|
|
9
|
Zhan X, Ghosh D. Incorporating auxiliary information for improved prediction using combination of kernel machines. ACTA ACUST UNITED AC 2015; 22:47-57. [PMID: 25419198 DOI: 10.1016/j.stamet.2014.08.001] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
With evolving genomic technologies, it is possible to get different measures of the same underlying biological phenomenon using different technologies. The goal of this paper is to build a prediction model for an outcome variable Y from covariates X. Besides X, we have surrogate covariates W which are related to X. We want to utilize the information in W to boost the prediction for Y using X. In this paper, we propose a kernel machine-based method to improve prediction of Y by X by incorporating auxiliary information W. By combining single kernel machines, we also propose a hybrid kernel machine predictor, which can yield a smaller prediction error than its constituents. The prediction error of our kernel machine predictors is evaluated using simulations. We also apply our method to a lung cancer dataset and an Alzheimer's disease dataset.
Collapse
Affiliation(s)
- Xiang Zhan
- Department of Statistics, Pennsylvania State University, University Park, PA 16802, U.S.A
| | - Debashis Ghosh
- Department of Statistics, Pennsylvania State University, University Park, PA 16802, U.S.A. ; Department of Public Health Sciences, Pennsylvania State University, University Park, PA 16802, U.S.A
| |
Collapse
|
10
|
Environmental risk score as a new tool to examine multi-pollutants in epidemiologic research: an example from the NHANES study using serum lipid levels. PLoS One 2014; 9:e98632. [PMID: 24901996 PMCID: PMC4047033 DOI: 10.1371/journal.pone.0098632] [Citation(s) in RCA: 37] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2014] [Accepted: 05/05/2014] [Indexed: 11/28/2022] Open
Abstract
Objective A growing body of evidence suggests that environmental pollutants, such as heavy metals, persistent organic pollutants and plasticizers play an important role in the development of chronic diseases. Most epidemiologic studies have examined environmental pollutants individually, but in real life, we are exposed to multi-pollutants and pollution mixtures, not single pollutants. Although multi-pollutant approaches have been recognized recently, challenges exist such as how to estimate the risk of adverse health responses from multi-pollutants. We propose an “Environmental Risk Score (ERS)” as a new simple tool to examine the risk of exposure to multi-pollutants in epidemiologic research. Methods and Results We examined 134 environmental pollutants in relation to serum lipids (total cholesterol, high-density lipoprotein cholesterol (HDL), low-density lipoprotein cholesterol (LDL) and triglycerides) using data from the National Health and Nutrition Examination Survey between 1999 and 2006. Using a two-stage approach, stage-1 for discovery (n = 10818) and stage-2 for validation (n = 4615), we identified 13 associated pollutants for total cholesterol, 9 for HDL, 5 for LDL and 27 for triglycerides with adjustment for sociodemographic factors, body mass index and serum nutrient levels. Using the regression coefficients (weights) from joint analyses of the combined data and exposure concentrations, ERS were computed as a weighted sum of the pollutant levels. We computed ERS for multiple lipid outcomes examined individually (single-phenotype approach) or together (multi-phenotype approach). Although the contributions of ERS to overall risk predictions for lipid outcomes were modest, we found relatively stronger associations between ERS and lipid outcomes than with individual pollutants. The magnitudes of the observed associations for ERS were comparable to or stronger than those for socio-demographic factors or BMI. Conclusions This study suggests ERS is a promising tool for characterizing disease risk from multi-pollutant exposures. This new approach supports the need for moving from a single-pollutant to a multi-pollutant framework.
Collapse
|
11
|
Boonstra PS, Mukherjee B, Taylor JM. BAYESIAN SHRINKAGE METHODS FOR PARTIALLY OBSERVED DATA WITH MANY PREDICTORS. Ann Appl Stat 2013; 7:2272-2292. [PMID: 24436727 DOI: 10.1214/13-aoas668] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
Motivated by the increasing use of and rapid changes in array technologies, we consider the prediction problem of fitting a linear regression relating a continuous outcome Y to a large number of covariates X , eg measurements from current, state-of-the-art technology. For most of the samples, only the outcome Y and surrogate covariates, W , are available. These surrogates may be data from prior studies using older technologies. Owing to the dimension of the problem and the large fraction of missing information, a critical issue is appropriate shrinkage of model parameters for an optimal bias-variance tradeoff. We discuss a variety of fully Bayesian and Empirical Bayes algorithms which account for uncertainty in the missing data and adaptively shrink parameter estimates for superior prediction. These methods are evaluated via a comprehensive simulation study. In addition, we apply our methods to a lung cancer dataset, predicting survival time (Y) using qRT-PCR ( X ) and microarray ( W ) measurements.
Collapse
|