1
|
Xie P, Ding J, Wang X. Leveraging External Aggregated Information for the Marginal Accelerated Failure Time Model. Stat Med 2024. [PMID: 39379012 DOI: 10.1002/sim.10224] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2023] [Revised: 03/29/2024] [Accepted: 09/06/2024] [Indexed: 10/10/2024]
Abstract
It is becoming increasingly common for researchers to consider leveraging information from external sources to enhance the analysis of small-scale studies. While much attention has focused on univariate survival data, correlated survival data are prevalent in epidemiological investigations. In this article, we propose a unified framework to improve the estimation of the marginal accelerated failure time model with correlated survival data by integrating additional information given in the form of covariate effects evaluated in a reduced accelerated failure time model. Such auxiliary information can be summarized by using valid estimating equations and hence can then be combined with the internal linear rank-estimating equations via the generalized method of moments. We investigate the asymptotic properties of the proposed estimator and show that it is more efficient than the conventional estimator using internal data only. When population heterogeneity exists, we revise the proposed estimation procedure and present a shrinkage estimator to protect against bias and loss of efficiency. Moreover, the proposed estimation procedure can be further refined to accommodate the non-negligible uncertainty in the auxiliary information, leading to more trustable inference conclusions. Simulation results demonstrate the finite sample performance of the proposed methods, and empirical application on the Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial substantiates its practical relevance.
Collapse
Affiliation(s)
- Ping Xie
- School of Mathematical Sciences, Dalian University of Technology, Dalian, Liaoning, China
| | - Jie Ding
- School of Mathematical Sciences, Dalian University of Technology, Dalian, Liaoning, China
| | - Xiaoguang Wang
- School of Mathematical Sciences, Dalian University of Technology, Dalian, Liaoning, China
| |
Collapse
|
2
|
Chen Z, Shen Y, Qin J, Ning J. Likelihood adaptively incorporated external aggregate information with uncertainty for survival data. Biometrics 2024; 80:ujae120. [PMID: 39468742 PMCID: PMC11518850 DOI: 10.1093/biomtc/ujae120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2023] [Revised: 09/02/2024] [Accepted: 09/24/2024] [Indexed: 10/30/2024]
Abstract
Population-based cancer registry databases are critical resources to bridge the information gap that results from a lack of sufficient statistical power from primary cohort data with small to moderate sample size. Although comprehensive data associated with tumor biomarkers often remain either unavailable or inconsistently measured in these registry databases, aggregate survival information sourced from these repositories has been well documented and publicly accessible. An appealing option is to integrate the aggregate survival information from the registry data with the primary cohort to enhance the evaluation of treatment impacts or prediction of survival outcomes across distinct tumor subtypes. Nevertheless, for rare types of cancer, even the sample sizes of cancer registries remain modest. The variability linked to the aggregated statistics could be non-negligible compared with the sample variation of the primary cohort. In response, we propose an externally informed likelihood approach, which facilitates the linkage between the primary cohort and external aggregate data, with consideration of the variation from aggregate information. We establish the asymptotic properties of the estimators and evaluate the finite sample performance via simulation studies. Through the application of our proposed method, we integrate data from the cohort of inflammatory breast cancer (IBC) patients at the University of Texas MD Anderson Cancer Center with aggregate survival data from the National Cancer Data Base, enabling us to appraise the effect of tri-modality treatment on survival across various tumor subtypes of IBC.
Collapse
Affiliation(s)
- Ziqi Chen
- Key Laboratory of Advanced Theory and Application in Statistics and Data Science-MOE, School of Statistics, East China Normal University, Shanghai 200062, China
| | - Yu Shen
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, United States
| | - Jing Qin
- National Institution of Allergy and Infectious Diseases, Bethesda, MD 20892, United States
| | - Jing Ning
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, United States
| |
Collapse
|
3
|
Ding J, Li J, Wang X. Renewable risk assessment of heterogeneous streaming time-to-event cohorts. Stat Med 2024; 43:3761-3777. [PMID: 38897797 DOI: 10.1002/sim.10146] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2023] [Revised: 05/03/2024] [Accepted: 06/06/2024] [Indexed: 06/21/2024]
Abstract
The analysis of streaming time-to-event cohorts has garnered significant research attention. Most existing methods require observed cohorts from a study sequence to be independent and identically sampled from a common model. This assumption may be easily violated in practice. Our methodology operates within the framework of online data updating, where risk estimates for each cohort of interest are continuously refreshed using the latest observations and historical summary statistics. At each streaming stage, we introduce parameters to quantify the potential discrepancy between batch-specific effects from adjacent cohorts. We then employ penalized estimation techniques to identify nonzero discrepancy parameters, allowing us to adaptively adjust risk estimates based on current data and historical trends. We illustrate our proposed method through extensive empirical simulations and a lung cancer data analysis.
Collapse
Affiliation(s)
- Jie Ding
- School of Mathematical Sciences, Dalian University of Technology, Liaoning, China
| | - Jialiang Li
- Department of Statistics and Data Science, National University of Singapore, Singapore, Singapore
- Duke University-NUS Graduate Medical School, National University of Singapore, Singapore, Singapore
| | - Xiaoguang Wang
- School of Mathematical Sciences, Dalian University of Technology, Liaoning, China
| |
Collapse
|
4
|
Chen C, Han P, Chen S, Shardell M, Qin J. Integrating external summary information in the presence of prior probability shift: an application to assessing essential hypertension. Biometrics 2024; 80:ujae090. [PMID: 39248121 PMCID: PMC11381951 DOI: 10.1093/biomtc/ujae090] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2023] [Revised: 07/05/2024] [Accepted: 08/22/2024] [Indexed: 09/10/2024]
Abstract
Recent years have witnessed a rise in the popularity of information integration without sharing of raw data. By leveraging and incorporating summary information from external sources, internal studies can achieve enhanced estimation efficiency and prediction accuracy. However, a noteworthy challenge in utilizing summary-level information is accommodating the inherent heterogeneity across diverse data sources. In this study, we delve into the issue of prior probability shift between two cohorts, wherein the difference of two data distributions depends on the outcome. We introduce a novel semi-parametric constrained optimization-based approach to integrate information within this framework, which has not been extensively explored in existing literature. Our proposed method tackles the prior probability shift by introducing the outcome-dependent selection function and effectively addresses the estimation uncertainty associated with summary information from the external source. Our approach facilitates valid inference even in the absence of a known variance-covariance estimate from the external source. Through extensive simulation studies, we observe the superiority of our method over existing ones, showcasing minimal estimation bias and reduced variance for both binary and continuous outcomes. We further demonstrate the utility of our method through its application in investigating risk factors related to essential hypertension, where the reduced estimation variability is observed after integrating summary information from an external data.
Collapse
Affiliation(s)
- Chixiang Chen
- Department of Epidemiology and Public Health, University of Maryland School of Medicine, Baltimore, 21201, United States
- Department of Neurosurgery, University of Maryland School of Medicine, Baltimore, 21201, United States
- University of Maryland Institute for Health Computing, Bethesda, MD 20852, United States
| | - Peisong Han
- Biostatistics Innovation Group, Gilead Sciences, Foster City, CA 94404, United States
| | - Shuo Chen
- Department of Epidemiology and Public Health, University of Maryland School of Medicine, Baltimore, 21201, United States
| | - Michelle Shardell
- Department of Epidemiology and Public Health, University of Maryland School of Medicine, Baltimore, 21201, United States
- Institute of Genome Sciences, University of Maryland School of Medicine, Baltimore, MD 21201, United States
| | - Jing Qin
- Biostatistics Research Branch, National Institute of Allergy and Infectious Diseases, National Institute of Health, Bethesda, MD 20892, United States
| |
Collapse
|
5
|
Han P, Li H, Park SK, Mukherjee B, Taylor JMG. Improving prediction of linear regression models by integrating external information from heterogeneous populations: James-Stein estimators. Biometrics 2024; 80:ujae072. [PMID: 39101548 PMCID: PMC11299067 DOI: 10.1093/biomtc/ujae072] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2023] [Revised: 06/19/2024] [Accepted: 07/18/2024] [Indexed: 08/06/2024]
Abstract
We consider the setting where (1) an internal study builds a linear regression model for prediction based on individual-level data, (2) some external studies have fitted similar linear regression models that use only subsets of the covariates and provide coefficient estimates for the reduced models without individual-level data, and (3) there is heterogeneity across these study populations. The goal is to integrate the external model summary information into fitting the internal model to improve prediction accuracy. We adapt the James-Stein shrinkage method to propose estimators that are no worse and are oftentimes better in the prediction mean squared error after information integration, regardless of the degree of study population heterogeneity. We conduct comprehensive simulation studies to investigate the numerical performance of the proposed estimators. We also apply the method to enhance a prediction model for patella bone lead level in terms of blood lead level and other covariates by integrating summary information from published literature.
Collapse
Affiliation(s)
- Peisong Han
- Biostatistics Innovation Group, Gilead Sciences, 333 Lakeside Drive, Foster City, CA 94404, United States
| | - Haoyue Li
- Department of Biostatistics, University of Michigan, 1415 Washington Heights, Ann Arbor, MI 48109, United States
| | - Sung Kyun Park
- Department of Epidemiology, University of Michigan, 1415 Washington Heights, Ann Arbor, MI 48109, United States
| | - Bhramar Mukherjee
- Department of Biostatistics, University of Michigan, 1415 Washington Heights, Ann Arbor, MI 48109, United States
| | - Jeremy M G Taylor
- Department of Biostatistics, University of Michigan, 1415 Washington Heights, Ann Arbor, MI 48109, United States
| |
Collapse
|
6
|
Choi K, Taylor JMG, Han P. Robust data integration from multiple external sources for generalized linear models with binary outcomes. Biometrics 2024; 80:ujad005. [PMID: 38364808 PMCID: PMC10873565 DOI: 10.1093/biomtc/ujad005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2022] [Revised: 08/02/2023] [Accepted: 10/12/2023] [Indexed: 02/18/2024]
Abstract
We aim to estimate parameters in a generalized linear model (GLM) for a binary outcome when, in addition to the raw data from the internal study, more than 1 external study provides summary information in the form of parameter estimates from fitting GLMs with varying subsets of the internal study covariates. We propose an adaptive penalization method that exploits the external summary information and gains efficiency for estimation, and that is both robust and computationally efficient. The robust property comes from exploiting the relationship between parameters of a GLM and parameters of a GLM with omitted covariates and from downweighting external summary information that is less compatible with the internal data through a penalization. The computational burden associated with searching for the optimal tuning parameter for the penalization is reduced by using adaptive weights and by using an information criterion when searching for the optimal tuning parameter. Simulation studies show that the proposed estimator is robust against various types of population distribution heterogeneity and also gains efficiency compared to direct maximum likelihood estimation. The method is applied to improve a logistic regression model that predicts high-grade prostate cancer making use of parameter estimates from 2 external models.
Collapse
Affiliation(s)
- Kyuseong Choi
- Department of Statistics and Data Science, Cornell University, Ithaca, NY 14853, United States
| | - Jeremy M G Taylor
- Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, United States
| | - Peisong Han
- Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, United States
| |
Collapse
|
7
|
Han W, Zhang S, Gao H, Bu D. Clustering on hierarchical heterogeneous data with prior pairwise relationships. BMC Bioinformatics 2024; 25:40. [PMID: 38262930 PMCID: PMC10807103 DOI: 10.1186/s12859-024-05652-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2023] [Accepted: 01/12/2024] [Indexed: 01/25/2024] Open
Abstract
BACKGROUND Clustering is a fundamental problem in statistics and has broad applications in various areas. Traditional clustering methods treat features equally and ignore the potential structure brought by the characteristic difference of features. Especially in cancer diagnosis and treatment, several types of biological features are collected and analyzed together. Treating these features equally fails to identify the heterogeneity of both data structure and cancer itself, which leads to incompleteness and inefficacy of current anti-cancer therapies. OBJECTIVES In this paper, we propose a clustering framework based on hierarchical heterogeneous data with prior pairwise relationships. The proposed clustering method fully characterizes the difference of features and identifies potential hierarchical structure by rough and refined clusters. RESULTS The refined clustering further divides the clusters obtained by the rough clustering into different subtypes. Thus it provides a deeper insight of cancer that can not be detected by existing clustering methods. The proposed method is also flexible with prior information, additional pairwise relationships of samples can be incorporated to help to improve clustering performance. Finally, well-grounded statistical consistency properties of our proposed method are rigorously established, including the accurate estimation of parameters and determination of clustering structures. CONCLUSIONS Our proposed method achieves better clustering performance than other methods in simulation studies, and the clustering accuracy increases with prior information incorporated. Meaningful biological findings are obtained in the analysis of lung adenocarcinoma with clinical imaging data and omics data, showing that hierarchical structure produced by rough and refined clustering is necessary and reasonable.
Collapse
Affiliation(s)
- Wei Han
- School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing, China
- Key Laboratory of Big Data Mining and Knowledge Management, Chinese Academy of Sciences, Beijing, China
| | - Sanguo Zhang
- School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing, China
- Key Laboratory of Big Data Mining and Knowledge Management, Chinese Academy of Sciences, Beijing, China
| | - Hailong Gao
- School of Mathematics and Statistics, Qingdao University, Qingdao, China
| | - Deliang Bu
- School of Statistics, Capital University of Economics and Business, Beijing, China.
| |
Collapse
|
8
|
Huang Y, Huang CY, Kim MO. Simultaneous selection and incorporation of consistent external aggregate information. Stat Med 2023; 42:5630-5645. [PMID: 37788982 DOI: 10.1002/sim.9929] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2022] [Revised: 08/24/2023] [Accepted: 09/21/2023] [Indexed: 10/05/2023]
Abstract
Interest has grown in synthesizing participant level data of a study with relevant external aggregate information. Several efficient and flexible procedures have been developed under the assumption that the internal study and the external sources concern the same population. This homogeneity condition, albeit commonly being imposed, is hard to check due to limitedly available external information in aggregate data forms. Bias may be introduced when the assumption is violated. In this article, we propose a penalized likelihood approach that avoids undesirable bias by simultaneously selecting and synthesizing consistent external aggregate information. The proposed approach provides a general framework which incorporate consistent external information from heterogeneous study populations as long as the conditional distribution of the dependent variable under investigation is same and differences in the independent variable distributions are properly accounted for via a semi-parametric density ratio model. The proposed approach also properly accounts for the sampling errors in the external information. A two-step estimator and an optimization algorithm are proposed for computation. We establish the selection and estimation consistency and the asymptotic normality of the two-step estimator. The proposed approach is illustrated with an analysis of gestational weight gain management studies.
Collapse
Affiliation(s)
- Yunxiang Huang
- Department of Epidemiology & Biostatistics, University of California at San Francisco, San Francisco, California, USA
| | - Chiung-Yu Huang
- Department of Epidemiology & Biostatistics, University of California at San Francisco, San Francisco, California, USA
| | - Mi-Ok Kim
- Department of Epidemiology & Biostatistics, University of California at San Francisco, San Francisco, California, USA
| |
Collapse
|
9
|
Gu T, Taylor JM, Mukherjee B. A synthetic data integration framework to leverage external summary-level information from heterogeneous populations. Biometrics 2023; 79:3831-3845. [PMID: 36876883 PMCID: PMC10480346 DOI: 10.1111/biom.13852] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2022] [Accepted: 02/24/2023] [Indexed: 03/07/2023]
Abstract
There is a growing need for flexible general frameworks that integrate individual-level data with external summary information for improved statistical inference. External information relevant for a risk prediction model may come in multiple forms, through regression coefficient estimates or predicted values of the outcome variable. Different external models may use different sets of predictors and the algorithm they used to predict the outcome Y given these predictors may or may not be known. The underlying populations corresponding to each external model may be different from each other and from the internal study population. Motivated by a prostate cancer risk prediction problem where novel biomarkers are measured only in the internal study, this paper proposes an imputation-based methodology, where the goal is to fit a target regression model with all available predictors in the internal study while utilizing summary information from external models that may have used only a subset of the predictors. The method allows for heterogeneity of covariate effects across the external populations. The proposed approach generates synthetic outcome data in each external population, uses stacked multiple imputation to create a long dataset with complete covariate information. The final analysis of the stacked imputed data is conducted by weighted regression. This flexible and unified approach can improve statistical efficiency of the estimated coefficients in the internal study, improve predictions by utilizing even partial information available from models that use a subset of the full set of covariates used in the internal study, and provide statistical inference for the external population with potentially different covariate effects from the internal population.
Collapse
Affiliation(s)
- Tian Gu
- Department of Biostatistics, University of Michigan, Ann Arbor, U.S.A
| | | | - Bhramar Mukherjee
- Department of Biostatistics, University of Michigan, Ann Arbor, U.S.A
| |
Collapse
|
10
|
Cheng YJ, Liu YC, Tsai CY, Huang CY. Semiparametric estimation of the transformation model by leveraging external aggregate data in the presence of population heterogeneity. Biometrics 2023; 79:1996-2009. [PMID: 36314375 DOI: 10.1111/biom.13778] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2022] [Accepted: 10/05/2022] [Indexed: 09/13/2023]
Abstract
Leveraging information in aggregate data from external sources to improve estimation efficiency and prediction accuracy with smaller scale studies has drawn a great deal of attention in recent years. Yet, conventional methods often either ignore uncertainty in the external information or fail to account for the heterogeneity between internal and external studies. This article proposes an empirical likelihood-based framework to improve the estimation of the semiparametric transformation models by incorporating information about the t-year subgroup survival probability from external sources. The proposed estimation procedure incorporates an additional likelihood component to account for uncertainty in the external information and employs a density ratio model to characterize population heterogeneity. We establish the consistency and asymptotic normality of the proposed estimator and show that it is more efficient than the conventional pseudopartial likelihood estimator without combining information. Simulation studies show that the proposed estimator yields little bias and outperforms the conventional approach even in the presence of information uncertainty and heterogeneity. The proposed methodologies are illustrated with an analysis of a pancreatic cancer study.
Collapse
Affiliation(s)
- Yu-Jen Cheng
- Institute of Statistics, National Tsing Hua University, Hsin-Chu, Taiwan
| | - Yen-Chun Liu
- Institute of Statistics, National Tsing Hua University, Hsin-Chu, Taiwan
| | - Chang-Yu Tsai
- Institute of Statistics, National Tsing Hua University, Hsin-Chu, Taiwan
| | - Chiung-Yu Huang
- Department of Epidemiology & Biostatistics, University of California at San Francisco, San Francisco, California, USA
| |
Collapse
|
11
|
Fu S, Purdue MP, Zhang H, Qin J, Song L, Berndt SI, Yu K. Improve the model of disease subtype heterogeneity by leveraging external summary data. PLoS Comput Biol 2023; 19:e1011236. [PMID: 37437002 DOI: 10.1371/journal.pcbi.1011236] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2022] [Accepted: 06/02/2023] [Indexed: 07/14/2023] Open
Abstract
Researchers are often interested in understanding the disease subtype heterogeneity by testing whether a risk exposure has the same level of effect on different disease subtypes. The polytomous logistic regression (PLR) model provides a flexible tool for such an evaluation. Disease subtype heterogeneity can also be investigated with a case-only study that uses a case-case comparison procedure to directly assess the difference between risk effects on two disease subtypes. Motivated by a large consortium project on the genetic basis of non-Hodgkin lymphoma (NHL) subtypes, we develop PolyGIM, a procedure to fit the PLR model by integrating individual-level data with summary data extracted from multiple studies under different designs. The summary data consist of coefficient estimates from working logistic regression models established by external studies. Examples of the working model include the case-case comparison model and the case-control comparison model, which compares the control group with a subtype group or a broad disease group formed by merging several subtypes. PolyGIM efficiently evaluates risk effects and provides a powerful test for disease subtype heterogeneity in situations when only summary data, instead of individual-level data, is available from external studies due to various informatics and privacy constraints. We investigate the theoretic properties of PolyGIM and use simulation studies to demonstrate its advantages. Using data from eight genome-wide association studies within the NHL consortium, we apply it to study the effect of the polygenic risk score defined by a lymphoid malignancy on the risks of four NHL subtypes. These results show that PolyGIM can be a valuable tool for pooling data from multiple sources for a more coherent evaluation of disease subtype heterogeneity.
Collapse
Affiliation(s)
- Sheng Fu
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, Maryland, United States of America
| | - Mark P Purdue
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, Maryland, United States of America
| | - Han Zhang
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, Maryland, United States of America
| | - Jing Qin
- National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Lei Song
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, Maryland, United States of America
| | - Sonja I Berndt
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, Maryland, United States of America
| | - Kai Yu
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, Maryland, United States of America
| |
Collapse
|
12
|
Fu S, Deng L, Zhang H, Qin J, Yu K. Integrative analysis of individual-level data and high-dimensional summary statistics. BIOINFORMATICS (OXFORD, ENGLAND) 2023; 39:7085950. [PMID: 36964712 PMCID: PMC10361352 DOI: 10.1093/bioinformatics/btad156] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/08/2022] [Revised: 03/19/2023] [Accepted: 03/22/2023] [Indexed: 04/23/2023]
Abstract
MOTIVATION Researchers usually conduct statistical analyses based on models built on raw data collected from individual participants (individual-level data). There is a growing interest in enhancing inference efficiency by incorporating aggregated summary information from other sources, such as summary statistics on genetic markers' marginal associations with a given trait generated from genome-wide association studies. However, combining high-dimensional summary data with individual-level data using existing integrative procedures can be challenging due to various numeric issues in optimizing an objective function over a large number of unknown parameters. RESULTS We develop a procedure to improve the fitting of a targeted statistical model by leveraging external summary data for more efficient statistical inference (both effect estimation and hypothesis testing). To make this procedure scalable to high-dimensional summary data, we propose a divide-and-conquer strategy by breaking the task into easier parallel jobs, each fitting the targeted model by integrating the individual-level data with a small proportion of summary data. We obtain the final estimates of model parameters by pooling results from multiple fitted models through the minimum distance estimation procedure. We improve the procedure for a general class of additive models commonly encountered in genetic studies. We further expand these two approaches to integrate individual-level and high-dimensional summary data from different study populations. We demonstrate the advantage of the proposed methods through simulations and an application to the study of the effect on pancreatic cancer risk by the polygenic risk score defined by BMI-associated genetic markers. AVAILABILITY AND IMPLEMENTATION R package is available at https://github.com/fushengstat/MetaGIM.
Collapse
Affiliation(s)
- Sheng Fu
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, MD 20892, USA
| | - Lu Deng
- School of Statistics and Data Science, Nankai University, Tianjin 300071, China
| | - Han Zhang
- Information Management Services, Inc, Bethesda, MD 20892, USA
| | - Jing Qin
- National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, MD 20892, USA
| | - Kai Yu
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, MD 20892, USA
| |
Collapse
|
13
|
Gao F, Chan KCG. Noniterative adjustment to regression estimators with population-based auxiliary information for semiparametric models. Biometrics 2023; 79:140-150. [PMID: 34693991 DOI: 10.1111/biom.13585] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2020] [Revised: 10/06/2021] [Accepted: 10/08/2021] [Indexed: 12/14/2022]
Abstract
Disease registries, surveillance data, and other datasets with extremely large sample sizes become increasingly available in providing population-based information on disease incidence, survival probability, or other important public health characteristics. Such information can be leveraged in studies that collect detailed measurements but with smaller sample sizes. In contrast to recent proposals that formulate additional information as constraints in optimization problems, we develop a general framework to construct simple estimators that update the usual regression estimators with some functionals of data that incorporate the additional information. We consider general settings that incorporate nuisance parameters in the auxiliary information, non-i.i.d. data such as those from case-control studies, and semiparametric models with infinite-dimensional parameters common in survival analysis. Details of several important data and sampling settings are provided with numerical examples.
Collapse
Affiliation(s)
- Fei Gao
- Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, Washington, USA
| | - K C G Chan
- Department of Biostatistics, University of Washington, Seattle, Washington, USA
| |
Collapse
|
14
|
Ding J, Li J, Han Y, McKeague IW, Wang X. Fitting additive risk models using auxiliary information. Stat Med 2023; 42:894-916. [PMID: 36599810 DOI: 10.1002/sim.9649] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2022] [Revised: 10/06/2022] [Accepted: 11/09/2022] [Indexed: 01/06/2023]
Abstract
There has been a growing interest in incorporating auxiliary summary information from external studies into the analysis of internal individual-level data. In this paper, we propose an adaptive estimation procedure for an additive risk model to integrate auxiliary subgroup survival information via a penalized method of moments technique. Our approach can accommodate information from heterogeneous data. Parameters to quantify the magnitude of potential incomparability between internal data and external auxiliary information are introduced in our framework while nonzero components of these parameters suggest a violation of the homogeneity assumption. We further develop an efficient computational algorithm to solve the numerical optimization problem by profiling out the nuisance parameters. In an asymptotic sense, our method can be as efficient as if all the incomparable auxiliary information is accurately acknowledged and has been automatically excluded from consideration. The asymptotic normality of the proposed estimator of the regression coefficients is established, with an explicit formula for the asymptotic variance-covariance matrix that can be consistently estimated from the data. Simulation studies show that the proposed method yields a substantial gain in statistical efficiency over the conventional method using the internal data only, and reduces estimation biases when the given auxiliary survival information is incomparable. We illustrate the proposed method with a lung cancer survival study.
Collapse
Affiliation(s)
- Jie Ding
- School of Mathematical Sciences, Dalian University of Technology, Liaoning, China
| | - Jialiang Li
- Department of Statistics and Data Science, National University of Singapore, Singapore, Singapore
- Duke University-NUS Graduate Medical School, National University of Singapore, Singapore, Singapore
| | - Yang Han
- Department of Mathematics, University of Manchester, Manchester, United Kingdom
| | - Ian W McKeague
- Department of Biostatistics, Mailman School of Public Health, Columbia University, New York, USA
| | - Xiaoguang Wang
- School of Mathematical Sciences, Dalian University of Technology, Liaoning, China
| |
Collapse
|
15
|
Gu T, Lee PH, Duan R. COMMUTE: Communication-efficient transfer learning for multi-site risk prediction. J Biomed Inform 2023; 137:104243. [PMID: 36403757 PMCID: PMC9868117 DOI: 10.1016/j.jbi.2022.104243] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2022] [Revised: 09/20/2022] [Accepted: 11/06/2022] [Indexed: 11/19/2022]
Abstract
OBJECTIVES We propose a communication-efficient transfer learning approach (COMMUTE) that effectively incorporates multi-site healthcare data for training a risk prediction model in a target population of interest, accounting for challenges including population heterogeneity and data sharing constraints across sites. METHODS We first train population-specific source models locally within each site. Using data from a given target population, COMMUTE learns a calibration term for each source model, which adjusts for potential data heterogeneity through flexible distance-based regularizations. In a centralized setting where multi-site data can be directly pooled, all data are combined to train the target model after calibration. When individual-level data are not shareable in some sites, COMMUTE requests only the locally trained models from these sites, with which, COMMUTE generates heterogeneity-adjusted synthetic data for training the target model. We evaluate COMMUTE via extensive simulation studies and an application to multi-site data from the electronic Medical Records and Genomics (eMERGE) Network to predict extreme obesity. RESULTS Simulation studies show that COMMUTE outperforms methods without adjusting for population heterogeneity and methods trained in a single population over a broad spectrum of settings. Using eMERGE data, COMMUTE achieves an area under the receiver operating characteristic curve (AUC) around 0.80, which outperforms other benchmark methods with AUC ranging from 0.51 to 0.70. CONCLUSION COMMUTE improves the risk prediction in a target population with limited samples and safeguards against negative transfer when some source populations are highly different from the target. In a federated setting, it is highly communication efficient as it only requires each site to share model parameter estimates once, and no iterative communication or higher-order terms are needed.
Collapse
Affiliation(s)
- Tian Gu
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, United States
| | - Phil H Lee
- Department of Psychiatry, Harvard Medical School, Boston, MA, United States; Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, United States; Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, United States
| | - Rui Duan
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, United States.
| |
Collapse
|
16
|
Zheng J, Dong X, Newton CC, Hsu L. A Generalized Integration Approach to Association Analysis with Multi-category Outcome: An Application to a Tumor Sequencing Study of Colorectal Cancer and Smoking. J Am Stat Assoc 2022; 118:29-42. [PMID: 37193510 PMCID: PMC10168026 DOI: 10.1080/01621459.2022.2105703] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2021] [Revised: 07/06/2022] [Accepted: 07/14/2022] [Indexed: 10/16/2022]
Abstract
Cancer is a heterogeneous disease, and rapid progress in sequencing and -omics technologies has enabled researchers to characterize tumors comprehensively. This has stimulated an intensive interest in studying how risk factors are associated with various tumor heterogeneous features. The Cancer Prevention Study-II (CPS-II) cohort is one of the largest prospective studies, particularly valuable for elucidating associations between cancer and risk factors. In this paper, we investigate the association of smoking with novel colorectal tumor markers obtained from targeted sequencing. However, due to cost and logistic difficulties, only a limited number of tumors can be assayed, which limits our capability for studying these associations. Meanwhile, there are extensive studies for assessing the association of smoking with overall cancer risk and established colorectal tumor markers. Importantly, such summary information is readily available from the literature. By linking this summary information to parameters of interest with proper constraints, we develop a generalized integration approach for polytomous logistic regression model with outcome characterized by tumor features. The proposed approach gains the efficiency through maximizing the joint likelihood of individual-level tumor data and external summary information under the constraints that narrow the parameter searching space. We apply the proposed method to the CPS-II data and identify the association of smoking with colorectal cancer risk differing by the mutational status of APC and RNF43 genes, neither of which is identified by the conventional analysis of CPS-II individual data only. These results help better understand the role of smoking in the etiology of colorectal cancer.
Collapse
Affiliation(s)
- Jiayin Zheng
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA
| | - Xinyuan Dong
- Department of Biostatistics, University of Washington, Seattle, WA
| | | | - Li Hsu
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA
- Department of Biostatistics, University of Washington, Seattle, WA
| |
Collapse
|
17
|
Zhai Y, Han P. Data Integration with Oracle Use of External Information from Heterogeneous Populations. J Comput Graph Stat 2022. [DOI: 10.1080/10618600.2022.2050248] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Affiliation(s)
- Yuqi Zhai
- Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109-2029, USA
| | - Peisong Han
- Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109-2029, USA
| |
Collapse
|
18
|
Yuan M, Li P, Wu C. Semiparametric empirical likelihood inference with estimating equations under density ratio models. Electron J Stat 2022. [DOI: 10.1214/22-ejs2069] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Meng Yuan
- Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, Ontario N2L 3G1, Canada
| | - Pengfei Li
- Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, Ontario N2L 3G1, Canada
| | - Changbao Wu
- Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, Ontario N2L 3G1, Canada
| |
Collapse
|
19
|
Gu T, Taylor JMG, Mukherjee B. A meta-inference framework to integrate multiple external models into a current study. Biostatistics 2021; 24:406-424. [PMID: 34269371 PMCID: PMC10102901 DOI: 10.1093/biostatistics/kxab017] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2020] [Revised: 04/04/2021] [Accepted: 04/16/2021] [Indexed: 11/14/2022] Open
Abstract
It is becoming increasingly common for researchers to consider incorporating external information from large studies to improve the accuracy of statistical inference instead of relying on a modestly sized data set collected internally. With some new predictors only available internally, we aim to build improved regression models based on individual-level data from an "internal" study while incorporating summary-level information from "external" models. We propose a meta-analysis framework along with two weighted estimators as the composite of empirical Bayes estimators, which combines the estimates from different external models. The proposed framework is flexible and robust in the ways that (i) it is capable of incorporating external models that use a slightly different set of covariates; (ii) it is able to identify the most relevant external information and diminish the influence of information that is less compatible with the internal data; and (iii) it nicely balances the bias-variance trade-off while preserving the most efficiency gain. The proposed estimators are more efficient than the naïve analysis of the internal data and other naïve combinations of external estimators.
Collapse
Affiliation(s)
- Tian Gu
- Department of Biostatistics, University of Michigan, 1415 Washington Heights, Ann Arbor, MI 48109, USA
| | - Jeremy M G Taylor
- Department of Biostatistics, University of Michigan, 1415 Washington Heights, Ann Arbor, MI 48109, USA
| | - Bhramar Mukherjee
- Department of Biostatistics, University of Michigan, 1415 Washington Heights, Ann Arbor, MI 48109, USA
| |
Collapse
|
20
|
Sheng Y, Sun Y, Huang CY, Kim MO. Synthesizing external aggregated information in the penalized Cox regression under population heterogeneity. Stat Med 2021; 40:4915-4930. [PMID: 34134178 DOI: 10.1002/sim.9101] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2020] [Revised: 03/29/2021] [Accepted: 05/27/2021] [Indexed: 11/06/2022]
Abstract
Synthesizing external aggregated information has been proven useful in improving estimation efficiency when conducting statistical analysis using a limited amount of data. In this paper, we develop a unified framework for combining information from high-dimensional individual-level data and potentially low-dimensional external aggregate data under the Cox model. We summarize various forms of external aggregated information by population estimating equations and propose a penalized empirical likelihood approach to borrow information from these estimating equations. The proposed methods possess the flexibility to handle the case where individual-level data and external aggregate data are from heterogeneous populations. Specifically, a penalized empirical likelihood ratio test is developed to check for the potential heterogeneity, and a semiparametric density ratio model is postulated to account for the heterogeneity. Moreover, we study the impact of uncertainty in the auxiliary information on the efficiency gain and propose a modified variance estimator to adjust for the uncertainty. The proposed estimators enjoy the oracle property and are asymptotically more efficient than the penalized partial likelihood estimator that does not exploit the external aggregated information. Simulation studies show improvement in both estimation efficiency and variable selection over the competitors. The proposed approaches are applied to the analysis of a pediatric kidney transplant study for illustration.
Collapse
Affiliation(s)
- Ying Sheng
- Department of Epidemiology & Biostatistics, University of California at San Francisco, San Francisco, California, USA
| | - Yifei Sun
- Department of Biostatistics, Mailman School of Public Health, Columbia University, New York, USA
| | - Chiung-Yu Huang
- Department of Epidemiology & Biostatistics, University of California at San Francisco, San Francisco, California, USA.,Helen Diller Family Comprehensive Cancer Center, University of California at San Francisco, San Francisco, California, USA
| | - Mi-Ok Kim
- Department of Epidemiology & Biostatistics, University of California at San Francisco, San Francisco, California, USA.,Helen Diller Family Comprehensive Cancer Center, University of California at San Francisco, San Francisco, California, USA
| |
Collapse
|
21
|
Jiang Z, Yang B, Qin J, Zhou Y. Enhanced empirical likelihood estimation of incubation period of COVID-19 by integrating published information. Stat Med 2021; 40:4252-4268. [PMID: 33973260 PMCID: PMC8242591 DOI: 10.1002/sim.9026] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2020] [Revised: 04/21/2021] [Accepted: 04/23/2021] [Indexed: 12/05/2022]
Abstract
Since the outbreak of the new coronavirus disease (COVID‐19), a large number of scientific studies and data analysis reports have been published in the International Journal of Medicine and Statistics. Taking the estimation of the incubation period as an example, we propose a low‐cost method to integrate external research results and available internal data together. By using empirical likelihood method, we can effectively incorporate summarized information even if it may be derived from a misspecified model. Taking the possible uncertainty in summarized information into account, we augment a logarithm of the normal density in the log empirical likelihood. We show that the augmented log‐empirical likelihood can produce enhanced estimates for the underlying parameters compared with the method without utilizing auxiliary information. Moreover, the Wilks' theorem is proved to be true. We illustrate our methodology by analyzing a COVID‐19 incubation period data set retrieved from Zhejiang Province and summarized information from a similar study in Shenzhen, China.
Collapse
Affiliation(s)
- Zhongfeng Jiang
- Academy of Mathematics and System SciencesChinese Academy of ScienceBeijingChina
| | - Baoying Yang
- Department of Statistics, College of MathematicsSouthwest Jiaotong UniversityChengduChina
| | - Jing Qin
- National Institute of Allergy and Infectious DiseasesNational Institute of HealthBethesdaMarylandUSA
| | - Yong Zhou
- Key Laboratory of Advanced Theory and Application in Statistics and Data ScienceMOEShanghaiChina
- Academy of Statistics and Interdisciplinary SciencesEast China Normal UniversityShanghaiChina
| |
Collapse
|
22
|
Zhang H, Deng L, Wheeler W, Qin J, Yu K. Integrative analysis of multiple case-control studies. Biometrics 2021; 78:1080-1091. [PMID: 33768525 DOI: 10.1111/biom.13461] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2020] [Revised: 02/23/2021] [Accepted: 03/12/2021] [Indexed: 11/28/2022]
Abstract
It is often challenging to share detailed individual-level data among studies due to various informatics and privacy constraints. However, it is relatively easy to pool together aggregated summary level data, such as the ones required for standard meta-analyses. Focusing on data generated from case-control studies, we present a flexible inference procedure that integrates individual-level data collected from an "internal" study with summary data borrowed from "external" studies. This procedure is built on a retrospective empirical likelihood framework to account for the sampling bias in case-control studies. It can incorporate summary statistics extracted from various working models adopted by multiple independent or overlapping external studies. It also allows for external studies to be conducted in a population that is different from the internal study population. We show both theoretically and numerically its efficiency advantage over several competing alternatives.
Collapse
Affiliation(s)
- Han Zhang
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, Maryland, USA
| | - Lu Deng
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, Maryland, USA
| | - William Wheeler
- Information Management Services, Silver Spring, Maryland, USA
| | - Jing Qin
- National Institute of Allergy and Infectious Diseases, National Institute of Health, Bethesda, Maryland, USA
| | - Kai Yu
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, Maryland, USA
| |
Collapse
|
23
|
Duan R, Ning Y, Chen Y. Heterogeneity-aware and communication-efficient distributed statistical inference. Biometrika 2021. [DOI: 10.1093/biomet/asab007] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Summary
In multicentre research, individual-level data are often protected against sharing across sites. To overcome the barrier of data sharing, many distributed algorithms, which only require sharing aggregated information, have been developed. The existing distributed algorithms usually assume the data are homogeneously distributed across sites. This assumption ignores the important fact that the data collected at different sites may come from various subpopulations and environments, which can lead to heterogeneity in the distribution of the data. Ignoring the heterogeneity may lead to erroneous statistical inference. We propose distributed algorithms which account for the heterogeneous distributions by allowing site-specific nuisance parameters. The proposed methods extend the surrogate likelihood approach (Wang et al. 2017; Jordan et al. 2018) to the heterogeneous setting by applying a novel density ratio tilting method to the efficient score function. The proposed algorithms maintain the same communication cost as existing communication-efficient algorithms. We establish a nonasymptotic risk bound for the proposed distributed estimator and its limiting distribution in the two-index asymptotic setting, which allows both sample size per site and the number of sites to go to infinity. In addition, we show that the asymptotic variance of the estimator attains the Cramér–Rao lower bound when the number of sites is smaller in rate than the sample size at each site. Finally, we use simulation studies and a real data application to demonstrate the validity and feasibility of the proposed methods.
Collapse
|