1
|
Chen LW, Fine JP, Bair E, Ritter VS, McElrath TF, Cantonwine DE, Meeker JD, Ferguson KK, Zhao S. Semiparametric analysis of a generalized linear model with multiple covariates subject to detection limits. Stat Med 2022; 41:4791-4808. [PMID: 35909228 PMCID: PMC9588684 DOI: 10.1002/sim.9536] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2020] [Revised: 07/10/2022] [Accepted: 07/11/2022] [Indexed: 11/10/2022]
Abstract
Studies on the health effects of environmental mixtures face the challenge of limit of detection (LOD) in multiple correlated exposure measurements. Conventional approaches to deal with covariates subject to LOD, including complete-case analysis, substitution methods, and parametric modeling of covariate distribution, are feasible but may result in efficiency loss or bias. With a single covariate subject to LOD, a flexible semiparametric accelerated failure time (AFT) model to accommodate censored measurements has been proposed. We generalize this approach by considering a multivariate AFT model for the multiple correlated covariates subject to LOD and a generalized linear model for the outcome. A two-stage procedure based on semiparametric pseudo-likelihood is proposed for estimating the effects of these covariates on health outcome. Consistency and asymptotic normality of the estimators are derived for an arbitrary fixed dimension of covariates. Simulations studies demonstrate good large sample performance of the proposed methods vs conventional methods in realistic scenarios. We illustrate the practical utility of the proposed method with the LIFECODES birth cohort data, where we compare our approach to existing approaches in an analysis of multiple urinary trace metals in association with oxidative stress in pregnant women.
Collapse
Affiliation(s)
- Ling-Wan Chen
- Biostatistics & Computational Biology Branch, National Institute of Environmental Health Sciences, Research Triangle Park, NC, USA
| | - Jason P. Fine
- Departments of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | | | - Victor S. Ritter
- Biostatistics & Computational Biology Branch, National Institute of Environmental Health Sciences, Research Triangle Park, NC, USA
| | | | | | - John D. Meeker
- Department of Environmental Health Sciences, University of Michigan School of Public Health, Ann Arbor, MI, USA
| | - Kelly K. Ferguson
- Epidemiology Branch, National Institute of Environmental Health Sciences, Research Triangle Park, NC, USA
| | - Shanshan Zhao
- Biostatistics & Computational Biology Branch, National Institute of Environmental Health Sciences, Research Triangle Park, NC, USA
| |
Collapse
|
2
|
Chen Y, Liang KY, Tong P, Beaty TH, Barnes KC, Linda Kao WH. A pseudolikelihood approach for assessing genetic association in case-control studies with unmeasured population structure. Stat Methods Med Res 2020; 29:3153-3165. [PMID: 32393154 DOI: 10.1177/0962280220921212] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
The case-control study design is one of the main tools for detecting associations between genetic markers and diseases. It is well known that population substructure can lead to spurious association between disease status and a genetic marker if the prevalence of disease and the marker allele frequency vary across subpopulations. In this paper, we propose a novel statistical method to estimate the association in case-control studies with unmeasured population substructure. The proposed method takes two steps. First, the information on genomic markers and disease status is used to infer the population substructure; second, the association between the disease and the test marker adjusting for the population substructure is modeled and estimated parametrically through polytomous logistic regression. The performance of the proposed method, relative to the existing methods, on bias, coverage probability and computational time, is assessed through simulations. The method is applied to an end-stage renal disease study in African Americans population.
Collapse
Affiliation(s)
- Yong Chen
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, USA
| | | | - Pan Tong
- Department of Bioinformatics & Computational Biology, University of Texas, Houston, USA
| | - Terri H Beaty
- Department of Epidemiology, Johns Hopkins University, Baltimore, USA
| | - Kathleen C Barnes
- University of Colorado Denver - Anschutz Medical Campus, Aurora, USA
| | - W H Linda Kao
- Department of Epidemiology, Johns Hopkins University, Baltimore, USA
| |
Collapse
|
3
|
Abstract
In biomedical cohort studies for assessing the association between an outcome variable and a set of covariates, usually, some covariates can only be measured on a subgroup of study subjects. An important design question is-which subjects to select into the subgroup to increase statistical efficiency. When the outcome is binary, one may adopt a case-control sampling design or a balanced case-control design where cases and controls are further matched on a small number of complete discrete covariates. While the latter achieves success in estimating odds ratio (OR) parameters for the matching covariates, similar two-phase design options have not been explored for the remaining covariates, especially the incompletely collected ones. This is of great importance in studies where the covariates of interest cannot be completely collected. To this end, assuming that an external model is available to relate the outcome and complete covariates, we propose a novel sampling scheme that oversamples cases and controls with worse goodness-of-fit based on the external model and further matches them on complete covariates similarly to the balanced design. We develop a pseudolikelihood method for estimating OR parameters. Through simulation studies and explorations in a real-cohort study, we find that our design generally leads to reduced asymptotic variances of the OR estimates and the reduction for the matching covariates is comparable to that of the balanced design.
Collapse
Affiliation(s)
- Le Wang
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA
- Department of Mathematics and Statistics, Villanova University, Villanova, PA 19085, USA
| | - Matthew L Williams
- Division of Cardiovascular Surgery, Department of Surgery, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Yong Chen
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Jinbo Chen
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA
| |
Collapse
|
4
|
Cai Y, Huang J, Ning J, Lee MLT, Rosner B, Chen Y. Two-sample test for correlated data under outcome-dependent sampling with an application to self-reported weight loss data. Stat Med 2019; 38:4999-5009. [PMID: 31489699 PMCID: PMC6800790 DOI: 10.1002/sim.8346] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2017] [Revised: 07/07/2019] [Accepted: 07/17/2019] [Indexed: 11/09/2022]
Abstract
Standard methods for two-sample tests such as the t-test and Wilcoxon rank sum test may lead to incorrect type I errors when applied to longitudinal or clustered data. Recent alternatives of two-sample tests for clustered data often require certain assumptions on the correlation structure and/or noninformative cluster size. In this paper, based on a novel pseudolikelihood for correlated data, we propose a score test without knowledge of the correlation structure or assuming data missingness at random. The proposed score test can capture differences in the mean and variance between two groups simultaneously. We use projection theory to derive the limiting distribution of the test statistic, in which the covariance matrix can be empirically estimated. We conduct simulation studies to evaluate the proposed test and compare it with existing methods. To illustrate the usefulness proposed test, we use it to compare self-reported weight loss data in a friends' referral group, with the data from the Internet self-joining group.
Collapse
Affiliation(s)
- Yi Cai
- AT&T Services, Inc., Plano, TX 75247, USA
| | - Jing Huang
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Jing Ning
- Department of Statistical Science, Cornell University, Ithaca, NY 14853, USA
| | - Mei-Ling Ting Lee
- Department of Epidemiology and Biostatistics, The University of Maryland School of Public Health, College Park, MD 20742, USA
| | - Bernard Rosner
- Department of Biostatistics, Harvard Medical School, MA 02115, USA
| | - Yong Chen
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA
| |
Collapse
|
5
|
Lobach I, Sampson J, Lobach S, Zhang L. Gene-environment interactions in case-control studies with silent disease. Genet Epidemiol 2018; 42:551-558. [PMID: 29896809 DOI: 10.1002/gepi.22135] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2018] [Revised: 03/31/2018] [Accepted: 05/10/2018] [Indexed: 12/30/2022]
Abstract
Genome-wide association studies (GWAS) often measure gene-environment interactions (G × E). We consider the problem of accurately estimating a G × E in a case-control GWAS when a subset of the controls have silent, or undiagnosed, disease and the frequency of the silent disease varies by the environmental variable. We show that using case-control status without accounting for misdiagnosis can lead to biased estimates of the G × E. We further propose a pseudolikelihood approach to remove the bias and accurately estimate how the relationship between the genetic variant and the true disease status varies by the environmental variable. We demonstrate our method in extensive simulations and apply our method to a GWAS of prostate cancer.
Collapse
Affiliation(s)
- Iryna Lobach
- Department of Epidemiology and Biostatistics, University of California, San Francisco, California
| | - Joshua Sampson
- National Cancer Institute, National Institutes of Health, Bethesda, Maryland
| | - Siarhei Lobach
- Applied Mathematics and Computer Science Department, Belarusian State University, Minsk, Belarus
| | - Li Zhang
- Department of Epidemiology and Biostatistics, University of California, San Francisco, California.,Department of Medicine, University of California, San Francisco, California
| |
Collapse
|
6
|
Li L, Brumback BA, Weppelmann TA, Morris JG, Ali A. Adjusting for unmeasured confounding due to either of two crossed factors with a logistic regression model. Stat Med 2016; 35:3179-88. [PMID: 26892025 DOI: 10.1002/sim.6916] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2015] [Revised: 11/19/2015] [Accepted: 01/29/2016] [Indexed: 11/12/2022]
Abstract
Motivated by an investigation of the effect of surface water temperature on the presence of Vibrio cholerae in water samples collected from different fixed surface water monitoring sites in Haiti in different months, we investigated methods to adjust for unmeasured confounding due to either of the two crossed factors site and month. In the process, we extended previous methods that adjust for unmeasured confounding due to one nesting factor (such as site, which nests the water samples from different months) to the case of two crossed factors. First, we developed a conditional pseudolikelihood estimator that eliminates fixed effects for the levels of each of the crossed factors from the estimating equation. Using the theory of U-Statistics for independent but non-identically distributed vectors, we show that our estimator is consistent and asymptotically normal, but that its variance depends on the nuisance parameters and thus cannot be easily estimated. Consequently, we apply our estimator in conjunction with a permutation test, and we investigate use of the pigeonhole bootstrap and the jackknife for constructing confidence intervals. We also incorporate our estimator into a diagnostic test for a logistic mixed model with crossed random effects and no unmeasured confounding. For comparison, we investigate between-within models extended to two crossed factors. These generalized linear mixed models include covariate means for each level of each factor in order to adjust for the unmeasured confounding. We conduct simulation studies, and we apply the methods to the Haitian data. Copyright © 2016 John Wiley & Sons, Ltd.
Collapse
Affiliation(s)
- Li Li
- Department of Biostatistics, College of Public Health and Health Professions, College of Medicine, University of Florida, Gainesville, 32611, FL, U.S.A
| | - Babette A Brumback
- Department of Biostatistics, College of Public Health and Health Professions, College of Medicine, University of Florida, Gainesville, 32611, FL, U.S.A
| | - Thomas A Weppelmann
- Emerging Pathogens Institute, University of Florida, Gainesville, 32611, FL, U.S.A
| | - J Glenn Morris
- Emerging Pathogens Institute, University of Florida, Gainesville, 32611, FL, U.S.A
| | - Afsar Ali
- Emerging Pathogens Institute, University of Florida, Gainesville, 32611, FL, U.S.A.,Department of Environmental and Global Health, College of Public Health and Health Professions, University of Florida, Gainesville, 32611, FL, U.S.A
| |
Collapse
|
7
|
Chen Y, Hong C, Riley RD. An alternative pseudolikelihood method for multivariate random-effects meta-analysis. Stat Med 2014; 34:361-80. [PMID: 25363629 PMCID: PMC4305202 DOI: 10.1002/sim.6350] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2013] [Revised: 10/01/2014] [Accepted: 10/07/2014] [Indexed: 12/19/2022]
Abstract
Recently, multivariate random-effects meta-analysis models have received a great deal of attention, despite its greater complexity compared to univariate meta-analyses. One of its advantages is its ability to account for the within-study and between-study correlations. However, the standard inference procedures, such as the maximum likelihood or maximum restricted likelihood inference, require the within-study correlations, which are usually unavailable. In addition, the standard inference procedures suffer from the problem of singular estimated covariance matrix. In this paper, we propose a pseudolikelihood method to overcome the aforementioned problems. The pseudolikelihood method does not require within-study correlations and is not prone to singular covariance matrix problem. In addition, it can properly estimate the covariance between pooled estimates for different outcomes, which enables valid inference on functions of pooled estimates, and can be applied to meta-analysis where some studies have outcomes missing completely at random. Simulation studies show that the pseudolikelihood method provides unbiased estimates for functions of pooled estimates, well-estimated standard errors, and confidence intervals with good coverage probability. Furthermore, the pseudolikelihood method is found to maintain high relative efficiency compared to that of the standard inferences with known within-study correlations. We illustrate the proposed method through three meta-analyses for comparison of prostate cancer treatment, for the association between paraoxonase 1 activities and coronary heart disease, and for the association between homocysteine level and coronary heart disease. © 2014 The Authors. Statistics in Medicine Published by John Wiley & Sons Ltd.
Collapse
Affiliation(s)
- Yong Chen
- Division of Biostatistics, University of Texas School of Public Health, 1200 Pressler St, Houston, Texas 77030, U.S.A
| | | | | |
Collapse
|
8
|
Breslow NE, Amorim G, Pettinger MB, Rossouw J. Using the Whole Cohort in the Analysis of Case-Control Data: Application to the Women's Health Initiative. Stat Biosci 2013; 5:10.1007/s12561-013-9080-2. [PMID: 24363785 PMCID: PMC3865808 DOI: 10.1007/s12561-013-9080-2] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
Standard analyses of data from case-control studies that are nested in a large cohort ignore information available for cohort members not sampled for the sub-study. This paper reviews several methods designed to increase estimation efficiency by using more of the data, treating the case-control sample as a two or three phase stratified sample. When applied to a study of coronary heart disease among women in the hormone trials of the Women's Health Initiative, modest but increasing gains in precision of regression coefficients were observed depending on the amount of cohort information used in the analysis. The gains were particularly evident for pseudo- or maximum likelihood estimates whose validity depends on the assumed model being correct. Larger standard errors were obtained for coefficients estimated by inverse probability weighted methods that are more robust to model misspecification. Such misspecification may have been responsible for an important difference in one key regression coefficient estimated using the weighted compared with the more efficient methods.
Collapse
Affiliation(s)
- Norman E. Breslow
- Department of Biostatistics, University of Washington, Seattle, WA, USA, Tel.: +1-206-543-1044, Fax: +1-206-616-2724
| | - Gustavo Amorim
- Department of Statistics, University of Auckland, Auckland, NZ
| | - Mary B. Pettinger
- WHI Clinical Coordinating Center, Fred Hutchinson Cancer Research Center, Seattle, WA, USA
| | - Jacques Rossouw
- Division of Cardiovascular Sciences, National Heart, Lung and Blood Institute, Bethesda, MD, USA
| |
Collapse
|
9
|
Birkner M, Blath J, Eldon B. Statistical properties of the site-frequency spectrum associated with lambda-coalescents. Genetics 2013; 195:1037-53. [PMID: 24026094 PMCID: PMC3813835 DOI: 10.1534/genetics.113.156612] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2013] [Accepted: 08/28/2013] [Indexed: 11/18/2022] Open
Abstract
Statistical properties of the site-frequency spectrum associated with Λ-coalescents are our objects of study. In particular, we derive recursions for the expected value, variance, and covariance of the spectrum, extending earlier results of Fu (1995) for the classical Kingman coalescent. Estimating coalescent parameters introduced by certain Λ-coalescents for data sets too large for full-likelihood methods is our focus. The recursions for the expected values we obtain can be used to find the parameter values that give the best fit to the observed frequency spectrum. The expected values are also used to approximate the probability a (derived) mutation arises on a branch subtending a given number of leaves (DNA sequences), allowing us to apply a pseudolikelihood inference to estimate coalescence parameters associated with certain subclasses of Λ-coalescents. The properties of the pseudolikelihood approach are investigated on simulated as well as real mtDNA data sets for the high-fecundity Atlantic cod (Gadus morhua). Our results for two subclasses of Λ-coalescents show that one can distinguish these subclasses from the Kingman coalescent, as well as between the Λ-subclasses, even for a moderate (maybe a few hundred) sample size.
Collapse
Affiliation(s)
- Matthias Birkner
- Institut für Mathematik, Johannes-Gutenberg-Universität, 55099 Mainz, Germany
| | - Jochen Blath
- Institut für Mathematik, Technische Universität Berlin, 10623 Berlin, Germany
| | - Bjarki Eldon
- Institut für Mathematik, Technische Universität Berlin, 10623 Berlin, Germany
| |
Collapse
|
10
|
Abstract
Many statistical models arising in applications contain non- and weakly-identified parameters. Due to identifiability concerns, tests concerning the parameters of interest may not be able to use conventional theories and it may not be clear how to assess statistical significance. This paper extends the literature by developing a testing procedure that can be used to evaluate hypotheses under non- and weakly-identifiable semiparametric models. The test statistic is constructed from a general estimating function of a finite dimensional parameter model representing the population characteristics of interest, but other characteristics which may be described by infinite dimensional parameters, and viewed as nuisance, are left completely unspecified. We derive the limiting distribution of this statistic and propose theoretically justified resampling approaches to approximate its asymptotic distribution. The methodology's practical utility is illustrated in simulations and an analysis of quality-of-life outcomes from a longitudinal study on breast cancer.
Collapse
Affiliation(s)
- Guanqun Cao
- Department of Statistics and Probability, Michigan State University
| | | | | | | |
Collapse
|