1
|
Kim S, Beane Freeman LE, Albert PS. A latent functional approach for modeling the effects of multidimensional exposures on disease risk. Stat Med 2023; 42:4776-4793. [PMID: 37635131 DOI: 10.1002/sim.9888] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2022] [Revised: 07/28/2023] [Accepted: 08/14/2023] [Indexed: 08/29/2023]
Abstract
Understanding the relationships between exposure and disease incidence is an important problem in environmental epidemiology. Typically, a large number of these exposures are measured, and it is found either that a few exposures transmit risk or that each exposure transmits a small amount of risk, but, taken together, these may pose a substantial disease risk. Further, these exposure effects can be nonlinear. We develop a latent functional approach, which assumes that the individual effect of each exposure can be characterized as one of a series of unobserved functions, where the number of latent functions is less than or equal to the number of exposures. We propose Bayesian methodology to fit models with a large number of exposures and show that existing Bayesian group LASSO approaches are a special case of the proposed model. An efficient Markov chain Monte Carlo sampling algorithm is developed for carrying out Bayesian inference. The deviance information criterion is used to choose an appropriate number of nonlinear latent functions. We demonstrate the good properties of the approach using simulation studies. Further, we show that complex exposure relationships can be represented with only a few latent functional curves. The proposed methodology is illustrated with an analysis of the effect of cumulative pesticide exposure on cancer risk in a large cohort of farmers.
Collapse
Affiliation(s)
- Sungduk Kim
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, Maryland, USA
| | - Laura E Beane Freeman
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, Maryland, USA
| | - Paul S Albert
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, Maryland, USA
| |
Collapse
|
2
|
Chang YHH, Buras MR, Davis JM, Crowson CS. Avoiding Blunders When Analyzing Correlated Data, Clustered Data, or Repeated Measures. J Rheumatol 2023; 50:1269-1272. [PMID: 37188383 PMCID: PMC10543393 DOI: 10.3899/jrheum.2022-1109] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/09/2023] [Indexed: 05/17/2023]
Abstract
Rheumatology research often involves correlated and clustered data. A common error when analyzing these data occurs when instead we treat these data as independent observations. This can lead to incorrect statistical inference. The data used are a subset of the 2017 study from Raheel et al consisting of 633 patients with rheumatoid arthritis (RA) between 1988 and 2007. RA flare and the number of swollen joints served as our binary and continuous outcomes, respectively. Generalized linear models (GLM) were fitted for each, while adjusting for rheumatoid factor (RF) positivity and sex. Additionally, a generalized linear mixed model with a random intercept and a generalized estimating equation were used to model RA flare and the number of swollen joints, respectively, to take additional correlation into account. The GLM's β coefficients and their 95% confidence intervals (CIs) are then compared to their mixed-effects equivalents. The β coefficients compared between methodologies are very similar. However, their standard errors increase when correlation is accounted for. As a result, if the additional correlations are not considered, the standard error can be underestimated. This results in an overestimated effect size, narrower CIs, increased type I error, and a smaller P value, thus potentially producing misleading results. It is important to model the additional correlation that occurs in correlated data.
Collapse
Affiliation(s)
- Yu-Hui H Chang
- Y.H.H. Chang, PhD, MS, M.R. Buras, MS, Department of Quantitative Health Sciences, Mayo Clinic, Scottsdale, Arizona
| | - Matthew R Buras
- Y.H.H. Chang, PhD, MS, M.R. Buras, MS, Department of Quantitative Health Sciences, Mayo Clinic, Scottsdale, Arizona
| | - John M Davis
- J.M. Davis III, MD, MS, Division of Rheumatology, Mayo Clinic, Rochester, Minnesota
| | - Cynthia S Crowson
- C.S. Crowson, PhD, Division of Rheumatology, and Department of Quantitative Health Sciences, Mayo Clinic, Rochester, Minnesota, USA.
| |
Collapse
|
3
|
Machida R, Sakamaki K, Kuchiba A. Clinical trial design and analysis for comparing three treatments with intra-individual right- and left-hand data. Clin Trials 2023; 20:203-210. [PMID: 36651336 DOI: 10.1177/17407745221150281] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023]
Abstract
BACKGROUND Chemotherapy-induced peripheral neuropathy can occur in the right and left hand. Studies on prevention treatments for chemotherapy-induced peripheral neuropathy have largely adopted either self-controlled designs or parallel designs to compare two preventive treatments. When three treatment options (two experimental treatments and a control treatment) are available, both designs can be extended. However, no clinical trials have adopted a self-controlled design to compare three prevention treatments for chemotherapy-induced peripheral neuropathy. The incomplete block crossover design for more than two treatments can be extended to compare three treatments in the self-controlled design. In simple extension, some of the participants receive two experimental treatments in both hands; however, it may be difficult to administer different experimental treatments in both hands for practical reasons, such as a concern for the different types of unexpected adverse events. This study proposes a design and analysis method appropriate for the situation where only one experimental treatment is provided to each participant. METHODS We assume clinical trials to compare each of the two experimental treatments (E1 and E2) with the control treatment (C) and between two experimental treatments only when both experimental treatments are superior to the control treatment. We propose a self-controlled design, which equally randomizes to four arms to adjust for the dominant hand effect: Arm 1: E1 for right hand, C for left hand; Arm 2: C for right hand, E1 for left hand; Arm 3: E2 for right hand, C for left hand; and Arm 4: C for right hand, E2 for left hand. We compare operating characteristics of the proposed design with the three-arm parallel design in which the same treatment is performed in both hands by participants. We also assess three proposed analysis methods for comparisons between experimental treatments in the self-controlled design under several conditions of correlations between right and left hands using simulation studies. RESULTS The simulation studies showed that the proposed design was more powerful than the three-arm parallel design when correlation was 0.3 or higher. For comparisons between experimental treatments, the methods based on the regression model, including the outcome of hands with C as a covariate, had the highest power under modest to high correlation among the analysis methods in the self-controlled design. CONCLUSION The proposed design can improve the power for comparing between two experimental treatments and the control treatment. Our design is useful in situations where it is undesirable for participants to receive different experimental treatments in both hands for practical reasons.
Collapse
Affiliation(s)
- Ryunosuke Machida
- Biostatistics Division, Center for Research Administration and Support, National Cancer Center, Tokyo, Japan
| | - Kentaro Sakamaki
- Center for Data Science, Yokohama City University, Yokohama, Japan
| | - Aya Kuchiba
- Biostatistics Division, Center for Research Administration and Support, National Cancer Center, Tokyo, Japan
- Graduate School of Health Innovation, Kanagawa University of Human Services, Kanagawa, Japan
| |
Collapse
|
4
|
Bay C, Glynn RJ, Seddon JM, Lee MLT, Rosner B. Evaluation of Risk Prediction with Hierarchical Data: Dependency Adjusted Confidence Intervals for the AUC. Stats (Basel) 2023; 6:526-538. [PMID: 37920864 PMCID: PMC10621602 DOI: 10.3390/stats6020034] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/04/2023] Open
Abstract
The area under the true ROC curve (AUC) is routinely used to determine how strongly a given model discriminates between the levels of a binary outcome. Standard inference with the AUC requires that outcomes be independent of each other. To overcome this limitation, a method was developed for the estimation of the variance of the AUC in the setting of two-level hierarchical data using probit-transformed prediction scores generated from generalized estimating equation models, thereby allowing for the application of inferential methods. This manuscript presents an extension of this approach so that inference for the AUC may be performed in a three-level hierarchical data setting (e.g., eyes nested within persons and persons nested within families). A method that accounts for the effect of tied prediction scores on inference is also described. The performance of 95% confidence intervals around the AUC was assessed through the simulation of three-level clustered data in multiple settings, including ones with tied data and variable cluster sizes. Across all settings, the actual 95% confidence interval coverage varied from 0.943 to 0.958, and the ratio of the theoretical variance to the empirical variance of the AUC varied from 0.920 to 1.013. The results are better than those from existing methods. Two examples of applying the proposed methodology are presented.
Collapse
Affiliation(s)
- Camden Bay
- Harvard Medical School, Brigham and Women’s Hospital, Boston, MA, 02115, USA
| | - Robert J Glynn
- Harvard Medical School, Brigham and Women’s Hospital, Boston, MA, 02115, USA
| | - Johanna M Seddon
- University of Massachusetts Chan Medical School, Worcester, MA, 01655, USA
| | - Mei-Ling Ting Lee
- University of Maryland School of Public Health, Department of Epidemiology and Biostatistics, College Park, MD, 20742, USA
| | - Bernard Rosner
- Harvard Medical School, Brigham and Women’s Hospital, Boston, MA, 02115, USA
| |
Collapse
|
5
|
Chien LC, Chang LY, Shen CW. A model selection criterion for clustered survival analysis with informative cluster size. Pharm Stat 2023; 22:79-95. [PMID: 36054538 DOI: 10.1002/pst.2261] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2022] [Revised: 07/24/2022] [Accepted: 08/04/2022] [Indexed: 02/01/2023]
Abstract
We propose a model selection criterion for correlated survival data when the cluster size is informative to the outcome. This approach, called Resampling Cluster Survival Information Criterion (RCSIC), uses the Cox proportional hazards model that is weighted with the inverse of the cluster size. The RCSIC based on the within-cluster resampling idea takes into account the possible variability of the within-cluster subsampling and the possible informativeness of cluster sizes. The RCSIC allows for easy execution for the within-cluster resampling idea without a large number of resamples of the data. In contrast with the traditional model selection method in survival analysis, the RCSIC has an additional penalization for the within-cluster subsampling variability. Our simulations show the satisfactory results where the RCSIC provides a more robust power for variable selection in terms of clustered survival analysis, regardless of whether informative cluster size exists or not. Applying the RCSIC method to a periodontal disease studies, we identify the tooth loss in patients associated with the risk factors, Age, Filled Tooth, Molar, Crown, Decayed Tooth, and Smoking Status, respectively.
Collapse
Affiliation(s)
- Li-Chu Chien
- Center for Fundamental Science, Kaohsiung Medical University, Kaohsiung, Taiwan, ROC
| | - Li-Ying Chang
- Department of Mathematics, National Chung Cheng University, Chia-Yi, Taiwan, ROC
| | - Chung-Wei Shen
- Department of Mathematics, National Chung Cheng University, Chia-Yi, Taiwan, ROC
| |
Collapse
|
6
|
Herber R, Graehlert X, Raiskup F, Veselá M, Pillunat LE, Spoerl E. Statistical Evaluation of Correlated Measurement Data in Longitudinal Setting Based on Bilateral Corneal Cross-Linking. Curr Eye Res 2022; 47:995-1002. [PMID: 35354347 DOI: 10.1080/02713683.2022.2052105] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
Abstract
PURPOSE In ophthalmology, data from both eyes of a person are frequently included in the statistical evaluation. This violates the requirement of data independence for classical statistical tests (e.g. t-Test or analysis of variance (ANOVA)) because it is correlated data. Linear mixed models (LMM) were used as a possibility to include the data of both eyes in the statistical evaluation. METHODS The LMM is available for a variety of statistical software such as SPSS or R. The application was applied to a retrospective longitudinal analysis of an accelerated corneal cross-linking (ACXL (9*10)) treatment in progressive keratoconus (KC) with a follow-up period of 36 months. Forty eyes of 20 patients were included, whereas sequential bilateral CXL treatment was performed within 12 months. LMM and ANOVA for repeated measurements were used for statistical evaluation of topographical and tomographical data measured by Pentacam (Oculus, Wetzlar, Germany). RESULTS Both eyes were classified into a worse and better eye concerning corneal topography. Visual acuity, keratometric values and minimal corneal thickness were statistically significant between them at baseline (p < 0.05). A significant correlation between worse and better eye was shown (p < 0.05). Therefore, analyzing the data at each follow-up visit using ANOVA partially led to an overestimation of the statistical effect that could be avoided by using LMM. After 36 months, ACXL has significantly improved BCVA and flattened the cornea. CONCLUSION The evaluation of data of both eyes without considering their correlation using classical statistical tests leads to an overestimation of the statistical effect, which can be avoided by using the LMM.
Collapse
Affiliation(s)
- Robert Herber
- Department of Ophthalmology, University Hospital Carl Gustav Carus, TU Dresden, Germany
| | - Xina Graehlert
- Coordination Center for Clinical Studies - KKS Dresden, Faculty of Medicine Carl Gustav Carus, TU Dresden, Germany
| | - Frederik Raiskup
- Department of Ophthalmology, University Hospital Carl Gustav Carus, TU Dresden, Germany
| | - Martina Veselá
- Department of Ophthalmology, Faculty of Medicine Hradec Králové, Charles University, Prague, Czech Republic
| | - Lutz E Pillunat
- Department of Ophthalmology, University Hospital Carl Gustav Carus, TU Dresden, Germany
| | - Eberhard Spoerl
- Department of Ophthalmology, University Hospital Carl Gustav Carus, TU Dresden, Germany
| |
Collapse
|
7
|
Nevalainen J, Datta S, Toppari J, Ilonen J, Hyöty H, Veijola R, Knip M, Virtanen SM. Frailty modeling under a selective sampling protocol: an application to type 1 diabetes related autoantibodies. Stat Med 2021; 40:6410-6420. [PMID: 34496070 DOI: 10.1002/sim.9190] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2021] [Revised: 08/12/2021] [Accepted: 08/23/2021] [Indexed: 02/01/2023]
Abstract
In studies following selective sampling protocols for secondary outcomes, conventional analyses regarding their appearance could provide misguided information. In the large type 1 diabetes prevention and prediction (DIPP) cohort study monitoring type 1 diabetes-associated autoantibodies, we propose to model their appearance via a multivariate frailty model, which incorporates a correlation component that is important for unbiased estimation of the baseline hazards under the selective sampling mechanism. As further advantages, the frailty model allows for systematic evaluation of the association and the differences in regression parameters among the autoantibodies. We demonstrate the properties of the model by a simulation study and the analysis of the autoantibodies and their association with background factors in the DIPP study, in which we found that high genetic risk is associated with the appearance of all the autoantibodies, whereas the association with sex and urban municipality was evident for IA-2A and IAA autoantibodies.
Collapse
Affiliation(s)
- Jaakko Nevalainen
- Health Sciences, Faculty of Social Sciences, Tampere University, Tampere, Finland
| | - Somnath Datta
- Department of Biostatistics, University of Florida, Gainesville, Florida, USA
| | - Jorma Toppari
- Institute of Biomedicine, University of Turku, Turku, Finland.,Department of Pediatrics, Turku University Hospital, Turku, Finland
| | - Jorma Ilonen
- Institute of Biomedicine, University of Turku, Turku, Finland
| | - Heikki Hyöty
- Faculty of Medicine and Health Technology, Tampere University, Tampere, Finland
| | - Riitta Veijola
- Department of Pediatrics, Oulu University Hospital and University of Oulu, Oulu, Finland
| | - Mikael Knip
- Children's Hospital, Helsinki University Hospital and University of Helsinki, Helsinki, Finland
| | - Suvi M Virtanen
- Health Sciences, Faculty of Social Sciences, Tampere University, Tampere, Finland.,Public Health and Welfare Department, Finnish Institute for Health and Welfare, Helsinki, Finland.,Research, Development and Innovation Centre, and Center for Child Health Research, Tampere University and University Hospital, Tampere, Finland
| |
Collapse
|
8
|
Basagaña X, Barrera-Gómez J. Reflection on modern methods: visualizing the effects of collinearity in distributed lag models. Int J Epidemiol 2021; 51:334-344. [PMID: 34458914 DOI: 10.1093/ije/dyab179] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 08/09/2021] [Indexed: 11/12/2022] Open
Abstract
Collinearity can be a problem in regression models. When examining the effects of an exposure at different time points, constrained distributed lag models can alleviate some of the problems caused by collinearity. Still, some consequences of collinearity may remain and they are often unexplored. We aimed to illustrate the effects of collinearity in the context of distributed lag models, and to provide a tool to assess whether the results of a study could be influenced by collinearity. We used simulations under different scenarios of hypothesized effects of an exposure to visualize the resulting curves of lagged effects. We analysed three real datasets: a cohort study looking for windows of vulnerability to air pollution, a time series study examining the linear association of air pollution with hospital admissions, and a time series study examining the non-linear association between temperature and mortality. We showed that collinearity could be the explanation for some unexpected results, e.g. for statistically significant associations in the opposite direction from that expected, or for wrongly suggesting that some periods are more important than others. We implemented the collin R package to explore the potential consequences of collinearity in the context of distributed lag models. Our visual tool can be a useful way to assess if the results of an analysis may be influenced by collinearity.
Collapse
Affiliation(s)
- Xavier Basagaña
- ISGlobal, Barcelona, Spain.,Universitat Pompeu Fabra (UPF), Barcelona, Spain.,CIBER Epidemiología y Salud Pública (CIBERESP), Madrid, Spain
| | - Jose Barrera-Gómez
- ISGlobal, Barcelona, Spain.,Universitat Pompeu Fabra (UPF), Barcelona, Spain.,CIBER Epidemiología y Salud Pública (CIBERESP), Madrid, Spain
| |
Collapse
|
9
|
Zabriskie BN, Corcoran C, Senchaudhuri P. A permutation-based approach for heterogeneous meta-analyses of rare events. Stat Med 2021; 40:5587-5604. [PMID: 34328659 DOI: 10.1002/sim.9142] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2020] [Revised: 05/21/2021] [Accepted: 06/30/2021] [Indexed: 11/08/2022]
Abstract
The increasingly widespread use of meta-analysis has led to growing interest in meta-analytic methods for rare events and sparse data. Conventional approaches tend to perform very poorly in such settings. Recent work in this area has provided options for sparse data, but these are still often hampered when heterogeneity across the available studies differs based on treatment group. We propose a permutation-based approach based on conditional logistic regression that accommodates this common contingency, providing more reliable statistical tests when such patterns of heterogeneity are observed. We find that commonly used methods can yield highly inflated Type I error rates, low confidence interval coverage, and bias when events are rare and non-negligible heterogeneity is present. Our method often produces much lower Type I error rates and higher confidence interval coverage than traditional methods in these circumstances. We illustrate the utility of our method by comparing it to several other methods via a simulation study and analyzing an example data set, which assess the use of antibiotics to prevent acute rheumatic fever.
Collapse
Affiliation(s)
| | - Chris Corcoran
- Department of Data Analytics and Information Systems, Utah State University, Logan, Utah, USA
| | | |
Collapse
|
10
|
Coley RY, Walker RL, Cruz M, Simon GE, Shortreed SM. Clinical risk prediction models and informative cluster size: Assessing the performance of a suicide risk prediction algorithm. Biom J 2021; 63:1375-1388. [PMID: 34031916 DOI: 10.1002/bimj.202000199] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2020] [Revised: 02/01/2021] [Accepted: 02/04/2021] [Indexed: 11/11/2022]
Abstract
Clinical visit data are clustered within people, which complicates prediction modeling. Cluster size is often informative because people receiving more care are less healthy and at higher risk of poor outcomes. We used data from seven health systems on 1,518,968 outpatient mental health visits from January 1, 2012 to June 30, 2015 to predict suicide attempt within 90 days. We evaluated true performance of prediction models using a prospective validation set of 4,286,495 visits from October 1, 2015 to September 30, 2017. We examined dividing clustered data on the person or visit level for model training and cross-validation and considered a within cluster resampling approach for model estimation. We evaluated optimism by comparing estimated performance from a left-out testing dataset to performance in the prospective dataset. We used two prediction methods, logistic regression with least absolute shrinkage and selection operator (LASSO) and random forest. The random forest model using a visit-level split for model training and testing was optimistic; it overestimated discrimination (area under the curve, AUC = 0.95 in testing versus 0.84 in prospective validation) and classification accuracy (sensitivity = 0.48 in testing versus 0.19 in prospective validation, 95th percentile cut-off). Logistic regression and random forest models using a person-level split performed well, accurately estimating prospective discrimination and classification: estimated AUCs ranged from 0.85 to 0.87 in testing versus 0.85 in prospective validation, and sensitivity ranged from 0.15 to 0.20 in testing versus 0.17 to 0.19 in prospective validation. Within cluster resampling did not improve performance. We recommend dividing clustered data on the person level, rather than visit level, to ensure strong performance in prospective use and accurate estimation of future performance at the time of model development.
Collapse
Affiliation(s)
- Rebecca Yates Coley
- Kaiser Permanente Washington Health Research Institute, Seattle, WA, USA.,Department of Biostatistics, University of Washington, Seattle, WA, USA
| | - Rod L Walker
- Kaiser Permanente Washington Health Research Institute, Seattle, WA, USA
| | - Maricela Cruz
- Kaiser Permanente Washington Health Research Institute, Seattle, WA, USA
| | - Gregory E Simon
- Kaiser Permanente Washington Health Research Institute, Seattle, WA, USA
| | - Susan M Shortreed
- Kaiser Permanente Washington Health Research Institute, Seattle, WA, USA.,Department of Biostatistics, University of Washington, Seattle, WA, USA
| |
Collapse
|
11
|
Thompson JA, Hemming K, Forbes A, Fielding K, Hayes R. Comparison of small-sample standard-error corrections for generalised estimating equations in stepped wedge cluster randomised trials with a binary outcome: A simulation study. Stat Methods Med Res 2021; 30:425-439. [PMID: 32970526 PMCID: PMC8008420 DOI: 10.1177/0962280220958735] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Generalised estimating equations with the sandwich standard-error estimator provide a promising method of analysis for stepped wedge cluster randomised trials. However, they have inflated type-one error when used with a small number of clusters, which is common for stepped wedge cluster randomised trials. We present a large simulation study of binary outcomes comparing bias-corrected standard errors from Fay and Graubard; Mancl and DeRouen; Kauermann and Carroll; Morel, Bokossa, and Neerchal; and Mackinnon and White with an independent and exchangeable working correlation matrix. We constructed 95% confidence intervals using a t-distribution with degrees of freedom including clusters minus parameters (DFC-P), cluster periods minus parameters, and estimators from Fay and Graubard (DFFG), and Pan and Wall. Fay and Graubard and an approximation to Kauermann and Carroll (with simpler matrix inversion) were unbiased in a wide range of scenarios with an independent working correlation matrix and more than 12 clusters. They gave confidence intervals with close to 95% coverage with DFFG with 12 or more clusters, and DFC-P with 18 or more clusters. Both standard errors were conservative with fewer clusters. With an exchangeable working correlation matrix, approximated Kauermann and Carroll and Fay and Graubard had a small degree of under-coverage.
Collapse
Affiliation(s)
- JA Thompson
- Department of Infectious Disease Epidemiology, London School of Hygiene & Tropical Medicine, London, UK
| | - K Hemming
- Institute of Applied Health Research, University of Birmingham, Birmingham, UK
| | - A Forbes
- Biostatistics Unit, Monash University, Melbourne, Australia
| | - K Fielding
- Department of Infectious Disease Epidemiology, London School of Hygiene & Tropical Medicine, London, UK
| | - R Hayes
- Department of Infectious Disease Epidemiology, London School of Hygiene & Tropical Medicine, London, UK
| |
Collapse
|
12
|
Wang X, Lim E, Liu CT, Sung YJ, Rao DC, Morrison AC, Boerwinkle E, Manning AK, Chen H. Efficient gene-environment interaction tests for large biobank-scale sequencing studies. Genet Epidemiol 2020; 44:908-923. [PMID: 32864785 DOI: 10.1002/gepi.22351] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2020] [Revised: 07/22/2020] [Accepted: 08/08/2020] [Indexed: 01/01/2023]
Abstract
Complex human diseases are affected by genetic and environmental risk factors and their interactions. Gene-environment interaction (GEI) tests for aggregate genetic variant sets have been developed in recent years. However, existing statistical methods become rate limiting for large biobank-scale sequencing studies with correlated samples. We propose efficient Mixed-model Association tests for GEne-Environment interactions (MAGEE), for testing GEI between an aggregate variant set and environmental exposures on quantitative and binary traits in large-scale sequencing studies with related individuals. Joint tests for the aggregate genetic main effects and GEI effects are also developed. A null generalized linear mixed model adjusting for covariates but without any genetic effects is fit only once in a whole genome GEI analysis, thereby vastly reducing the overall computational burden. Score tests for variant sets are performed as a combination of genetic burden and variance component tests by accounting for the genetic main effects using matrix projections. The computational complexity is dramatically reduced in a whole genome GEI analysis, which makes MAGEE scalable to hundreds of thousands of individuals. We applied MAGEE to the exome sequencing data of 41,144 related individuals from the UK Biobank, and the analysis of 18,970 protein coding genes finished within 10.4 CPU hours.
Collapse
Affiliation(s)
- Xinyu Wang
- Department of Biostatistics and Data Science, School of Public Health, The University of Texas Health Science Center at Houston, Houston, Texas
| | - Elise Lim
- Department of Biostatistics, Boston University, Boston, Massachusetts
| | - Ching-Ti Liu
- Department of Biostatistics, Boston University, Boston, Massachusetts
| | - Yun Ju Sung
- Division of Biostatistics, Washington University School of Medicine, St. Louis, Missouri
| | - Dabeeru C Rao
- Division of Biostatistics, Washington University School of Medicine, St. Louis, Missouri
| | - Alanna C Morrison
- Human Genetics Center, Department of Epidemiology, Human Genetics and Environmental Sciences, School of Public Health, The University of Texas Health Science Center at Houston, Houston, Texas
| | - Eric Boerwinkle
- Human Genetics Center, Department of Epidemiology, Human Genetics and Environmental Sciences, School of Public Health, The University of Texas Health Science Center at Houston, Houston, Texas.,Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas
| | - Alisa K Manning
- Center for Human Genetics Research, Massachusetts General Hospital, Boston, Massachusetts.,Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts
| | - Han Chen
- Human Genetics Center, Department of Epidemiology, Human Genetics and Environmental Sciences, School of Public Health, The University of Texas Health Science Center at Houston, Houston, Texas.,Center for Precision Health, School of Public Health and School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas
| |
Collapse
|
13
|
Abstract
PURPOSE To describe and demonstrate methods for analyzing longitudinal correlated eye data with a continuous outcome measure. METHODS We described fixed effects, mixed effects and generalized estimating equations (GEE) models, applied them to data from the Complications of Age-Related Macular Degeneration Prevention Trial (CAPT) and the Age-Related Eye Disease Study (AREDS). In CAPT (N = 1052), we assessed the effect of eye-specific laser treatment on change in visual acuity (VA). In the AREDS study, we evaluated effects of systemic supplement treatment among 1463 participants with AMD category 3. RESULTS In CAPT, the inter-eye correlations (0.33 to 0.53) and longitudinal correlations (0.31 to 0.88) varied. There was a small treatment effect on VA change (approximately one letter) at 24 months for all three models (p = .009 to 0.02). Model fit was better with the mixed effects model than the fixed effects model (p < .001). In AREDS, there was no significant treatment effect in all models (p > .55). Current smokers had a significantly greater VA decline than non-current smokers in the fixed effects model (p = .04) and the mixed effects model with random intercept (p = .0003), but marginally significant in the mixed effects model with random intercept and slope (p = .08), and GEE models (p = .054 to 0.07). The model fit was better with the fixed effects model than the mixed effects model (p < .0001). CONCLUSION Longitudinal models using the eye as the unit of analysis can be implemented using available statistical software to account for both inter-eye and longitudinal correlations. Goodness-of-fit statistics may guide the selection of the most appropriate model.
Collapse
Affiliation(s)
- Gui-Shuang Ying
- Center for Preventive Ophthalmology and Biostatistics, Department of Ophthalmology, Perelman School of Medicine, University of Pennsylvania , Philadelphia, Pennsylvania, USA
| | - Maureen G Maguire
- Center for Preventive Ophthalmology and Biostatistics, Department of Ophthalmology, Perelman School of Medicine, University of Pennsylvania , Philadelphia, Pennsylvania, USA
| | - Robert J Glynn
- Division of Preventive Medicine and the Channing Lab, Department of Medicine, Brigham and Women's Hospital , Boston, Massachusetts, USA
| | - Bernard Rosner
- Division of Preventive Medicine and the Channing Lab, Department of Medicine, Brigham and Women's Hospital , Boston, Massachusetts, USA
| |
Collapse
|
14
|
Petterle RR, Bonat WH, Scarpin CT, Jonasson T, Borba VZC. Multivariate quasi-beta regression models for continuous bounded data. Int J Biostat 2020; 17:39-53. [PMID: 32735553 DOI: 10.1515/ijb-2019-0163] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2019] [Accepted: 06/22/2020] [Indexed: 11/15/2022]
Abstract
We propose a multivariate regression model to deal with multiple continuous bounded data. The proposed model is based on second-moment assumptions, only. We adopted the quasi-score and Pearson estimating functions for estimation of the regression and dispersion parameters, respectively. Thus, the proposed approach does not require a multivariate probability distribution for the variable response vector. The multivariate quasi-beta regression model can easily handle multiple continuous bounded outcomes taking into account the correlation between the response variables. Furthermore, the model allows us to analyze continuous bounded data on the interval [0, 1], including zeros and/or ones. Simulation studies were conducted to investigate the behavior of the NORmal To Anything (NORTA) algorithm and to check the properties of the estimating function estimators to deal with multiple correlated response variables generated from marginal beta distributions. The model was motivated by a data set concerning the body fat percentage, which was measured at five regions of the body and represent the response variables. We analyze each response variable separately and compare it with the fit of the multivariate proposed model. The multivariate quasi-beta regression model provides better fit than its univariate counterparts, as well as allows us to measure the correlation between response variables. Finally, we adapted diagnostic tools to the proposed model. In the supplementary material, we provide the data set and R code.
Collapse
Affiliation(s)
- Ricardo R Petterle
- Department of Integrative Medicine, Federal University of Parana, Curitiba, Brazil
| | - Wagner H Bonat
- Laboratory of Statistics and Geoinformation, Department of Statistics, Federal University of Parana, Curitiba, Brazil
| | - Cassius T Scarpin
- Research Group of Technology Applied to Optimization (GTAO), Federal University of Parana, Curitiba, Brazil
| | - Thaísa Jonasson
- Internal Medicine, Federal University of Parana, Curitiba, Brazil
| | - Victória Z C Borba
- Endocrine Division, Hospital de Clínicas da Universidade Federal do Paraná (SEMPR), Federal University of Parana, Curitiba, Brazil
| |
Collapse
|
15
|
Zhu W, Ku JY, Zheng Y, Knox PC, Kolamunnage-Dona R, Czanner G. Spatial Linear Mixed Effects Modelling for OCT Images: SLME Model. J Imaging 2020; 6:44. [PMID: 34460590 DOI: 10.3390/jimaging6060044] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2020] [Revised: 05/27/2020] [Accepted: 05/30/2020] [Indexed: 11/22/2022] Open
Abstract
Much recent research focuses on how to make disease detection more accurate as well as “slimmer”, i.e., allowing analysis with smaller datasets. Explanatory models are a hot research topic because they explain how the data are generated. We propose a spatial explanatory modelling approach that combines Optical Coherence Tomography (OCT) retinal imaging data with clinical information. Our model consists of a spatial linear mixed effects inference framework, which innovatively models the spatial topography of key information via mixed effects and spatial error structures, thus effectively modelling the shape of the thickness map. We show that our spatial linear mixed effects (SLME) model outperforms traditional analysis-of-variance approaches in the analysis of Heidelberg OCT retinal thickness data from a prospective observational study, involving 300 participants with diabetes and 50 age-matched controls. Our SLME model has a higher power for detecting the difference between disease groups, and it shows where the shape of retinal thickness profiles differs between the eyes of participants with diabetes and the eyes of healthy controls. In simulated data, the SLME model demonstrates how incorporating spatial correlations can increase the accuracy of the statistical inferences. This model is crucial in the understanding of the progression of retinal thickness changes in diabetic maculopathy to aid clinicians for early planning of effective treatment. It can be extended to disease monitoring and prognosis in other diseases and with other imaging technologies.
Collapse
|
16
|
Abstract
Using a sample from a population to estimate the proportion of the population with a certain category label is a broadly important problem. In the context of microbiome studies, this problem arises when researchers wish to use a sample from a population of microbes to estimate the population proportion of a particular taxon, known as the taxon's relative abundance. In this paper, we propose a beta-binomial model for this task. Like existing models, our model allows for a taxon's relative abundance to be associated with covariates of interest. However, unlike existing models, our proposal also allows for the overdispersion in the taxon's counts to be associated with covariates of interest. We exploit this model in order to propose tests not only for differential relative abundance, but also for differential variability. The latter is particularly valuable in light of speculation that dysbiosis, the perturbation from a normal microbiome that can occur in certain disease conditions, may manifest as a loss of stability, or increase in variability, of the counts associated with each taxon. We demonstrate the performance of our proposed model using a simulation study and an application to soil microbial data.
Collapse
Affiliation(s)
| | - Daniela Witten
- Departments of Statistics and Biostatistics, University of Washington
| | - Amy D Willis
- Department of Biostatistics, University of Washington
| |
Collapse
|
17
|
Wen CC, Chen YH, Tseng CH. Joint analysis of panel count and interval-censored data using distribution-free frailty analysis. Biom J 2020; 62:1164-1175. [PMID: 32022280 DOI: 10.1002/bimj.201900134] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2019] [Revised: 07/04/2019] [Accepted: 08/01/2019] [Indexed: 11/07/2022]
Abstract
We propose a joint analysis of recurrent and nonrecurrent event data subject to general types of interval censoring. The proposed analysis allows for general semiparametric models, including the Box-Cox transformation and inverse Box-Cox transformation models for the recurrent and nonrecurrent events, respectively. A frailty variable is used to account for the potential dependence between the recurrent and nonrecurrent event processes, while leaving the distribution of the frailty unspecified. We apply the pseudolikelihood for interval-censored recurrent event data, usually termed as panel count data, and the sufficient likelihood for interval-censored nonrecurrent event data by conditioning on the sufficient statistic for the frailty and using the working assumption of independence over examination times. Large sample theory and a computation procedure for the proposed analysis are established. We illustrate the proposed methodology by a joint analysis of the numbers of occurrences of basal cell carcinoma over time and time to the first recurrence of squamous cell carcinoma based on a skin cancer dataset, as well as a joint analysis of the numbers of adverse events and time to premature withdrawal from study medication based on a scleroderma lung disease dataset.
Collapse
Affiliation(s)
- Chi-Chung Wen
- Department of Mathematics, Tamkang University, New Taipei City, Taiwan
| | - Yi-Hau Chen
- Institute of Statistical Science, Academia Sinica, Taipei, Taiwan
| | - Chi-Hong Tseng
- Department of Medicine, David Geffen School of Medicine, University of California al Los Angeles, Los Angeles, CA, USA
| |
Collapse
|
18
|
Tapsoba JDD, Wang CY, Zangeneh S, Chen YQ. Methods for generalized change-point models: with applications to human immunodeficiency virus surveillance and diabetes data. Stat Med 2020; 39:1167-1182. [PMID: 31997385 DOI: 10.1002/sim.8469] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2018] [Revised: 09/20/2019] [Accepted: 12/13/2019] [Indexed: 11/11/2022]
Abstract
In many epidemiological and biomedical studies, the association between a response variable and some covariates of interest may change at one or several thresholds of the covariates. Change-point models are suitable for investigating the relationship between the response and covariates in such situations. We present change-point models, with at least one unknown change-point occurring with respect to some covariates of a generalized linear model for independent or correlated data. We develop methods for the estimation of the model parameters and investigate their finite-sample performances in simulations. We apply the proposed methods to examine the trends in the reported estimates of the annual percentage of new human immunodeficiency virus (HIV) diagnoses linked to HIV-related medical care within 3 months after diagnosis using HIV surveillance data from the HIV prevention trial network 065 study. We also apply our methods to a dataset from the Pima Indian diabetes study to examine the effects of age and body mass index on the risk of being diagnosed with type 2 diabetes.
Collapse
Affiliation(s)
- Jean de Dieu Tapsoba
- Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, Washington
| | - Ching-Yun Wang
- Division of Public Health, Fred Hutchinson Cancer Research Center, Seattle, Washington
| | - Sahar Zangeneh
- Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, Washington
| | - Ying Qing Chen
- Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, Washington
| |
Collapse
|
19
|
Cai Y, Huang J, Ning J, Lee MLT, Rosner B, Chen Y. Two-sample test for correlated data under outcome-dependent sampling with an application to self-reported weight loss data. Stat Med 2019; 38:4999-5009. [PMID: 31489699 PMCID: PMC6800790 DOI: 10.1002/sim.8346] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2017] [Revised: 07/07/2019] [Accepted: 07/17/2019] [Indexed: 11/09/2022]
Abstract
Standard methods for two-sample tests such as the t-test and Wilcoxon rank sum test may lead to incorrect type I errors when applied to longitudinal or clustered data. Recent alternatives of two-sample tests for clustered data often require certain assumptions on the correlation structure and/or noninformative cluster size. In this paper, based on a novel pseudolikelihood for correlated data, we propose a score test without knowledge of the correlation structure or assuming data missingness at random. The proposed score test can capture differences in the mean and variance between two groups simultaneously. We use projection theory to derive the limiting distribution of the test statistic, in which the covariance matrix can be empirically estimated. We conduct simulation studies to evaluate the proposed test and compare it with existing methods. To illustrate the usefulness proposed test, we use it to compare self-reported weight loss data in a friends' referral group, with the data from the Internet self-joining group.
Collapse
Affiliation(s)
- Yi Cai
- AT&T Services, Inc., Plano, TX 75247, USA
| | - Jing Huang
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Jing Ning
- Department of Statistical Science, Cornell University, Ithaca, NY 14853, USA
| | - Mei-Ling Ting Lee
- Department of Epidemiology and Biostatistics, The University of Maryland School of Public Health, College Park, MD 20742, USA
| | - Bernard Rosner
- Department of Biostatistics, Harvard Medical School, MA 02115, USA
| | - Yong Chen
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA
| |
Collapse
|
20
|
Kazakiewicz D, Claesen J, Górczak K, Plewczynski D, Burzykowski T. A Multivariate Negative-Binomial Model with Random Effects for Differential Gene-Expression Analysis of Correlated mRNA Sequencing Data. J Comput Biol 2019; 26:1339-1348. [PMID: 31314581 DOI: 10.1089/cmb.2019.0168] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Experimental designs such as matched-pair or longitudinal studies yield mRNA sequencing (mRNA-Seq) counts that are correlated across samples. Most of the approaches for the analysis of correlated mRNA-Seq data are restricted to a specific design and/or balanced data only (with the same number of samples in each group). We propose a model that is applicable to the analysis of correlated mRNA-Seq data of different types: paired, clustered, longitudinal, or others. Any combination of explanatory variables, as well as unbalanced data, can be processed within the proposed modeling framework. The model assumes that exon counts of a particular gene of an individual sample jointly follow a multivariate negative-binomial distribution. Additional correlation between exon counts obtained for, for example, individual samples within the same pair or cluster, is taken into account by including into the model a cluster-level normally distributed random effect. An interesting feature of the model is that it provides explicit expression for marginal correlation between exon counts at different levels. The performance of the model is evaluated by using a simulation study and an analysis of two real-life data sets: a paired mRNA-Seq experiment for 24 patients with clear-cell renal-cell carcinoma and a longitudinal mRNA-Seq experiment for 29 patients with Lyme disease.
Collapse
Affiliation(s)
- Denis Kazakiewicz
- Interuniversity Institute for Biostatistics and statistical Bioinformatics, Hasselt University, Diepenbeek, Belgium.,Center for Innovative Research, Medical University of Białystok, Białystok, Poland
| | - Jürgen Claesen
- Interuniversity Institute for Biostatistics and statistical Bioinformatics, Hasselt University, Diepenbeek, Belgium
| | - Katarzyna Górczak
- Interuniversity Institute for Biostatistics and statistical Bioinformatics, Hasselt University, Diepenbeek, Belgium.,Department of Mathematical and Statistical Methods, Poznań University of Life Sciences, Poznań, Poland
| | - Dariusz Plewczynski
- Center for Innovative Research, Medical University of Białystok, Białystok, Poland.,Centre of New Technologies, University of Warsaw, Warsaw, Poland.,Faculty of Mathematics and Information Science, Warsaw University of Technology, Warsaw, Poland
| | - Tomasz Burzykowski
- Interuniversity Institute for Biostatistics and statistical Bioinformatics, Hasselt University, Diepenbeek, Belgium.,Center for Innovative Research, Medical University of Białystok, Białystok, Poland
| |
Collapse
|
21
|
Cao Y, Yoshikawa M, Xiao Y, Xiong L. Quantifying Differential Privacy in Continuous Data Release Under Temporal Correlations. IEEE Trans Knowl Data Eng 2019; 31:1281-1295. [PMID: 31435181 PMCID: PMC6704013 DOI: 10.1109/tkde.2018.2824328] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Differential Privacy (DP) has received increasing attention as a rigorous privacy framework. Many existing studies employ traditional DP mechanisms (e.g., the Laplace mechanism) as primitives to continuously release private data for protecting privacy at each time point (i.e., event-level privacy), which assume that the data at different time points are independent, or that adversaries do not have knowledge of correlation between data. However, continuously generated data tend to be temporally correlated, and such correlations can be acquired by adversaries. In this paper, we investigate the potential privacy loss of a traditional DP mechanism under temporal correlations. First, we analyze the privacy leakage of a DP mechanism under temporal correlation that can be modeled using Markov Chain. Our analysis reveals that, the event-level privacy loss of a DP mechanism may increase over time. We call the unexpected privacy loss temporal privacy leakage (TPL). Although TPL may increase over time, we find that its supremum may exist in some cases. Second, we design efficient algorithms for calculating TPL. Third, we propose data releasing mechanisms that convert any existing DP mechanism into one against TPL. Experiments confirm that our approach is efficient and effective.
Collapse
Affiliation(s)
- Yang Cao
- Department of Math and Computer Science, Emory University, Atlanta, GA 30322
| | | | | | - Li Xiong
- Department of Math and Computer Science, Emory University, Atlanta, GA 30322
| |
Collapse
|
22
|
Saha KK, Wang S. Confidence intervals for the difference in the success rates of two treatments in the analysis of correlated binary responses. Biom J 2019; 61:983-1002. [PMID: 30843251 DOI: 10.1002/bimj.201700089] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2017] [Revised: 02/28/2018] [Accepted: 03/03/2018] [Indexed: 11/10/2022]
Abstract
In clinical studies, we often compare the success rates of two treatment groups where post-treatment responses of subjects within clusters are usually correlated. To estimate the difference between the success rates, interval estimation procedures that do not account for this intraclass correlation are likely inappropriate. To address this issue, we propose three interval procedures by direct extensions of recently proposed methods for independent binary data based on the concepts of design effect and effective sample size used in sample surveys. Each of them is then evaluated with four competing variance estimates. We also extend three existing methods recommended for complex survey data using different weighting schemes required for those three existing methods. An extensive simulation study is conducted for the purposes of evaluating and comparing the performance of the proposed methods in terms of coverage and expected width. The interval estimation procedures are illustrated using three examples in clinical and social science studies. Our analytic arguments and numerical studies suggest that the methods proposed in this work may be useful in clustered data analyses.
Collapse
Affiliation(s)
- Krishna K Saha
- Department of Mathematical Sciences, Central Connecticut State University, New Britain, CT, USA
| | - Suojin Wang
- Department of Statistics, Texas A&M University, College Station, TX, USA
| |
Collapse
|
23
|
Abstract
Perfusion computed tomography is an emerging functional imaging modality that uses physiological models to quantify characteristics pertaining to the passage of fluid through blood vessels. Perfusion characteristics provide physiological correlates for neovascularization induced by tumor angiogenesis and thus a quantitative basis for cancer detection, prognostication, and treatment monitoring. We consider a liver cancer study where patients underwent a dynamic computed tomography protocol to enable evaluation of multiple perfusion characteristics derived from interrogating the time-attenuation of the concentration of the intravenously administered contrast medium. The objective is to determine the effectiveness of using perfusion characteristics to identify and discriminate between regions of liver that contain malignant tissues from normal tissue. Each patient contributes multiple regions of interest which are spatially correlated due to the shared vasculature. We propose a multivariate functional data model to disclose the correlation over time and space as well as the correlation among multiple perfusion characteristics. We further propose a simultaneous classification approach that utilizes all the correlation information to predict class assignments for collections of regions. The proposed method outperforms conventional classification approaches in the presence of strong spatial correlation. The method offers maximal relative improvement in the presence of temporal sparsity wherein measurements are obtainable at only a few time points.
Collapse
Affiliation(s)
- Yuan Wang
- Department of Mathematics and Statistics, Washington State University, Pullman, WA, USA
| | - Jianhua Hu
- Department of Biostatistics, Columbia University, New York, NY, USA
| | - Chaan S Ng
- Department of Diagnostic Radiology, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Brian P Hobbs
- Department of Quantitative Health Sciences, Cleveland Clinic, Cleveland, OH, USA
| |
Collapse
|
24
|
Gebregziabher M, Eckert MA, Matthews LJ, Teklehaimanot AA, Dubno JR. Joint modeling of multivariate hearing thresholds measured longitudinally at multiple frequencies. COMMUN STAT-THEOR M 2018; 47:5418-5434. [PMID: 30983686 DOI: 10.1080/03610926.2017.1395045] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Abstract
Pure-tone thresholds are used to estimate hearing acuity and, when measured longitudinally, can characterize age-related changes in hearing. Measured at multiple-frequencies, multiple-irregular time points, for right and left ears, these longitudinal studies of age-related hearing loss produce data of inherent complexity due to: 1) multivariate outcomes at different frequencies; 2) longitudinal measurements taken at subject-specific time intervals; and 3) inter-ear correlations due to clustering and nesting. To address limitations in existing methods, we propose a multivariate generalized linear mixed model(mGLMM) and assess its performance. We demonstrate its application using a unique dataset from a cohort study of age-related hearing loss.
Collapse
Affiliation(s)
- Mulugeta Gebregziabher
- Department of Public Health Sciences, Medical University of South Carolina, Charleston, USA
| | - Mark A Eckert
- Department of Otolaryngology, Medical University of South Carolina, Charleston, USA
| | - Lois J Matthews
- Department of Otolaryngology, Medical University of South Carolina, Charleston, USA
| | - Abeba A Teklehaimanot
- Department of Public Health Sciences, Medical University of South Carolina, Charleston, USA
| | - Judy R Dubno
- Department of Otolaryngology, Medical University of South Carolina, Charleston, USA
| |
Collapse
|
25
|
Pallmann P, Ritz C, Hothorn LA. Simultaneous small-sample comparisons in longitudinal or multi-endpoint trials using multiple marginal models. Stat Med 2018; 37:1562-1576. [PMID: 29444546 DOI: 10.1002/sim.7610] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2017] [Revised: 11/29/2017] [Accepted: 12/20/2017] [Indexed: 11/10/2022]
Abstract
Simultaneous inference in longitudinal, repeated-measures, and multi-endpoint designs can be onerous, especially when trying to find a reasonable joint model from which the interesting effects and covariances are estimated. A novel statistical approach known as multiple marginal models greatly simplifies the modelling process: the core idea is to "marginalise" the problem and fit multiple small models to different portions of the data, and then estimate the overall covariance matrix in a subsequent, separate step. Using these estimates guarantees strong control of the family-wise error rate, however only asymptotically. In this paper, we show how to make the approach also applicable to small-sample data problems. Specifically, we discuss the computation of adjusted P values and simultaneous confidence bounds for comparisons of randomised treatment groups as well as for levels of a nonrandomised factor such as multiple endpoints, repeated measures, or a series of points in time or space. We illustrate the practical use of the method with a data example.
Collapse
Affiliation(s)
- Philip Pallmann
- Medical and Pharmaceutical Statistics Research Unit, Department of Mathematics and Statistics, Lancaster University, Lancaster, LA1 4YF, UK
| | - Christian Ritz
- Department of Nutrition, Exercise and Sports, University of Copenhagen, 1958, Frederiksberg C, Denmark
| | - Ludwig A Hothorn
- Institute of Biostatistics, Leibniz University Hannover, 30419, Hannover, Germany
| |
Collapse
|
26
|
Abstract
Identifying correlation structure is important to achieving estimation efficiency in analyzing longitudinal data, and is also crucial for drawing valid statistical inference for large size clustered data. In this paper, we propose a nonparametric method to estimate the correlation structure, which is applicable for discrete longitudinal data. We utilize eigenvector-based basis matrices to approximate the inverse of the empirical correlation matrix and determine the number of basis matrices via model selection. A penalized objective function based on the difference between the empirical and model approximation of the correlation matrices is adopted to select an informative structure for the correlation matrix. The eigenvector representation of the correlation estimation is capable of reducing the risk of model misspecification, and also provides useful information on the specific within-cluster correlation pattern of the data. We show that the proposed method possesses the oracle property and selects the true correlation structure consistently. The proposed method is illustrated through simulations and two data examples on air pollution and sonar signal studies.
Collapse
Affiliation(s)
- Jianhua Hu
- University of Texas MD Anderson Cancer Center, Houston, TX 77030 ()
| | - Peng Wang
- Department of Mathematics and Statistics, Bowling Green State University, Bowling Green, OH 43403 ()
| | - Annie Qu
- Department of Statistics, University of Illinois at Urbana-Champaign, Champaign, IL 61820
| |
Collapse
|
27
|
Li P, Redden DT. Small sample performance of bias-corrected sandwich estimators for cluster-randomized trials with binary outcomes. Stat Med 2014; 34:281-96. [PMID: 25345738 DOI: 10.1002/sim.6344] [Citation(s) in RCA: 106] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2014] [Accepted: 10/07/2014] [Indexed: 11/08/2022]
Abstract
The sandwich estimator in generalized estimating equations (GEE) approach underestimates the true variance in small samples and consequently results in inflated type I error rates in hypothesis testing. This fact limits the application of the GEE in cluster-randomized trials (CRTs) with few clusters. Under various CRT scenarios with correlated binary outcomes, we evaluate the small sample properties of the GEE Wald tests using bias-corrected sandwich estimators. Our results suggest that the GEE Wald z-test should be avoided in the analyses of CRTs with few clusters even when bias-corrected sandwich estimators are used. With t-distribution approximation, the Kauermann and Carroll (KC)-correction can keep the test size to nominal levels even when the number of clusters is as low as 10 and is robust to the moderate variation of the cluster sizes. However, in cases with large variations in cluster sizes, the Fay and Graubard (FG)-correction should be used instead. Furthermore, we derive a formula to calculate the power and minimum total number of clusters one needs using the t-test and KC-correction for the CRTs with binary outcomes. The power levels as predicted by the proposed formula agree well with the empirical powers from the simulations. The proposed methods are illustrated using real CRT data. We conclude that with appropriate control of type I error rates under small sample sizes, we recommend the use of GEE approach in CRTs with binary outcomes because of fewer assumptions and robustness to the misspecification of the covariance structure.
Collapse
Affiliation(s)
- Peng Li
- Department of Biostatistics, School of Public Health, University of Alabama at Birmingham, Birmingham, AL 35294, U.S.A
| | | |
Collapse
|
28
|
Harun N, Cai B. Bayesian random effects selection in mixed accelerated failure time model for interval-censored data. Stat Med 2014; 33:971-84. [PMID: 24123191 DOI: 10.1002/sim.6002] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2013] [Revised: 09/16/2013] [Accepted: 09/17/2013] [Indexed: 11/10/2022]
Abstract
In many medical problems that collect multiple observations per subject, the time to an event is often of interest. Sometimes, the occurrence of the event can be recorded at regular intervals leading to interval-censored data. It is further desirable to obtain the most parsimonious model in order to increase predictive power and to obtain ease of interpretation. Variable selection and often random effects selection in case of clustered data become crucial in such applications. We propose a Bayesian method for random effects selection in mixed effects accelerated failure time (AFT) models. The proposed method relies on the Cholesky decomposition on the random effects covariance matrix and the parameter-expansion method for the selection of random effects. The Dirichlet prior is used to model the uncertainty in the random effects. The error distribution for the accelerated failure time model has been specified using a Gaussian mixture to allow flexible error density and prediction of the survival and hazard functions. We demonstrate the model using extensive simulations and the Signal Tandmobiel Study(®).
Collapse
Affiliation(s)
- Nusrat Harun
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, TX 77030, U.S.A
| | | |
Collapse
|
29
|
Long Q, Zhang X, Zhao Y, Johnson BA, Bostick RM. Modeling clinical outcome using multiple correlated functional biomarkers: A Bayesian approach. Stat Methods Med Res 2012; 25:520-37. [PMID: 23070593 DOI: 10.1177/0962280212460444] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
In some biomedical studies, biomarkers are measured repeatedly along some spatial structure or over time and are subject to measurement error. In these studies, it is often of interest to evaluate associations between a clinical endpoint and these biomarkers (also known as functional biomarkers). There are potentially two levels of correlation in such data, namely, between repeated measurements of a biomarker from the same subject and between multiple biomarkers from the same subject; none of the existing methods accounts for correlation between multiple functional biomarkers. We propose a Bayesian approach to model a clinical outcome of interest (e.g. risk for colorectal cancer) in the presence of multiple functional biomarkers while accounting for potential correlation. Our simulations show that the proposed approach achieves good performance in finite samples under various settings. In the presence of substantial or moderate correlation, the proposed approach outperforms an existing approach that does not account for correlation. The proposed approach is applied to a study of biomarkers of risk for colorectal neoplasms and our results show that the risk for colorectal cancer is associated with two functional biomarkers, APC and TGF-α, in particular, with their values in the region between the proliferating and differentiating zones of colorectal crypts.
Collapse
Affiliation(s)
- Qi Long
- Department of Biostatistics and Bioinformatics, Emory University, USA
| | | | - Yize Zhao
- Department of Biostatistics and Bioinformatics, Emory University, USA
| | - Brent A Johnson
- Department of Biostatistics and Bioinformatics, Emory University, USA
| | | |
Collapse
|
30
|
Hobbs BP, Sargent DJ, Carlin BP. Commensurate Priors for Incorporating Historical Information in Clinical Trials Using General and Generalized Linear Models. Bayesian Anal 2012; 7:639-674. [PMID: 24795786 PMCID: PMC4007051 DOI: 10.1214/12-ba722] [Citation(s) in RCA: 108] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/28/2023]
Abstract
Assessing between-study variability in the context of conventional random-effects meta-analysis is notoriously difficult when incorporating data from only a small number of historical studies. In order to borrow strength, historical and current data are often assumed to be fully homogeneous, but this can have drastic consequences for power and Type I error if the historical information is biased. In this paper, we propose empirical and fully Bayesian modifications of the commensurate prior model (Hobbs et al., 2011) extending Pocock (1976), and evaluate their frequentist and Bayesian properties for incorporating patient-level historical data using general and generalized linear mixed regression models. Our proposed commensurate prior models lead to preposterior admissible estimators that facilitate alternative bias-variance trade-offs than those offered by pre-existing methodologies for incorporating historical data from a small number of historical studies. We also provide a sample analysis of a colon cancer trial comparing time-to-disease progression using a Weibull regression model.
Collapse
Affiliation(s)
- Brian P Hobbs
- Department of Biostatistics, M.D. Anderson Cancer Center, Houston, TX, 77030, USA
| | - Daniel J Sargent
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN, 55905, USA
| | - Bradley P Carlin
- Division of Biostatistics, University of Minnesota, Minneapolis, Minnesota, 55455, USA
| |
Collapse
|
31
|
Martella F, Vermunt JK. Model-based approaches to synthesize microarray data: a unifying review using mixture of SEMs. Stat Methods Med Res 2011; 22:567-82. [PMID: 21948997 DOI: 10.1177/0962280211419482] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Several statistical methods are nowadays available for the analysis of gene expression data recorded through microarray technology. In this article, we take a closer look at several Gaussian mixture models which have recently been proposed to model gene expression data. It can be shown that these are special cases of a more general model, called the mixture of structural equation models (mixture of SEMs), which has been developed in psychometrics. This model combines mixture modelling and SEMs by assuming that component-specific means and variances are subject to a SEM. The connection with SEM is useful for at least two reasons: (1) it shows the basic assumptions of existing methods more explicitly and (2) it helps in straightforward development of alternative mixture models for gene expression data with alternative mean/covariance structures. Different specifications of mixture of SEMs for clustering gene expression data are illustrated using two benchmark datasets.
Collapse
Affiliation(s)
- F Martella
- 1Dipartimento di Scienze Statistiche, Sapienza University of Rome, P.le Aldo Moro, 5-I00185 Rome, Italy
| | | |
Collapse
|
32
|
Berhane K, Molitor NT. A Bayesian approach to functional-based multilevel modeling of longitudinal data: applications to environmental epidemiology. Biostatistics 2008; 9:686-99. [PMID: 18349036 PMCID: PMC2733176 DOI: 10.1093/biostatistics/kxm059] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2007] [Revised: 11/07/2007] [Accepted: 12/17/2007] [Indexed: 11/13/2022] Open
Abstract
Flexible multilevel models are proposed to allow for cluster-specific smooth estimation of growth curves in a mixed-effects modeling format that includes subject-specific random effects on the growth parameters. Attention is then focused on models that examine between-cluster comparisons of the effects of an ecologic covariate of interest (e.g. air pollution) on nonlinear functionals of growth curves (e.g. maximum rate of growth). A Gibbs sampling approach is used to get posterior mean estimates of nonlinear functionals along with their uncertainty estimates. A second-stage ecologic random-effects model is used to examine the association between a covariate of interest (e.g. air pollution) and the nonlinear functionals. A unified estimation procedure is presented along with its computational and theoretical details. The models are motivated by, and illustrated with, lung function and air pollution data from the Southern California Children's Health Study.
Collapse
Affiliation(s)
- Kiros Berhane
- Department of Preventive Medicine, University of Southern California, Los Angeles, CA 90033-9987, USA.
| | | |
Collapse
|