1
|
Testing the missing at random assumption in generalized linear models in the presence of instrumental variables. Scand Stat Theory Appl 2024; 51:334-354. [PMID: 38370508 PMCID: PMC10871667 DOI: 10.1111/sjos.12685] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2021] [Accepted: 07/09/2023] [Indexed: 02/20/2024]
Abstract
Practical problems with missing data are common, and many methods have been developed concerning the validity and/or efficiency of statistical procedures. On a central focus, there have been longstanding interests on the mechanism governing data missingness, and correctly deciding the appropriate mechanism is crucially relevant for conducting proper practical investigations. In this paper, we present a new hypothesis testing approach for deciding between the conventional notions of missing at random and missing not at random in generalized linear models in the presence of instrumental variables. The foundational idea is to develop appropriate discrepancy measures between estimators whose properties significantly differ only when missing at random does not hold. We show that our testing approach achieves an objective data-oriented choice between missing at random or not. We demonstrate the feasibility, validity, and efficacy of the new test by theoretical analysis, simulation studies, and a real data analysis.
Collapse
|
2
|
Investigating the Influence of Fluctuating Humidity and Temperature on Creep Deformation in High-Performance Concrete Beams: A Comparative Study between Natural and Laboratorial Environmental Tests. MATERIALS (BASEL, SWITZERLAND) 2024; 17:998. [PMID: 38473471 DOI: 10.3390/ma17050998] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/17/2024] [Revised: 02/13/2024] [Accepted: 02/19/2024] [Indexed: 03/14/2024]
Abstract
To investigate the influence of temperature and humidity variations on creep in high-performance concrete beams, beam tests were conducted in both natural and laboratory settings. The findings indicate that the variations in creep primarily stem from temperature changes, whereas humidity changes have little influence on fluctuations in both basic creep and total creep. The influence of humidity on creep is more strongly reflected in the magnitude of creep. Functions describing the influence of temperature and humidity on the creep behavior of high-performance concrete (HPC) subjected to fluctuating conditions are proposed. The findings were employed to examine creep deformation in engineering applications across four places. This study complements the correction method for the creep of members under fluctuating temperature and humidity. This research application can provide a basis for the calculation of the long-term deformation of HPC structures in natural environments.
Collapse
|
3
|
M-quantile regression shrinkage and selection via the Lasso and Elastic Net to assess the effect of meteorology and traffic on air quality. Biom J 2023; 65:e2100355. [PMID: 37743255 DOI: 10.1002/bimj.202100355] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2021] [Revised: 01/31/2023] [Accepted: 04/11/2023] [Indexed: 09/26/2023]
Abstract
In this work, we intersect data on size-selected particulate matter (PM) with vehicular traffic counts and a comprehensive set of meteorological covariates to study the effect of traffic on air quality. To this end, we develop an M-quantile regression model with Lasso and Elastic Net penalizations. This allows (i) to identify the best proxy for vehicular traffic via model selection, (ii) to investigate the relationship between fine PM concentration and the covariates at different M-quantiles of the conditional response distribution, and (iii) to be robust to the presence of outliers. Heterogeneity in the data is accounted by fitting a B-spline on the effect of the day of the year. Analytic and bootstrap-based variance estimates of the regression coefficients are provided, together with a numerical evaluation of the proposed estimation procedure. Empirical results show that atmospheric stability is responsible for the most significant effect on fine PM concentration: this effect changes at different levels of the conditional response distribution and is relatively weaker on the tails. On the other hand, model selection allows to identify the best proxy for vehicular traffic whose effect remains essentially the same at different levels of the conditional response distribution.
Collapse
|
4
|
Nonparametric inference of general while-alive estimands for recurrent events. Biometrics 2023; 79:1749-1760. [PMID: 35731993 PMCID: PMC9772359 DOI: 10.1111/biom.13709] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2021] [Accepted: 06/16/2022] [Indexed: 12/24/2022]
Abstract
Measuring the treatment effect on recurrent events like hospitalization in the presence of death has long challenged statisticians and clinicians alike. Traditional inference on the cumulative frequency unjustly penalizes survivorship as longer survivors also tend to experience more adverse events. Expanding a recently suggested idea of the "while-alive" event rate, we consider a general class of such estimands that adjust for the length of survival without losing causal interpretation. Given a user-specified loss function that allows for arbitrary weighting, we define as estimand the average loss experienced per unit time alive within a target period and use the ratio of this loss rate to measure the effect size. Scaling the loss rate by the width of the corresponding time window gives us an alternative, and sometimes more photogenic, way of showing the data. To make inferences, we construct a nonparametric estimator for the loss rate through the cumulative loss and the restricted mean survival time and derive its influence function in closed form for variance estimation and testing. As simulations and analysis of real data from a heart failure trial both show, the while-alive approach corrects for the false attenuation of treatment effect due to patients living longer under treatment, with increased statistical power as a result. The proposed methods are implemented in the R-package WA, which is publicly available from the Comprehensive R Archive Network (CRAN).
Collapse
|
5
|
Bayesian and influence function-based empirical likelihoods for inference of sensitivity to the early diseased stage in diagnostic tests. Biom J 2023; 65:e2200021. [PMID: 36642803 PMCID: PMC10006346 DOI: 10.1002/bimj.202200021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2022] [Revised: 09/11/2022] [Accepted: 10/08/2022] [Indexed: 01/17/2023]
Abstract
In practice, a disease process might involve three ordinal diagnostic stages: the normal healthy stage, the early stage of the disease, and the stage of full development of the disease. Early detection is critical for some diseases since it often means an optimal time window for therapeutic treatments of the diseases. In this study, we propose a new influence function-based empirical likelihood method and Bayesian empirical likelihood methods to construct confidence/credible intervals for the sensitivity of a test to patients in the early diseased stage given a specificity and a sensitivity of the test to patients in the fully diseased stage. Numerical studies are performed to compare the finite sample performances of the proposed approaches with existing methods. The proposed methods are shown to outperform existing methods in terms of coverage probability. A real dataset from the Alzheimer's Disease Neuroimaging Initiative (ANDI) is used to illustrate the proposed methods.
Collapse
|
6
|
Estimation of separable direct and indirect effects in continuous time. Biometrics 2023; 79:127-139. [PMID: 34506039 DOI: 10.1111/biom.13559] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2020] [Revised: 08/04/2021] [Accepted: 08/26/2021] [Indexed: 11/29/2022]
Abstract
Many research questions involve time-to-event outcomes that can be prevented from occurring due to competing events. In these settings, we must be careful about the causal interpretation of classical statistical estimands. In particular, estimands on the hazard scale, such as ratios of cause-specific or subdistribution hazards, are fundamentally hard to interpret causally. Estimands on the risk scale, such as contrasts of cumulative incidence functions, do have a clear causal interpretation, but they only capture the total effect of the treatment on the event of interest; that is, effects both through and outside of the competing event. To disentangle causal treatment effects on the event of interest and competing events, the separable direct and indirect effects were recently introduced. Here we provide new results on the estimation of direct and indirect separable effects in continuous time. In particular, we derive the nonparametric influence function in continuous time and use it to construct an estimator that has certain robustness properties. We also propose a simple estimator based on semiparametric models for the two cause-specific hazard functions. We describe the asymptotic properties of these estimators and present results from simulation studies, suggesting that the estimators behave satisfactorily in finite samples. Finally, we reanalyze the prostate cancer trial from Stensrud et al. (2020).
Collapse
|
7
|
Group sequential methods for interim monitoring of randomized clinical trials with time-lagged outcome. Stat Med 2022; 41:5517-5536. [PMID: 36117235 PMCID: PMC9825950 DOI: 10.1002/sim.9580] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2022] [Revised: 08/17/2022] [Accepted: 08/29/2022] [Indexed: 01/11/2023]
Abstract
The primary analysis in two-arm clinical trials usually involves inference on a scalar treatment effect parameter; for example, depending on the outcome, the difference of treatment-specific means, risk difference, risk ratio, or odds ratio. Most clinical trials are monitored for the possibility of early stopping. Because ordinarily the outcome on any given subject can be ascertained only after some time lag, at the time of an interim analysis, among the subjects already enrolled, the outcome is known for only a subset and is effectively censored for those who have not been enrolled sufficiently long for it to be observed. Typically, the interim analysis is based only on the data from subjects for whom the outcome has been ascertained. A goal of an interim analysis is to stop the trial as soon as the evidence is strong enough to do so, suggesting that the analysis ideally should make the most efficient use of all available data, thus including information on censoring as well as other baseline and time-dependent covariates in a principled way. A general group sequential framework is proposed for clinical trials with a time-lagged outcome. Treatment effect estimators that take account of censoring and incorporate covariate information at an interim analysis are derived using semiparametric theory and are demonstrated to lead to stronger evidence for early stopping than standard approaches. The associated test statistics are shown to have the independent increments structure, so that standard software can be used to obtain stopping boundaries.
Collapse
|
8
|
Optimal sampling for design-based estimators of regression models. Stat Med 2022; 41:1482-1497. [PMID: 34989429 PMCID: PMC8918008 DOI: 10.1002/sim.9300] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2021] [Revised: 12/02/2021] [Accepted: 12/10/2021] [Indexed: 11/05/2022]
Abstract
Two-phase designs measure variables of interest on a subcohort where the outcome and covariates are readily available or cheap to collect on all individuals in the cohort. Given limited resource availability, it is of interest to find an optimal design that includes more informative individuals in the final sample. We explore the optimal designs and efficiencies for analyses by design-based estimators. Generalized raking is an efficient class of design-based estimators, and they improve on the inverse-probability weighted (IPW) estimator by adjusting weights based on the auxiliary information. We derive a closed-form solution of the optimal design for estimating regression coefficients from generalized raking estimators. We compare it with the optimal design for analysis via the IPW estimator and other two-phase designs in measurement-error settings. We consider general two-phase designs where the outcome variable and variables of interest can be continuous or discrete. Our results show that the optimal designs for analyses by the two classes of design-based estimators can be very different. The optimal design for analysis via the IPW estimator is optimal for IPW estimation and typically gives near-optimal efficiency for generalized raking estimation, though we show there is potential improvement in some settings.
Collapse
|
9
|
A robust variable screening procedure for ultra-high dimensional data. Stat Methods Med Res 2021; 30:1816-1832. [PMID: 34053339 DOI: 10.1177/09622802211017299] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Variable selection in ultra-high dimensional regression problems has become an important issue. In such situations, penalized regression models may face computational problems and some pre-screening of the variables may be necessary. A number of procedures for such pre-screening has been developed; among them the Sure Independence Screening (SIS) enjoys some popularity. However, SIS is vulnerable to outliers in the data, and in particular in small samples this may lead to faulty inference. In this paper, we develop a new robust screening procedure. We build on the density power divergence (DPD) estimation approach and introduce DPD-SIS and its extension iterative DPD-SIS. We illustrate the behavior of the methods through extensive simulation studies and show that they are superior to both the original SIS and other robust methods when there are outliers in the data. Finally, we illustrate its use in a study on regulation of lipid metabolism.
Collapse
|
10
|
Optimal multiwave sampling for regression modeling in two-phase designs. Stat Med 2020; 39:4912-4921. [PMID: 33016376 PMCID: PMC7902311 DOI: 10.1002/sim.8760] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2020] [Revised: 08/27/2020] [Accepted: 09/08/2020] [Indexed: 11/09/2022]
Abstract
Two-phase designs involve measuring extra variables on a subset of the cohort where some variables are already measured. The goal of two-phase designs is to choose a subsample of individuals from the cohort and analyse that subsample efficiently. It is of interest to obtain an optimal design that gives the most efficient estimates of regression parameters. In this article, we propose a multiwave sampling design to approximate the optimal design for design-based estimators. Influence functions are used to compute the optimal sampling allocations. We propose to use informative priors on regression parameters to derive the wave-1 sampling probabilities because any prespecified sampling probabilities may be far from optimal and decrease the design efficiency. The posterior distributions of the regression parameters derived from the current wave will then be used as priors for the next wave. Generalized raking is used in the final statistical analysis. We show that a two-wave sampling with reasonable informative priors will end up with a highly efficient estimation for the parameter of interest and be close to the underlying optimal design.
Collapse
|
11
|
Bayesian and influence function-based empirical likelihoods for inference of sensitivity in diagnostic tests. Stat Methods Med Res 2020; 29:3457-3491. [PMID: 32552342 DOI: 10.1177/0962280220929042] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
In medical diagnostic studies, a diagnostic test can be evaluated based on its sensitivity under a desired specificity. Existing methods for inference on sensitivity include normal approximation-based approaches and empirical likelihood (EL)-based approaches. These methods generally have poor performance when the specificity is high, and some require choosing smoothing parameters. We propose a new influence function-based empirical likelihood method and Bayesian empirical likelihood methods to overcome such problems. Numerical studies are performed to compare the finite sample performance of the proposed approaches with existing methods. The proposed methods are shown to perform better in terms of both coverage probability and interval length. A real data set from Alzheimer's Disease Neuroimaging Initiative (ANDI) is analyzed.
Collapse
|
12
|
Influence function-based empirical likelihood for inference of quantile medical costs with censored data. Stat Methods Med Res 2019; 29:1913-1934. [PMID: 31595834 DOI: 10.1177/0962280219880573] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
In this paper, we propose empirical likelihood methods based on influence function and Jackknife techniques to construct confidence intervals for quantile medical costs with censored data. We show that the influence function-based empirical log-likelihood ratio statistic for the quantile medical cost has a standard Chi-square distribution as its asymptotic distribution. Simulation studies are conducted to compare coverage probabilities and interval lengths of the proposed empirical likelihood confidence intervals with the existing normal approximation-based confidence intervals for quantile medical costs. The proposed methods are observed to have better finite-sample performances than existing methods. The new methods are also illustrated through a real example.
Collapse
|
13
|
Machine learning methods for leveraging baseline covariate information to improve the efficiency of clinical trials. Stat Med 2019; 38:1703-1714. [PMID: 30474289 DOI: 10.1002/sim.8054] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2018] [Revised: 09/11/2018] [Accepted: 11/09/2018] [Indexed: 11/09/2022]
Abstract
Clinical trials are widely considered the gold standard for treatment evaluation, and they can be highly expensive in terms of time and money. The efficiency of clinical trials can be improved by incorporating information from baseline covariates that are related to clinical outcomes. This can be done by modifying an unadjusted treatment effect estimator with an augmentation term that involves a function of covariates. The optimal augmentation is well characterized in theory but must be estimated in practice. In this article, we investigate the use of machine learning methods to estimate the optimal augmentation. We consider and compare an indirect approach based on an estimated regression function and a direct approach that aims directly to minimize the asymptotic variance of the treatment effect estimator. Theoretical considerations and simulation results indicate that the direct approach is generally preferable over the indirect approach. The direct approach can be implemented using any existing prediction algorithm that can minimize a weighted sum of squared prediction errors. Many such prediction algorithms are available, and the super learning principle can be used to combine multiple algorithms into a super learner under the direct approach. The resulting direct super learner has a desirable oracle property, is easy to implement, and performs well in realistic settings. The proposed methodology is illustrated with real data from a stroke trial.
Collapse
|
14
|
Robust Inference after Random Projections via Hellinger Distance for Location-Scale Family. ENTROPY 2019; 21:e21040348. [PMID: 33267062 PMCID: PMC7514831 DOI: 10.3390/e21040348] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/06/2019] [Revised: 03/23/2019] [Accepted: 03/24/2019] [Indexed: 11/02/2022]
Abstract
Big data and streaming data are encountered in a variety of contemporary applications in business and industry. In such cases, it is common to use random projections to reduce the dimension of the data yielding compressed data. These data however possess various anomalies such as heterogeneity, outliers, and round-off errors which are hard to detect due to volume and processing challenges. This paper describes a new robust and efficient methodology, using Hellinger distance, to analyze the compressed data. Using large sample methods and numerical experiments, it is demonstrated that a routine use of robust estimation procedure is feasible. The role of double limits in understanding the efficiency and robustness is brought out, which is of independent interest.
Collapse
|
15
|
A critical issue of using the variance of the total in the linearization method - In the context of unequal probability sampling. Stat Med 2018; 38:1475-1483. [PMID: 30488467 DOI: 10.1002/sim.8053] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2018] [Revised: 11/09/2018] [Accepted: 11/10/2018] [Indexed: 11/05/2022]
Abstract
Publicly available national survey data are useful for the evidence-based research to advance our understanding of important questions in the health and biomedical sciences. Appropriate variance estimation is a crucial step to evaluate the strength of evidence in the data analysis. In survey data analysis, the conventional linearization method for estimating the variance of a statistic of interest uses the variance estimator of the total based on linearized variables. We warn that this common practice may result in undesirable consequences such as susceptibility to data shift and severely inflated variance estimates, when unequal weights are incorporated into variance estimation. We propose to use the variance estimator of the mean (mean-approach) instead of the variance estimator of the total (total-approach). We show a superiority of the mean-approach through analytical investigations. A real data example (the National Comorbidity Survey Replication) and simulation-based studies strongly support our conclusion.
Collapse
|
16
|
A New Class of Robust Two-Sample Wald-Type Tests. Int J Biostat 2018; 14:/j/ijb.ahead-of-print/ijb-2017-0023/ijb-2017-0023.xml. [PMID: 30024852 DOI: 10.1515/ijb-2017-0023] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2017] [Accepted: 06/25/2018] [Indexed: 11/15/2022]
Abstract
Parametric hypothesis testing associated with two independent samples arises frequently in several applications in biology, medical sciences, epidemiology, reliability and many more. In this paper, we propose robust Wald-type tests for testing such two sample problems using the minimum density power divergence estimators of the underlying parameters. In particular, we consider the simple two-sample hypothesis concerning the full parametric homogeneity as well as the general two-sample (composite) hypotheses involving some nuisance parameters. The asymptotic and theoretical robustness properties of the proposed Wald-type tests have been developed for both the simple and general composite hypotheses. Some particular cases of testing against one-sided alternatives are discussed with specific attention to testing the effectiveness of a treatment in clinical trials. Performances of the proposed tests have also been illustrated numerically through appropriate real data examples.
Collapse
|
17
|
Robustness Property of Robust-BD Wald-Type Test for Varying-Dimensional General Linear Models. ENTROPY 2018; 20:e20030168. [PMID: 33265259 PMCID: PMC7512684 DOI: 10.3390/e20030168] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/12/2018] [Revised: 03/01/2018] [Accepted: 03/01/2018] [Indexed: 11/29/2022]
Abstract
An important issue for robust inference is to examine the stability of the asymptotic level and power of the test statistic in the presence of contaminated data. Most existing results are derived in finite-dimensional settings with some particular choices of loss functions. This paper re-examines this issue by allowing for a diverging number of parameters combined with a broader array of robust error measures, called “robust-BD”, for the class of “general linear models”. Under regularity conditions, we derive the influence function of the robust-BD parameter estimator and demonstrate that the robust-BD Wald-type test enjoys the robustness of validity and efficiency asymptotically. Specifically, the asymptotic level of the test is stable under a small amount of contamination of the null hypothesis, whereas the asymptotic power is large enough under a contaminated distribution in a neighborhood of the contiguous alternatives, thus lending supports to the utility of the proposed robust-BD Wald-type test.
Collapse
|
18
|
Abstract
Classification measures play essential roles in the assessment and construction of classifiers. Hence, determining how to prevent these measures from being affected by individual observations has become an important problem. In this paper, we propose several indexes based on the influence function and the concept of local influence to identify influential observations that affect the estimate of the area under the receiver operating characteristic curve (AUC), an important and commonly used measure. Cumulative lift charts are also used to equipoise the disagreements among the proposed indexes. Both the AUC indexes and the graphical tools only rely on the classification scores, and both are applicable to classifiers that can produce real-valued classification scores. A real data set is used for illustration.
Collapse
|
19
|
Abstract
We introduce a new method of estimation of parameters in semi-parametric and nonparametric models. The method is based on estimating equations that are U-statistics in the observations. The U-statistics are based on higher order influence functions that extend ordinary linear influence functions of the parameter of interest, and represent higher derivatives of this parameter. For parameters for which the representation cannot be perfect the method leads to a bias-variance trade-off, and results in estimators that converge at a slower than n -rate . In a number of examples the resulting rate can be shown to be optimal. We are particularly interested in estimating parameters in models with a nuisance parameter of high dimension or low regularity, where the parameter of interest cannot be estimated at n -rate , but we also consider efficient n -estimation using novel nonlinear estimators. The general approach is applied in detail to the example of estimating a mean response when the response is not always observed.
Collapse
|
20
|
Empirical likelihood inference in randomized clinical trials. Stat Methods Med Res 2017; 27:3770-3784. [PMID: 28679341 DOI: 10.1177/0962280217711205] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
In individually randomized controlled trials, in addition to the primary outcome, information is often available on a number of covariates prior to randomization. This information is frequently utilized to undertake adjustment for baseline characteristics in order to increase precision of the estimation of average treatment effects; such adjustment is usually performed via covariate adjustment in outcome regression models. Although the use of covariate adjustment is widely seen as desirable for making treatment effect estimates more precise and the corresponding hypothesis tests more powerful, there are considerable concerns that objective inference in randomized clinical trials can potentially be compromised. In this paper, we study an empirical likelihood approach to covariate adjustment and propose two unbiased estimating functions that automatically decouple evaluation of average treatment effects from regression modeling of covariate-outcome relationships. The resulting empirical likelihood estimator of the average treatment effect is as efficient as the existing efficient adjusted estimators1 when separate treatment-specific working regression models are correctly specified, yet are at least as efficient as the existing efficient adjusted estimators1 for any given treatment-specific working regression models whether or not they coincide with the true treatment-specific covariate-outcome relationships. We present a simulation study to compare the finite sample performance of various methods along with some results on analysis of a data set from an HIV clinical trial. The simulation results indicate that the proposed empirical likelihood approach is more efficient and powerful than its competitors when the working covariate-outcome relationships by treatment status are misspecified.
Collapse
|
21
|
Empirical Likelihood in Nonignorable Covariate-Missing Data Problems. Int J Biostat 2017; 13:/j/ijb.ahead-of-print/ijb-2016-0053/ijb-2016-0053.xml. [PMID: 28441139 DOI: 10.1515/ijb-2016-0053] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Missing covariate data occurs often in regression analysis, which frequently arises in the health and social sciences as well as in survey sampling. We study methods for the analysis of a nonignorable covariate-missing data problem in an assumed conditional mean function when some covariates are completely observed but other covariates are missing for some subjects. We adopt the semiparametric perspective of Bartlett et al. (Improving upon the efficiency of complete case analysis when covariates are MNAR. Biostatistics 2014;15:719-30) on regression analyses with nonignorable missing covariates, in which they have introduced the use of two working models, the working probability model of missingness and the working conditional score model. In this paper, we study an empirical likelihood approach to nonignorable covariate-missing data problems with the objective of effectively utilizing the two working models in the analysis of covariate-missing data. We propose a unified approach to constructing a system of unbiased estimating equations, where there are more equations than unknown parameters of interest. One useful feature of these unbiased estimating equations is that they naturally incorporate the incomplete data into the data analysis, making it possible to seek efficient estimation of the parameter of interest even when the working regression function is not specified to be the optimal regression function. We apply the general methodology of empirical likelihood to optimally combine these unbiased estimating equations. We propose three maximum empirical likelihood estimators of the underlying regression parameters and compare their efficiencies with other existing competitors. We present a simulation study to compare the finite-sample performance of various methods with respect to bias, efficiency, and robustness to model misspecification. The proposed empirical likelihood method is also illustrated by an analysis of a data set from the US National Health and Nutrition Examination Survey (NHANES).
Collapse
|
22
|
Computationally efficient confidence intervals for cross-validated area under the ROC curve estimates. Electron J Stat 2015; 9:1583-1607. [PMID: 26279737 PMCID: PMC4533123 DOI: 10.1214/15-ejs1035] [Citation(s) in RCA: 127] [Impact Index Per Article: 14.1] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
In binary classification problems, the area under the ROC curve (AUC) is commonly used to evaluate the performance of a prediction model. Often, it is combined with cross-validation in order to assess how the results will generalize to an independent data set. In order to evaluate the quality of an estimate for cross-validated AUC, we obtain an estimate of its variance. For massive data sets, the process of generating a single performance estimate can be computationally expensive. Additionally, when using a complex prediction method, the process of cross-validating a predictive model on even a relatively small data set can still require a large amount of computation time. Thus, in many practical settings, the bootstrap is a computationally intractable approach to variance estimation. As an alternative to the bootstrap, we demonstrate a computationally efficient influence curve based approach to obtaining a variance estimate for cross-validated AUC.
Collapse
|
23
|
Semiparametric estimation of treatment effect with time-lagged response in the presence of informative censoring. LIFETIME DATA ANALYSIS 2011; 17:566-593. [PMID: 21706378 PMCID: PMC3217309 DOI: 10.1007/s10985-011-9199-8] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/21/2009] [Accepted: 06/11/2011] [Indexed: 05/30/2023]
Abstract
In many randomized clinical trials, the primary response variable, for example, the survival time, is not observed directly after the patients enroll in the study but rather observed after some period of time (lag time). It is often the case that such a response variable is missing for some patients due to censoring that occurs when the study ends before the patient's response is observed or when the patients drop out of the study. It is often assumed that censoring occurs at random which is referred to as noninformative censoring; however, in many cases such an assumption may not be reasonable. If the missing data are not analyzed properly, the estimator or test for the treatment effect may be biased. In this paper, we use semiparametric theory to derive a class of consistent and asymptotically normal estimators for the treatment effect parameter which are applicable when the response variable is right censored. The baseline auxiliary covariates and post-treatment auxiliary covariates, which may be time-dependent, are also considered in our semiparametric model. These auxiliary covariates are used to derive estimators that both account for informative censoring and are more efficient then the estimators which do not consider the auxiliary covariates.
Collapse
|
24
|
Abstract
The pretest-posttest study is commonplace in numerous applications. Typically, subjects are randomized to two treatments, and response is measured at baseline, prior to intervention with the randomized treatment (pretest), and at prespecified follow-up time (posttest). Interest focuses on the effect of treatments on the change between mean baseline and follow-up response. Missing posttest response for some subjects is routine, and disregarding missing cases can lead to invalid inference. Despite the popularity of this design, a consensus on an appropriate analysis when no data are missing, let alone for taking into account missing follow-up, does not exist. Under a semiparametric perspective on the pretest-posttest model, in which limited distributional assumptions on pretest or posttest response are made, we show how the theory of Robins, Rotnitzky and Zhao may be used to characterize a class of consistent treatment effect estimators and to identify the efficient estimator in the class. We then describe how the theoretical results translate into practice. The development not only shows how a unified framework for inference in this setting emerges from the Robins, Rotnitzky and Zhao theory, but also provides a review and demonstration of the key aspects of this theory in a familiar context. The results are also relevant to the problem of comparing two treatment means with adjustment for baseline covariates.
Collapse
|