1
|
Patel A, DiTraglia FJ, Zuber V, Burgess S. Selecting Invalid Instruments to Improve Mendelian Randomization with Two-Sample Summary Data. Ann Appl Stat 2024; 18:23-aoas1856. [PMID: 38737575 PMCID: PMC7615940 DOI: 10.1214/23-aoas1856] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/14/2024]
Abstract
Mendelian randomization (MR) is a widely-used method to estimate the causal relationship between a risk factor and disease. A fundamental part of any MR analysis is to choose appropriate genetic variants as instrumental variables. Genome-wide association studies often reveal that hundreds of genetic variants may be robustly associated with a risk factor, but in some situations investigators may have greater confidence in the instrument validity of only a smaller subset of variants. Nevertheless, the use of additional instruments may be optimal from the perspective of mean squared error even if they are slightly invalid; a small bias in estimation may be a price worth paying for a larger reduction in variance. For this purpose, we consider a method for "focused" instrument selection whereby genetic variants are selected to minimise the estimated asymptotic mean squared error of causal effect estimates. In a setting of many weak and locally invalid instruments, we propose a novel strategy to construct confidence intervals for post-selection focused estimators that guards against the worst case loss in asymptotic coverage. In empirical applications to: (i) validate lipid drug targets; and (ii) investigate vitamin D effects on a wide range of outcomes, our findings suggest that the optimal selection of instruments does not involve only a small number of biologically-justified instruments, but also many potentially invalid instruments.
Collapse
Affiliation(s)
| | | | - Verena Zuber
- Department of Biostatistics and Epidemiology, Imperial College London
| | - Stephen Burgess
- MRC Biostatistics Unit, University of Cambridge
- Cardiovascular Epidemiology Unit, University of Cambridge
| |
Collapse
|
2
|
Van Lancker K, Dukes O, Vansteelandt S. Ensuring valid inference for Cox hazard ratios after variable selection. Biometrics 2023; 79:3096-3110. [PMID: 37349873 DOI: 10.1111/biom.13889] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2021] [Accepted: 05/29/2023] [Indexed: 06/24/2023]
Abstract
The problem of how to best select variables for confounding adjustment forms one of the key challenges in the evaluation of exposure effects in observational studies, and has been the subject of vigorous recent activity in causal inference. A major drawback of routine procedures is that there is no finite sample size at which they are guaranteed to deliver exposure effect estimators and associated confidence intervals with adequate performance. In this work, we will consider this problem when inferring conditional causal hazard ratios from observational studies under the assumption of no unmeasured confounding. The major complication that we face with survival data is that the key confounding variables may not be those that explain the censoring mechanism. In this paper, we overcome this problem using a novel and simple procedure that can be implemented using off-the-shelf software for penalized Cox regression. In particular, we will propose tests of the null hypothesis that the exposure has no effect on the considered survival endpoint, which are uniformly valid under standard sparsity conditions. Simulation results show that the proposed methods yield valid inferences even when covariates are high-dimensional.
Collapse
Affiliation(s)
- Kelly Van Lancker
- Department of Applied Mathematics, Computer Science and Statistics, Ghent University, Ghent, Belgium
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland, USA
| | - Oliver Dukes
- Department of Applied Mathematics, Computer Science and Statistics, Ghent University, Ghent, Belgium
| | - Stijn Vansteelandt
- Department of Applied Mathematics, Computer Science and Statistics, Ghent University, Ghent, Belgium
| |
Collapse
|
3
|
Huang TJ, Luedtke A, McKeague IW. EFFICIENT ESTIMATION OF THE MAXIMAL ASSOCIATION BETWEEN MULTIPLE PREDICTORS AND A SURVIVAL OUTCOME. Ann Stat 2023; 51:1965-1988. [PMID: 38405375 PMCID: PMC10888526 DOI: 10.1214/23-aos2313] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/27/2024]
Abstract
This paper develops a new approach to post-selection inference for screening high-dimensional predictors of survival outcomes. Post-selection inference for right-censored outcome data has been investigated in the literature, but much remains to be done to make the methods both reliable and computationally-scalable in high-dimensions. Machine learning tools are commonly used to provide predictions of survival outcomes, but the estimated effect of a selected predictor suffers from confirmation bias unless the selection is taken into account. The new approach involves the construction of semi-parametrically efficient estimators of the linear association between the predictors and the survival outcome, which are used to build a test statistic for detecting the presence of an association between any of the predictors and the outcome. Further, a stabilization technique reminiscent of bagging allows a normal calibration for the resulting test statistic, which enables the construction of confidence intervals for the maximal association between predictors and the outcome and also greatly reduces computational cost. Theoretical results show that this testing procedure is valid even when the number of predictors grows superpolynomially with sample size, and our simulations support this asymptotic guarantee at moderate sample sizes. The new approach is applied to the problem of identifying patterns in viral gene expression associated with the potency of an antiviral drug.
Collapse
Affiliation(s)
- Tzu-Jung Huang
- Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center
| | - Alex Luedtke
- Department of Statistics, University of Washington
| | | |
Collapse
|
4
|
Gao LL, Bien J, Witten D. Selective Inference for Hierarchical Clustering. J Am Stat Assoc 2022; 119:332-342. [PMID: 38660582 PMCID: PMC11036349 DOI: 10.1080/01621459.2022.2116331] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2020] [Accepted: 08/16/2022] [Indexed: 10/17/2022]
Abstract
Classical tests for a difference in means control the type I error rate when the groups are defined a priori. However, when the groups are instead defined via clustering, then applying a classical test yields an extremely inflated type I error rate. Notably, this problem persists even if two separate and independent data sets are used to define the groups and to test for a difference in their means. To address this problem, in this paper, we propose a selective inference approach to test for a difference in means between two clusters. Our procedure controls the selective type I error rate by accounting for the fact that the choice of null hypothesis was made based on the data. We describe how to efficiently compute exact p-values for clusters obtained using agglomerative hierarchical clustering with many commonly-used linkages. We apply our method to simulated data and to single-cell RNA-sequencing data.
Collapse
Affiliation(s)
- Lucy L. Gao
- Department of Statistics, University of British Columbia
| | - Jacob Bien
- Department of Data Sciences and Operations, University of Southern California
| | - Daniela Witten
- Departments of Statistics and Biostatistics, University of Washington
| |
Collapse
|
5
|
Neufeld AC, Gao LL, Witten DM. Tree-Values: Selective Inference for Regression Trees. J Mach Learn Res 2022; 23:305. [PMID: 38481523 PMCID: PMC10933572] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Figures] [Subscribe] [Scholar Register] [Indexed: 03/17/2024]
Abstract
We consider conducting inference on the output of the Classification and Regression Tree (CART) (Breiman et al., 1984) algorithm. A naive approach to inference that does not account for the fact that the tree was estimated from the data will not achieve standard guarantees, such as Type 1 error rate control and nominal coverage. Thus, we propose a selective inference framework for conducting inference on a fitted CART tree. In a nutshell, we condition on the fact that the tree was estimated from the data. We propose a test for the difference in the mean response between a pair of terminal nodes that controls the selective Type 1 error rate, and a confidence interval for the mean response within a single terminal node that attains the nominal selective coverage. Efficient algorithms for computing the necessary conditioning sets are provided. We apply these methods in simulation and to a dataset involving the association between portion control interventions and caloric intake.
Collapse
Affiliation(s)
- Anna C Neufeld
- Department of Statistics, University of Washington, Seattle, WA 98195, USA
| | - Lucy L Gao
- Department of Statistics, University of British Columbia, Vancouver, British Columbia, V6T 1Z4, Canada
| | - Daniela M Witten
- Departments of Statistics and Biostatistics, University of Washington, Seattle, WA 98195, USA
| |
Collapse
|
6
|
Abstract
A great deal of interest has recently focused on conducting inference on the parameters in a high-dimensional linear model. In this paper, we consider a simple and very naïve two-step procedure for this task, in which we (i) fit a lasso model in order to obtain a subset of the variables, and (ii) fit a least squares model on the lasso-selected set. Conventional statistical wisdom tells us that we cannot make use of the standard statistical inference tools for the resulting least squares model (such as confidence intervals and p-values), since we peeked at the data twice: once in running the lasso, and again in fitting the least squares model. However, in this paper, we show that under a certain set of assumptions, with high probability, the set of variables selected by the lasso is identical to the one selected by the noiseless lasso and is hence deterministic. Consequently, the naïve two-step approach can yield asymptotically valid inference. We utilize this finding to develop the naïve confidence interval, which can be used to draw inference on the regression coefficients of the model selected by the lasso, as well as the naïve score test, which can be used to test the hypotheses regarding the full-model regression coefficients.
Collapse
Affiliation(s)
- Sen Zhao
- 1600Amphitheatre Parkway, Mountain View, California 94043, USA
| | - Daniela Witten
- University of Washington, Health Sciences Building, Box 357232, Seattle, Washington 98195, USA
| | - Ali Shojaie
- University of Washington, Health Sciences Building, Box 357232, Seattle, Washington 98195, USA
| |
Collapse
|
7
|
Yates LA, Richards SA, Brook BW. Parsimonious model selection using information theory: a modified selection rule. Ecology 2021; 102:e03475. [PMID: 34272730 DOI: 10.1002/ecy.3475] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/13/2020] [Revised: 02/16/2021] [Accepted: 05/13/2021] [Indexed: 11/08/2022]
Abstract
Information-theoretic approaches to model selection, such as Akaike's information criterion (AIC) and cross validation, provide a rigorous framework to select among candidate hypotheses in ecology, yet the persistent concern of overfitting undermines the interpretation of inferred processes. A common misconception is that overfitting is due to the choice of criterion or model score, despite research demonstrating that selection uncertainty associated with score estimation is the predominant influence. Here we introduce a novel selection rule that identifies a parsimonious model by directly accounting for estimation uncertainty, while still retaining an information-theoretic interpretation. The new rule, which is a modification of the existing one-standard-error rule, mitigates overfitting and reduces the likelihood that spurious effects will be included in the selected model, thereby improving its inferential properties. We present the rule and illustrative examples in the context of maximum-likelihood estimation and Kullback-Leibler discrepancy, although the rule is applicable in a more general setting, including Bayesian model selection and other types of discrepancy.
Collapse
Affiliation(s)
- Luke A Yates
- School of Natural Sciences, University of Tasmania, Hobart, Tasmania, 7005, Australia
| | - Shane A Richards
- School of Natural Sciences, University of Tasmania, Hobart, Tasmania, 7005, Australia
| | - Barry W Brook
- School of Natural Sciences, University of Tasmania, Hobart, Tasmania, 7005, Australia
| |
Collapse
|
8
|
Hyun S, Lin KZ, G'Sell M, Tibshirani RJ. Post-selection inference for changepoint detection algorithms with application to copy number variation data. Biometrics 2021; 77:1037-1049. [PMID: 33434289 DOI: 10.1111/biom.13422] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2019] [Revised: 07/06/2020] [Accepted: 08/11/2020] [Indexed: 11/30/2022]
Abstract
Changepoint detection methods are used in many areas of science and engineering, for example, in the analysis of copy number variation data to detect abnormalities in copy numbers along the genome. Despite the broad array of available tools, methodology for quantifying our uncertainty in the strength (or the presence) of given changepoints post-selection are lacking. Post-selection inference offers a framework to fill this gap, but the most straightforward application of these methods results in low-powered hypothesis tests and leaves open several important questions about practical usability. In this work, we carefully tailor post-selection inference methods toward changepoint detection, focusing on copy number variation data. To accomplish this, we study commonly used changepoint algorithms: binary segmentation, as well as two of its most popular variants, wild and circular, and the fused lasso. We implement some of the latest developments in post-selection inference theory, mainly auxiliary randomization. This improves the power, which requires implementations of Markov chain Monte Carlo algorithms (importance sampling and hit-and-run sampling) to carry out our tests. We also provide recommendations for improving practical useability, detailed simulations, and example analyses on array comparative genomic hybridization as well as sequencing data.
Collapse
Affiliation(s)
- Sangwon Hyun
- Department of Data Sciences and Operations, University of Southern California, Los Angeles, California, USA
| | - Kevin Z Lin
- Department of Statistics, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Max G'Sell
- Department of Statistics and Data Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| | - Ryan J Tibshirani
- Department of Statistics and Data Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| |
Collapse
|
9
|
Zhao J, Chen C. A Nuisance-Free Inference Procedure Accounting for the Unknown Missingness with Application to Electronic Health Records. Entropy (Basel) 2020; 22:E1154. [PMID: 33286923 PMCID: PMC7597318 DOI: 10.3390/e22101154] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/24/2020] [Revised: 09/27/2020] [Accepted: 10/12/2020] [Indexed: 11/16/2022]
Abstract
We study how to conduct statistical inference in a regression model where the outcome variable is prone to missing values and the missingness mechanism is unknown. The model we consider might be a traditional setting or a modern high-dimensional setting where the sparsity assumption is usually imposed and the regularization technique is popularly used. Motivated by the fact that the missingness mechanism, albeit usually treated as a nuisance, is difficult to specify correctly, we adopt the conditional likelihood approach so that the nuisance can be completely ignored throughout our procedure. We establish the asymptotic theory of the proposed estimator and develop an easy-to-implement algorithm via some data manipulation strategy. In particular, under the high-dimensional setting where regularization is needed, we propose a data perturbation method for the post-selection inference. The proposed methodology is especially appealing when the true missingness mechanism tends to be missing not at random, e.g., patient reported outcomes or real world data such as electronic health records. The performance of the proposed method is evaluated by comprehensive simulation experiments as well as a study of the albumin level in the MIMIC-III database.
Collapse
Affiliation(s)
- Jiwei Zhao
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI 53726, USA
| | - Chi Chen
- Novartis Institutes for Biomedical Research, Shanghai 201203, China;
| |
Collapse
|
10
|
Abstract
Model selection from a set of candidate models plays an important role in many structural equation modelling applications. However, traditional model selection methods introduce extra randomness that is not accounted for by post-model selection inference. In the current study, we propose a model averaging technique within the frequentist statistical framework. Instead of selecting an optimal model, the contributions of all candidate models are acknowledged. Valid confidence intervals and a [Formula: see text] test statistic are proposed. A simulation study shows that the proposed method is able to produce a robust mean-squared error, a better coverage probability, and a better goodness-of-fit test compared to model selection. It is an interesting compromise between model selection and the full model.
Collapse
Affiliation(s)
- Shaobo Jin
- Department of Statistics, Uppsala University, Uppsala, Sweden.
| | | |
Collapse
|