1
|
Xu N, Solari A, Goeman JJ. Combining Partial True Discovery Guarantee Procedures. Biom J 2024; 66:e202300075. [PMID: 38953670 DOI: 10.1002/bimj.202300075] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2023] [Revised: 03/31/2024] [Accepted: 05/04/2024] [Indexed: 07/04/2024]
Abstract
Closed testing has recently been shown to be optimal for simultaneous true discovery proportion control. It is, however, challenging to construct true discovery guarantee procedures in such a way that it focuses power on some feature sets chosen by users based on their specific interest or expertise. We propose a procedure that allows users to target power on prespecified feature sets, that is, "focus sets." Still, the method also allows inference for feature sets chosen post hoc, that is, "nonfocus sets," for which we deduce a true discovery lower confidence bound by interpolation. Our procedure is built from partial true discovery guarantee procedures combined with Holm's procedure and is a conservative shortcut to the closed testing procedure. A simulation study confirms that the statistical power of our method is relatively high for focus sets, at the cost of power for nonfocus sets, as desired. In addition, we investigate its power property for sets with specific structures, for example, trees and directed acyclic graphs. We also compare our method with AdaFilter in the context of replicability analysis. The application of our method is illustrated with a gene ontology analysis in gene expression data.
Collapse
Affiliation(s)
- Ningning Xu
- Department of Biomedical Data Sciences, Leiden University Medical Center, Leiden, The Netherlands
| | - Aldo Solari
- Department of Economics, Management and Statistics, University of Milano-Bicocca, Milan, Italy
- Department of Economics, Ca' Foscari University of Venice, Venice, Italy
| | - Jelle J Goeman
- Department of Biomedical Data Sciences, Leiden University Medical Center, Leiden, The Netherlands
| |
Collapse
|
2
|
Liehrmann A, Delannoy E, Launay-Avon A, Gilbault E, Loudet O, Castandet B, Rigaill G. DiffSegR: an RNA-seq data driven method for differential expression analysis using changepoint detection. NAR Genom Bioinform 2023; 5:lqad098. [PMID: 37954572 PMCID: PMC10632193 DOI: 10.1093/nargab/lqad098] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2023] [Revised: 09/27/2023] [Accepted: 10/23/2023] [Indexed: 11/14/2023] Open
Abstract
To fully understand gene regulation, it is necessary to have a thorough understanding of both the transcriptome and the enzymatic and RNA-binding activities that shape it. While many RNA-Seq-based tools have been developed to analyze the transcriptome, most only consider the abundance of sequencing reads along annotated patterns (such as genes). These annotations are typically incomplete, leading to errors in the differential expression analysis. To address this issue, we present DiffSegR - an R package that enables the discovery of transcriptome-wide expression differences between two biological conditions using RNA-Seq data. DiffSegR does not require prior annotation and uses a multiple changepoints detection algorithm to identify the boundaries of differentially expressed regions in the per-base log2 fold change. In a few minutes of computation, DiffSegR could rightfully predict the role of chloroplast ribonuclease Mini-III in rRNA maturation and chloroplast ribonuclease PNPase in (3'/5')-degradation of rRNA, mRNA and tRNA precursors as well as intron accumulation. We believe DiffSegR will benefit biologists working on transcriptomics as it allows access to information from a layer of the transcriptome overlooked by the classical differential expression analysis pipelines widely used today. DiffSegR is available at https://aliehrmann.github.io/DiffSegR/index.html.
Collapse
Affiliation(s)
- Arnaud Liehrmann
- Institute of Plant Sciences Paris-Saclay (IPS2), Université Paris-Saclay, CNRS, INRAE, Université Evry, Gif sur Yvette, 91190, France
- Institute of Plant Sciences Paris-Saclay (IPS2), Université Paris Cité, CNRS, INRAE, Gif sur Yvette, 91190, France
- Laboratoire de Mathématiques et de Modélisation d’Evry (LaMME), Université d’Evry-Val-d’Essonne, UMR CNRS 8071, ENSIIE, USC INRAE, Evry,91037, France
| | - Etienne Delannoy
- Institute of Plant Sciences Paris-Saclay (IPS2), Université Paris-Saclay, CNRS, INRAE, Université Evry, Gif sur Yvette, 91190, France
- Institute of Plant Sciences Paris-Saclay (IPS2), Université Paris Cité, CNRS, INRAE, Gif sur Yvette, 91190, France
| | - Alexandra Launay-Avon
- Institute of Plant Sciences Paris-Saclay (IPS2), Université Paris-Saclay, CNRS, INRAE, Université Evry, Gif sur Yvette, 91190, France
- Institute of Plant Sciences Paris-Saclay (IPS2), Université Paris Cité, CNRS, INRAE, Gif sur Yvette, 91190, France
| | - Elodie Gilbault
- Université Paris-Saclay, INRAE, AgroParisTech, Institut Jean-Pierre Bourgin (IJPB), 78000, Versailles, France
| | - Olivier Loudet
- Université Paris-Saclay, INRAE, AgroParisTech, Institut Jean-Pierre Bourgin (IJPB), 78000, Versailles, France
| | - Benoît Castandet
- Institute of Plant Sciences Paris-Saclay (IPS2), Université Paris-Saclay, CNRS, INRAE, Université Evry, Gif sur Yvette, 91190, France
- Institute of Plant Sciences Paris-Saclay (IPS2), Université Paris Cité, CNRS, INRAE, Gif sur Yvette, 91190, France
| | - Guillem Rigaill
- Institute of Plant Sciences Paris-Saclay (IPS2), Université Paris-Saclay, CNRS, INRAE, Université Evry, Gif sur Yvette, 91190, France
- Institute of Plant Sciences Paris-Saclay (IPS2), Université Paris Cité, CNRS, INRAE, Gif sur Yvette, 91190, France
- Laboratoire de Mathématiques et de Modélisation d’Evry (LaMME), Université d’Evry-Val-d’Essonne, UMR CNRS 8071, ENSIIE, USC INRAE, Evry,91037, France
| |
Collapse
|
3
|
Andreella A, Hemerik J, Finos L, Weeda W, Goeman J. Permutation-based true discovery proportions for functional magnetic resonance imaging cluster analysis. Stat Med 2023; 42:2311-2340. [PMID: 37259808 DOI: 10.1002/sim.9725] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2022] [Revised: 11/24/2022] [Accepted: 03/18/2023] [Indexed: 06/02/2023]
Abstract
We propose a permutation-based method for testing a large collection of hypotheses simultaneously. Our method provides lower bounds for the number of true discoveries in any selected subset of hypotheses. These bounds are simultaneously valid with high confidence. The methodology is particularly useful in functional Magnetic Resonance Imaging cluster analysis, where it provides a confidence statement on the percentage of truly activated voxels within clusters of voxels, avoiding the well-known spatial specificity paradox. We offer a user-friendly tool to estimate the percentage of true discoveries for each cluster while controlling the family-wise error rate for multiple testing and taking into account that the cluster was chosen in a data-driven way. The method adapts to the spatial correlation structure that characterizes functional Magnetic Resonance Imaging data, gaining power over parametric approaches.
Collapse
Affiliation(s)
- Angela Andreella
- Department of Economics, Ca' Foscari University of Venice, Venice, Italy
| | - Jesse Hemerik
- Biometris, Wageningen University and Research, Wageningen, The Netherlands
| | - Livio Finos
- Department of Statistics, University of Padova, Padova, Italy
| | - Wouter Weeda
- Department of Psychology, Leiden University, Leiden, The Netherlands
| | - Jelle Goeman
- Department of Biomedical Data Sciences, Leiden University Medical Center, Leiden, The Netherlands
| |
Collapse
|
4
|
Bogomolov M. Testing partial conjunction hypotheses under dependency, with applications to meta-analysis. Electron J Stat 2023. [DOI: 10.1214/22-ejs2100] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023]
Affiliation(s)
- Marina Bogomolov
- Faculty of Data and Decision Sciences, Technion - Israel Institute of Technology, Haifa 3200003, Israel
| |
Collapse
|
5
|
Jeng XJ. Estimating the proportion of signal variables under arbitrary covariance dependence. Electron J Stat 2023. [DOI: 10.1214/23-ejs2119] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/31/2023]
|
6
|
Enjalbert-Courrech N, Neuvial P. Powerful and interpretable control of false discoveries in two-group differential expression studies. Bioinformatics 2022; 38:5214-5221. [PMID: 36264124 DOI: 10.1093/bioinformatics/btac693] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2022] [Revised: 09/07/2022] [Accepted: 10/19/2022] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION The standard approach for statistical inference in differential expression (DE) analyses is to control the false discovery rate (FDR). However, controlling the FDR does not in fact imply that the proportion of false discoveries is upper bounded. Moreover, no statistical guarantee can be given on subsets of genes selected by FDR thresholding. These known limitations are overcome by post hoc inference, which provides guarantees of the number of proportion of false discoveries among arbitrary gene selections. However, post hoc inference methods are not yet widely used for DE studies. RESULTS In this article, we demonstrate the relevance and illustrate the performance of adaptive interpolation-based post hoc methods for two-group DE studies. First, we formalize the use of permutation-based methods to obtain sharp confidence bounds that are adaptive to the dependence between genes. Then, we introduce a generic linear time algorithm for computing post hoc bounds, making these bounds applicable to large-scale two-group DE studies. The use of the resulting Adaptive Simes bound is illustrated on a RNA sequencing study. Comprehensive numerical experiments based on real microarray and RNA sequencing data demonstrate the statistical performance of the method. AVAILABILITY AND IMPLEMENTATION A cross-platform open source implementation within the R package sanssouci is available at https://sanssouci-org.github.io/sanssouci/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Nicolas Enjalbert-Courrech
- Institut de Mathématiques de Toulouse, UMR 5219, Université de Toulouse, CNRS, UPS, F-31062 Toulouse Cedex 9, France
| | - Pierre Neuvial
- Institut de Mathématiques de Toulouse, UMR 5219, Université de Toulouse, CNRS, UPS, F-31062 Toulouse Cedex 9, France
| |
Collapse
|
7
|
Blain A, Thirion B, Neuvial P. Notip: Non-parametric true discovery proportion control for brain imaging. Neuroimage 2022; 260:119492. [PMID: 35870698 DOI: 10.1016/j.neuroimage.2022.119492] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2022] [Revised: 07/01/2022] [Accepted: 07/17/2022] [Indexed: 10/17/2022] Open
Abstract
Cluster-level inference procedures are widely used for brain mapping. These methods compare the size of clusters obtained by thresholding brain maps to an upper bound under the global null hypothesis, computed using Random Field Theory or permutations. However, the guarantees obtained by this type of inference - i.e. at least one voxel is truly activated in the cluster - are not informative with regards to the strength of the signal therein. There is thus a need for methods to assess the amount of signal within clusters; yet such methods have to take into account that clusters are defined based on the data, which creates circularity in the inference scheme. This has motivated the use of post hoc estimates that allow statistically valid estimation of the proportion of activated voxels in clusters. In the context of fMRI data, the All-Resolutions Inference framework introduced in Rosenblatt et al. (2018) provides post hoc estimates of the proportion of activated voxels. However, this method relies on parametric threshold families, which results in conservative inference. In this paper, we leverage randomization methods to adapt to data characteristics and obtain tighter false discovery control. We obtain Notip, for Non-parametric True Discovery Proportion control: a powerful, non-parametric method that yields statistically valid guarantees on the proportion of activated voxels in data-derived clusters. Numerical experiments demonstrate substantial gains in number of detections compared with state-of-the-art methods on 36 fMRI datasets. The conditions under which the proposed method brings benefits are also discussed.
Collapse
Affiliation(s)
- Alexandre Blain
- Inria, CEA, Université Paris-Saclay, Paris, France; Institut de Mathématiques de Toulouse, UMR 5219, Université de Toulouse, CNRS, UPS, Toulouse, France.
| | | | - Pierre Neuvial
- Institut de Mathématiques de Toulouse, UMR 5219, Université de Toulouse, CNRS, UPS, Toulouse, France
| |
Collapse
|
8
|
Frossard J, Renaud O. The cluster depth tests: Toward point-wise strong control of the family-wise error rate in massively univariate tests with application to M/EEG. Neuroimage 2021; 247:118824. [PMID: 34921993 DOI: 10.1016/j.neuroimage.2021.118824] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2021] [Revised: 12/10/2021] [Accepted: 12/14/2021] [Indexed: 11/26/2022] Open
Abstract
The cluster mass test has been widely used for massively univariate tests in M/EEG, fMRI and, recently, pupillometry analysis. It is a powerful method for detecting effects while controlling weakly the family-wise error rate (FWER), although its correct interpretation can only be performed at the cluster level without any point-wise conclusion. It implies that the discoveries of a cluster mass test cannot be precisely localized in time or in space. We propose a new multiple comparisons procedure, the cluster depth tests, that both controls the FWER while allowing an interpretation at the time point level. We show the conditions for a strong control of the FWER, and a simulation study shows that the cluster depth tests achieve large power and guarantee the FWER even in the presence of physiologically plausible effects. By having an interpretation at the time point/voxel level, the cluster depth tests make it possible to take full advantage of the high temporal resolution of EEG recording and give a precise timing of the start and end of the significant effects.
Collapse
Affiliation(s)
- Jaromil Frossard
- Methodology and Data Analysis, Department of Psychology, University of Geneva, Bd Carl-Vogt 101, 1205 Geneva,Switzerland.
| | - Olivier Renaud
- Methodology and Data Analysis, Department of Psychology, University of Geneva, Bd Carl-Vogt 101, 1205 Geneva,Switzerland.
| |
Collapse
|
9
|
Hierarchical correction of p-values via an ultrametric tree running Ornstein-Uhlenbeck process. Comput Stat 2021. [DOI: 10.1007/s00180-021-01148-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Abstract
AbstractStatistical testing is classically used as an exploratory tool to search for association between a phenotype and many possible explanatory variables. This approach often leads to multiple testing under dependence. We assume a hierarchical structure between tests via an Ornstein-Uhlenbeck process on a tree. The process correlation structure is used for smoothing the p-values. We design a penalized estimation of the mean of the Ornstein-Uhlenbeck process for p-value computation. The performances of the algorithm are assessed via simulations. Its ability to discover new associations is demonstrated on a metagenomic dataset. The corresponding R package is available from https://github.com/abichat/zazou.
Collapse
|
10
|
Ebrahimpoor M, Goeman JJ. Inflated false discovery rate due to volcano plots: problem and solutions. Brief Bioinform 2021; 22:bbab053. [PMID: 33758907 PMCID: PMC8425469 DOI: 10.1093/bib/bbab053] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2020] [Revised: 02/01/2021] [Indexed: 12/13/2022] Open
Abstract
MOTIVATION Volcano plots are used to select the most interesting discoveries when too many discoveries remain after application of Benjamini-Hochberg's procedure (BH). The volcano plot suggests a double filtering procedure that selects features with both small adjusted $P$-value and large estimated effect size. Despite its popularity, this type of selection overlooks the fact that BH does not guarantee error control over filtered subsets of discoveries. Therefore the selected subset of features may include an inflated number of false discoveries. RESULTS In this paper, we illustrate the substantially inflated type I error rate of volcano plot selection with simulation experiments and RNA-seq data. In particular, we show that the feature with the largest estimated effect is a very likely false positive result. Next, we investigate two alternative approaches for multiple testing with double filtering that do not inflate the false discovery rate. Our procedure is implemented in an interactive web application and is publicly available.
Collapse
Affiliation(s)
- Mitra Ebrahimpoor
- Medical statistics, Department of Biomedical Data Sciences, Leiden University Medical Center, Leiden, 2333 ZA, The Netherlands
| | - Jelle J Goeman
- Medical statistics, Department of Biomedical Data Sciences, Leiden University Medical Center, Leiden, 2333 ZA, The Netherlands
| |
Collapse
|
11
|
Katsevich E, Sabatti C, Bogomolov M. Filtering the rejection set while preserving false discovery rate control. J Am Stat Assoc 2021; 118:165-176. [PMID: 37346227 PMCID: PMC10281705 DOI: 10.1080/01621459.2021.1920958] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2020] [Revised: 04/14/2021] [Accepted: 04/18/2021] [Indexed: 12/28/2022]
Abstract
Scientific hypotheses in a variety of applications have domain-specific structures, such as the tree structure of the International Classification of Diseases (ICD), the directed acyclic graph structure of the Gene Ontology (GO), or the spatial structure in genome-wide association studies. In the context of multiple testing, the resulting relationships among hypotheses can create redundancies among rejections that hinder interpretability. This leads to the practice of filtering rejection sets obtained from multiple testing procedures, which may in turn invalidate their inferential guarantees. We propose Focused BH, a simple, flexible, and principled methodology to adjust for the application of any pre-specified filter. We prove that Focused BH controls the false discovery rate under various conditions, including when the filter satisfies an intuitive monotonicity property and the p-values are positively dependent. We demonstrate in simulations that Focused BH performs well across a variety of settings, and illustrate this method's practical utility via analyses of real datasets based on ICD and GO.
Collapse
Affiliation(s)
| | - Chiara Sabatti
- Departments of Statistics and Biomedical Data Science, Stanford University
| | - Marina Bogomolov
- Faculty of Industrial Engineering and Management, Technion - Israel Institute of Technology
| |
Collapse
|
12
|
Goeman JJ, Hemerik J, Solari A. Only closed testing procedures are admissible for controlling false discovery proportions. Ann Stat 2021. [DOI: 10.1214/20-aos1999] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Jelle J. Goeman
- Department of Biomedical Data Sciences, Leiden University Medical Center
| | - Jesse Hemerik
- Oslo Centre for Biostatistics and Epidemiology, University of Oslo, and Biometris, Wageningen University & Research
| | - Aldo Solari
- Department of Economics, Management and Statistics, University of Milano-Bicocca
| |
Collapse
|
13
|
Carpentier A, Delattre S, Roquain E, Verzelen N. Estimating minimum effect with outlier selection. Ann Stat 2021. [DOI: 10.1214/20-aos1956] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
14
|
Katsevich E, Ramdas A. Simultaneous high-probability bounds on the false discovery proportion in structured, regression and online settings. Ann Stat 2020. [DOI: 10.1214/19-aos1938] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
15
|
Döhler S, Roquain E. Controlling the false discovery exceedance for heterogeneous tests. Electron J Stat 2020. [DOI: 10.1214/20-ejs1771] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|