1
|
Yang L, Wang P, Chen J. 2dGBH: Two-dimensional group Benjamini-Hochberg procedure for false discovery rate control in two-way multiple testing of genomic data. Bioinformatics 2024; 40:btae035. [PMID: 38244568 PMCID: PMC10873908 DOI: 10.1093/bioinformatics/btae035] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2024] [Revised: 02/16/2024] [Accepted: 02/16/2024] [Indexed: 01/22/2024] Open
Abstract
MOTIVATION Emerging omics technologies have introduced a two-way grouping structure in multiple testing, as seen in single-cell omics data, where the features can be grouped by either genes or cell types. Traditional multiple testing methods have limited ability to exploit such two-way grouping structure, leading to potential power loss. RESULTS We propose a new 2D Group Benjamini-Hochberg (2dGBH) procedure to harness the two-way grouping structure in omics data, extending the traditional one-way adaptive GBH procedure. Using both simulated and real datasets, we show that 2dGBH effectively controls the false discovery rate across biologically relevant settings, and it is more powerful than the BH or q-value procedure and more robust than the one-way adaptive GBH procedure. AVAILABILITY AND IMPLEMENTATION 2dGBH is available as an R package at: https://github.com/chloelulu/tdGBH. The analysis code and data are available at: https://github.com/chloelulu/tdGBH-paper.
Collapse
Affiliation(s)
- Lu Yang
- Division of Computational Biology, Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN 55905, United States
- Center for Individualized Medicine, Mayo Clinic, Rochester, MN 55905, United States
| | - Pei Wang
- Department of Statistics, Miami University, Oxford, OH 45056, United States
| | - Jun Chen
- Division of Computational Biology, Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN 55905, United States
- Center for Individualized Medicine, Mayo Clinic, Rochester, MN 55905, United States
| |
Collapse
|
2
|
Jeon H, Lim KS, Nguyen Y, Nettleton D. Adjusting for gene-specific covariates to improve RNA-seq analysis. Bioinformatics 2023; 39:btad498. [PMID: 37589589 PMCID: PMC10460482 DOI: 10.1093/bioinformatics/btad498] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2023] [Revised: 06/29/2023] [Accepted: 08/16/2023] [Indexed: 08/18/2023] Open
Abstract
SUMMARY This article suggests a novel positive false discovery rate (pFDR) controlling method for testing gene-specific hypotheses using a gene-specific covariate variable, such as gene length. We suppose the null probability depends on the covariate variable. In this context, we propose a rejection rule that accounts for heterogeneity among tests by using two distinct types of null probabilities. We establish a pFDR estimator for a given rejection rule by following Storey's q-value framework. A condition on a type 1 error posterior probability is provided that equivalently characterizes our rejection rule. We also present a suitable procedure for selecting a tuning parameter through cross-validation that maximizes the expected number of hypotheses declared significant. A simulation study demonstrates that our method is comparable to or better than existing methods across realistic scenarios. In data analysis, we find support for our method's premise that the null probability varies with a gene-specific covariate variable. AVAILABILITY AND IMPLEMENTATION The source code repository is publicly available at https://github.com/hsjeon1217/conditional_method.
Collapse
Affiliation(s)
- Hyeongseon Jeon
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH, United States
- Pelotonia Institute for Immuno-Oncology, The James Comprehensive Cancer Center, The Ohio State University, Columbus, OH 43210, United States
| | - Kyu-Sang Lim
- Department of Animal Resources Science, Kongju National University, Yesan-gun, Chungnam 32439, Republic of Korea
| | - Yet Nguyen
- Department of Mathematics and Statistics, Old Dominion University, Norfolk, VA 23529, United States
| | - Dan Nettleton
- Department of Statistics, Iowa State University, Ames, IA 50011, Unites States
| |
Collapse
|
3
|
Obry L, Dalmasso C. Weighted multiple testing procedures in genome-wide association studies. PeerJ 2023; 11:e15369. [PMID: 37337586 PMCID: PMC10276986 DOI: 10.7717/peerj.15369] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2022] [Accepted: 04/17/2023] [Indexed: 06/21/2023] Open
Abstract
Multiple testing procedures controlling the false discovery rate (FDR) are increasingly used in the context of genome wide association studies (GWAS), and weighted multiple testing procedures that incorporate covariate information are efficient to improve the power to detect associations. In this work, we evaluate some recent weighted multiple testing procedures in the specific context of GWAS through a simulation study. We also present a new efficient procedure called wBHa that prioritizes the detection of genetic variants with low minor allele frequencies while maximizing the overall detection power. The results indicate good performance of our procedure compared to other weighted multiple testing procedures. In particular, in all simulated settings, wBHa tends to outperform other procedures in detecting rare variants while maintaining good overall power. The use of the different procedures is illustrated with a real dataset.
Collapse
Affiliation(s)
- Ludivine Obry
- Université Paris-Saclay, CNRS, Univ Evry, Laboratoire de Mathématiques et Modélisation d’Evry, Evry-Courcouronnes, France
| | - Cyril Dalmasso
- Université Paris-Saclay, CNRS, Univ Evry, Laboratoire de Mathématiques et Modélisation d’Evry, Evry-Courcouronnes, France
| |
Collapse
|
4
|
Yu X, Xiao J, Cai M, Jiao Y, Wan X, Liu J, Yang C. PALM: a powerful and adaptive latent model for prioritizing risk variants with functional annotations. Bioinformatics 2023; 39:7028484. [PMID: 36744920 PMCID: PMC9950853 DOI: 10.1093/bioinformatics/btad068] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2022] [Revised: 01/12/2023] [Accepted: 02/03/2023] [Indexed: 02/07/2023] Open
Abstract
MOTIVATION The findings from genome-wide association studies (GWASs) have greatly helped us to understand the genetic basis of human complex traits and diseases. Despite the tremendous progress, much effects are still needed to address several major challenges arising in GWAS. First, most GWAS hits are located in the non-coding region of human genome, and thus their biological functions largely remain unknown. Second, due to the polygenicity of human complex traits and diseases, many genetic risk variants with weak or moderate effects have not been identified yet. RESULTS To address the above challenges, we propose a powerful and adaptive latent model (PALM) to integrate cell-type/tissue-specific functional annotations with GWAS summary statistics. Unlike existing methods, which are mainly based on linear models, PALM leverages a tree ensemble to adaptively characterize non-linear relationship between functional annotations and the association status of genetic variants. To make PALM scalable to millions of variants and hundreds of functional annotations, we develop a functional gradient-based expectation-maximization algorithm, to fit the tree-based non-linear model in a stable manner. Through comprehensive simulation studies, we show that PALM not only controls false discovery rate well, but also improves statistical power of identifying risk variants. We also apply PALM to integrate summary statistics of 30 GWASs with 127 cell type/tissue-specific functional annotations. The results indicate that PALM can identify more risk variants as well as rank the importance of functional annotations, yielding better interpretation of GWAS results. AVAILABILITY AND IMPLEMENTATION The source code is available at https://github.com/YangLabHKUST/PALM. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Xinyi Yu
- Shenzhen Research Institute of Big Data, Shenzhen 518172, China.,Department of Mathematics, The Hong Kong University of Science and Technology, Hong Kong SAR, China
| | - Jiashun Xiao
- Shenzhen Research Institute of Big Data, Shenzhen 518172, China.,Department of Mathematics, The Hong Kong University of Science and Technology, Hong Kong SAR, China
| | - Mingxuan Cai
- Department of Mathematics, The Hong Kong University of Science and Technology, Hong Kong SAR, China.,Department of Biostatistics, City University of Hong Kong, Hong Kong SAR, China
| | - Yuling Jiao
- School of Mathematics and Statistics, Wuhan University, Wuhan 430072, China
| | - Xiang Wan
- Shenzhen Research Institute of Big Data, Shenzhen 518172, China
| | - Jin Liu
- Centre for Quantitative Medicine, Health Services & Systems Research, Duke-NUS Medical School, Singapore 169857, Singapore.,School of Data Science, The Chinese University of Hong Kong-Shenzhen, Shenzhen 518172, China
| | - Can Yang
- Department of Mathematics, The Hong Kong University of Science and Technology, Hong Kong SAR, China
| |
Collapse
|
5
|
Dixit V, Martin R. Revisiting consistency of a recursive estimator of mixing distributions. Electron J Stat 2023. [DOI: 10.1214/23-ejs2121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/03/2023]
|
6
|
Bello N, López-Kleine L. Prog-Plot - a visual method to determine functional relationships for false discovery rate regression methods. J Cell Sci 2023; 136:jcs260312. [PMID: 36482762 DOI: 10.1242/jcs.260312] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2022] [Accepted: 12/01/2022] [Indexed: 12/14/2022] Open
Abstract
Multiple test corrections are a fundamental step in the analysis of differentially expressed genes, as the number of tests performed would otherwise inflate the false discovery rate (FDR). Recent methods for P-value correction involve a regression model in order to include covariates that are informative of the power of the test. Here, we present Progressive proportions plot (Prog-Plot), a visual tool to identify the functional relationship between the covariate and the proportion of P-values consistent with the null hypothesis. The relationship between the proportion of P-values and the covariate to be included is needed, but there are no available tools to verify it. The approach presented here aims at having an objective way to specify regression models instead of relying on prior knowledge.
Collapse
Affiliation(s)
- Nicolás Bello
- Statistics Department, Universidad Nacional de Colombia, Ciudad Universitaria, Cra 30 No 45-03, Bogotá 111321, Colombia
| | - Liliana López-Kleine
- Statistics Department, Universidad Nacional de Colombia, Ciudad Universitaria, Cra 30 No 45-03, Bogotá 111321, Colombia
| |
Collapse
|
7
|
Leung D, Sun W. ZAP: Z$$ Z $$‐value adaptive procedures for false discovery rate control with side information. J R Stat Soc Series B Stat Methodol 2022. [DOI: 10.1111/rssb.12557] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022]
Affiliation(s)
- Dennis Leung
- School of Mathematics and Statistics University of Melbourne Parkville Victoria Australia
| | - Wenguang Sun
- Center for Data Science Zhejiang University Hangzhou China
| |
Collapse
|
8
|
LOPER JH, Lei L, FITHIAN W, TANSEY W. Smoothed Nested Testing on Directed Acyclic Graphs. Biometrika 2022; 109:457-471. [PMID: 38694183 PMCID: PMC11061840 DOI: 10.1093/biomet/asab041] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/04/2024] Open
Abstract
We consider the problem of multiple hypothesis testing when there is a logical nested structure to the hypotheses. When one hypothesis is nested inside another, the outer hypothesis must be false if the inner hypothesis is false. We model the nested structure as a directed acyclic graph, including chain and tree graphs as special cases. Each node in the graph is a hypothesis and rejecting a node requires also rejecting all of its ancestors. We propose a general framework for adjusting node-level test statistics using the known logical constraints. Within this framework, we study a smoothing procedure that combines each node with all of its descendants to form a more powerful statistic. We prove a broad class of smoothing strategies can be used with existing selection procedures to control the familywise error rate, false discovery exceedance rate, or false discovery rate, so long as the original test statistics are independent under the null. When the null statistics are not independent but are derived from positively-correlated normal observations, we prove control for all three error rates when the smoothing method is arithmetic averaging of the observations. Simulations and an application to a real biology dataset demonstrate that smoothing leads to substantial power gains.
Collapse
Affiliation(s)
- J. H. LOPER
- Department of Neuroscience, Columbia University, 716 Jerome L. Greene Building, New York, New York 10025, U.S.A
| | - L. Lei
- Department of Statistics, Stanford University, Sequoia Hall, Palo Alto, California 94305, U.S.A
| | - W. FITHIAN
- Department of Statistics, University of California, Berkeley, 367 Evans Hall, Berkeley, California 94720, U.S.A
| | - W. TANSEY
- Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, 321 E 61st St., New York, New York 10065, U.S.A
| |
Collapse
|
9
|
Cao H, Chen J, Zhang X. Optimal false discovery rate control for large scale multiple testing with auxiliary information. Ann Stat 2022; 50:807-857. [PMID: 37138896 PMCID: PMC10153594 DOI: 10.1214/21-aos2128] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
Large-scale multiple testing is a fundamental problem in high dimensional statistical inference. It is increasingly common that various types of auxiliary information, reflecting the structural relationship among the hypotheses, are available. Exploiting such auxiliary information can boost statistical power. To this end, we propose a framework based on a two-group mixture model with varying probabilities of being null for different hypotheses a priori, where a shape-constrained relationship is imposed between the auxiliary information and the prior probabilities of being null. An optimal rejection rule is designed to maximize the expected number of true positives when average false discovery rate is controlled. Focusing on the ordered structure, we develop a robust EM algorithm to estimate the prior probabilities of being null and the distribution of p-values under the alternative hypothesis simultaneously. We show that the proposed method has better power than state-of-the-art competitors while controlling the false discovery rate, both empirically and theoretically. Extensive simulations demonstrate the advantage of the proposed method. Datasets from genome-wide association studies are used to illustrate the new methodology.
Collapse
Affiliation(s)
- Hongyuan Cao
- Department of Statistics, Florida State University
| | - Jun Chen
- Department of Quantitative Health Sciences, Mayo Clinic
| | | |
Collapse
|
10
|
Zhang X, Chen J. Covariate Adaptive False Discovery Rate Control With Applications to Omics-Wide Multiple Testing. J Am Stat Assoc 2022. [DOI: 10.1080/01621459.2020.1783273] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Affiliation(s)
- Xianyang Zhang
- Department of Statistics, Texas A&M University, College Station, TX
| | - Jun Chen
- Division of Biomedical Statistics and Informatics, and Center for Individualized Medicine, Mayo Clinic, Rochester, MN
| |
Collapse
|
11
|
Sarkar SK, Zhao Z. Local false discovery rate based methods for multiple testing of one-way classified hypotheses. Electron J Stat 2022. [DOI: 10.1214/22-ejs2080] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
Affiliation(s)
- Sanat K. Sarkar
- Department of Statistics, Operations, and Data Science, Temple University, Philadelphia, PA, 19122, USA
| | - Zhigen Zhao
- Department of Statistics, Operations, and Data Science, Temple University, Philadelphia, PA, 19122, USA
| |
Collapse
|
12
|
Zhou H, Zhang X, Chen J. Covariate adaptive familywise error rate control for genome-wide association studies. Biometrika 2021; 108:915-931. [PMID: 34803516 DOI: 10.1093/biomet/asaa098] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2020] [Indexed: 11/12/2022] Open
Abstract
The familywise error rate has been widely used in genome-wide association studies. With the increasing availability of functional genomics data, it is possible to increase detection power by leveraging these genomic functional annotations. Previous efforts to accommodate covariates in multiple testing focused on false discovery rate control, while covariate-adaptive procedures controlling the familywise error rate remain underdeveloped. Here, we propose a novel covariate-adaptive procedure to control the familywise error rate that incorporates external covariates which are potentially informative of either the statistical power or the prior null probability. An efficient algorithm is developed to implement the proposed method. We prove its asymptotic validity and obtain the rate of convergence through a perturbation-type argument. Our numerical studies show that the new procedure is more powerful than competing methods and maintains robustness across different settings. We apply the proposed approach to the UK Biobank data and analyse 27 traits with 9 million single-nucleotide polymorphisms tested for associations. Seventy-five genomic annotations are used as covariates. Our approach detects more genome-wide significant loci than other methods in 21 out of the 27 traits.
Collapse
Affiliation(s)
- Huijuan Zhou
- Institute of Statistics and Big Data, Renmin University of China, Beijing 100872, China
| | - Xianyang Zhang
- Department of Statistics, Texas A&M University, College Station, Texas 77843, U.S.A
| | - Jun Chen
- Division of Biomedical Statistics and Informatics, Mayo Clinic, 200 First St. SW, Rochester, Minnesota 55905, U.S.A
| |
Collapse
|
13
|
Zhu Z, Fan Y, Kong Y, Lv J, Sun F. DeepLINK: Deep learning inference using knockoffs with applications to genomics. Proc Natl Acad Sci U S A 2021; 118:e2104683118. [PMID: 34480002 PMCID: PMC8433583 DOI: 10.1073/pnas.2104683118] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2021] [Accepted: 07/16/2021] [Indexed: 11/18/2022] Open
Abstract
We propose a deep learning-based knockoffs inference framework, DeepLINK, that guarantees the false discovery rate (FDR) control in high-dimensional settings. DeepLINK is applicable to a broad class of covariate distributions described by the possibly nonlinear latent factor models. It consists of two major parts: an autoencoder network for the knockoff variable construction and a multilayer perceptron network for feature selection with the FDR control. The empirical performance of DeepLINK is investigated through extensive simulation studies, where it is shown to achieve FDR control in feature selection with both high selection power and high prediction accuracy. We also apply DeepLINK to three real data applications to demonstrate its practical utility.
Collapse
Affiliation(s)
- Zifan Zhu
- Quantitative and Computational Biology Department, University of Southern California, Los Angeles, CA 90089
| | - Yingying Fan
- Data Sciences and Operations Department, Marshall School of Business, University of Southern California, Los Angeles, CA 90089;
| | - Yinfei Kong
- Department of Information Systems and Decision Sciences, California State University, Fullerton, CA 92831
| | - Jinchi Lv
- Data Sciences and Operations Department, Marshall School of Business, University of Southern California, Los Angeles, CA 90089
| | - Fengzhu Sun
- Quantitative and Computational Biology Department, University of Southern California, Los Angeles, CA 90089;
| |
Collapse
|
14
|
Ignatiadis N, Huber W. Covariate powered cross‐weighted multiple testing. J R Stat Soc Series B Stat Methodol 2021. [DOI: 10.1111/rssb.12411] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Affiliation(s)
| | - Wolfgang Huber
- European Molecular Biology Laboratory Heidelberg Germany
| |
Collapse
|
15
|
Menyhart O, Weltz B, Győrffy B. MultipleTesting.com: A tool for life science researchers for multiple hypothesis testing correction. PLoS One 2021; 16:e0245824. [PMID: 34106935 PMCID: PMC8189492 DOI: 10.1371/journal.pone.0245824] [Citation(s) in RCA: 52] [Impact Index Per Article: 17.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2021] [Accepted: 05/14/2021] [Indexed: 11/18/2022] Open
Abstract
Scientists from nearly all disciplines face the problem of simultaneously evaluating many hypotheses. Conducting multiple comparisons increases the likelihood that a non-negligible proportion of associations will be false positives, clouding real discoveries. Drawing valid conclusions require taking into account the number of performed statistical tests and adjusting the statistical confidence measures. Several strategies exist to overcome the problem of multiple hypothesis testing. We aim to summarize critical statistical concepts and widely used correction approaches while also draw attention to frequently misinterpreted notions of statistical inference. We provide a step-by-step description of each multiple-testing correction method with clear examples and present an easy-to-follow guide for selecting the most suitable correction technique. To facilitate multiple-testing corrections, we developed a fully automated solution not requiring programming skills or the use of a command line. Our registration free online tool is available at www.multipletesting.com and compiles the five most frequently used adjustment tools, including the Bonferroni, the Holm (step-down), the Hochberg (step-up) corrections, allows to calculate False Discovery Rates (FDR) and q-values. The current summary provides a much needed practical synthesis of basic statistical concepts regarding multiple hypothesis testing in a comprehensible language with well-illustrated examples. The web tool will fill the gap for life science researchers by providing a user-friendly substitute for command-line alternatives.
Collapse
Affiliation(s)
- Otília Menyhart
- Department of Bioinformatics, Semmelweis University, Budapest, Hungary
- Research Centre for Natural Sciences, Cancer Biomarker Research Group, Institute of Enzymology, Budapest, Hungary
| | - Boglárka Weltz
- Research Centre for Natural Sciences, Cancer Biomarker Research Group, Institute of Enzymology, Budapest, Hungary
- A5 Genetics Ltd, Und, Hungary
| | - Balázs Győrffy
- Department of Bioinformatics, Semmelweis University, Budapest, Hungary
- Research Centre for Natural Sciences, Cancer Biomarker Research Group, Institute of Enzymology, Budapest, Hungary
- 2 Department of Pediatrics, Semmelweis University, Budapest, Hungary
- * E-mail:
| |
Collapse
|
16
|
Mussap M, Noto A, Piras C, Atzori L, Fanos V. Slotting metabolomics into routine precision medicine. EXPERT REVIEW OF PRECISION MEDICINE AND DRUG DEVELOPMENT 2021. [DOI: 10.1080/23808993.2021.1911639] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Affiliation(s)
- Michele Mussap
- Department of Surgical Science, University of Cagliari, Monserrato, Italy
| | - Antonio Noto
- Department of Medical Sciences and Public Health, University of Cagliari, Monserrato, Italy
| | - Cristina Piras
- Department of Surgical Science, University of Cagliari, Monserrato, Italy
- Department of Biomedical Sciences, University of Cagliari, Monserrato, Italy
| | - Luigi Atzori
- Department of Biomedical Sciences, University of Cagliari, Monserrato, Italy
| | - Vassilios Fanos
- Department of Surgical Science, University of Cagliari, Monserrato, Italy
| |
Collapse
|
17
|
Deb N, Saha S, Guntuboyina A, Sen B. Two-Component Mixture Model in the Presence of Covariates. J Am Stat Assoc 2021. [DOI: 10.1080/01621459.2021.1888739] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Affiliation(s)
- Nabarun Deb
- Department of Statistics, Columbia University, New York, NY
| | | | | | | |
Collapse
|
18
|
Liley J, Wallace C. Accurate error control in high-dimensional association testing using conditional false discovery rates. Biom J 2021; 63:1096-1130. [PMID: 33682201 PMCID: PMC7612315 DOI: 10.1002/bimj.201900254] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2019] [Revised: 12/05/2020] [Accepted: 12/30/2020] [Indexed: 01/13/2023]
Abstract
High-dimensional hypothesis testing is ubiquitous in the biomedical sciences, and informative covariates may be employed to improve power. The conditional false discovery rate (cFDR) is a widely used approach suited to the setting where the covariate is a set of p-values for the equivalent hypotheses for a second trait. Although related to the Benjamini–Hochberg procedure, it does not permit any easy control of type-1 error rate and existing methods are over-conservative. We propose a newmethod for type-1 error rate control based on identifyingmappings from the unit square to the unit interval defined by the estimated cFDR and splitting observations so that each map is independent of the observations it is used to test. We also propose an adjustment to the existing cFDR estimator which further improves power. We show by simulation that the new method more than doubles potential improvement in power over unconditional analyses compared to existing methods. We demonstrate our method on transcriptome-wide association studies and show that the method can be used in an iterative way, enabling the use of multiple covariates successively. Our methods substantially improve the power and applicability of cFDR analysis.
Collapse
Affiliation(s)
- James Liley
- MRC Biostatistics Unit, University of Cambridge, Cambridge, UK.,Department of Medicine, Addenbrookes Hospital, University of Cambridge, Cambridge, UK
| | - Chris Wallace
- MRC Biostatistics Unit, University of Cambridge, Cambridge, UK.,Department of Medicine, Addenbrookes Hospital, University of Cambridge, Cambridge, UK.,Cambridge Institute of Therapeutic Immunology and Infectious Disease, Jeffrey Cheah Biomedical Centre, Cambridge Biomedical Campus, Cambridge, UK
| |
Collapse
|
19
|
Woody S, Padilla OHM, Scott JG. Optimal post-selection inference for sparse signals: a nonparametric empirical Bayes approach. Biometrika 2021. [DOI: 10.1093/biomet/asab014] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Summary
Many recently developed Bayesian methods focus on sparse signal detection. However, much less work has been done on the natural follow-up question: how does one make valid inferences for the magnitude of those signals after selection? Ordinary Bayesian credible intervals suffer from selection bias, as do ordinary frequentist confidence intervals. Existing Bayesian methods for correcting this bias produce credible intervals with poor frequentist properties. Further, existing frequentist approaches require sacrificing the benefits of shrinkage typical in Bayesian methods, resulting in confidence intervals that are needlessly wide. We address this gap by proposing a nonparametric empirical Bayes approach to constructing optimal selection-adjusted confidence sets. Our method produces confidence sets that are as short as possible on average, while both adjusting for selection and maintaining exact frequentist coverage uniformly over the parameter space. We demonstrate an important consistency property of our procedure: under mild conditions, it asymptotically converges to the results of an oracle-Bayes analysis in which the prior distribution of signal sizes is known exactly. Across a series of examples, the method is found to outperform existing frequentist techniques for post-selection inference, producing confidence sets that are notably shorter, but with the same coverage guarantee.
Collapse
|
20
|
Cai TT, Sun W, Xia Y. LAWS: A Locally Adaptive Weighting and Screening Approach to Spatial Multiple Testing. J Am Stat Assoc 2021. [DOI: 10.1080/01621459.2020.1859379] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Affiliation(s)
- T. Tony Cai
- Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA
| | - Wenguang Sun
- Department of Data Sciences and Operations, University of Southern California, Los Angeles, CA
| | - Yin Xia
- Department of Statistics, School of Management, Fudan University, Shanghai, China
| |
Collapse
|
21
|
Chen X, Robinson DG, Storey JD. The functional false discovery rate with applications to genomics. Biostatistics 2021; 22:68-81. [PMID: 31135886 PMCID: PMC7846131 DOI: 10.1093/biostatistics/kxz010] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2018] [Revised: 03/08/2019] [Accepted: 03/24/2019] [Indexed: 12/15/2022] Open
Abstract
The false discovery rate (FDR) measures the proportion of false discoveries among a set of hypothesis tests called significant. This quantity is typically estimated based on p-values or test statistics. In some scenarios, there is additional information available that may be used to more accurately estimate the FDR. We develop a new framework for formulating and estimating FDRs and q-values when an additional piece of information, which we call an "informative variable", is available. For a given test, the informative variable provides information about the prior probability a null hypothesis is true or the power of that particular test. The FDR is then treated as a function of this informative variable. We consider two applications in genomics. Our first application is a genetics of gene expression (eQTL) experiment in yeast where every genetic marker and gene expression trait pair are tested for associations. The informative variable in this case is the distance between each genetic marker and gene. Our second application is to detect differentially expressed genes in an RNA-seq study carried out in mice. The informative variable in this study is the per-gene read depth. The framework we develop is quite general, and it should be useful in a broad range of scientific applications.
Collapse
Affiliation(s)
- Xiongzhi Chen
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
| | - David G Robinson
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
| | - John D Storey
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
| |
Collapse
|
22
|
Fu L, Gang B, James GM, Sun W. Heteroscedasticity-Adjusted Ranking and Thresholding for Large-Scale Multiple Testing. J Am Stat Assoc 2020. [DOI: 10.1080/01621459.2020.1840992] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Affiliation(s)
- Luella Fu
- Department of Mathematics, San Francisco State University, San Francisco, CA
| | - Bowen Gang
- Department of Statistics, Fudan University, Shanghai, China
| | - Gareth M. James
- Department of Data Sciences and Operations, University of Southern California, Los Angeles, CA
| | - Wenguang Sun
- Department of Data Sciences and Operations, University of Southern California, Los Angeles, CA
| |
Collapse
|
23
|
Affiliation(s)
- Wesley Tansey
- Department of Epidemiology and Biostatistics Memorial Sloan Kettering Cancer Center New York New York USA
| | - Yixin Wang
- Department of Statistics Columbia University New York New York USA
| | - Raul Rabadan
- Department of Systems Biology Columbia University Medical Center New York New York USA
| | - David Blei
- Department of Statistics Columbia University New York New York USA
- Department of Computer Science Columbia University New York New York USA
| |
Collapse
|
24
|
Rao S, Lau A, So HC. Exploring Diseases/Traits and Blood Proteins Causally Related to Expression of ACE2, the Putative Receptor of SARS-CoV-2: A Mendelian Randomization Analysis Highlights Tentative Relevance of Diabetes-Related Traits. Diabetes Care 2020; 43:1416-1426. [PMID: 32430459 DOI: 10.2337/dc20-0643] [Citation(s) in RCA: 156] [Impact Index Per Article: 39.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/26/2020] [Accepted: 05/10/2020] [Indexed: 02/03/2023]
Abstract
OBJECTIVE COVID-19 has become a major public health problem. There is good evidence that ACE2 is a receptor for SARS-CoV-2, and high expression of ACE2 may increase susceptibility to infection. We aimed to explore risk factors affecting susceptibility to infection and prioritize drug repositioning candidates, based on Mendelian randomization (MR) studies on ACE2 lung expression. RESEARCH DESIGN AND METHODS We conducted a phenome-wide MR study to prioritize diseases/traits and blood proteins causally linked to ACE2 lung expression in GTEx. We also explored drug candidates whose targets overlapped with the top-ranked proteins in MR, as these drugs may alter ACE2 expression and may be clinically relevant. RESULTS The most consistent finding was tentative evidence of an association between diabetes-related traits and increased ACE2 expression. Based on one of the largest genome-wide association studies on type 2 diabetes mellitus (T2DM) to date (N = 898,130), T2DM was causally linked to raised ACE2 expression (P = 2.91E-03; MR-IVW). Significant associations (at nominal level; P < 0.05) with ACE2 expression were observed across multiple diabetes data sets and analytic methods for T1DM, T2DM, and related traits including early start of insulin. Other diseases/traits having nominal significant associations with increased expression included inflammatory bowel disease, (estrogen receptor-positive) breast cancer, lung cancer, asthma, smoking, and elevated alanine aminotransferase. We also identified drugs that may target the top-ranked proteins in MR, such as fostamatinib and zinc. CONCLUSIONS Our analysis suggested that diabetes and related traits may increase ACE2 expression, which may influence susceptibility to infection (or more severe infection). However, none of these findings withstood rigorous multiple testing corrections (at false discovery rate <0.05). Proteome-wide MR analyses might help uncover mechanisms underlying ACE2 expression and guide drug repositioning. Further studies are required to verify our findings.
Collapse
Affiliation(s)
- Shitao Rao
- School of Biomedical Sciences, The Chinese University of Hong Kong, Shatin, Hong Kong
| | - Alexandria Lau
- School of Biomedical Sciences, The Chinese University of Hong Kong, Shatin, Hong Kong
| | - Hon-Cheong So
- School of Biomedical Sciences, The Chinese University of Hong Kong, Shatin, Hong Kong .,Shenzhen Research Institute, The Chinese University of Hong Kong, Shenzhen, China.,KIZ-CUHK Joint Laboratory of Bioresources and Molecular Research of Common Diseases, Kunming Institute of Zoology and The Chinese University of Hong Kong, Shatin, Hong Kong.,Department of Psychiatry, The Chinese University of Hong Kong, Shatin, Hong Kong.,Margaret K.L. Cheung Research Centre for Management of Parkinsonism, The Chinese University of Hong Kong, Shatin, Hong Kong.,Brain and Mind Institute, The Chinese University of Hong Kong, Shatin, Hong Kong
| |
Collapse
|
25
|
A selective inference approach for false discovery rate control using multiomics covariates yields insights into disease risk. Proc Natl Acad Sci U S A 2020; 117:15028-15035. [PMID: 32522875 PMCID: PMC7334489 DOI: 10.1073/pnas.1918862117] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open
Abstract
Variation is rampant throughout human genomes: some of it affects disease risk, and most does not; to separate the two requires a plethora of hypothesis tests. This challenge of multiple testing—limiting false positives while maximizing power—arises in many “omics” studies and sciences. One approach is to control the false discovery rate (FDR), and a recent selective inference method for controlling FDR, adaptive P-value thresholding (AdaPT), facilitates incorporation of auxiliary information (covariates) related to each hypothesis test. How AdaPT performs on data is an open question. We apply AdaPT to results from genomic association studies and include many covariates. This adaptive search discovers a more complex and interpretable model with far greater power than classic multiple testing procedures. To correct for a large number of hypothesis tests, most researchers rely on simple multiple testing corrections. Yet, new methodologies of selective inference could potentially improve power while retaining statistical guarantees, especially those that enable exploration of test statistics using auxiliary information (covariates) to weight hypothesis tests for association. We explore one such method, adaptive P-value thresholding (AdaPT), in the framework of genome-wide association studies (GWAS) and gene expression/coexpression studies, with particular emphasis on schizophrenia (SCZ). Selected SCZ GWAS association P values play the role of the primary data for AdaPT; single-nucleotide polymorphisms (SNPs) are selected because they are gene expression quantitative trait loci (eQTLs). This natural pairing of SNPs and genes allow us to map the following covariate values to these pairs: GWAS statistics from genetically correlated bipolar disorder, the effect size of SNP genotypes on gene expression, and gene–gene coexpression, captured by subnetwork (module) membership. In all, 24 covariates per SNP/gene pair were included in the AdaPT analysis using flexible gradient boosted trees. We demonstrate a substantial increase in power to detect SCZ associations using gene expression information from the developing human prefrontal cortex. We interpret these results in light of recent theories about the polygenic nature of SCZ. Importantly, our entire process for identifying enrichment and creating features with independent complementary data sources can be implemented in many different high-throughput settings to ultimately improve power.
Collapse
|
26
|
Huang J, Bai L, Cui B, Wu L, Wang L, An Z, Ruan S, Yu Y, Zhang X, Chen J. Leveraging biological and statistical covariates improves the detection power in epigenome-wide association testing. Genome Biol 2020; 21:88. [PMID: 32252795 PMCID: PMC7132874 DOI: 10.1186/s13059-020-02001-7] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2019] [Accepted: 03/17/2020] [Indexed: 12/14/2022] Open
Abstract
BACKGROUND Epigenome-wide association studies (EWAS), which seek the association between epigenetic marks and an outcome or exposure, involve multiple hypothesis testing. False discovery rate (FDR) control has been widely used for multiple testing correction. However, traditional FDR control methods do not use auxiliary covariates, and they could be less powerful if the covariates could inform the likelihood of the null hypothesis. Recently, many covariate-adaptive FDR control methods have been developed, but application of these methods to EWAS data has not yet been explored. It is not clear whether these methods can significantly improve detection power, and if so, which covariates are more relevant for EWAS data. RESULTS In this study, we evaluate the performance of five covariate-adaptive FDR control methods with EWAS-related covariates using simulated as well as real EWAS datasets. We develop an omnibus test to assess the informativeness of the covariates. We find that statistical covariates are generally more informative than biological covariates, and the covariates of methylation mean and variance are almost universally informative. In contrast, the informativeness of biological covariates depends on specific datasets. We show that the independent hypothesis weighting (IHW) and covariate adaptive multiple testing (CAMT) method are overall more powerful, especially for sparse signals, and could improve the detection power by a median of 25% and 68% on real datasets, compared to the ST procedure. We further validate the findings in various biological contexts. CONCLUSIONS Covariate-adaptive FDR control methods with informative covariates can significantly increase the detection power for EWAS. For sparse signals, IHW and CAMT are recommended.
Collapse
Affiliation(s)
- Jinyan Huang
- State Key Laboratory of Medical Genomics, Shanghai Institute of Hematology, National Research Center for Translational Medicine, Rui-Jin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai Jiao Tong University, 197 Ruijin Er Road, Shanghai, 200025, China.
| | - Ling Bai
- State Key Laboratory of Medical Genomics, Shanghai Institute of Hematology, National Research Center for Translational Medicine, Rui-Jin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai Jiao Tong University, 197 Ruijin Er Road, Shanghai, 200025, China
| | - Bowen Cui
- State Key Laboratory of Medical Genomics, Shanghai Institute of Hematology, National Research Center for Translational Medicine, Rui-Jin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai Jiao Tong University, 197 Ruijin Er Road, Shanghai, 200025, China
| | - Liang Wu
- State Key Laboratory of Medical Genomics, Shanghai Institute of Hematology, National Research Center for Translational Medicine, Rui-Jin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai Jiao Tong University, 197 Ruijin Er Road, Shanghai, 200025, China
| | - Liwen Wang
- Department of General Surgery, Rui-Jin Hospital, Shanghai Jiao Tong University, 197 Ruijin Er Road, Shanghai, 200025, China
| | - Zhiyin An
- State Key Laboratory of Medical Genomics, Shanghai Institute of Hematology, National Research Center for Translational Medicine, Rui-Jin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai Jiao Tong University, 197 Ruijin Er Road, Shanghai, 200025, China
| | - Shulin Ruan
- State Key Laboratory of Medical Genomics, Shanghai Institute of Hematology, National Research Center for Translational Medicine, Rui-Jin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai Jiao Tong University, 197 Ruijin Er Road, Shanghai, 200025, China
| | - Yue Yu
- Division of Digital Health Sciences, Mayo Clinic, 200 1st St SW, Rochester, MN, 55905, USA
| | - Xianyang Zhang
- Department of Statistics, Texas A&M University, Blocker 449D, College Station, TX, 77843, USA.
| | - Jun Chen
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research and Center for Individualized Medicine, Mayo Clinic, 200 1st St SW, Rochester, MN, 55905, USA.
| |
Collapse
|
27
|
Influence of multiple hypothesis testing on reproducibility in neuroimaging research: A simulation study and Python-based software. J Neurosci Methods 2020; 337:108654. [PMID: 32114144 DOI: 10.1016/j.jneumeth.2020.108654] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2019] [Revised: 02/26/2020] [Accepted: 02/26/2020] [Indexed: 11/24/2022]
Abstract
BACKGROUND Reproducibility of research findings has been recently questioned in many fields of science, including psychology and neurosciences. One factor influencing reproducibility is the simultaneous testing of multiple hypotheses, which entails false positive findings unless the analyzed p-values are carefully corrected. While this multiple testing problem is well known and studied, it continues to be both a theoretical and practical problem. NEW METHOD Here we assess reproducibility in simulated experiments in the context of multiple testing. We consider methods that control either the family-wise error rate (FWER) or false discovery rate (FDR), including techniques based on random field theory (RFT), cluster-mass based permutation testing, and adaptive FDR. Several classical methods are also considered. The performance of these methods is investigated under two different models. RESULTS We found that permutation testing is the most powerful method among the considered approaches to multiple testing, and that grouping hypotheses based on prior knowledge can improve power. We also found that emphasizing primary and follow-up studies equally produced most reproducible outcomes. COMPARISON WITH EXISTING METHOD(S) We have extended the use of two-group and separate-classes models for analyzing reproducibility and provide a new open-source software "MultiPy" for multiple hypothesis testing. CONCLUSIONS Our simulations suggest that performing strict corrections for multiple testing is not sufficient to improve reproducibility of neuroimaging experiments. The methods are freely available as a Python toolkit "MultiPy" and we aim this study to help in improving statistical data analysis practices and to assist in conducting power and reproducibility analyses for new experiments.
Collapse
|
28
|
Liang K. Empirical Bayes analysis of RNA sequencing experiments with auxiliary information. Ann Appl Stat 2019. [DOI: 10.1214/19-aoas1270] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
29
|
Arroyo Relión JD, Kessler D, Levina E, Taylor SF. NETWORK CLASSIFICATION WITH APPLICATIONS TO BRAIN CONNECTOMICS. Ann Appl Stat 2019; 13:1648-1677. [PMID: 33408802 DOI: 10.1214/19-aoas1252] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
Abstract
While statistical analysis of a single network has received a lot of attention in recent years, with a focus on social networks, analysis of a sample of networks presents its own challenges which require a different set of analytic tools. Here we study the problem of classification of networks with labeled nodes, motivated by applications in neuroimaging. Brain networks are constructed from imaging data to represent functional connectivity between regions of the brain, and previous work has shown the potential of such networks to distinguish between various brain disorders, giving rise to a network classification problem. Existing approaches tend to either treat all edge weights as a long vector, ignoring the network structure, or focus on graph topology as represented by summary measures while ignoring the edge weights. Our goal is to design a classification method that uses both the individual edge information and the network structure of the data in a computationally efficient way, and that can produce a parsimonious and interpretable representation of differences in brain connectivity patterns between classes. We propose a graph classification method that uses edge weights as predictors but incorporates the network nature of the data via penalties that promote sparsity in the number of nodes, in addition to the usual sparsity penalties that encourage selection of edges. We implement the method via efficient convex optimization and provide a detailed analysis of data from two fMRI studies of schizophrenia.
Collapse
|
30
|
Fast and covariate-adaptive method amplifies detection power in large-scale multiple hypothesis testing. Nat Commun 2019; 10:3433. [PMID: 31366926 PMCID: PMC6668431 DOI: 10.1038/s41467-019-11247-0] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2018] [Accepted: 07/03/2019] [Indexed: 12/31/2022] Open
Abstract
Multiple hypothesis testing is an essential component of modern data science. In many settings, in addition to the p-value, additional covariates for each hypothesis are available, e.g., functional annotation of variants in genome-wide association studies. Such information is ignored by popular multiple testing approaches such as the Benjamini-Hochberg procedure (BH). Here we introduce AdaFDR, a fast and flexible method that adaptively learns the optimal p-value threshold from covariates to significantly improve detection power. On eQTL analysis of the GTEx data, AdaFDR discovers 32% more associations than BH at the same false discovery rate. We prove that AdaFDR controls false discovery proportion and show that it makes substantially more discoveries while controlling false discovery rate (FDR) in extensive experiments. AdaFDR is computationally efficient and allows multi-dimensional covariates with both numeric and categorical values, making it broadly useful across many applications.
Collapse
|
31
|
Xia Y, Cai TT, Sun W. GAP: A General Framework for Information Pooling in Two-Sample Sparse Inference. J Am Stat Assoc 2019. [DOI: 10.1080/01621459.2019.1611585] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
Affiliation(s)
- Yin Xia
- Department of Statistics, School of Management, Fudan University, Shanghai, China
| | - T. Tony Cai
- Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA
| | - Wenguang Sun
- Department of Data Sciences and Operations, University of Southern California, Los Angeles, CA
| |
Collapse
|
32
|
Korthauer K, Kimes PK, Duvallet C, Reyes A, Subramanian A, Teng M, Shukla C, Alm EJ, Hicks SC. A practical guide to methods controlling false discoveries in computational biology. Genome Biol 2019; 20:118. [PMID: 31164141 PMCID: PMC6547503 DOI: 10.1186/s13059-019-1716-1] [Citation(s) in RCA: 182] [Impact Index Per Article: 36.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2018] [Accepted: 05/10/2019] [Indexed: 01/06/2023] Open
Abstract
BACKGROUND In high-throughput studies, hundreds to millions of hypotheses are typically tested. Statistical methods that control the false discovery rate (FDR) have emerged as popular and powerful tools for error rate control. While classic FDR methods use only p values as input, more modern FDR methods have been shown to increase power by incorporating complementary information as informative covariates to prioritize, weight, and group hypotheses. However, there is currently no consensus on how the modern methods compare to one another. We investigate the accuracy, applicability, and ease of use of two classic and six modern FDR-controlling methods by performing a systematic benchmark comparison using simulation studies as well as six case studies in computational biology. RESULTS Methods that incorporate informative covariates are modestly more powerful than classic approaches, and do not underperform classic approaches, even when the covariate is completely uninformative. The majority of methods are successful at controlling the FDR, with the exception of two modern methods under certain settings. Furthermore, we find that the improvement of the modern FDR methods over the classic methods increases with the informativeness of the covariate, total number of hypothesis tests, and proportion of truly non-null hypotheses. CONCLUSIONS Modern FDR methods that use an informative covariate provide advantages over classic FDR-controlling procedures, with the relative gain dependent on the application and informativeness of available covariates. We present our findings as a practical guide and provide recommendations to aid researchers in their choice of methods to correct for false discoveries.
Collapse
Affiliation(s)
- Keegan Korthauer
- Department of Data Sciences, Dana-Farber Cancer Institute, 450 Brookline Avenue, Boston, 02215 USA
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, 677 Huntington Avenue, Boston, 02215 USA
| | - Patrick K. Kimes
- Department of Data Sciences, Dana-Farber Cancer Institute, 450 Brookline Avenue, Boston, 02215 USA
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, 677 Huntington Avenue, Boston, 02215 USA
| | - Claire Duvallet
- Department of Biological Engineering, MIT, 77 Massachusetts Avenue, Cambridge, USA
- Center for Microbiome Informatics and Therapeutics, MIT, 77 Massachusetts Avenue, Cambridge, USA
| | - Alejandro Reyes
- Department of Data Sciences, Dana-Farber Cancer Institute, 450 Brookline Avenue, Boston, 02215 USA
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, 677 Huntington Avenue, Boston, 02215 USA
| | | | - Mingxiang Teng
- Department of Biostatistics & Bioinformatics, Moffitt Cancer Center, 12902 Magnolia Drive, Tampa, 33612 USA
| | - Chinmay Shukla
- Biological and Biomedical Sciences Program, Harvard University, Boston, USA
| | - Eric J. Alm
- Department of Biological Engineering, MIT, 77 Massachusetts Avenue, Cambridge, USA
- Center for Microbiome Informatics and Therapeutics, MIT, 77 Massachusetts Avenue, Cambridge, USA
- Broad Institute, 415 Main Street, Cambridge, USA
| | - Stephanie C. Hicks
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, 615 N. Wolfe Street, Baltimore, 21205 USA
| |
Collapse
|
33
|
Tony Cai T, Sun W, Wang W. Covariate‐assisted ranking and screening for large‐scale two‐sample inference. J R Stat Soc Series B Stat Methodol 2019. [DOI: 10.1111/rssb.12304] [Citation(s) in RCA: 28] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023]
Affiliation(s)
| | - Wenguang Sun
- University of Southern California Los Angeles USA
| | - Weinan Wang
- University of Southern California Los Angeles USA
| |
Collapse
|
34
|
Boca SM, Leek JT. A direct approach to estimating false discovery rates conditional on covariates. PeerJ 2018; 6:e6035. [PMID: 30581661 PMCID: PMC6292380 DOI: 10.7717/peerj.6035] [Citation(s) in RCA: 45] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2018] [Accepted: 10/29/2018] [Indexed: 11/20/2022] Open
Abstract
Modern scientific studies from many diverse areas of research abound with multiple hypothesis testing concerns. The false discovery rate (FDR) is one of the most commonly used approaches for measuring and controlling error rates when performing multiple tests. Adaptive FDRs rely on an estimate of the proportion of null hypotheses among all the hypotheses being tested. This proportion is typically estimated once for each collection of hypotheses. Here, we propose a regression framework to estimate the proportion of null hypotheses conditional on observed covariates. This may then be used as a multiplication factor with the Benjamini-Hochberg adjusted p-values, leading to a plug-in FDR estimator. We apply our method to a genome-wise association meta-analysis for body mass index. In our framework, we are able to use the sample sizes for the individual genomic loci and the minor allele frequencies as covariates. We further evaluate our approach via a number of simulation scenarios. We provide an implementation of this novel method for estimating the proportion of null hypotheses in a regression framework as part of the Bioconductor package swfdr.
Collapse
Affiliation(s)
- Simina M Boca
- Innovation Center for Biomedical Informatics, Georgetown University Medical Center, Washington, D.C., USA.,Department of Oncology, Georgetown University Medical Center, Washington, D.C., USA.,Department of Biostatistics, Bioinformatics & Biomathematics, Georgetown University Medical Center, Washington, D.C., USA
| | - Jeffrey T Leek
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
| |
Collapse
|
35
|
Parks MM, Raphael BJ, Lawrence CE. Using controls to limit false discovery in the era of big data. BMC Bioinformatics 2018; 19:323. [PMID: 30217148 PMCID: PMC6137876 DOI: 10.1186/s12859-018-2356-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2018] [Accepted: 09/03/2018] [Indexed: 12/04/2022] Open
Abstract
Background Procedures for controlling the false discovery rate (FDR) are widely applied as a solution to the multiple comparisons problem of high-dimensional statistics. Current FDR-controlling procedures require accurately calculated p-values and rely on extrapolation into the unknown and unobserved tails of the null distribution. Both of these intermediate steps are challenging and can compromise the reliability of the results. Results We present a general method for controlling the FDR that capitalizes on the large amount of control data often found in big data studies to avoid these frequently problematic intermediate steps. The method utilizes control data to empirically construct the distribution of the test statistic under the null hypothesis and directly compares this distribution to the empirical distribution of the test data. By not relying on p-values, our control data-based empirical FDR procedure more closely follows the foundational principles of the scientific method: that inference is drawn by comparing test data to control data. The method is demonstrated through application to a problem in structural genomics. Conclusions The method described here provides a general statistical framework for controlling the FDR that is specifically tailored for the big data setting. By relying on empirically constructed distributions and control data, it forgoes potentially problematic modeling steps and extrapolation into the unknown tails of the null distribution. This procedure is broadly applicable insofar as controlled experiments or internal negative controls are available, as is increasingly common in the big data setting. Electronic supplementary material The online version of this article (10.1186/s12859-018-2356-2) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Matthew M Parks
- Department of Physiology and Biophysics, Weill Cornell Medicine, 1300 York Ave, New York, NY, 10065, USA
| | - Benjamin J Raphael
- Department of Computer Science, Princeton University, 35 Olden Street, Princeton, NJ, 08540, USA
| | - Charles E Lawrence
- Center for Computational Molecular Biology, Brown University, 115 Waterman Street, Providence, RI, 02912, USA. .,Division of Applied Mathematics, Brown University, 182 George Street, Providence, RI, 02912, USA.
| |
Collapse
|
36
|
Adjusted regularization of cortical covariance. J Comput Neurosci 2018; 45:83-101. [PMID: 30191352 DOI: 10.1007/s10827-018-0692-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2017] [Revised: 07/13/2018] [Accepted: 07/30/2018] [Indexed: 12/31/2022]
Abstract
It is now common to record dozens to hundreds or more neurons simultaneously, and to ask how the network activity changes across experimental conditions. A natural framework for addressing questions of functional connectivity is to apply Gaussian graphical modeling to neural data, where each edge in the graph corresponds to a non-zero partial correlation between neurons. Because the number of possible edges is large, one strategy for estimating the graph has been to apply methods that aim to identify large sparse effects using an [Formula: see text] penalty. However, the partial correlations found in neural spike count data are neither large nor sparse, so techniques that perform well in sparse settings will typically perform poorly in the context of neural spike count data. Fortunately, the correlated firing for any pair of cortical neurons depends strongly on both their distance apart and the features for which they are tuned. We introduce a method that takes advantage of these known, strong effects by allowing the penalty to depend on them: thus, for example, the connection between pairs of neurons that are close together will be penalized less than pairs that are far apart. We show through simulations that this physiologically-motivated procedure performs substantially better than off-the-shelf generic tools, and we illustrate by applying the methodology to populations of neurons recorded with multielectrode arrays implanted in macaque visual cortex areas V1 and V4.
Collapse
|
37
|
Affiliation(s)
- Wesley Tansey
- Department of Computer Science, University of Texas at Austin, Austin, TX
| | - Oluwasanmi Koyejo
- Department of Computer Science, University of Illinois at Urbana-Champaign, Champaign, IL
| | | | - James G. Scott
- Department of Information, Risk, and Operations Management; Department of Statistics and Data Sciences; University of Texas at Austin, Austin, TX
| |
Collapse
|
38
|
VINCI GIUSEPPE, VENTURA VALÉRIE, SMITH MATTHEWA, KASS ROBERTE. ADJUSTED REGULARIZATION IN LATENT GRAPHICAL MODELS: APPLICATION TO MULTIPLE-NEURON SPIKE COUNT DATA. Ann Appl Stat 2018; 12:1068-1095. [PMID: 31772696 PMCID: PMC6879176 DOI: 10.1214/18-aoas1190] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
A major challenge in contemporary neuroscience is to analyze data from large numbers of neurons recorded simultaneously across many experimental replications (trials), where the data are counts of neural firing events, and one of the basic problems is to characterize the dependence structure among such multivariate counts. Methods of estimating high-dimensional covariation based on ℓ 1-regularization are most appropriate when there are a small number of relatively large partial correlations, but in neural data there are often large numbers of relatively small partial correlations. Furthermore, the variation across trials is often confounded by Poisson-like variation within trials. To overcome these problems we introduce a comprehensive methodology that imbeds a Gaussian graphical model into a hierarchical structure: the counts are assumed Poisson, conditionally on latent variables that follow a Gaussian graphical model, and the graphical model parameters, in turn, are assumed to depend on physiologically-motivated covariates, which can greatly improve correct detection of interactions (non-zero partial correlations). We develop a Bayesian approach to fitting this covariate-adjusted generalized graphical model and we demonstrate its success in simulation studies. We then apply it to data from an experiment on visual attention, where we assess functional interactions between neurons recorded from two brain areas.
Collapse
Affiliation(s)
- GIUSEPPE VINCI
- Rice University, Department of Statistics, Duncan Hall, 6100 Main St, Houston, 77005, TX, USA
| | - VALÉRIE VENTURA
- Carnegie Mellon University, Department of Statistics, Baker Hall 132, 5000 Forbes Avenue, Pittsburgh, 15203, PA, USA
- Center for the Neural Basis of Cognition
| | - MATTHEW A. SMITH
- University of Pittsburgh, Department of Ophthalmology, Eye and Ear Institute, Room 914 203 Lothrop St., Pittsburgh, PA 15213, USA
- Center for the Neural Basis of Cognition
| | - ROBERT E. KASS
- Carnegie Mellon University, Department of Statistics, Baker Hall 132, 5000 Forbes Avenue, Pittsburgh, 15203, PA, USA
- Center for the Neural Basis of Cognition
| |
Collapse
|
39
|
Donner C, Opper M. Inverse Ising problem in continuous time: A latent variable approach. Phys Rev E 2017; 96:062104. [PMID: 29347355 DOI: 10.1103/physreve.96.062104] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2017] [Indexed: 06/07/2023]
Abstract
We consider the inverse Ising problem: the inference of network couplings from observed spin trajectories for a model with continuous time Glauber dynamics. By introducing two sets of auxiliary latent random variables we render the likelihood into a form which allows for simple iterative inference algorithms with analytical updates. The variables are (1) Poisson variables to linearize an exponential term which is typical for point process likelihoods and (2) Pólya-Gamma variables, which make the likelihood quadratic in the coupling parameters. Using the augmented likelihood, we derive an expectation-maximization (EM) algorithm to obtain the maximum likelihood estimate of network parameters. Using a third set of latent variables we extend the EM algorithm to sparse couplings via L1 regularization. Finally, we develop an efficient approximate Bayesian inference algorithm using a variational approach. We demonstrate the performance of our algorithms on data simulated from an Ising model. For data which are simulated from a more biologically plausible network with spiking neurons, we show that the Ising model captures well the low order statistics of the data and how the Ising couplings are related to the underlying synaptic structure of the simulated network.
Collapse
Affiliation(s)
- Christian Donner
- Artificial Intelligence Group, Technische Universität, Marchstr. 23, 10587 Berlin, Germany
| | - Manfred Opper
- Artificial Intelligence Group, Technische Universität, Marchstr. 23, 10587 Berlin, Germany
| |
Collapse
|
40
|
Zablocki RW, Levine RA, Schork AJ, Xu S, Wang Y, Fan CC, Thompson WK. Semiparametric covariate-modulated local false discovery rate for genome-wide association studies. Ann Appl Stat 2017. [DOI: 10.1214/17-aoas1077] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
41
|
Ignatiadis N, Klaus B, Zaugg J, Huber W. Data-driven hypothesis weighting increases detection power in genome-scale multiple testing. Nat Methods 2016; 13:577-80. [PMID: 27240256 PMCID: PMC4930141 DOI: 10.1038/nmeth.3885] [Citation(s) in RCA: 332] [Impact Index Per Article: 41.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2015] [Accepted: 05/03/2016] [Indexed: 11/25/2022]
Abstract
Hypothesis weighting improves the power of large-scale multiple testing. We describe independent hypothesis weighting (IHW), a method that assigns weights using covariates independent of the P-values under the null hypothesis but informative of each test's power or prior probability of the null hypothesis (http://www.bioconductor.org/packages/IHW). IHW increases power while controlling the false discovery rate and is a practical approach to discovering associations in genomics, high-throughput biology and other large data sets.
Collapse
Affiliation(s)
| | - Bernd Klaus
- European Molecular Biology Laboratory, Heidelberg, Germany
| | - Judith Zaugg
- European Molecular Biology Laboratory, Heidelberg, Germany
| | - Wolfgang Huber
- European Molecular Biology Laboratory, Heidelberg, Germany
| |
Collapse
|