1
|
Chen X, Zhang H, Liu M, Deng HW, Wu Z. Simultaneous detection of novel genes and SNPs by adaptive p-value combination. Front Genet 2022; 13:1009428. [DOI: 10.3389/fgene.2022.1009428] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2022] [Accepted: 11/03/2022] [Indexed: 11/18/2022] Open
Abstract
Combining SNP p-values from GWAS summary data is a promising strategy for detecting novel genetic factors. Existing statistical methods for the p-value-based SNP-set testing confront two challenges. First, the statistical power of different methods depends on unknown patterns of genetic effects that could drastically vary over different SNP sets. Second, they do not identify which SNPs primarily contribute to the global association of the whole set. We propose a new signal-adaptive analysis pipeline to address these challenges using the omnibus thresholding Fisher’s method (oTFisher). The oTFisher remains robustly powerful over various patterns of genetic effects. Its adaptive thresholding can be applied to estimate important SNPs contributing to the overall significance of the given SNP set. We develop efficient calculation algorithms to control the type I error rate, which accounts for the linkage disequilibrium among SNPs. Extensive simulations show that the oTFisher has robustly high power and provides a higher balanced accuracy in screening SNPs than the traditional Bonferroni and FDR procedures. We applied the oTFisher to study the genetic association of genes and haplotype blocks of the bone density-related traits using the summary data of the Genetic Factors for Osteoporosis Consortium. The oTFisher identified more novel and literature-reported genetic factors than existing p-value combination methods. Relevant computation has been implemented into the R package TFisher to support similar data analysis.
Collapse
|
2
|
Stoepker IV, Castro RM, Arias-Castro E, van den Heuvel E. Anomaly Detection for a Large Number of Streams: A Permutation-Based Higher Criticism Approach. J Am Stat Assoc 2022. [DOI: 10.1080/01621459.2022.2126361] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/14/2022]
|
3
|
Hébert F, Causeur D, Emily M. Omnibus testing approach for gene-based gene-gene interaction. Stat Med 2022; 41:2854-2878. [PMID: 35338506 DOI: 10.1002/sim.9389] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2020] [Revised: 03/03/2022] [Accepted: 03/04/2022] [Indexed: 11/07/2022]
Abstract
Genetic interaction is considered as one of the main heritable component of complex traits. With the emergence of genome-wide association studies (GWAS), a collection of statistical methods dedicated to the identification of interaction at the SNP level have been proposed. More recently, gene-based gene-gene interaction testing has emerged as an attractive alternative as they confer advantage in both statistical power and biological interpretation. Most of the gene-based interaction methods rely on a multidimensional modeling of the interaction, thus facing a lack of robustness against the huge space of interaction patterns. In this paper, we study a global testing approaches to address the issue of gene-based gene-gene interaction. Based on a logistic regression modeling framework, all SNP-SNP interaction tests are combined to produce a gene-level test for interaction. We propose an omnibus test that takes advantage of (1) the heterogeneity between existing global tests and (2) the complementarity between allele-based and genotype-based coding of SNPs. Through an extensive simulation study, it is demonstrated that the proposed omnibus test has the ability to detect with high power the most common interaction genetic models with one causal pair as well as more complex genetic models where more than one causal pair is involved. On the other hand, the flexibility of the proposed approach is shown to be robust and improves power compared to single global tests in replication studies. Furthermore, the application of our procedure to real datasets confirms the adaptability of our approach to replicate various gene-gene interactions.
Collapse
Affiliation(s)
- Florian Hébert
- Department of Statistics and Computer Science, Institut Agro, CNRS, IRMAR, Univ Rennes, F-35000, Rennes, France
| | - David Causeur
- Department of Statistics and Computer Science, Institut Agro, CNRS, IRMAR, Univ Rennes, F-35000, Rennes, France
| | - Mathieu Emily
- Department of Statistics and Computer Science, Institut Agro, CNRS, IRMAR, Univ Rennes, F-35000, Rennes, France
| |
Collapse
|
4
|
Abstract
Feature screening is crucial in the analysis of ultrahigh dimensional data, where the number of variables (features) is in an exponential order of the number of observations. In various ultrahigh dimensional data, variables are naturally grouped, giving us a good rationale to develop a screening method using joint effect of multiple variables. In this article, we propose a group screening procedure via the F-test statistic. The proposed method is a direct extension of the original sure independence screening procedure, when the group information is known, for example, from prior knowledge. Under certain regularity conditions, we prove that the proposed group screening procedure possesses the sure screening property that selects all effective groups with a probability approaching one at an exponential rate. We use simulations to demonstrate the advantages of the proposed method and show its application in a genome-wide association study. We conclude that the grouping method is very useful in the analysis of ultrahigh dimensional data, as the optimal F-test can detect true signals with desired properties.
Collapse
Affiliation(s)
- Won Chul Song
- Milwaukee School of Engineering, 500 E. Kilbourn Avenue, Milwaukee, WI 53202
| | - Jun Xie
- Department of Statistics, Purdue University, 250 N. University Street, West Lafayette, IN 47907
| |
Collapse
|
5
|
Novel directions in data pre-processing and genome-wide association study (GWAS) methodologies to overcome ongoing challenges. INFORMATICS IN MEDICINE UNLOCKED 2021. [DOI: 10.1016/j.imu.2021.100586] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
|
6
|
Hébert F, Causeur D, Emily M. An adaptive decorrelation procedure for signal detection. Comput Stat Data Anal 2021. [DOI: 10.1016/j.csda.2020.107082] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
7
|
Zhao Y, Sun L. On set‐based association tests: Insights from a regression using summary statistics. CAN J STAT 2020. [DOI: 10.1002/cjs.11584] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Affiliation(s)
- Yanyan Zhao
- Department of Statistical Sciences University of Toronto Toronto M5S 3G3 Ontario Canada
| | - Lei Sun
- Department of Statistical Sciences University of Toronto Toronto M5S 3G3 Ontario Canada
- Division of Biostatistics, Dalla Lana School of Public Health University of Toronto Toronto M5T 3M7 Ontario Canada
| |
Collapse
|
8
|
Zhang H, Tong T, Landers J, Wu Z. TFisher: A powerful truncation and weighting procedure for combining $p$-values. Ann Appl Stat 2020. [DOI: 10.1214/19-aoas1302] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
9
|
Mukherjee R, Mukherjee S, Yuan M. Global testing against sparse alternatives under Ising models. Ann Stat 2018. [DOI: 10.1214/17-aos1612] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
10
|
Liu Y, Xie J. Accurate and Efficient P-value Calculation via Gaussian Approximation: a Novel Monte-Carlo Method. J Am Stat Assoc 2018; 114:384-392. [PMID: 31130762 DOI: 10.1080/01621459.2017.1407776] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022]
Abstract
It is of fundamental interest in statistics to test the significance of a set of covariates. For example, in genome-wide association studies, a joint null hypothesis of no genetic effect is tested for a set of multiple genetic variants. The minimum p-value method, higher criticism, and Berk-Jones tests are particularly effective when the covariates with nonzero effects are sparse. However, the correlations among covariates and the non-Gaussian distribution of the response pose a great challenge towards the p-value calculation of the three tests. In practice, permutation is commonly used to obtain accurate p-values, but it is computationally very intensive, especially when we need to conduct a large amount of hypothesis testing. In this paper, we propose a Gaussian approximation method based on a Monte Carlo scheme, which is computationally more efficient than permutation while still achieving similar accuracy. We derive non-asymptotic approximation error bounds that could vanish in the limit even if the number of covariates is much larger than the sample size. Through real-genotype-based simulations and data analysis of a genome-wide association study of Crohn's disease, we compare the accuracy and computation cost of our proposed method, of permutation, and of the method based on asymptotic distribution.
Collapse
Affiliation(s)
- Yaowu Liu
- Department of Biostatistics, Harvard School of Public Health
| | - Jun Xie
- Department of Statistics, Purdue University
| |
Collapse
|
11
|
Abstract
This paper considers testing procedures for screening large genome-wide data, where we examine hundreds of thousands of genetic variants, e.g., single nucleotide polymorphisms (SNP), on a quantitative phenotype. We screen the whole genome by SNP sets and propose a new test that is based on conditional effects from multiple SNPs. The test statistic is developed for weak genetic effects and incorporates correlations among genetic variables, which may be very high due to linkage disequilibrium. The limiting null distribution of the test statistic and the power of the test are derived. Under appropriate conditions, the test is shown to be more powerful than the minimum p-value method, which is based on marginal SNP effects and is the most commonly used method in genome-wide screening. The proposed test is also compared with other existing methods, including the Higher Criticism (HC) test and the sequence kernel association test (SKAT), through simulations and analysis of a real genome data set. For typical genome-wide data, where effects of individual SNPs are weak and correlations among SNPs are high, the proposed test is more advantageous and clearly outperforms the other methods in the literature.
Collapse
Affiliation(s)
- Yaowu Liu
- Department of Biostatistics, Harvard School of Public Health, 665 Huntington Avenue, Boston, Massachusetts 02115, USA
| | - Jun Xie
- Department of Statistics, Purdue University, 250 N. University Street, West Lafayette, Indiana 47907, USA
| |
Collapse
|
12
|
Barnett I, Mukherjee R, Lin X. The Generalized Higher Criticism for Testing SNP-Set Effects in Genetic Association Studies. J Am Stat Assoc 2017; 112:64-76. [PMID: 28736464 DOI: 10.1080/01621459.2016.1192039] [Citation(s) in RCA: 59] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2023]
Abstract
It is of substantial interest to study the effects of genes, genetic pathways, and networks on the risk of complex diseases. These genetic constructs each contain multiple SNPs, which are often correlated and function jointly, and might be large in number. However, only a sparse subset of SNPs in a genetic construct is generally associated with the disease of interest. In this article, we propose the generalized higher criticism (GHC) to test for the association between an SNP set and a disease outcome. The higher criticism is a test traditionally used in high-dimensional signal detection settings when marginal test statistics are independent and the number of parameters is very large. However, these assumptions do not always hold in genetic association studies, due to linkage disequilibrium among SNPs and the finite number of SNPs in an SNP set in each genetic construct. The proposed GHC overcomes the limitations of the higher criticism by allowing for arbitrary correlation structures among the SNPs in an SNP-set, while performing accurate analytic p-value calculations for any finite number of SNPs in the SNP-set. We obtain the detection boundary of the GHC test. We compared empirically using simulations the power of the GHC method with existing SNP-set tests over a range of genetic regions with varied correlation structures and signal sparsity. We apply the proposed methods to analyze the CGEM breast cancer genome-wide association study. Supplementary materials for this article are available online.
Collapse
Affiliation(s)
- Ian Barnett
- Department of Biostatistics, Harvard School of Public Health, Boston, MA
| | | | - Xihong Lin
- Department of Biostatistics, Harvard School of Public Health, Boston, MA
| |
Collapse
|
13
|
Donoho D, Jin J. Higher Criticism for Large-Scale Inference, Especially for Rare and Weak Effects. Stat Sci 2015. [DOI: 10.1214/14-sts506] [Citation(s) in RCA: 52] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|