1
|
Lin B, Paterson AD, Sun L. Better together against genetic heterogeneity: A sex-combined joint main and interaction analysis of 290 quantitative traits in the UK Biobank. PLoS Genet 2024; 20:e1011221. [PMID: 38656964 PMCID: PMC11073786 DOI: 10.1371/journal.pgen.1011221] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2023] [Revised: 05/06/2024] [Accepted: 03/11/2024] [Indexed: 04/26/2024] Open
Abstract
Genetic effects can be sex-specific, particularly for traits such as testosterone, a sex hormone. While sex-stratified analysis provides easily interpretable sex-specific effect size estimates, the presence of sex-differences in SNP effect implies a SNP×sex interaction. This suggests the usage of the often overlooked joint test, testing for an SNP's main and SNP×sex interaction effects simultaneously. Notably, even without individual-level data, the joint test statistic can be derived from sex-stratified summary statistics through an omnibus meta-analysis. Utilizing the available sex-stratified summary statistics of the UK Biobank, we performed such omnibus meta-analyses for 290 quantitative traits. Results revealed that this approach is robust to genetic effect heterogeneity and can outperform the traditional sex-stratified or sex-combined main effect-only tests. Therefore, we advocate using the omnibus meta-analysis that captures both the main and interaction effects. Subsequent sex-stratified analysis should be conducted for sex-specific effect size estimation and interpretation.
Collapse
Affiliation(s)
- Boxi Lin
- Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, Canada
| | - Andrew D. Paterson
- Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, Canada
- Genetics and Genome Biology, The Hospital for Sick Children, Toronto, Ontario, Canada
| | - Lei Sun
- Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, Canada
- Department of Statistical Sciences, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
2
|
Chen Y, Paramo MI, Zhang Y, Yao L, Shah SR, Jin Y, Zhang J, Pan X, Yu H. Finding Needles in the Haystack: Strategies for Uncovering Noncoding Regulatory Variants. Annu Rev Genet 2023; 57:201-222. [PMID: 37562413 DOI: 10.1146/annurev-genet-030723-120717] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/12/2023]
Abstract
Despite accumulating evidence implicating noncoding variants in human diseases, unraveling their functionality remains a significant challenge. Systematic annotations of the regulatory landscape and the growth of sequence variant data sets have fueled the development of tools and methods to identify causal noncoding variants and evaluate their regulatory effects. Here, we review the latest advances in the field and discuss potential future research avenues to gain a more in-depth understanding of noncoding regulatory variants.
Collapse
Affiliation(s)
- You Chen
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York, USA
- Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, New York, USA;
| | - Mauricio I Paramo
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York, USA
- Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, New York, USA;
| | - Yingying Zhang
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York, USA
- Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, New York, USA;
| | - Li Yao
- Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, New York, USA;
- Department of Computational Biology, Cornell University, Ithaca, New York, USA
| | - Sagar R Shah
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York, USA
- Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, New York, USA;
| | - Yiyang Jin
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York, USA
- Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, New York, USA;
| | - Junke Zhang
- Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, New York, USA;
- Department of Computational Biology, Cornell University, Ithaca, New York, USA
| | - Xiuqi Pan
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York, USA
- Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, New York, USA;
| | - Haiyuan Yu
- Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, New York, USA;
- Department of Computational Biology, Cornell University, Ithaca, New York, USA
| |
Collapse
|
3
|
Sugolov A, Emmenegger E, Paterson AD, Sun L. Statistical Learning of Large-Scale Genetic Data: How to Run a Genome-Wide Association Study of Gene-Expression Data Using the 1000 Genomes Project Data. STATISTICS IN BIOSCIENCES 2023; 16:250-264. [PMID: 38495080 PMCID: PMC10940486 DOI: 10.1007/s12561-023-09375-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2022] [Revised: 04/07/2023] [Accepted: 05/22/2023] [Indexed: 03/19/2024]
Abstract
Teaching statistics through engaging applications to contemporary large-scale datasets is essential to attracting students to the field. To this end, we developed a hands-on, week-long workshop for senior high-school or junior undergraduate students, without prior knowledge in statistical genetics but with some basic knowledge in data science, to conduct their own genome-wide association study (GWAS). The GWAS was performed for open source gene expression data, using publicly available human genetics data. Assisted by a detailed instruction manual, students were able to obtain ∼ 1.4 million p-values from a real scientific study, within several days. This early motivation kept students engaged in learning the theories that support their results, including regression, data visualization, results interpretation, and large-scale multiple hypothesis testing. To further their learning motivation by emphasizing the personal connection to this type of data analysis, students were encouraged to make short presentations about how GWAS has provided insights into the genetic basis of diseases that are present in their friends or families. The appended open source, step-by-step instruction manual includes descriptions of the datasets used, the software needed, and results from the workshop. Additionally, scripts used in the workshop are archived on Github and Zenodo to further enhance reproducible research and training.
Collapse
Affiliation(s)
- Anton Sugolov
- Department of Mathematics,Faculty of Arts and Sciences, University of Toronto, Toronto, Canada
| | - Eric Emmenegger
- Department of Mechanical and Industrial Engineering, University of Toronto, Toronto, Canada
| | - Andrew D. Paterson
- Program in Genetics & Genome Biology The Hospital for Sick Children, University of Toronto, Toronto, ON Canada
- Dalla Lana School of Public Health, University of Toronto, Toronto, Canada
| | - Lei Sun
- Dalla Lana School of Public Health, University of Toronto, Toronto, Canada
- Department of Statistical Sciences, Faculty of Arts and Sciences, Dalla Lana School of Public Health, University of Toronto, Toronto, Canada
| |
Collapse
|
4
|
Sun L, Wang Z, Lu T, Manolio TA, Paterson AD. eXclusionarY: 10 years later, where are the sex chromosomes in GWASs? Am J Hum Genet 2023; 110:903-912. [PMID: 37267899 DOI: 10.1016/j.ajhg.2023.04.009] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/04/2023] Open
Abstract
10 years ago, a detailed analysis showed that only 33% of genome-wide association study (GWAS) results included the X chromosome. Multiple recommendations were made to combat such exclusion. Here, we re-surveyed the research landscape to determine whether these earlier recommendations had been translated. Unfortunately, among the genome-wide summary statistics reported in 2021 in the NHGRI-EBI GWAS Catalog, only 25% provided results for the X chromosome and 3% for the Y chromosome, suggesting that the exclusion phenomenon not only persists but has also expanded into an exclusionary problem. Normalizing by physical length of the chromosome, the average number of studies published through November 2022 with genome-wide-significant findings on the X chromosome is ∼1 study/Mb. By contrast, it ranges from ∼6 to ∼16 studies/Mb for chromosomes 4 and 19, respectively. Compared with the autosomal growth rate of ∼0.086 studies/Mb/year over the last decade, studies of the X chromosome grew at less than one-seventh that rate, only ∼0.012 studies/Mb/year. Among the studies that reported significant associations on the X chromosome, we noted extreme heterogeneities in data analysis and reporting of results, suggesting the need for clear guidelines. Unsurprisingly, among the 430 scores sampled from the PolyGenic Score Catalog, 0% contained weights for sex chromosomal SNPs. To overcome the dearth of sex chromosome analyses, we provide five sets of recommendations and future directions. Finally, until the sex chromosomes are included in a whole-genome study, instead of GWASs, we propose such studies would more properly be referred to as "AWASs," meaning "autosome-wide scans."
Collapse
Affiliation(s)
- Lei Sun
- Department of Statistical Sciences, Faculty of Arts and Science, University of Toronto, Toronto, ON, Canada; Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada.
| | - Zhong Wang
- Department of Statistics and Data Science, Faculty of Science, National University of Singapore, Singapore
| | - Tianyuan Lu
- Department of Statistical Sciences, Faculty of Arts and Science, University of Toronto, Toronto, ON, Canada; Lady Davis Institute for Medical Research, Jewish General Hospital, Montreal, QC, Canada
| | - Teri A Manolio
- Division of Genomic Medicine, National Human Genome Research Institute, NIH, Bethesda, MD, USA
| | - Andrew D Paterson
- Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada; Division of Epidemiology, Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada; Genetics and Genome Biology, The Hospital for Sick Children, Toronto, ON, Canada.
| |
Collapse
|
5
|
Yoo S, Garg E, Elliott LT, Hung RJ, Halevy AR, Brooks JD, Bull SB, Gagnon F, Greenwood C, Lawless JF, Paterson AD, Sun L, Zawati MH, Lerner-Ellis J, Abraham R, Birol I, Bourque G, Garant JM, Gosselin C, Li J, Whitney J, Thiruvahindrapuram B, Herbrick JA, Lorenti M, Reuter MS, Adeoye OO, Liu S, Allen U, Bernier FP, Biggs CM, Cheung AM, Cowan J, Herridge M, Maslove DM, Modi BP, Mooser V, Morris SK, Ostrowski M, Parekh RS, Pfeffer G, Suchowersky O, Taher J, Upton J, Warren RL, Yeung R, Aziz N, Turvey SE, Knoppers BM, Lathrop M, Jones S, Scherer SW, Strug LJ. HostSeq: a Canadian whole genome sequencing and clinical data resource. BMC Genom Data 2023; 24:26. [PMID: 37131148 PMCID: PMC10152008 DOI: 10.1186/s12863-023-01128-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2022] [Accepted: 02/22/2023] [Indexed: 05/04/2023] Open
Abstract
HostSeq was launched in April 2020 as a national initiative to integrate whole genome sequencing data from 10,000 Canadians infected with SARS-CoV-2 with clinical information related to their disease experience. The mandate of HostSeq is to support the Canadian and international research communities in their efforts to understand the risk factors for disease and associated health outcomes and support the development of interventions such as vaccines and therapeutics. HostSeq is a collaboration among 13 independent epidemiological studies of SARS-CoV-2 across five provinces in Canada. Aggregated data collected by HostSeq are made available to the public through two data portals: a phenotype portal showing summaries of major variables and their distributions, and a variant search portal enabling queries in a genomic region. Individual-level data is available to the global research community for health research through a Data Access Agreement and Data Access Compliance Office approval. Here we provide an overview of the collective project design along with summary level information for HostSeq. We highlight several statistical considerations for researchers using the HostSeq platform regarding data aggregation, sampling mechanism, covariate adjustment, and X chromosome analysis. In addition to serving as a rich data source, the diversity of study designs, sample sizes, and research objectives among the participating studies provides unique opportunities for the research community.
Collapse
Affiliation(s)
- S Yoo
- The Hospital for Sick Children, Toronto, ON, Canada
- University of Ottawa, Ottawa, ON, Canada
| | - E Garg
- Simon Fraser University, Burnaby, BC, Canada
| | - L T Elliott
- Simon Fraser University, Burnaby, BC, Canada
| | - R J Hung
- University of Toronto, Toronto, ON, Canada
- Lunenfeld-Tanenbaum Research Institute, Sinai Health, Toronto, ON, Canada
| | - A R Halevy
- The Hospital for Sick Children, Toronto, ON, Canada
| | - J D Brooks
- University of Toronto, Toronto, ON, Canada
| | - S B Bull
- University of Toronto, Toronto, ON, Canada
- Lunenfeld-Tanenbaum Research Institute, Sinai Health, Toronto, ON, Canada
| | - F Gagnon
- University of Toronto, Toronto, ON, Canada
| | - Cmt Greenwood
- McGill University, Montreal, QC, Canada
- Lady Davis Institute for Medical Research, Jewish General Hospital, Montreal, QC, Canada
| | - J F Lawless
- University of Waterloo, Waterloo, ON, Canada
| | - A D Paterson
- The Hospital for Sick Children, Toronto, ON, Canada
- University of Toronto, Toronto, ON, Canada
| | - L Sun
- University of Toronto, Toronto, ON, Canada
| | | | - J Lerner-Ellis
- University of Toronto, Toronto, ON, Canada
- Sinai Health System, Toronto, ON, Canada
| | - Rjs Abraham
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, Canada
| | - I Birol
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, Canada
| | - G Bourque
- McGill University, Montreal, QC, Canada
| | - J-M Garant
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, Canada
| | - C Gosselin
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, Canada
| | - J Li
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, Canada
| | - J Whitney
- The Hospital for Sick Children, Toronto, ON, Canada
| | | | - J-A Herbrick
- The Hospital for Sick Children, Toronto, ON, Canada
| | - M Lorenti
- The Hospital for Sick Children, Toronto, ON, Canada
| | - M S Reuter
- The Hospital for Sick Children, Toronto, ON, Canada
| | - O O Adeoye
- The Hospital for Sick Children, Toronto, ON, Canada
| | - S Liu
- The Hospital for Sick Children, Toronto, ON, Canada
| | - U Allen
- The Hospital for Sick Children, Toronto, ON, Canada
- University of Toronto, Toronto, ON, Canada
| | - F P Bernier
- University of Calgary, Calgary, AB, Canada
- Alberta Children's Hospital, Calgary, AB, Canada
| | - C M Biggs
- University of British Columbia, Vancouver, BC, Canada
- BC Children's Hospital, Vancouver, BC, Canada
- St. Paul's Hospital, Vancouver, BC, Canada
| | - A M Cheung
- University Health Network, Toronto, ON, Canada
| | - J Cowan
- University of Ottawa, Ottawa, ON, Canada
- The Ottawa Hospital Research Institute, Ottawa, ON, Canada
| | - M Herridge
- University Health Network, Toronto, ON, Canada
| | | | - B P Modi
- BC Children's Hospital, Vancouver, BC, Canada
| | - V Mooser
- McGill University, Montreal, QC, Canada
| | - S K Morris
- The Hospital for Sick Children, Toronto, ON, Canada
- University of Toronto, Toronto, ON, Canada
| | - M Ostrowski
- University of Toronto, Toronto, ON, Canada
- St. Michael's Hospital, Unity Health, Toronto, ON, Canada
| | - R S Parekh
- The Hospital for Sick Children, Toronto, ON, Canada
- University of Toronto, Toronto, ON, Canada
- Women's College Hospital, Toronto, ON, Canada
| | - G Pfeffer
- University of Calgary, Calgary, AB, Canada
| | | | - J Taher
- University of Toronto, Toronto, ON, Canada
- Sinai Health System, Toronto, ON, Canada
| | - J Upton
- The Hospital for Sick Children, Toronto, ON, Canada
- University of Toronto, Toronto, ON, Canada
| | - R L Warren
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, Canada
| | - Rsm Yeung
- The Hospital for Sick Children, Toronto, ON, Canada
- University of Toronto, Toronto, ON, Canada
| | - N Aziz
- The Hospital for Sick Children, Toronto, ON, Canada
| | - S E Turvey
- University of British Columbia, Vancouver, BC, Canada
- BC Children's Hospital, Vancouver, BC, Canada
| | | | - M Lathrop
- McGill University, Montreal, QC, Canada
| | - Sjm Jones
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, Canada
| | - S W Scherer
- The Hospital for Sick Children, Toronto, ON, Canada
- University of Toronto, Toronto, ON, Canada
| | - L J Strug
- The Hospital for Sick Children, Toronto, ON, Canada.
- University of Toronto, Toronto, ON, Canada.
| |
Collapse
|
6
|
Zhao Y, Sun L. A stable and adaptive polygenic signal detection method based on repeated sample splitting. CAN J STAT 2023. [DOI: 10.1002/cjs.11768] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/03/2023]
|
7
|
Integrating variant functional annotation scores have varied abilities to improve power of genome-wide association studies. Sci Rep 2022; 12:10720. [PMID: 35750789 PMCID: PMC9232605 DOI: 10.1038/s41598-022-14924-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2022] [Accepted: 06/15/2022] [Indexed: 11/12/2022] Open
Abstract
Functional annotations have the potential to increase power of genome-wide association studies (GWAS) by prioritizing variants according to their biological function, but this potential has not been well studied. We comprehensively evaluated all 1132 traits in the UK Biobank whose SNP-heritability estimates were given “medium” or “high” labels by Neale’s lab. For each trait, we integrated GWAS summary statistics of close to 8 million common variants (minor allele frequency \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$>1\%$$\end{document}>1%) with either their 75 individual functional scores or their meta-scores, using three different data-integration methods. Overall, the number of new genome-wide significant findings after data-integration increases as a trait SNP-heritability estimate increases. However, there is a trade-off between new findings and loss of baseline GWAS findings, resulting in similar total numbers of significant findings between using GWAS alone and integrating GWAS with functional scores, across all 1132 traits analyzed and all three data-integration methods considered. Our findings suggest that, even with the current biobank-level sample size, more informative functional scores and/or new data-integration methods are needed to further improve the power of GWAS of common variants. For example, studying variants in coding sequence and obtaining cell-type-specific scores are potential future directions.
Collapse
|
8
|
Testing the equality of multivariate means when $$p>n$$ by combining the Hotelling and Simes tests. TEST-SPAIN 2022. [DOI: 10.1007/s11749-021-00781-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
9
|
Hébert F, Causeur D, Emily M. Omnibus testing approach for gene-based gene-gene interaction. Stat Med 2022; 41:2854-2878. [PMID: 35338506 DOI: 10.1002/sim.9389] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2020] [Revised: 03/03/2022] [Accepted: 03/04/2022] [Indexed: 11/07/2022]
Abstract
Genetic interaction is considered as one of the main heritable component of complex traits. With the emergence of genome-wide association studies (GWAS), a collection of statistical methods dedicated to the identification of interaction at the SNP level have been proposed. More recently, gene-based gene-gene interaction testing has emerged as an attractive alternative as they confer advantage in both statistical power and biological interpretation. Most of the gene-based interaction methods rely on a multidimensional modeling of the interaction, thus facing a lack of robustness against the huge space of interaction patterns. In this paper, we study a global testing approaches to address the issue of gene-based gene-gene interaction. Based on a logistic regression modeling framework, all SNP-SNP interaction tests are combined to produce a gene-level test for interaction. We propose an omnibus test that takes advantage of (1) the heterogeneity between existing global tests and (2) the complementarity between allele-based and genotype-based coding of SNPs. Through an extensive simulation study, it is demonstrated that the proposed omnibus test has the ability to detect with high power the most common interaction genetic models with one causal pair as well as more complex genetic models where more than one causal pair is involved. On the other hand, the flexibility of the proposed approach is shown to be robust and improves power compared to single global tests in replication studies. Furthermore, the application of our procedure to real datasets confirms the adaptability of our approach to replicate various gene-gene interactions.
Collapse
Affiliation(s)
- Florian Hébert
- Department of Statistics and Computer Science, Institut Agro, CNRS, IRMAR, Univ Rennes, F-35000, Rennes, France
| | - David Causeur
- Department of Statistics and Computer Science, Institut Agro, CNRS, IRMAR, Univ Rennes, F-35000, Rennes, France
| | - Mathieu Emily
- Department of Statistics and Computer Science, Institut Agro, CNRS, IRMAR, Univ Rennes, F-35000, Rennes, France
| |
Collapse
|
10
|
Wang F, Panjwani N, Wang C, Sun L, Strug LJ. A flexible summary statistics-based colocalization method with application to the mucin cystic fibrosis lung disease modifier locus. Am J Hum Genet 2022; 109:253-269. [PMID: 35065708 PMCID: PMC8874229 DOI: 10.1016/j.ajhg.2021.12.012] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2021] [Accepted: 12/15/2021] [Indexed: 12/18/2022] Open
Abstract
Mucus obstruction is a central feature in the cystic fibrosis (CF) airways. A genome-wide association study (GWAS) of lung disease by the CF Gene Modifier Consortium (CFGMC) identified a significant locus containing two mucin genes, MUC20 and MUC4. Expression quantitative trait locus (eQTL) analysis using human nasal epithelia (HNE) from 94 CF-affected Canadians in the CFGMC demonstrated MUC4 eQTLs that mirrored the lung association pattern in the region, suggesting that MUC4 expression may mediate CF lung disease. Complications arose, however, with colocalization testing using existing methods: the locus is complex and the associated SNPs span a 0.2 Mb region with high linkage disequilibrium (LD) and evidence of allelic heterogeneity. We previously developed the Simple Sum (SS), a powerful colocalization test in regions with allelic heterogeneity, but SS assumed eQTLs to be present to achieve type I error control. Here we propose a two-stage SS (SS2) colocalization test that avoids a priori eQTL assumptions, accounts for multiple hypothesis testing and the composite null hypothesis, and enables meta-analysis. We compare SS2 to published approaches through simulation and demonstrate type I error control for all settings with the greatest power in the presence of high LD and allelic heterogeneity. Applying SS2 to the MUC20/MUC4 CF lung disease locus with eQTLs from CF HNE revealed significant colocalization with MUC4 (p = 1.31 × 10−5) rather than with MUC20. The SS2 is a powerful method to inform the responsible gene(s) at a locus and guide future functional studies. SS2 has been implemented in the application LocusFocus.
Collapse
Affiliation(s)
- Fan Wang
- Department of Statistical Sciences, University of Toronto, Toronto, ON M5G 1Z5, Canada; Program in Genetics and Genome Biology, The Hospital for Sick Children, Toronto, ON M5G 0A4, Canada
| | - Naim Panjwani
- Program in Genetics and Genome Biology, The Hospital for Sick Children, Toronto, ON M5G 0A4, Canada
| | - Cheng Wang
- Program in Genetics and Genome Biology, The Hospital for Sick Children, Toronto, ON M5G 0A4, Canada
| | - Lei Sun
- Department of Statistical Sciences, University of Toronto, Toronto, ON M5G 1Z5, Canada; Biostatistics Division, Dalla Lana School of Public Health, University of Toronto, Toronto, ON M5T 3M7, Canada.
| | - Lisa J Strug
- Department of Statistical Sciences, University of Toronto, Toronto, ON M5G 1Z5, Canada; Program in Genetics and Genome Biology, The Hospital for Sick Children, Toronto, ON M5G 0A4, Canada; Biostatistics Division, Dalla Lana School of Public Health, University of Toronto, Toronto, ON M5T 3M7, Canada; Department of Computer Science, University of Toronto, Toronto, ON M5S 2E4, Canada; The Centre for Applied Genomics, The Hospital for Sick Children, Toronto, ON M5G 0A4, Canada.
| |
Collapse
|
11
|
Chen B, Craiu RV, Strug LJ, Sun L. The X factor: A robust and powerful approach to X-chromosome-inclusive whole-genome association studies. Genet Epidemiol 2021; 45:694-709. [PMID: 34224641 PMCID: PMC9292551 DOI: 10.1002/gepi.22422] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2021] [Revised: 05/14/2021] [Accepted: 05/28/2021] [Indexed: 12/17/2022]
Abstract
The X‐chromosome is often excluded from genome‐wide association studies because of analytical challenges. Some of the problems, such as the random, skewed, or no X‐inactivation model uncertainty, have been investigated. Other considerations have received little to no attention, such as the value in considering nonadditive and gene–sex interaction effects, and the inferential consequence of choosing different baseline alleles (i.e., the reference vs. the alternative allele). Here we propose a unified and flexible regression‐based association test for X‐chromosomal variants. We provide theoretical justifications for its robustness in the presence of various model uncertainties, as well as for its improved power when compared with the existing approaches under certain scenarios. For completeness, we also revisit the autosomes and show that the proposed framework leads to a more robust approach than the standard method. Finally, we provide supporting evidence by revisiting several published association studies. Supporting Information for this article are available online.
Collapse
Affiliation(s)
- Bo Chen
- Department of Statistical Sciences, University of Toronto, Toronto, Ontario, Canada
| | - Radu V Craiu
- Department of Statistical Sciences, University of Toronto, Toronto, Ontario, Canada
| | - Lisa J Strug
- Department of Statistical Sciences, University of Toronto, Toronto, Ontario, Canada.,Biostatistics Division, Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, Canada.,Department of Computer Science, University of Toronto, Toronto, Ontario, Canada.,Program in Genetics and Genome Biology, The Hospital for Sick Children, Toronto, Ontario, Canada
| | - Lei Sun
- Department of Statistical Sciences, University of Toronto, Toronto, Ontario, Canada.,Biostatistics Division, Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
12
|
Extension of SKAT to multi-category phenotypes through a geometrical interpretation. Eur J Hum Genet 2021; 29:736-744. [PMID: 33446828 PMCID: PMC8110546 DOI: 10.1038/s41431-020-00792-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2020] [Revised: 10/26/2020] [Accepted: 11/25/2020] [Indexed: 01/29/2023] Open
Abstract
Rare genetic variants are expected to play an important role in disease and several statistical methods have been developed to test for disease association with rare variants, including variance-component tests. These tests however deal only with binary or continuous phenotypes and it is not possible to take advantage of a suspected heterogeneity between subgroups of patients. To address this issue, we extended the popular rare-variant association test SKAT to compare more than two groups of individuals. Simulations under different scenarios were performed that showed gain in power in presence of genetic heterogeneity and minor lack of power in absence of heterogeneity. An application on whole-exome sequencing data from patients with early- or late-onset moyamoya disease also illustrated the advantage of our SKAT extension. Genetic simulations and SKAT extension are implemented in the R package Ravages available on GitHub ( https://github.com/genostats/Ravages ).
Collapse
|
13
|
Soave D, Lawless JF, Awadalla P. Score tests for scale effects, with application to genomic analysis. Stat Med 2021; 40:3808-3822. [PMID: 33908071 DOI: 10.1002/sim.9000] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2020] [Revised: 04/01/2021] [Accepted: 04/07/2021] [Indexed: 11/07/2022]
Abstract
Tests for variance or scale effects due to covariates are used in many areas and recently, in genomic and genetic association studies. We study score tests based on location-scale models with arbitrary error distributions that allow incorporation of additional adjustment covariates. Tests based on Gaussian and Laplacian double generalized linear models are examined in some detail. Numerical properties of the tests under Gaussian and other error distributions are examined. Our results show that the use of model-based asymptotic distributions with score tests for scale effects does not control type 1 error well in many settings of practical relevance. We consider simple statistics based on permutation distribution approximations, which correspond to well-known statistics derived by another approach. They are shown to give good type 1 error control under different error distributions and under covariate distribution imbalance. The methods are illustrated through a differential gene expression analysis involving breast cancer tumor samples.
Collapse
Affiliation(s)
- David Soave
- Department of Mathematics, Wilfrid Laurier University, Waterloo, Ontario, Canada.,Computational Biology Program, Ontario Institute for Cancer Research, Toronto, Ontario, Canada
| | - Jerald F Lawless
- Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, Ontario, Canada
| | - Philip Awadalla
- Computational Biology Program, Ontario Institute for Cancer Research, Toronto, Ontario, Canada.,Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
14
|
Zhang L, Sun L. A generalized robust allele-based genetic association test. Biometrics 2021; 78:487-498. [PMID: 33729547 PMCID: PMC9544499 DOI: 10.1111/biom.13456] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2020] [Revised: 12/13/2020] [Accepted: 03/04/2021] [Indexed: 12/30/2022]
Abstract
The allele-based association test, comparing allele frequency difference between case and control groups, is locally most powerful. However, application of the classical allelic test is limited in practice, because the method is sensitive to the Hardy-Weinberg equilibrium (HWE) assumption, not applicable to continuous traits, and not easy to account for covariate effect or sample correlation. To develop a generalized robust allelic test, we propose a new allele-based regression model with individual allele as the response variable. We show that the score test statistic derived from this robust and unifying regression framework contains a correction factor that explicitly adjusts for potential departure from HWE and encompasses the classical allelic test as a special case. When the trait of interest is continuous, the corresponding allelic test evaluates a weighted difference between individual-level allele frequency estimate and sample estimate where the weight is proportional to an individual's trait value, and the test remains valid under Y-dependent sampling. Finally, the proposed allele-based method can analyze multiple (continuous or binary) phenotypes simultaneously and multiallelic genetic markers, while accounting for covariate effect, sample correlation, and population heterogeneity. To support our analytical findings, we provide empirical evidence from both simulation and application studies.
Collapse
Affiliation(s)
- Lin Zhang
- Department of Statistical Sciences, Faculty of Arts and Science, University of Toronto, Toronto, Canada
| | - Lei Sun
- Department of Statistical Sciences, Faculty of Arts and Science, University of Toronto, Toronto, Canada.,Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, Canada
| |
Collapse
|
15
|
Chen B, Craiu RV, Sun L. Bayesian model averaging for the X-chromosome inactivation dilemma in genetic association study. Biostatistics 2020; 21:319-335. [PMID: 30247537 DOI: 10.1093/biostatistics/kxy049] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2017] [Accepted: 06/30/2018] [Indexed: 01/17/2023] Open
Abstract
X-chromosome is often excluded from the so called "whole-genome" association studies due to the differences it exhibits between males and females. One particular analytical challenge is the unknown status of X-inactivation, where one of the two X-chromosome variants in females may be randomly selected to be silenced. In the absence of biological evidence in favor of one specific model, we consider a Bayesian model averaging framework that offers a principled way to account for the inherent model uncertainty, providing model averaging-based posterior density intervals and Bayes factors. We examine the inferential properties of the proposed methods via extensive simulation studies, and we apply the methods to a genetic association study of an intestinal disease occurring in about 20% of cystic fibrosis patients. Compared with the results previously reported assuming the presence of inactivation, we show that the proposed Bayesian methods provide more feature-rich quantities that are useful in practice.
Collapse
Affiliation(s)
- Bo Chen
- Department of Statistical Sciences, University of Toronto, Sidney Smith Hall, 100 St. George Street, Toronto, ON, Canada
| | - Radu V Craiu
- Department of Statistical Sciences, University of Toronto, Sidney Smith Hall, 100 St. George Street, Toronto, ON, Canada
| | - Lei Sun
- Department of Statistical Sciences, University of Toronto, Sidney Smith Hall, 100 St. George Street, Toronto, ON, Canada
| |
Collapse
|
16
|
Zhao Y, Sun L. On set‐based association tests: Insights from a regression using summary statistics. CAN J STAT 2020. [DOI: 10.1002/cjs.11584] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Affiliation(s)
- Yanyan Zhao
- Department of Statistical Sciences University of Toronto Toronto M5S 3G3 Ontario Canada
| | - Lei Sun
- Department of Statistical Sciences University of Toronto Toronto M5S 3G3 Ontario Canada
- Division of Biostatistics, Dalla Lana School of Public Health University of Toronto Toronto M5T 3M7 Ontario Canada
| |
Collapse
|
17
|
Fore R, Boehme J, Li K, Westra J, Tintle N. Multi-Set Testing Strategies Show Good Behavior When Applied to Very Large Sets of Rare Variants. Front Genet 2020; 11:591606. [PMID: 33240333 PMCID: PMC7680887 DOI: 10.3389/fgene.2020.591606] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2020] [Accepted: 10/05/2020] [Indexed: 12/22/2022] Open
Abstract
Gene-based tests of association (e.g., variance components and burden tests) are now common practice for analyses attempting to elucidate the contribution of rare genetic variants on common disease. As sequencing datasets continue to grow in size, the number of variants within each set (e.g., gene) being tested is also continuing to grow. Pathway-based methods have been used to allow for the initial aggregation of gene-based statistical evidence and then the subsequent aggregation of evidence across the pathway. This “multi-set” approach (first gene-based test, followed by pathway-based) lacks thorough exploration in regard to evaluating genotype–phenotype associations in the age of large, sequenced datasets. In particular, we wonder whether there are statistical and biological characteristics that make the multi-set approach optimal vs. simply doing all gene-based tests? In this paper, we provide an intuitive framework for evaluating these questions and use simulated data to affirm us this intuition. A real data application is provided demonstrating how our insights manifest themselves in practice. Ultimately, we find that when initial subsets are biologically informative (e.g., tending to aggregate causal genetic variants within one or more subsets, often genes), multi-set strategies can improve statistical power, with particular gains in cases where causal variants are aggregated in subsets with less variants overall (high proportion of causal variants in the subset). However, we find that there is little advantage when the sets are non-informative (similar proportion of causal variants in the subsets). Our application to real data further demonstrates this intuition. In practice, we recommend wider use of pathway-based methods and further exploration of optimal ways of aggregating variants into subsets based on emerging biological evidence of the genetic architecture of complex disease.
Collapse
Affiliation(s)
- Ruby Fore
- Department of Biostatistics, Brown University, Providence, RI, United States
| | - Jaden Boehme
- Department of Mathematics, Oregon State University, Corvallis, OR, United States
| | - Kevin Li
- Department of Mathematics, School of Arts and Sciences, Columbia University, New York, NY, United States
| | - Jason Westra
- Department of Mathematics and Statistics, Dordt University, Sioux Center, IA, United States
| | - Nathan Tintle
- Department of Mathematics and Statistics, Dordt University, Sioux Center, IA, United States
| |
Collapse
|
18
|
Derkach A, Moore SC, Boca SM, Sampson JN. Group testing in mediation analysis. Stat Med 2020; 39:2423-2436. [DOI: 10.1002/sim.8546] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2019] [Revised: 11/01/2019] [Accepted: 03/05/2020] [Indexed: 11/09/2022]
Affiliation(s)
- Andriy Derkach
- Biostatistics Branch, Division of Cancer Epidemiology and GeneticsNational Cancer Institute Rockville Maryland USA
| | - Steven C. Moore
- Metabolomics Epidemiology Branch, Division of Cancer Epidemiology and GeneticsNational Cancer Institute Rockville Maryland USA
| | - Simina M. Boca
- Innovation Center for Biomedical Informatics, Department of Oncology and Biostatistics, Bioinformatics and BiomathematicsGeorgetown University Medical Center Washington District of Columbia USA
| | - Joshua N. Sampson
- Biostatistics Branch, Division of Cancer Epidemiology and GeneticsNational Cancer Institute Rockville Maryland USA
| |
Collapse
|
19
|
Che M, Lawless JF, Han P. Empirical and conditional likelihoods for two‐phase studies. CAN J STAT 2020. [DOI: 10.1002/cjs.11566] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Affiliation(s)
- Menglu Che
- Department of Statistics and Actuarial Science University of Waterloo Waterloo Ontario Canada
| | - Jerald F. Lawless
- Department of Statistics and Actuarial Science University of Waterloo Waterloo Ontario Canada
| | - Peisong Han
- Department of Biostatistics, School of Public Health University of Michigan Ann Arbor MI U.S.A
| |
Collapse
|
20
|
Bocher O, Génin E. Rare variant association testing in the non-coding genome. Hum Genet 2020; 139:1345-1362. [PMID: 32500240 DOI: 10.1007/s00439-020-02190-y] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2020] [Accepted: 05/29/2020] [Indexed: 12/25/2022]
Abstract
The development of next-generation sequencing technologies has opened-up some new possibilities to explore the contribution of genetic variants to human diseases and in particular that of rare variants. Statistical methods have been developed to test for association with rare variants that require the definition of testing units and, in these testing units, the selection of qualifying variants to include in the test. In the coding regions of the genome, testing units are usually the different genes and qualifying variants are selected based on their functional effects on the encoded proteins. Extending these tests to the non-coding regions of the genome is challenging. Testing units are difficult to define as the non-coding genome organisation is still rather unknown. Qualifying variants are difficult to select as the functional impact of non-coding variants on gene expression is hard to predict. These difficulties could explain why very few investigators so far have analysed the non-coding parts of their whole genome sequencing data. These non-coding parts yet represent the vast majority of the genome and some studies suggest that they could play a major role in disease susceptibility. In this review, we discuss recent experimental and statistical developments to gain knowledge on the non-coding genome and how this knowledge could be used to include rare non-coding variants in association tests. We describe the few studies that have considered variants from the non-coding genome in association tests and how they managed to define testing units and select qualifying variants.
Collapse
Affiliation(s)
- Ozvan Bocher
- Génétique, Génomique Fonctionnelle Et Biotechnologies, Faculté de Médecine, Univ Brest, Inserm, Inserm UMR1078, Bâtiment E-IBRBS 2ieme étage, 22 avenue Camille Desmoulins, 29238, Brest Cedex 3, France.
| | - Emmanuelle Génin
- Génétique, Génomique Fonctionnelle Et Biotechnologies, Faculté de Médecine, Univ Brest, Inserm, Inserm UMR1078, Bâtiment E-IBRBS 2ieme étage, 22 avenue Camille Desmoulins, 29238, Brest Cedex 3, France.
- CHU Brest, Brest, France.
| |
Collapse
|
21
|
Romanescu RG, Green J, Andrulis IL, Bull SB. Gene-based and pathway-based testing for rare-variant association in affected sib pairs. Genet Epidemiol 2020; 44:368-381. [PMID: 32237178 PMCID: PMC7318298 DOI: 10.1002/gepi.22291] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2019] [Revised: 02/28/2020] [Accepted: 03/06/2020] [Indexed: 12/04/2022]
Abstract
Next generation sequencing technologies have made it possible to investigate the role of rare variants (RVs) in disease etiology. Because RVs associated with disease susceptibility tend to be enriched in families with affected individuals, study designs based on affected sib pairs (ASP) can be more powerful than case-control studies. We construct tests of RV-set association in ASPs for single genomic regions as well as for multiple regions. Single-region tests can efficiently detect a gene region harboring susceptibility variants, while multiple-region extensions are meant to capture signals dispersed across a biological pathway, potentially as a result of locus heterogeneity. Within ascertained ASPs, the test statistics contrast the frequencies of duplicate rare alleles (usually appearing on a shared haplotype) against frequencies of a single rare allele copy (appearing on a nonshared haplotype); we call these allelic parity tests. Incorporation of minor allele frequency estimates from reference populations can markedly improve test efficiency. Under various genetic penetrance models, application of the tests in simulated ASP data sets demonstrates good type I error properties as well as power gains over approaches that regress ASP rare allele counts on sharing state, especially in small samples. We discuss robustness of the allelic parity methods to the presence of genetic linkage, misspecification of reference population allele frequencies, sequencing error and de novo mutations, and population stratification. As proof of principle, we apply single- and multiple-region tests in a motivating study data set consisting of whole exome sequencing of sisters ascertained with early onset breast cancer.
Collapse
Affiliation(s)
- Razvan G. Romanescu
- Lunenfeld‐Tanenbaum Research InstituteSinai Health SystemTorontoOntarioCanada
- Centre for Healthcare Innovation, Rady Faculty of Health ScienceUniversity of ManitobaWinnipegManitobaCanada
| | - Jessica Green
- Lunenfeld‐Tanenbaum Research InstituteSinai Health SystemTorontoOntarioCanada
| | - Irene L. Andrulis
- Lunenfeld‐Tanenbaum Research InstituteSinai Health SystemTorontoOntarioCanada
- Department of Molecular GeneticsUniversity of TorontoTorontoOntarioCanada
| | - Shelley B. Bull
- Division of Biostatistics, Dalla Lana School of Public HealthUniversity of TorontoTorontoOntarioCanada
| |
Collapse
|
22
|
Heller R, Meir A, Chatterjee N. Post-selection estimation and testing following aggregate association tests. J R Stat Soc Series B Stat Methodol 2019. [DOI: 10.1111/rssb.12318] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
| | - Amit Meir
- University of Washington; Seattle USA
| | | |
Collapse
|
23
|
Derkach A, Pfeiffer RM. Subset testing and analysis of multiple phenotypes. Genet Epidemiol 2019; 43:492-505. [PMID: 30920058 DOI: 10.1002/gepi.22199] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2018] [Revised: 02/08/2019] [Accepted: 02/19/2019] [Indexed: 11/08/2022]
Abstract
Meta-analysis of multiple genome-wide association studies (GWAS) is effective for detecting single- or multimarker associations with complex traits. We develop a flexible procedure (subset testing and analysis of multiple phenotypes [STAMP]) based on mixture models to perform a region-based meta-analysis of different phenotypes using data from different GWAS and identify subsets of associated phenotypes. Our model framework helps distinguish true associations from between-study heterogeneity. As a measure of association, we compute for each phenotype the posterior probability that the genetic region under investigation is truly associated. Extensive simulations show that STAMP is more powerful than standard approaches for meta-analyses when the proportion of truly associated outcomes is between 25% and 50%. For other settings, the power of STAMP is similar to that of existing methods. We illustrate our method on two examples, the association of a region on chromosome 9p21 with the risk of 14 cancers, and the associations of expression of quantitative trait loci from two genetic regions with their cis-single-nucleotide polymorphisms measured in 17 tissue types using data from The Cancer Genome Atlas.
Collapse
Affiliation(s)
- Andriy Derkach
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Rockville, Maryland
| | - Ruth M Pfeiffer
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Rockville, Maryland
| |
Collapse
|
24
|
Zhang T, Sun L. Beyond the traditional simulation design for evaluating type 1 error control: From the "theoretical" null to "empirical" null. Genet Epidemiol 2018; 43:166-179. [PMID: 30478944 PMCID: PMC6518945 DOI: 10.1002/gepi.22172] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2018] [Revised: 09/10/2018] [Accepted: 09/21/2018] [Indexed: 01/25/2023]
Abstract
When evaluating a newly developed statistical test, an important step is to check its type 1 error (T1E) control using simulations. This is often achieved by the standard simulation design S0 under the so-called "theoretical" null of no association. In practice, the whole-genome association analyses scan through a large number of genetic markers ( <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mi>G</mml:mi></mml:math> s) for the ones associated with an outcome of interest ( <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mi>Y</mml:mi></mml:math> ), where <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mi>Y</mml:mi></mml:math> comes from an alternative while the majority of <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mi>G</mml:mi></mml:math> s are not associated with <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mi>Y</mml:mi></mml:math> ; the <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mi>Y</mml:mi> <mml:mo>-</mml:mo> <mml:mi>G</mml:mi></mml:math> relationships are under the "empirical" null. This reality can be better represented by two other simulation designs, where design S1.1 simulates <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mi>Y</mml:mi></mml:math> from analternative model based on <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mi>G</mml:mi></mml:math> , then evaluates its association with independently generated <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mrow><mml:mrow/> <mml:msub><mml:mi>G</mml:mi> <mml:mrow><mml:mi>n</mml:mi> <mml:mi>e</mml:mi> <mml:mi>w</mml:mi></mml:mrow> </mml:msub> </mml:mrow> </mml:math> ; while design S1.2 evaluates the association between permutated <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mi>Y</mml:mi></mml:math> and <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mi>G</mml:mi></mml:math> . More than a decade ago, Efron (2004) has noted the important distinction between the "theoretical" and "empirical" null in false discovery rate control. Using scale tests for variance heterogeneity, direct univariate, and multivariate interaction tests as examples, here we show that not all null simulation designs are equal. In examining the accuracy of a likelihood ratio test, while simulation design S0 suggested the method being accurate, designs S1.1 and S1.2 revealed its increased empirical T1E rate if applied in real data setting. The inflation becomes more severe at the tail and does not diminish as sample size increases. This is an important observation that calls for new practices for methods evaluation and T1E control interpretation.
Collapse
Affiliation(s)
- Ting Zhang
- Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, Canada
| | - Lei Sun
- Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, Canada.,Department of Statistical Sciences, Faculty of Arts and Science, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
25
|
Zhu B, Mirabello L, Chatterjee N. A subregion-based burden test for simultaneous identification of susceptibility loci and subregions within. Genet Epidemiol 2018; 42:673-683. [PMID: 29931698 PMCID: PMC6185783 DOI: 10.1002/gepi.22134] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2018] [Revised: 04/14/2018] [Accepted: 05/04/2018] [Indexed: 01/08/2023]
Abstract
In rare variant association studies, aggregating rare and/or low frequency variants, may increase statistical power for detection of the underlying susceptibility gene or region. However, it is unclear which variants, or class of them, in a gene contribute most to the association. We proposed a subregion-based burden test (REBET) to simultaneously select susceptibility genes and identify important underlying subregions. The subregions are predefined by shared common biologic characteristics, such as the protein domain or functional impact. Based on a subset-based approach considering local correlations between combinations of test statistics of subregions, REBET is able to properly control the type I error rate while adjusting for multiple comparisons in a computationally efficient manner. Simulation studies show that REBET can achieve power competitive to alternative methods when rare variants cluster within subregions. In two case studies, REBET is able to identify known disease susceptibility genes, and more importantly pinpoint the unreported most susceptible subregions, which represent protein domains essential for gene function. R package REBET is available at https://dceg.cancer.gov/tools/analysis/rebet.
Collapse
Affiliation(s)
- Bin Zhu
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institute of Health, Bethesda, MD 20892, USA
| | - Lisa Mirabello
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institute of Health, Bethesda, MD 20892, USA
| | - Nilanjan Chatterjee
- Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, MD 21205, USA
- Department of Oncology, School of Medicine, Johns Hopkins University, Baltimore, MD 21205, USA
| |
Collapse
|
26
|
Derkach A, Zhang H, Chatterjee N. Power Analysis for Genetic Association Test (PAGEANT) provides insights to challenges for rare variant association studies. Bioinformatics 2018; 34:1506-1513. [PMID: 29194474 PMCID: PMC5925788 DOI: 10.1093/bioinformatics/btx770] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2017] [Revised: 10/02/2017] [Accepted: 11/27/2017] [Indexed: 12/18/2022] Open
Abstract
Motivation Genome-wide association studies are now shifting focus from analysis of common to rare variants. As power for association testing for individual rare variants may often be low, various aggregate level association tests have been proposed to detect genetic loci. Typically, power calculations for such tests require specification of large number of parameters, including effect sizes and allele frequencies of individual variants, making them difficult to use in practice. We propose to approximate power to a varying degree of accuracy using a smaller number of key parameters, including the total genetic variance explained by multiple variants within a locus. Results We perform extensive simulation studies to assess the accuracy of the proposed approximations in realistic settings. Using these simplified power calculations, we develop an analytic framework to obtain bounds on genetic architecture of an underlying trait given results from genome-wide association studies with rare variants. Finally, we provide insights into the required quality of annotation/functional information for identification of likely causal variants to make meaningful improvement in power. Availability and implementation A shiny application that allows a variety of Power Analysis of GEnetic AssociatioN Tests (PAGEANT), in R is made publicly available at https://andrewhaoyu.shinyapps.io/PAGEANT/. Contact nilanjan@jhu.edu. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Andriy Derkach
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Rockville, MD, USA
| | - Haoyu Zhang
- Department of Biostatistics, Bloomberg School of Public Health, School of Medicine, Johns Hopkins University, Baltimore, MD, USA
| | - Nilanjan Chatterjee
- Department of Biostatistics, Bloomberg School of Public Health, School of Medicine, Johns Hopkins University, Baltimore, MD, USA
- Department of Oncology, School of Medicine, Johns Hopkins University, Baltimore, MD, USA
| |
Collapse
|
27
|
Wu C, Pan W. Integrating eQTL data with GWAS summary statistics in pathway-based analysis with application to schizophrenia. Genet Epidemiol 2018; 42:303-316. [PMID: 29411426 PMCID: PMC5851843 DOI: 10.1002/gepi.22110] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2017] [Revised: 01/04/2018] [Accepted: 01/04/2018] [Indexed: 12/11/2022]
Abstract
Many genetic variants affect complex traits through gene expression, which can be exploited to boost statistical power and enhance interpretation in genome-wide association studies (GWASs) as demonstrated by the transcriptome-wide association study (TWAS) approach. Furthermore, due to polygenic inheritance, a complex trait is often affected by multiple genes with similar functions as annotated in gene pathways. Here, we extend TWAS from gene-based analysis to pathway-based analysis: we integrate public pathway collections, expression quantitative trait locus (eQTL) data and GWAS summary association statistics (or GWAS individual-level data) to identify gene pathways associated with complex traits. The basic idea is to weight the SNPs of the genes in a pathway based on their estimated cis-effects on gene expression, then adaptively test for association of the pathway with a GWAS trait by effectively aggregating possibly weak association signals across the genes in the pathway. The P values can be calculated analytically and thus fast. We applied our proposed test with the KEGG and GO pathways to two schizophrenia (SCZ) GWAS summary association data sets, denoted by SCZ1 and SCZ2 with about 20,000 and 150,000 subjects, respectively. Most of the significant pathways identified by analyzing the SCZ1 data were reproduced by the SCZ2 data. Importantly, we identified 15 novel pathways associated with SCZ, such as GABA receptor complex (GO:1902710), which could not be uncovered by the standard single SNP-based analysis or gene-based TWAS. The newly identified pathways may help us gain insights into the biological mechanism underlying SCZ. Our results showcase the power of incorporating gene expression information and gene functional annotations into pathway-based association testing for GWAS.
Collapse
Affiliation(s)
- Chong Wu
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, United States of America
| | - Wei Pan
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, United States of America
| |
Collapse
|
28
|
Kim SA, Cho CS, Kim SR, Bull SB, Yoo YJ. A new haplotype block detection method for dense genome sequencing data based on interval graph modeling of clusters of highly correlated SNPs. Bioinformatics 2018; 34:388-397. [PMID: 29028986 PMCID: PMC5860363 DOI: 10.1093/bioinformatics/btx609] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2016] [Revised: 09/11/2017] [Accepted: 09/28/2017] [Indexed: 11/13/2022] Open
Abstract
Motivation Linkage disequilibrium (LD) block construction is required for research in population genetics and genetic epidemiology, including specification of sets of single nucleotide polymorphisms (SNPs) for analysis of multi-SNP based association and identification of haplotype blocks in high density sequencing data. Existing methods based on a narrow sense definition do not allow intermediate regions of low LD between strongly associated SNP pairs and tend to split high density SNP data into small blocks having high between-block correlation. Results We present Big-LD, a block partition method based on interval graph modeling of LD bins which are clusters of strong pairwise LD SNPs, not necessarily physically consecutive. Big-LD uses an agglomerative approach that starts by identifying small communities of SNPs, i.e. the SNPs in each LD bin region, and proceeds by merging these communities. We determine the number of blocks using a method to find maximum-weight independent set. Big-LD produces larger LD blocks compared to existing methods such as MATILDE, Haploview, MIG ++, or S-MIG ++ and the LD blocks better agree with recombination hotspot locations determined by sperm-typing experiments. The observed average runtime of Big-LD for 13 288 240 non-monomorphic SNPs from 1000 Genomes Project autosome data (286 East Asians) is about 5.83 h, which is a significant improvement over the existing methods. Availability and implementation Source code and documentation are available for download at http://github.com/sunnyeesl/BigLD. Contact yyoo@snu.ac.kr. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Sun Ah Kim
- Department of Statistics, Seoul National University, Seoul, South Korea
| | - Chang-Sung Cho
- Department of Mathematics Education, Seoul National University, Seoul, South Korea
| | - Suh-Ryung Kim
- Department of Mathematics Education, Seoul National University, Seoul, South Korea
| | - Shelley B Bull
- Prosserman Centre for Health Research, The Lunenfeld-Tanenbaum Research Institute, Sinai Health System, Toronto, Canada
- Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, Canada
| | - Yun Joo Yoo
- Department of Mathematics Education, Seoul National University, Seoul, South Korea
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, South Korea
| |
Collapse
|
29
|
Association detection between ordinal trait and rare variants based on adaptive combination of P values. J Hum Genet 2017; 63:37-45. [PMID: 29215083 DOI: 10.1038/s10038-017-0354-2] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2017] [Revised: 08/19/2017] [Accepted: 09/06/2017] [Indexed: 12/31/2022]
Abstract
Next-generation sequencing technology not only presents a new method for the detection of human genomic structural variation, but also provides a large number of genetic data of rare variants for us. Currently, how to detect association between human complex diseases and rare variants using genetical data has attracted extensive attention. In the field of medicine, many people's health and disease conditions are measured by ordinal response variables, namely, the trait value reflects the development stage or severity of a certain condition. However, most existing methods to test for association between rare variants and complex diseases are designed to deal with dichotomous or quantitative traits. Association analysis methods of ordinal traits are relatively fewer, and considering ordinal traits as dichotomous and quantitative traits will inevitably lose some valuable information in the original data. Therefore, in this paper, we extend an existing method of adaptive combination of P values (ADA) and propose a new method of association analysis for ordinal trait based on it (called OR-ADA) to test for possible association between ordinal trait and rare variants. In our method, we establish a cumulative logistic regression model, in which the regression coefficients are estimated by the Newton-Raphson algorithm and the likelihood ratio test is used to test the association. Through a large number of simulation studies and an example, we demonstrate the performance of the new method and compare it with several methods. The analysis results show that the OR-ADA strategy is robust to the signs of effects of causal variants and more powerful under many scenarios.
Collapse
|
30
|
He Z, Xu B, Lee S, Ionita-Laza I. Unified Sequence-Based Association Tests Allowing for Multiple Functional Annotations and Meta-analysis of Noncoding Variation in Metabochip Data. Am J Hum Genet 2017; 101:340-352. [PMID: 28844485 PMCID: PMC5590864 DOI: 10.1016/j.ajhg.2017.07.011] [Citation(s) in RCA: 38] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2017] [Accepted: 07/18/2017] [Indexed: 12/14/2022] Open
Abstract
Substantial progress has been made in the functional annotation of genetic variation in the human genome. Integrative analysis that incorporates such functional annotations into sequencing studies can aid the discovery of disease-associated genetic variants, especially those with unknown function and located outside protein-coding regions. Direct incorporation of one functional annotation as weight in existing dispersion and burden tests can suffer substantial loss of power when the functional annotation is not predictive of the risk status of a variant. Here, we have developed unified tests that can utilize multiple functional annotations simultaneously for integrative association analysis with efficient computational techniques. We show that the proposed tests significantly improve power when variant risk status can be predicted by functional annotations. Importantly, when functional annotations are not predictive of risk status, the proposed tests incur only minimal loss of power in relation to existing dispersion and burden tests, and under certain circumstances they can even have improved power by learning a weight that better approximates the underlying disease model in a data-adaptive manner. The tests can be constructed with summary statistics of existing dispersion and burden tests for sequencing data, therefore allowing meta-analysis of multiple studies without sharing individual-level data. We applied the proposed tests to a meta-analysis of noncoding rare variants in Metabochip data on 12,281 individuals from eight studies for lipid traits. By incorporating the Eigen functional score, we detected significant associations between noncoding rare variants in SLC22A3 and low-density lipoprotein and total cholesterol, associations that are missed by standard dispersion and burden tests.
Collapse
Affiliation(s)
- Zihuai He
- Department of Biostatistics, Columbia University, New York, NY 10032, USA
| | - Bin Xu
- Department of Psychiatry, Columbia University, New York, NY 10032, USA
| | - Seunggeun Lee
- Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA
| | | |
Collapse
|
31
|
Abstract
Despite thousands of genetic loci identified to date, a large proportion of genetic variation predisposing to complex disease and traits remains unaccounted for. Advances in sequencing technology enable focused explorations on the contribution of low-frequency and rare variants to human traits. Here we review experimental approaches and current knowledge on the contribution of these genetic variants in complex disease and discuss challenges and opportunities for personalised medicine.
Collapse
Affiliation(s)
- Lorenzo Bomba
- Human Genetics, Wellcome Trust Sanger Institute, Genome Campus, Hinxton, CB10 1HH, UK
| | - Klaudia Walter
- Human Genetics, Wellcome Trust Sanger Institute, Genome Campus, Hinxton, CB10 1HH, UK
| | - Nicole Soranzo
- Human Genetics, Wellcome Trust Sanger Institute, Genome Campus, Hinxton, CB10 1HH, UK. .,Department of Haematology, University of Cambridge, Hills Rd, Cambridge, CB2 0AH, UK. .,The National Institute for Health Research Blood and Transplant Unit (NIHR BTRU) in Donor Health and Genomics at the University of Cambridge, University of Cambridge, Strangeways Research Laboratory, Wort's Causeway, Cambridge, CB1 8RN, UK.
| |
Collapse
|
32
|
Soave D, Sun L. A generalized Levene's scale test for variance heterogeneity in the presence of sample correlation and group uncertainty. Biometrics 2017; 73:960-971. [DOI: 10.1111/biom.12651] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2016] [Revised: 12/01/2016] [Accepted: 12/01/2016] [Indexed: 10/20/2022]
Affiliation(s)
- David Soave
- Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto; Toronto, Ontario M5T 3M7 Canada
- Program in Genetics and Genome Biology, Research Institute, The Hospital for Sick Children; Toronto, Ontario M5G 0A4 Canada
| | - Lei Sun
- Department of Statistical Sciences, University of Toronto; Toronto, Ontario M5S 3G3 Canada
- Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto; Toronto, Ontario M5T 3M7 Canada
| |
Collapse
|
33
|
Wang Z, Xu K, Zhang X, Wu X, Wang Z. Longitudinal SNP-set association analysis of quantitative phenotypes. Genet Epidemiol 2017; 41:81-93. [PMID: 27859628 PMCID: PMC5154867 DOI: 10.1002/gepi.22016] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2016] [Revised: 08/10/2016] [Accepted: 09/19/2016] [Indexed: 02/06/2023]
Abstract
Many genetic epidemiological studies collect repeated measurements over time. This design not only provides a more accurate assessment of disease condition, but allows us to explore the genetic influence on disease development and progression. Thus, it is of great interest to study the longitudinal contribution of genes to disease susceptibility. Most association testing methods for longitudinal phenotypes are developed for single variant, and may have limited power to detect association, especially for variants with low minor allele frequency. We propose Longitudinal SNP-set/sequence kernel association test (LSKAT), a robust, mixed-effects method for association testing of rare and common variants with longitudinal quantitative phenotypes. LSKAT uses several random effects to account for the within-subject correlation in longitudinal data, and allows for adjustment for both static and time-varying covariates. We also present a longitudinal trait burden test (LBT), where we test association between the trait and the burden score in linear mixed models. In simulation studies, we demonstrate that LBT achieves high power when variants are almost all deleterious or all protective, while LSKAT performs well in a wide range of genetic models. By making full use of trait values from repeated measures, LSKAT is more powerful than several tests applied to a single measurement or average over all time points. Moreover, LSKAT is robust to misspecification of the covariance structure. We apply the LSKAT and LBT methods to detect association with longitudinally measured body mass index in the Framingham Heart Study, where we are able to replicate association with a circadian gene NR1D2.
Collapse
Affiliation(s)
- Zhong Wang
- Department of Biostatistics, Yale School of Public Health, New Haven, Connecticut, United States of America
- Baker Institute for Animal Health, Cornell University, Ithaca, New York, United States of America
- Center for Computational Biology, Beijing Forestry University, Beijing, China
| | - Ke Xu
- Department of Psychiatry, Yale School of Medicine, New Haven, Connecticut, United States of America
- VA Connecticut Healthcare System, West Haven, Connecticut, United States of America
| | - Xinyu Zhang
- Department of Psychiatry, Yale School of Medicine, New Haven, Connecticut, United States of America
- VA Connecticut Healthcare System, West Haven, Connecticut, United States of America
| | - Xiaowei Wu
- Department of Statistics, Virginia Polytechnic Institute and State University, Blacksburg, Virginia, United States of America
| | - Zuoheng Wang
- Department of Biostatistics, Yale School of Public Health, New Haven, Connecticut, United States of America
| |
Collapse
|
34
|
He Z, Zhang M, Lee S, Smith JA, Kardia SLR, Diez Roux AV, Mukherjee B. Set-Based Tests for the Gene-Environment Interaction in Longitudinal Studies. J Am Stat Assoc 2016; 112:966-978. [PMID: 29780190 PMCID: PMC5954413 DOI: 10.1080/01621459.2016.1252266] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2016] [Accepted: 10/01/2016] [Indexed: 01/09/2023]
Abstract
We propose a generalized score type test for set-based inference for gene-environment interaction with longitudinally measured quantitative traits. The test is robust to misspecification of within subject correlation structure and has enhanced power compared to existing alternatives. Unlike tests for marginal genetic association, set-based tests for gene-environment interaction face the challenges of a potentially misspecified and high-dimensional main effect model under the null hypothesis. We show that our proposed test is robust to main effect misspecification of environmental exposure and genetic factors under the gene-environment independence condition. When genetic and environmental factors are dependent, the method of sieves is further proposed to eliminate potential bias due to a misspecified main effect of a continuous environmental exposure. A weighted principal component analysis approach is developed to perform dimension reduction when the number of genetic variants in the set is large relative to the sample size. The methods are motivated by an example from the Multi-Ethnic Study of Atherosclerosis (MESA), investigating interaction between measures of neighborhood environment and genetic regions on longitudinal measures of blood pressure over a study period of about seven years with 4 exams.
Collapse
Affiliation(s)
- Zihuai He
- Department of Biostatistics, University of Michigan
| | - Min Zhang
- Department of Biostatistics, University of Michigan
| | | | | | | | | | | |
Collapse
|
35
|
Yoo YJ, Sun L, Poirier JG, Paterson AD, Bull SB. Multiple linear combination (MLC) regression tests for common variants adapted to linkage disequilibrium structure. Genet Epidemiol 2016; 41:108-121. [PMID: 27885705 PMCID: PMC5245123 DOI: 10.1002/gepi.22024] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2016] [Revised: 05/25/2016] [Accepted: 09/27/2016] [Indexed: 11/21/2022]
Abstract
By jointly analyzing multiple variants within a gene, instead of one at a time, gene‐based multiple regression can improve power, robustness, and interpretation in genetic association analysis. We investigate multiple linear combination (MLC) test statistics for analysis of common variants under realistic trait models with linkage disequilibrium (LD) based on HapMap Asian haplotypes. MLC is a directional test that exploits LD structure in a gene to construct clusters of closely correlated variants recoded such that the majority of pairwise correlations are positive. It combines variant effects within the same cluster linearly, and aggregates cluster‐specific effects in a quadratic sum of squares and cross‐products, producing a test statistic with reduced degrees of freedom (df) equal to the number of clusters. By simulation studies of 1000 genes from across the genome, we demonstrate that MLC is a well‐powered and robust choice among existing methods across a broad range of gene structures. Compared to minimum P‐value, variance‐component, and principal‐component methods, the mean power of MLC is never much lower than that of other methods, and can be higher, particularly with multiple causal variants. Moreover, the variation in gene‐specific MLC test size and power across 1000 genes is less than that of other methods, suggesting it is a complementary approach for discovery in genome‐wide analysis. The cluster construction of the MLC test statistics helps reveal within‐gene LD structure, allowing interpretation of clustered variants as haplotypic effects, while multiple regression helps to distinguish direct and indirect associations.
Collapse
Affiliation(s)
- Yun Joo Yoo
- Department of Mathematics Education, Seoul National University, Seoul, South Korea.,Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, South Korea
| | - Lei Sun
- Department of Statistical Sciences, University of Toronto, Toronto, Canada.,Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, Canada
| | - Julia G Poirier
- Prosserman Centre for Health Research, Lunenfeld-Tanenbaum Research Institute, Sinai Health System, Toronto, Canada
| | - Andrew D Paterson
- Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, Canada.,Program in Genetics and Genome Biology, Hospital for Sick Children Research Institute, Toronto, Canada
| | - Shelley B Bull
- Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, Canada.,Prosserman Centre for Health Research, Lunenfeld-Tanenbaum Research Institute, Sinai Health System, Toronto, Canada
| |
Collapse
|
36
|
Sun J, Bhatnagar SR, Oualkacha K, Ciampi A, Greenwood CMT. Joint analysis of multiple blood pressure phenotypes in GAW19 data by using a multivariate rare-variant association test. BMC Proc 2016; 10:309-313. [PMID: 27980654 PMCID: PMC5133485 DOI: 10.1186/s12919-016-0048-3] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
INTRODUCTION Large-scale sequencing studies often measure many related phenotypes in addition to the genetic variants. Joint analysis of multiple phenotypes in genetic association studies may increase power to detect disease-associated loci. METHODS We apply a recently developed multivariate rare-variant association test to the Genetic Analysis Workshop 19 data in order to test associations between genetic variants and multiple blood pressure phenotypes simultaneously. We also compare this multivariate test with a widely used univariate test that analyzes phenotypes separately. RESULTS The multivariate test identified 2 genetic variants that have been previously reported as associated with hypertension or coronary artery disease. In addition, our region-based analyses also show that the multivariate test tends to give smaller p values than the univariate test. CONCLUSIONS Hence, the multivariate test has potential to improve test power, especially when multiple phenotypes are correlated.
Collapse
Affiliation(s)
- Jianping Sun
- Department of Epidemiology, Biostatistics and Occupational Health, McGill University, Montreal, QC H3A 1A2 Canada
- Lady Davis Institute for Medical Research, Jewish General Hospital, Montreal, QC H3T 1E2 Canada
| | - Sahir R. Bhatnagar
- Department of Epidemiology, Biostatistics and Occupational Health, McGill University, Montreal, QC H3A 1A2 Canada
| | - Karim Oualkacha
- Département de Mathématiques, Université du Québec à Montréal, Montréal, QC H2X 3Y7 Canada
| | - Antonio Ciampi
- Department of Epidemiology, Biostatistics and Occupational Health, McGill University, Montreal, QC H3A 1A2 Canada
| | - Celia M. T. Greenwood
- Department of Epidemiology, Biostatistics and Occupational Health, McGill University, Montreal, QC H3A 1A2 Canada
- Lady Davis Institute for Medical Research, Jewish General Hospital, Montreal, QC H3T 1E2 Canada
- Department of Oncology, McGill University, Montreal, QC H2W 1S6 Canada
- Department of Human Genetics, McGill University, Montreal, QC H3A 1B1 Canada
| |
Collapse
|
37
|
Abstract
Over the past few years, interest in the identification of rare variants that influence human phenotype has led to the development of many statistical methods for testing for association between sets of rare variants and binary or quantitative traits. Here, I review some of the most important ideas that underlie these methods and the most relevant issues when choosing a method for analysis. In addition to the tests for association, I review crucial issues in performing a rare variant study, from experimental design to interpretation and validation. I also discuss the many challenges of these studies, some of their limitations, and future research directions.
Collapse
Affiliation(s)
- Dan L Nicolae
- Departments of Medicine and Statistics, University of Chicago, Chicago, Illinois 60637;
| |
Collapse
|
38
|
Abstract
Empirical studies and evolutionary theory support a role for rare variants in the etiology of complex traits. Given this motivation and increasing affordability of whole-exome and whole-genome sequencing, methods for rare variant association have been an active area of research for the past decade. Here, we provide a survey of the current literature and developments from the Genetics Analysis Workshop 19 (GAW19) Collapsing Rare Variants working group. In particular, we present the generalized linear regression framework and associated score statistic for the 2 major types of methods: burden and variance components methods. We further show that by simply modifying weights within these frameworks we arrive at many of the popular existing methods, for example, the cohort allelic sums test and sequence kernel association test. Meta-analysis techniques are also described. Next, we describe the 6 contributions from the GAW19 Collapsing Rare Variants working group. These included development of new methods, such as a retrospective likelihood for family data, a method using genomic structure to compare cases and controls, a haplotype-based meta-analysis, and a permutation-based method for combining different statistical tests. In addition, one contribution compared a mega-analysis of family-based and population-based data to meta-analysis. Finally, the power of existing family-based methods for binary traits was compared. We conclude with suggestions for open research questions.
Collapse
Affiliation(s)
- Stephanie A Santorico
- Department of Mathematical and Statistical Sciences, University of Colorado Denver, Denver, CO, 80217-3364, USA.
| | - Audrey E Hendricks
- Department of Mathematical and Statistical Sciences, University of Colorado Denver, Denver, CO, 80217-3364, USA.
| |
Collapse
|
39
|
Yazdani A, Yazdani A, Boerwinkle E. Rare variants analysis using penalization methods for whole genome sequence data. BMC Bioinformatics 2015; 16:405. [PMID: 26637205 PMCID: PMC4670502 DOI: 10.1186/s12859-015-0825-4] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2015] [Accepted: 11/11/2015] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Availability of affordable and accessible whole genome sequencing for biomedical applications poses a number of statistical challenges and opportunities, particularly related to the analysis of rare variants and sparseness of the data. Although efforts have been devoted to address these challenges, the performance of statistical methods for rare variants analysis still needs further consideration. RESULT We introduce a new approach that applies restricted principal component analysis with convex penalization and then selects the best predictors of a phenotype by a concave penalized regression model, while estimating the impact of each genomic region on the phenotype. Using simulated data, we show that the proposed method maintains good power for association testing while keeping the false discovery rate low under a verity of genetic architectures. Illustrative data analyses reveal encouraging result of this method in comparison with other commonly applied methods for rare variants analysis. CONCLUSION By taking into account linkage disequilibrium and sparseness of the data, the proposed method improves power and controls the false discovery rate compared to other commonly applied methods for rare variant analyses.
Collapse
Affiliation(s)
- Akram Yazdani
- Human Genetics Center, University of Texas Health Science Center at Houston, TX, USA.
| | - Azam Yazdani
- Human Genetics Center, University of Texas Health Science Center at Houston, TX, USA.
| | - Eric Boerwinkle
- Human Genetics Center, University of Texas Health Science Center at Houston, TX, USA. .,Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA.
| |
Collapse
|
40
|
Hussain S. A new conceptual framework for investigating complex genetic disease. Front Genet 2015; 6:327. [PMID: 26583033 PMCID: PMC4631989 DOI: 10.3389/fgene.2015.00327] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2015] [Accepted: 10/21/2015] [Indexed: 01/17/2023] Open
Abstract
Some common diseases are known to have an inherited component, however, their population- and familial-incidence patterns do not conform to any known monogenic Mendelian pattern of inheritance and instead they are currently much better explained if an underlying polygenic architecture is posited. Studies that have attempted to identify the causative genetic factors have been designed on this polygenic framework, but so far the yield has been largely unsatisfactory. Based on accumulating recent observations concerning the roles of somatic mosaicism in disease, in this article a second framework which posits a single gene-two hit model which can be modulated by a mutator/anti-mutator genetic background is suggested. I discuss whether such a model can be considered a viable alternative based on current knowledge, its advantages over the current polygenic framework, and describe practical routes via which the new framework can be investigated.
Collapse
Affiliation(s)
- Shobbir Hussain
- Department of Biology and Biochemistry, University of BathBath, UK
| |
Collapse
|
41
|
Huh I, Kwon MS, Park T. An Efficient Stepwise Statistical Test to Identify Multiple Linked Human Genetic Variants Associated with Specific Phenotypic Traits. PLoS One 2015; 10:e0138700. [PMID: 26406920 PMCID: PMC4583484 DOI: 10.1371/journal.pone.0138700] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2015] [Accepted: 09/02/2015] [Indexed: 11/19/2022] Open
Abstract
Recent advances in genotyping methodologies have allowed genome-wide association studies (GWAS) to accurately identify genetic variants that associate with common or pathological complex traits. Although most GWAS have focused on associations with single genetic variants, joint identification of multiple genetic variants, and how they interact, is essential for understanding the genetic architecture of complex phenotypic traits. Here, we propose an efficient stepwise method based on the Cochran-Mantel-Haenszel test (for stratified categorical data) to identify causal joint multiple genetic variants in GWAS. This method combines the CMH statistic with a stepwise procedure to detect multiple genetic variants associated with specific categorical traits, using a series of associated I × J contingency tables and a null hypothesis of no phenotype association. Through a new stratification scheme based on the sum of minor allele count criteria, we make the method more feasible for GWAS data having sample sizes of several thousands. We also examine the properties of the proposed stepwise method via simulation studies, and show that the stepwise CMH test performs better than other existing methods (e.g., logistic regression and detection of associations by Markov blanket) for identifying multiple genetic variants. Finally, we apply the proposed approach to two genomic sequencing datasets to detect linked genetic variants associated with bipolar disorder and obesity, respectively.
Collapse
Affiliation(s)
- Iksoo Huh
- Department of Statistics, Seoul National University, Gwanak-gu, Seoul, Korea
| | - Min-Seok Kwon
- Interdisciplinary Program in Bioinformatics, Seoul National University, Gwanak-gu, Seoul, Korea
| | - Taesung Park
- Department of Statistics, Seoul National University, Gwanak-gu, Seoul, Korea
- Interdisciplinary Program in Bioinformatics, Seoul National University, Gwanak-gu, Seoul, Korea
- * E-mail:
| |
Collapse
|
42
|
Derkach A, Lawless JF, Sun L. Score tests for association under response-dependent sampling designs for expensive covariates. Biometrika 2015. [DOI: 10.1093/biomet/asv038] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
|
43
|
Wang C, Kao WH, Hsiao CK. Using Hamming Distance as Information for SNP-Sets Clustering and Testing in Disease Association Studies. PLoS One 2015; 10:e0135918. [PMID: 26302001 PMCID: PMC4547758 DOI: 10.1371/journal.pone.0135918] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2014] [Accepted: 07/28/2015] [Indexed: 11/27/2022] Open
Abstract
The availability of high-throughput genomic data has led to several challenges in recent genetic association studies, including the large number of genetic variants that must be considered and the computational complexity in statistical analyses. Tackling these problems with a marker-set study such as SNP-set analysis can be an efficient solution. To construct SNP-sets, we first propose a clustering algorithm, which employs Hamming distance to measure the similarity between strings of SNP genotypes and evaluates whether the given SNPs or SNP-sets should be clustered. A dendrogram can then be constructed based on such distance measure, and the number of clusters can be determined. With the resulting SNP-sets, we next develop an association test HDAT to examine susceptibility to the disease of interest. This proposed test assesses, based on Hamming distance, whether the similarity between a diseased and a normal individual differs from the similarity between two individuals of the same disease status. In our proposed methodology, only genotype information is needed. No inference of haplotypes is required, and SNPs under consideration do not need to locate in nearby regions. The proposed clustering algorithm and association test are illustrated with applications and simulation studies. As compared with other existing methods, the clustering algorithm is faster and better at identifying sets containing SNPs exerting a similar effect. In addition, the simulation studies demonstrated that the proposed test works well for SNP-sets containing a large proportion of neutral SNPs. Furthermore, employing the clustering algorithm before testing a large set of data improves the knowledge in confining the genetic regions for susceptible genetic markers.
Collapse
Affiliation(s)
- Charlotte Wang
- Institute of Epidemiology and Preventive Medicine, National Taiwan University, Taipei, 100, Taiwan
| | - Wen-Hsin Kao
- Institute of Epidemiology and Preventive Medicine, National Taiwan University, Taipei, 100, Taiwan
| | - Chuhsing Kate Hsiao
- Institute of Epidemiology and Preventive Medicine, National Taiwan University, Taipei, 100, Taiwan
- Bioinformatics and Biostatistics Core, Division of Genomic Medicine, Research Center for Medical Excellence, National Taiwan University, Taipei, 100, Taiwan
- Department of Public Health, National Taiwan University, Taipei, 100, Taiwan
- * E-mail:
| |
Collapse
|
44
|
Clique-Based Clustering of Correlated SNPs in a Gene Can Improve Performance of Gene-Based Multi-Bin Linear Combination Test. BIOMED RESEARCH INTERNATIONAL 2015; 2015:852341. [PMID: 26346579 PMCID: PMC4539439 DOI: 10.1155/2015/852341] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/14/2014] [Revised: 02/03/2015] [Accepted: 02/14/2015] [Indexed: 11/18/2022]
Abstract
Gene-based analysis of multiple single nucleotide polymorphisms (SNPs) in a gene region is an alternative to single SNP analysis. The multi-bin linear combination test (MLC) proposed in previous studies utilizes the correlation among SNPs within a gene to construct a gene-based global test. SNPs are partitioned into clusters of highly correlated SNPs, and the MLC test statistic quadratically combines linear combination statistics constructed for each cluster. The test has degrees of freedom equal to the number of clusters and can be more powerful than a fully quadratic or fully linear test statistic. In this study, we develop a new SNP clustering algorithm designed to find cliques, which are complete subnetworks of SNPs with all pairwise correlations above a threshold. We evaluate the performance of the MLC test using the clique-based CLQ algorithm versus using the tag-SNP-based LDSelect algorithm. In our numerical power calculations we observed that the two clustering algorithms produce identical clusters about 40~60% of the time, yielding similar power on average. However, because the CLQ algorithm tends to produce smaller clusters with stronger positive correlation, the MLC test is less likely to be affected by the occurrence of opposing signs in the individual SNP effect coefficients.
Collapse
|
45
|
Fouladi R, Bessonov K, Van Lishout F, Van Steen K. Model-Based Multifactor Dimensionality Reduction for Rare Variant Association Analysis. Hum Hered 2015. [PMID: 26201701 DOI: 10.1159/000381286] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Abstract
Genome-wide association studies have revealed a vast amount of common loci associated to human complex diseases. Still, a large proportion of heritability remains unexplained. The extent to which rare genetic variants (RVs) are able to explain a relevant portion of the genetic heritability for complex traits leaves room for several debates and paves the way to the collection of RV databases and the development of novel analytic tools to analyze these. To date, several statistical methods have been proposed to uncover the association of RVs with complex diseases, but none of them is the clear winner in all possible scenarios of study design and assumed underlying disease model. The latter may involve differences in the distributions of effect sizes, proportions of causal variants, and ratios of protective to deleterious variants at distinct regions throughout the genome. Therefore, there is a need for robust scalable methods with acceptable overall performance in terms of power and type I error under various realistic scenarios. In this paper, we propose a novel RV association analysis strategy, which satisfies several of the desired properties that a RV analysis tool should exhibit.
Collapse
Affiliation(s)
- Ramouna Fouladi
- Systems and Modeling Unit, Montefiore Institute, and Bioinformatics and Modeling, GIGA-R, University of Liège, Liège, Belgium
| | | | | | | |
Collapse
|
46
|
Soave D, Corvol H, Panjwani N, Gong J, Li W, Boëlle PY, Durie PR, Paterson AD, Rommens JM, Strug LJ, Sun L. A Joint Location-Scale Test Improves Power to Detect Associated SNPs, Gene Sets, and Pathways. Am J Hum Genet 2015; 97:125-38. [PMID: 26140448 PMCID: PMC4572492 DOI: 10.1016/j.ajhg.2015.05.015] [Citation(s) in RCA: 40] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2015] [Accepted: 05/26/2015] [Indexed: 11/28/2022] Open
Abstract
Gene-based, pathway, and other multivariate association methods are motivated by the possibility of GxG and GxE interactions; however, accounting for such interactions is limited by the challenges associated with adequate modeling information. Here we propose an easy-to-implement joint location-scale (JLS) association testing framework for single-variant and multivariate analysis that accounts for interactions without explicitly modeling them. We apply the JLS method to a gene-set analysis of cystic fibrosis (CF) lung disease, which is influenced by multiple environmental and genetic factors. We identify and replicate an association between the constituents of the apical plasma membrane and CF lung disease (p = 0.0099 and p = 0.0180, respectively) and highlight a role for the SLC9A3-SLC9A3R1/2-EZR complex in contributing to CF lung disease. Many association studies could benefit from re-analysis with the JLS method that leverages complex genetic architecture for SNP, gene, and pathway identification. Analytical verification, simulation, and additional proof-of-principle applications support our approach.
Collapse
Affiliation(s)
- David Soave
- Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, ON M5T 3M7, Canada; Program in Genetics and Genome Biology, Research Institute, The Hospital for Sick Children, Toronto, ON M5G 0A4, Canada
| | - Harriet Corvol
- Assistance Publique-Hôpitaux de Paris (AP-HP), Trousseau Hospital, Pediatric Pulmonology Department; Institut National de la Santé et la Recherche Médicale (INSERM), UMR_S 938, CDR Saint-Antoine, 75012 Paris, France; Sorbonne Universités, Université Pierre et Marie Curie (UPMC) Paris 06, 75005 Paris, France
| | - Naim Panjwani
- Program in Genetics and Genome Biology, Research Institute, The Hospital for Sick Children, Toronto, ON M5G 0A4, Canada
| | - Jiafen Gong
- Program in Genetics and Genome Biology, Research Institute, The Hospital for Sick Children, Toronto, ON M5G 0A4, Canada
| | - Weili Li
- Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, ON M5T 3M7, Canada; Program in Genetics and Genome Biology, Research Institute, The Hospital for Sick Children, Toronto, ON M5G 0A4, Canada
| | - Pierre-Yves Boëlle
- Sorbonne Universités, Université Pierre et Marie Curie (UPMC) Paris 06, 75005 Paris, France; AP-HP, Saint-Antoine Hospital, Biostatistics Department, INSERM, UMR_S 1136, 75012 Paris, France
| | - Peter R Durie
- Program in Physiology and Experimental Medicine, The Hospital for Sick Children, Toronto, ON M5G 0A4, Canada; Department of Pediatrics, University of Toronto, Toronto, ON M5S 1A1, Canada
| | - Andrew D Paterson
- Program in Genetics and Genome Biology, Research Institute, The Hospital for Sick Children, Toronto, ON M5G 0A4, Canada; Division of Epidemiology, Dalla Lana School of Public Health, University of Toronto, Toronto, ON M5T 3M7, Canada
| | - Johanna M Rommens
- Program in Genetics and Genome Biology, Research Institute, The Hospital for Sick Children, Toronto, ON M5G 0A4, Canada; Department of Molecular Genetics, University of Toronto, Toronto, ON M5S 1A8, Canada
| | - Lisa J Strug
- Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, ON M5T 3M7, Canada; Program in Genetics and Genome Biology, Research Institute, The Hospital for Sick Children, Toronto, ON M5G 0A4, Canada.
| | - Lei Sun
- Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, ON M5T 3M7, Canada; Department of Statistical Sciences, University of Toronto, Toronto, ON M5S 3G3, Canada.
| |
Collapse
|
47
|
Zeng P, Zhao Y, Li H, Wang T, Chen F. Permutation-based variance component test in generalized linear mixed model with application to multilocus genetic association study. BMC Med Res Methodol 2015; 15:37. [PMID: 25897803 PMCID: PMC4410500 DOI: 10.1186/s12874-015-0030-1] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2014] [Accepted: 04/07/2015] [Indexed: 11/29/2022] Open
Abstract
Background In many medical studies the likelihood ratio test (LRT) has been widely applied to examine whether the random effects variance component is zero within the mixed effects models framework; whereas little work about likelihood-ratio based variance component test has been done in the generalized linear mixed models (GLMM), where the response is discrete and the log-likelihood cannot be computed exactly. Before applying the LRT for variance component in GLMM, several difficulties need to be overcome, including the computation of the log-likelihood, the parameter estimation and the derivation of the null distribution for the LRT statistic. Methods To overcome these problems, in this paper we make use of the penalized quasi-likelihood algorithm and calculate the LRT statistic based on the resulting working response and the quasi-likelihood. The permutation procedure is used to obtain the null distribution of the LRT statistic. We evaluate the permutation-based LRT via simulations and compare it with the score-based variance component test and the tests based on the mixture of chi-square distributions. Finally we apply the permutation-based LRT to multilocus association analysis in the case–control study, where the problem can be investigated under the framework of logistic mixed effects model. Results The simulations show that the permutation-based LRT can effectively control the type I error rate, while the score test is sometimes slightly conservative and the tests based on mixtures cannot maintain the type I error rate. Our studies also show that the permutation-based LRT has higher power than these existing tests and still maintains a reasonably high power even when the random effects do not follow a normal distribution. The application to GAW17 data also demonstrates that the proposed LRT has a higher probability to identify the association signals than the score test and the tests based on mixtures. Conclusions In the present paper the permutation-based LRT was developed for variance component in GLMM. The LRT outperforms existing tests and has a reasonably higher power under various scenarios; additionally, it is conceptually simple and easy to implement.
Collapse
Affiliation(s)
- Ping Zeng
- Department of Epidemiology and Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, 211166, , Jiangsu, People's Republic of China. .,Department of Epidemiology and Biostatistics, Center of Medical Statistics and Data Analysis, School of Public Health, Xuzhou Medical College, Xuzhou, 221004, Jiangsu, People's Republic of China.
| | - Yang Zhao
- Department of Epidemiology and Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, 211166, , Jiangsu, People's Republic of China.
| | - Hongliang Li
- Center for Disease Control and Prevention of Pudong New Area, Pudong New Area, Shanghai, 200136, People's Republic of China.
| | - Ting Wang
- Department of Epidemiology and Biostatistics, Center of Medical Statistics and Data Analysis, School of Public Health, Xuzhou Medical College, Xuzhou, 221004, Jiangsu, People's Republic of China.
| | - Feng Chen
- Department of Epidemiology and Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, 211166, , Jiangsu, People's Republic of China.
| |
Collapse
|
48
|
Lee W, Lee D, Pawitan Y. Likelihood ratio and score burden tests for detecting disease-associated rare variants. Stat Appl Genet Mol Biol 2015; 14:481-95. [DOI: 10.1515/sagmb-2014-0089] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
AbstractThis paper presents two simple rare variant (RV) burden tests based on the likelihood ratio test (LRT) and score statistics. LRT is one of the commonly used tests in practical data analysis, and we show here that there is no reason to ignore it in testing RV associations. With the Bartlett correction, we have numerically shown that the LRT-based test can have a reliable distribution. Our simulation study indicates that if the non-null variants are as common as the null variants, then the LRT and score statistics have comparable performance to the C-alpha test, and if the former is rarer than the null variants, then they outperform the C-alpha test.
Collapse
|
49
|
Dering C, König IR, Ramsey LB, Relling MV, Yang W, Ziegler A. A comprehensive evaluation of collapsing methods using simulated and real data: excellent annotation of functionality and large sample sizes required. Front Genet 2014; 5:323. [PMID: 25309579 PMCID: PMC4164031 DOI: 10.3389/fgene.2014.00323] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2014] [Accepted: 08/28/2014] [Indexed: 01/23/2023] Open
Abstract
The advent of next generation sequencing (NGS) technologies enabled the investigation of the rare variant-common disease hypothesis in unrelated individuals, even on the genome-wide level. Analysis of this hypothesis requires tailored statistical methods as single marker tests fail on rare variants. An entire class of statistical methods collapses rare variants from a genomic region of interest (ROI), thereby aggregating rare variants. In an extensive simulation study using data from the Genetic Analysis Workshop 17 we compared the performance of 15 collapsing methods by means of a variety of pre-defined ROIs regarding minor allele frequency thresholds and functionality. Findings of the simulation study were additionally confirmed by a real data set investigating the association between methotrexate clearance and the SLCO1B1 gene in patients with acute lymphoblastic leukemia. Our analyses showed substantially inflated type I error levels for many of the proposed collapsing methods. Only four approaches yielded valid type I errors in all considered scenarios. None of the statistical tests was able to detect true associations over a substantial proportion of replicates in the simulated data. Detailed annotation of functionality of variants is crucial to detect true associations. These findings were confirmed in the analysis of the real data. Recent theoretical work showed that large power is achieved in gene-based analyses only if large sample sizes are available and a substantial proportion of causing rare variants is present in the gene-based analysis. Many of the investigated statistical approaches use permutation requiring high computational cost. There is a clear need for valid, powerful and fast to calculate test statistics for studies investigating rare variants.
Collapse
Affiliation(s)
- Carmen Dering
- Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein Lübeck, Germany
| | - Inke R König
- Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein Lübeck, Germany
| | - Laura B Ramsey
- Pharmaceutical Department, St. Jude Children's Research Hospital Memphis, TN, USA
| | - Mary V Relling
- Pharmaceutical Department, St. Jude Children's Research Hospital Memphis, TN, USA
| | - Wenjian Yang
- Pharmaceutical Department, St. Jude Children's Research Hospital Memphis, TN, USA
| | - Andreas Ziegler
- Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein Lübeck, Germany ; Zentrum für Klinische Studien, Universität zu Lübeck Lübeck, Germany ; School of Mathematics, Statistics and Computer Science, University of KwaZulu-Natal Durban, South Africa
| |
Collapse
|