1651
|
Beaudoin M, Lo KS, N'Diaye A, Rivas MA, Dubé MP, Laplante N, Phillips MS, Rioux JD, Tardif JC, Lettre G. Pooled DNA resequencing of 68 myocardial infarction candidate genes in French canadians. ACTA ACUST UNITED AC 2012; 5:547-54. [PMID: 22923420 DOI: 10.1161/circgenetics.112.963165] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
BACKGROUND Familial history is a strong risk factor for coronary artery disease (CAD), especially for early-onset myocardial infarction (MI). Several genes and chromosomal regions have been implicated in the genetic cause of coronary artery disease/MI, mostly through the discovery of familial mutations implicated in hyper-/hypocholesterolemia by linkage studies and single nucleotide polymorphisms by genome-wide association studies. Except for a few examples (eg, PCSK9), the role of low-frequency genetic variation (minor allele frequency [MAF]) ≈0.1%-5% on MI/coronary artery disease predisposition has not been extensively investigated. METHODS AND RESULTS We selected 68 candidate genes and sequenced their exons (394 kb) in 500 early-onset MI cases and 500 matched controls, all of French-Canadian ancestry, using solution-based capture in pools of nonindexed DNA samples. In these regions, we identified 1852 single nucleotide variants (695 novel) and captured 85% of the variants with MAF≥1% found by the 1000 Genomes Project in Europe-ancestry individuals. Using gene-based association testing, we prioritized for follow-up 29 low-frequency variants in 8 genes and attempted to genotype them for replication in 1594 MI cases and 2988 controls from 2 French-Canadian panels. Our pilot association analysis of low-frequency variants in 68 candidate genes did not identify genes with large effect on MI risk in French Canadians. CONCLUSIONS We have optimized a strategy, applicable to all complex diseases and traits, to discover efficiently and cost-effectively DNA sequence variants in large populations. Resequencing endeavors to find low-frequency variants implicated in common human diseases are likely to require very large sample size.
Collapse
Affiliation(s)
- Mélissa Beaudoin
- Montreal Heart Institute, 5000 Rue Bélanger, Montreal, Québec, Canada
| | | | | | | | | | | | | | | | | | | |
Collapse
|
1652
|
Maity A, Sullivan PF, Tzeng JY. Multivariate phenotype association analysis by marker-set kernel machine regression. Genet Epidemiol 2012; 36:686-95. [PMID: 22899176 DOI: 10.1002/gepi.21663] [Citation(s) in RCA: 68] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2012] [Revised: 05/23/2012] [Accepted: 06/18/2012] [Indexed: 11/06/2022]
Abstract
Genetic studies of complex diseases often collect multiple phenotypes relevant to the disorders. As these phenotypes can be correlated and share common genetic mechanisms, jointly analyzing these traits may bring more power to detect genes influencing individual or multiple phenotypes. Given the advancement brought by the multivariate phenotype approaches and the multimarker kernel machine regression, we construct a multivariate regression based on kernel machine to facilitate the joint evaluation of multimarker effects on multiple phenotypes. The kernel machine serves as a powerful dimension-reduction tool to capture complex effects among markers. The multivariate framework incorporates the potentially correlated multidimensional phenotypic information and accommodates common or different environmental covariates for each trait. We derive the multivariate kernel machine test based on a score-like statistic, and conduct simulations to evaluate the validity and efficacy of the method. We also study the performance of the commonly adapted strategies for kernel machine analysis on multiple phenotypes, including the multiple univariate kernel machine tests with original phenotypes or with their principal components. Our results suggest that none of these approaches has the uniformly best power, and the optimal test depends on the magnitude of the phenotype correlation and the effect patterns. However, the multivariate test retains to be a reasonable approach when the multiple phenotypes have none or mild correlations, and gives the best power once the correlation becomes stronger or when there exist genes that affect more than one phenotype. We illustrate the utility of the multivariate kernel machine method through the Clinical Antipsychotic Trails of Intervention Effectiveness antibody study.
Collapse
Affiliation(s)
- Arnab Maity
- Department of Statistics, North Carolina State University, Raleigh, USA
| | | | | |
Collapse
|
1653
|
Lee S, Emond MJ, Bamshad MJ, Barnes KC, Rieder MJ, Nickerson DA, Christiani D, Wurfel M, Lin X, Lin X. Optimal unified approach for rare-variant association testing with application to small-sample case-control whole-exome sequencing studies. Am J Hum Genet 2012; 91:224-37. [PMID: 22863193 DOI: 10.1016/j.ajhg.2012.06.007] [Citation(s) in RCA: 712] [Impact Index Per Article: 59.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2012] [Revised: 05/22/2012] [Accepted: 06/12/2012] [Indexed: 12/23/2022] Open
Abstract
We propose in this paper a unified approach for testing the association between rare variants and phenotypes in sequencing association studies. This approach maximizes power by adaptively using the data to optimally combine the burden test and the nonburden sequence kernel association test (SKAT). Burden tests are more powerful when most variants in a region are causal and the effects are in the same direction, whereas SKAT is more powerful when a large fraction of the variants in a region are noncausal or the effects of causal variants are in different directions. The proposed unified test maintains the power in both scenarios. We show that the unified test corresponds to the optimal test in an extended family of SKAT tests, which we refer to as SKAT-O. The second goal of this paper is to develop a small-sample adjustment procedure for the proposed methods for the correction of conservative type I error rates of SKAT family tests when the trait of interest is dichotomous and the sample size is small. Both small-sample-adjusted SKAT and the optimal unified test (SKAT-O) are computationally efficient and can easily be applied to genome-wide sequencing association studies. We evaluate the finite sample performance of the proposed methods using extensive simulation studies and illustrate their application using the acute-lung-injury exome-sequencing data of the National Heart, Lung, and Blood Institute Exome Sequencing Project.
Collapse
|
1654
|
Epstein M, Duncan R, Jiang Y, Conneely K, Allen A, Satten G. A permutation procedure to correct for confounders in case-control studies, including tests of rare variation. Am J Hum Genet 2012; 91:215-23. [PMID: 22818855 DOI: 10.1016/j.ajhg.2012.06.004] [Citation(s) in RCA: 48] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2012] [Revised: 05/03/2012] [Accepted: 06/05/2012] [Indexed: 01/30/2023] Open
Abstract
Many case-control tests of rare variation are implemented in statistical frameworks that make correction for confounders like population stratification difficult. Simple permutation of disease status is unacceptable for resolving this issue because the replicate data sets do not have the same confounding as the original data set. These limitations make it difficult to apply rare-variant tests to samples in which confounding most likely exists, e.g., samples collected from admixed populations. To enable the use of such rare-variant methods in structured samples, as well as to facilitate permutation tests for any situation in which case-control tests require adjustment for confounding covariates, we propose to establish the significance of a rare-variant test via a modified permutation procedure. Our procedure uses Fisher's noncentral hypergeometric distribution to generate permuted data sets with the same structure present in the actual data set such that inference is valid in the presence of confounding factors. We use simulated sequence data based on coalescent models to show that our permutation strategy corrects for confounding due to population stratification that, if ignored, would otherwise inflate the size of a rare-variant test. We further illustrate the approach by using sequence data from the Dallas Heart Study of energy metabolism traits. Researchers can implement our permutation approach by using the R package BiasedUrn.
Collapse
|
1655
|
Xu C, Ladouceur M, Dastani Z, Richards JB, Ciampi A, Greenwood CMT. Multiple regression methods show great potential for rare variant association tests. PLoS One 2012; 7:e41694. [PMID: 22916111 PMCID: PMC3420665 DOI: 10.1371/journal.pone.0041694] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2012] [Accepted: 06/25/2012] [Indexed: 01/08/2023] Open
Abstract
The investigation of associations between rare genetic variants and diseases or phenotypes has two goals. Firstly, the identification of which genes or genomic regions are associated, and secondly, discrimination of associated variants from background noise within each region. Over the last few years, many new methods have been developed which associate genomic regions with phenotypes. However, classical methods for high-dimensional data have received little attention. Here we investigate whether several classical statistical methods for high-dimensional data: ridge regression (RR), principal components regression (PCR), partial least squares regression (PLS), a sparse version of PLS (SPLS), and the LASSO are able to detect associations with rare genetic variants. These approaches have been extensively used in statistics to identify the true associations in data sets containing many predictor variables. Using genetic variants identified in three genes that were Sanger sequenced in 1998 individuals, we simulated continuous phenotypes under several different models, and we show that these feature selection and feature extraction methods can substantially outperform several popular methods for rare variant analysis. Furthermore, these approaches can identify which variants are contributing most to the model fit, and therefore both goals of rare variant analysis can be achieved simultaneously with the use of regression regularization methods. These methods are briefly illustrated with an analysis of adiponectin levels and variants in the ADIPOQ gene.
Collapse
Affiliation(s)
- ChangJiang Xu
- Lady Davis Institute for Medical Research, Jewish General Hospital, Montreal, Quebec, Canada
| | | | | | | | | | | |
Collapse
|
1656
|
Cheung YH, Wang G, Leal SM, Wang S. A fast and noise-resilient approach to detect rare-variant associations with deep sequencing data for complex disorders. Genet Epidemiol 2012; 36:675-85. [PMID: 22865616 DOI: 10.1002/gepi.21662] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2012] [Accepted: 06/14/2012] [Indexed: 11/11/2022]
Abstract
Next generation sequencing technology has enabled the paradigm shift in genetic association studies from the common disease/common variant to common disease/rare-variant hypothesis. Analyzing individual rare variants is known to be underpowered; therefore association methods have been developed that aggregate variants across a genetic region, which for exome sequencing is usually a gene. The foreseeable widespread use of whole genome sequencing poses new challenges in statistical analysis. It calls for new rare-variant association methods that are statistically powerful, robust against high levels of noise due to inclusion of noncausal variants, and yet computationally efficient. We propose a simple and powerful statistic that combines the disease-associated P-values of individual variants using a weight that is the inverse of the expected standard deviation of the allele frequencies under the null. This approach, dubbed as Sigma-P method, is extremely robust to the inclusion of a high proportion of noncausal variants and is also powerful when both detrimental and protective variants are present within a genetic region. The performance of the Sigma-P method was tested using simulated data based on realistic population demographic and disease models and its power was compared to several previously published methods. The results demonstrate that this method generally outperforms other rare-variant association methods over a wide range of models. Additionally, sequence data on the ANGPTL family of genes from the Dallas Heart Study were tested for associations with nine metabolic traits and both known and novel putative associations were uncovered using the Sigma-P method.
Collapse
Affiliation(s)
- Yee Him Cheung
- Department of Biostatistics, Mailman School of Public Health, Columbia University, New York, New York 10032, USA
| | | | | | | |
Collapse
|
1657
|
Kuk AY, Li X, Xu J. A fast collapsed data method for estimating haplotype frequencies from pooled genotype data with applications to the study of rare variants. Stat Med 2012; 32:1343-60. [DOI: 10.1002/sim.5540] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2012] [Accepted: 06/11/2012] [Indexed: 12/31/2022]
Affiliation(s)
- Anthony Y.C. Kuk
- Department of Statistics and Applied Probability; National University of Singapore; Singapore; Singapore
| | - Xiang Li
- Department of Statistics and Applied Probability; National University of Singapore; Singapore; Singapore
| | - Jinfeng Xu
- Department of Statistics and Applied Probability; National University of Singapore; Singapore; Singapore
| |
Collapse
|
1658
|
Chen G, Yuan A, Zhou Y, Bentley AR, Zhou J, Chen W, Shriner D, Adeyemo A, Rotimi CN. Simultaneous Analysis of Common and Rare Variants in Complex Traits: Application to SNPs (SCARVAsnp). Bioinform Biol Insights 2012; 6:177-85. [PMID: 22904618 PMCID: PMC3418150 DOI: 10.4137/bbi.s9966] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
Abstract
Advances in technology and reduced costs are facilitating large-scale sequencing of genes and exomes as well as entire genomes. Recently, we described an approach based on haplotypes called SCARVA1 that enables the simultaneous analysis of the association between rare and common variants in disease etiology. Here, we describe an extension of SCARVA that evaluates individual markers instead of haplotypes. This modified method (SCARVAsnp) is implemented in four stages. First, all common variants in a pre-specified region (eg, gene) are evaluated individually. Second, a union procedure is used to combined all rare variants (RVs) in the index region, and the ratio of the log likelihood with one RV excluded to the log likelihood of a model with all the collapsed RVs is calculated. On the basis of previously-reported simulation studies,1 a likelihood ratio ≥1.3 is considered statistically significant. Third, the direction of the association of the removed RV is determined by evaluating the change in λ values with the inclusion and exclusion of that RV. Lastly, significant common and rare variants, along with covariates, are included in a final regression model to evaluate the association between the trait and variants in that region. We apply simulated and real data sets to show that the method is simple to use, computationally effcient, and that it can accurately identify both common and rare risk variants. This method overcomes several limitations of existing methods. For example, SCARVAsnp limits loss of statistical power by not including variants that are not associated with the trait of interest in the final model. Also, SCARVAsnp takes into consideration the direction of association by effectively modelling positively and negatively associated variants.
Collapse
Affiliation(s)
- Guanjie Chen
- Center for Research on Genomics and Global Health, NHGRI, NIH, Bethesda, Maryland, USA
| | | | | | | | | | | | | | | | | |
Collapse
|
1659
|
Smoothed functional principal component analysis for testing association of the entire allelic spectrum of genetic variation. Eur J Hum Genet 2012; 21:217-24. [PMID: 22781089 DOI: 10.1038/ejhg.2012.141] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022] Open
Abstract
Fast and cheaper next-generation sequencing technologies will generate unprecedentedly massive and highly dimensional genetic variation data that allow nearly complete evaluation of genetic variation including both common and rare variants. There are two types of association tests: variant-by-variant test and group test. The variant-by-variant test is designed to test the association of common variants, while the group test is suitable to collectively test the association of multiple rare variants. We propose here a smoothed functional principal component analysis (SFPCA) statistic as a general approach for testing association of the entire allelic spectrum of genetic variation (both common and rare variants), which utilizes the merits of both variant-by-variant analysis and group tests. By intensive simulations, we demonstrate that the SFPCA statistic has the correct type 1 error rates and much higher power than the existing methods to detect association of (1) common variants, (2) rare variants, (3) both common and rare variants and (4) variants with opposite directions of effects. To further evaluate its performance, the SFPCA statistic is applied to ANGPTL4 sequence and six continuous phenotypes data from the Dallas Heart Study as an example for testing association of rare variants and a GWAS of schizophrenia data as an example for testing association of common variants. The results show that the SFPCA statistic has much smaller P-values than many existing statistics in both real data analysis examples.
Collapse
|
1660
|
Sha Q, Wang S, Zhang S. Adaptive clustering and adaptive weighting methods to detect disease associated rare variants. Eur J Hum Genet 2012; 21:332-7. [PMID: 22781093 DOI: 10.1038/ejhg.2012.143] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/04/2023] Open
Abstract
Current statistical methods to test association between rare variants and phenotypes are essentially the group-wise methods that collapse or aggregate all variants in a predefined group into a single variant. Comparing with the variant-by-variant methods, the group-wise methods have their advantages. However, two factors may affect the power of these methods. One is that some of the causal variants may be protective. When both risk and protective variants are presented, it will lose power by collapsing or aggregating all variants because the effects of risk and protective variants will counteract each other. The other is that not all variants in the group are causal; rather, a large proportion is believed to be neutral. When a large proportion of variants are neutral, collapsing or aggregating all variants may not be an optimal solution. We propose two alternative methods, adaptive clustering (AC) method and adaptive weighting (AW) method, aiming to test rare variant association in the presence of neutral and/or protective variants. Both of AC and AW are applicable to quantitative traits as well as qualitative traits. Results of extensive simulation studies show that AC and AW have similar power and both of them have clear advantages from power to computational efficiency comparing with existing group-wise methods and existing data-driven methods that allow neutral and protective variants. We recommend AW method because AW method is computationally more efficient than AC method.
Collapse
Affiliation(s)
- Qiuying Sha
- Department of Mathematical Sciences, Michigan Technological University, Houghton, MI, USA
| | | | | |
Collapse
|
1661
|
Chang D, Keinan A. Predicting signatures of "synthetic associations" and "natural associations" from empirical patterns of human genetic variation. PLoS Comput Biol 2012; 8:e1002600. [PMID: 22792059 PMCID: PMC3390358 DOI: 10.1371/journal.pcbi.1002600] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2011] [Accepted: 05/23/2012] [Indexed: 11/18/2022] Open
Abstract
Genome-wide association studies (GWAS) have in recent years discovered thousands of associated markers for hundreds of phenotypes. However, associated loci often only explain a relatively small fraction of heritability and the link between association and causality has yet to be uncovered for most loci. Rare causal variants have been suggested as one scenario that may partially explain these shortcomings. Specifically, Dickson et al. recently reported simulations of rare causal variants that lead to association signals of common, tag single nucleotide polymorphisms, dubbed "synthetic associations". However, an open question is what practical implications synthetic associations have for GWAS. Here, we explore the signatures exhibited by such "synthetic associations" and their implications based on patterns of genetic variation observed in human populations, thus accounting for human evolutionary history -a force disregarded in previous simulation studies. This is made possible by human population genetic data from HapMap 3 consisting of both resequencing and array-based genotyping data for the same set of individuals from multiple populations. We report that synthetic associations tend to be further away from the underlying risk alleles compared to "natural associations" (i.e. associations due to underlying common causal variants), but to a much lesser extent than previously predicted, with both the age and the effect size of the risk allele playing a part in this phenomenon. We find that while a synthetic association has a lower probability of capturing causal variants within its linkage disequilibrium block, sequencing around the associated variant need not extend substantially to have a high probability of capturing at least one causal variant. We also show that the minor allele frequency of synthetic associations is lower than of natural associations for most, but not all, loci that we explored. Finally, we find the variance in associated allele frequency to be a potential indicator of synthetic associations.
Collapse
Affiliation(s)
- Diana Chang
- Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, New York, United States of America
- Program in Computational Biology and Medicine, Cornell University, Ithaca, New York, United States of America
| | - Alon Keinan
- Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, New York, United States of America
- * E-mail:
| |
Collapse
|
1662
|
Lin WY, Tiwari HK, Gao G, Zhang K, Arcaroli JJ, Abraham E, Liu N. Similarity-based multimarker association tests for continuous traits. Ann Hum Genet 2012; 76:246-60. [PMID: 22497480 DOI: 10.1111/j.1469-1809.2012.00706.x] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Testing multiple markers simultaneously not only can capture the linkage disequilibrium patterns but also can decrease the number of tests and thus alleviate the multiple-testing penalty. If a gene is associated with a phenotype, subjects with similar genotypes in this gene should also have similar phenotypes. Based on this concept, we have developed a general framework that is applicable to continuous traits. Two similarity-based tests (namely, SIMc and SIMp tests) were derived as special cases of the general framework. In our simulation study, we compared the power of the two tests with that of the single-marker analysis, a standard haplotype regression, and a popular and powerful kernel machine regression. Our SIMc test outperforms other tests when the average R(2) (a measure of linkage disequilibrium) between the causal variant and the surrounding markers is larger than 0.3 or when the causal allele is common (say, frequency = 0.3). Our SIMp test outperforms other tests when the causal variant was introduced at common haplotypes (the maximum frequency of risk haplotypes >0.4). We also applied our two tests to an adiposity data set to show their utility.
Collapse
Affiliation(s)
- Wan-Yu Lin
- Department of Biostatistics, University of Alabama at Birmingham, USA
| | | | | | | | | | | | | |
Collapse
|
1663
|
Single Nucleotide Polymorphism (SNP) Detection and Genotype Calling from Massively Parallel Sequencing (MPS) Data. STATISTICS IN BIOSCIENCES 2012; 5:3-25. [PMID: 24489615 DOI: 10.1007/s12561-012-9067-4] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
Abstract
Massively parallel sequencing (MPS), since its debut in 2005, has transformed the field of genomic studies. These new sequencing technologies have resulted in the successful identification of causal variants for several rare Mendelian disorders. They have also begun to deliver on their promise to explain some of the missing heritability from genome-wide association studies (GWAS) of complex traits. We anticipate a rapidly growing number of MPS-based studies for a diverse range of applications in the near future. One crucial and nearly inevitable step is to detect SNPs and call genotypes at the detected polymorphic sites from the sequencing data. Here, we review statistical methods that have been proposed in the past five years for this purpose. In addition, we discuss emerging issues and future directions related to SNP detection and genotype calling from MPS data.
Collapse
|
1664
|
Witte JS. Rare genetic variants and treatment response: sample size and analysis issues. Stat Med 2012; 31:3041-50. [PMID: 22736504 DOI: 10.1002/sim.5428] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2011] [Accepted: 03/15/2012] [Indexed: 11/06/2022]
Abstract
Incorporating information about common genetic variants may help improve the design and analysis of clinical trials. For example, if genes impact response to treatment, one can pregenotype potential participants to screen out genetically determined nonresponders and substantially reduce the sample size and duration of a trial. Genetic associations with response to treatment are generally much larger than those observed for development of common diseases, as highlighted here by findings from genome-wide association studies. With the development and decreasing cost of next generation sequencing, more extensive genetic information - including rare variants - is becoming available on individuals treated with drugs and other therapies. We can use this information to evaluate whether rare variants impact treatment response. The sparseness of rare variants, however, raises issues of how the resulting data should be best analyzed. As shown here, simply evaluating the association between each rare variant and treatment response one-at-a-time will require enormous sample sizes. Combining the rare variants together can substantially reduce the required sample sizes, but require a number of assumptions about the similarity among the rare variants' effects on treatment response. We have developed an empirical approach for aggregating and analyzing rare variants that limit such assumptions and work well under a range of scenarios. Such analyses provide a valuable opportunity to more fully decipher the genomic basis of response to treatment.
Collapse
Affiliation(s)
- John S Witte
- Department of Epidemiology and Biostatistics, Institute for Human Genetics, University of California, San Francisco, CA 94143, U.S.A.
| |
Collapse
|
1665
|
Sha Q, Wang X, Wang X, Zhang S. Detecting association of rare and common variants by testing an optimally weighted combination of variants. Genet Epidemiol 2012; 36:561-71. [PMID: 22714994 DOI: 10.1002/gepi.21649] [Citation(s) in RCA: 59] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2012] [Revised: 04/13/2012] [Accepted: 05/09/2012] [Indexed: 11/07/2022]
Abstract
Next-generation sequencing technology will soon allow sequencing the whole genome of large groups of individuals, and thus will make directly testing rare variants possible. Currently, most of existing methods for rare variant association studies are essentially testing the effect of a weighted combination of variants with different weighting schemes. Performance of these methods depends on the weights being used and no optimal weights are available. By putting large weights on rare variants and small weights on common variants, these methods target at rare variants only, although increasing evidence shows that complex diseases are caused by both common and rare variants. In this paper, we analytically derive optimal weights under a certain criterion. Based on the optimal weights, we propose a Variable Weight Test for testing the effect of an Optimally Weighted combination of variants (VW-TOW). VW-TOW aims to test the effects of both rare and common variants. VW-TOW is applicable to both quantitative and qualitative traits, allows covariates, can control for population stratification, and is robust to directions of effects of causal variants. Extensive simulation studies and application to the Genetic Analysis Workshop 17 (GAW17) data show that VW-TOW is more powerful than existing ones either for testing effects of both rare and common variants or for testing effects of rare variants only.
Collapse
Affiliation(s)
- Qiuying Sha
- Department of Mathematical Sciences, Michigan Technological University, Houghton, Michigan 49931, USA
| | | | | | | |
Collapse
|
1666
|
Ionita-Laza I, Makarov V, Buxbaum JD. Scan-statistic approach identifies clusters of rare disease variants in LRP2, a gene linked and associated with autism spectrum disorders, in three datasets. Am J Hum Genet 2012; 90:1002-13. [PMID: 22578327 DOI: 10.1016/j.ajhg.2012.04.010] [Citation(s) in RCA: 45] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2012] [Revised: 02/27/2012] [Accepted: 04/19/2012] [Indexed: 01/20/2023] Open
Abstract
Cluster-detection approaches, commonly used in epidemiology and astronomy, can be applied in the context of genetic sequence data for the identification of genetic regions significantly enriched with rare disease-risk variants (DRVs). Unlike existing association tests for sequence data, the goal of cluster-detection methods is to localize significant disease mutation clusters within a gene or region of interest. Here, we focus on a chromosome 2q replicated linkage region that is associated with autism spectrum disorder (ASD) and that has been sequenced in three independent datasets. We found that variants in one gene, LRP2, residing on 2q are associated with ASD in two datasets (the combined variable-threshold-test p value is 1.2 × 10(-5)). Using a cluster-detection method, we show that in the discovery and replication datasets, variants associated with ASD cluster preponderantly in 25 kb windows (adjusted p values are p(1) = 0.003 and p(2) = 0.002), and the two windows are highly overlapping. Furthermore, for the third dataset, a 25 kb region similar to those in the other two datasets shows significant evidence of enrichment of rare DRVs. The region implicated by all three studies is involved in ligand binding, suggesting that subtle alterations in either LRP2 expression or LRP2 primary sequence modulate the uptake of LRP2 ligands. BMP4 is a ligand of particular interest given its role in forebrain development, and modest changes in BMP4 binding, which binds to LRP2 near the mutation cluster, might subtly affect development and could lead to autism-associated phenotypes.
Collapse
|
1667
|
Kang G, Lin D, Hakonarson H, Chen J. Two-stage extreme phenotype sequencing design for discovering and testing common and rare genetic variants: efficiency and power. Hum Hered 2012; 73:139-47. [PMID: 22678112 DOI: 10.1159/000337300] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2011] [Accepted: 02/10/2012] [Indexed: 01/10/2023] Open
Abstract
Next-generation sequencing technology provides an unprecedented opportunity to identify rare susceptibility variants. It is not yet financially feasible to perform whole-genome sequencing on a large number of subjects, and a two-stage design has been advocated to be a practical option. In stage I, variants are discovered by sequencing the whole genomes of a small number of carefully selected individuals. In stage II, the discovered variants of a large number of individuals are genotyped to assess associations. Individuals with extreme phenotypes are typically selected in stage I. Using simulated data for unrelated individuals, we explore two important aspects of this two-stage design: the efficiency of discovering common and rare single-nucleotide polymorphisms (SNPs) in stage I and the impact of incomplete SNP discovery in stage I on the power of testing associations in stage II. We applied a sum test and a sum of squared score test for gene-based association analyses evaluating the power of the two-stage design. We obtained the following results from extensive simulation studies and analysis of the GAW17 dataset. When individuals with trait values more extreme than the 99.7-99th quantile were included in stage I, the two-stage design could achieve the same power as or even higher than the one-stage design if the rare causal variants had large effect sizes. In such design, fewer than half of the total SNPs including more than half of the causal SNPs were discovered, which included nearly all SNPs with minor allele frequencies (MAFs) ≥5%, more than half of the SNPs with MAFs between 1% and 5%, and fewer than half of the SNPs with MAFs <1%. Although a one-stage design may be preferable to identify multiple rare variants having small to moderate effect sizes, our observations support using the two-stage design as a cost-effective option for next-generation sequencing studies.
Collapse
Affiliation(s)
- Guolian Kang
- Department of Biostatistics and Epidemiology, University of Pennsylvania, Philadelphia, PA 19104, USA
| | | | | | | |
Collapse
|
1668
|
Fang S, Sha Q, Zhang S. Two adaptive weighting methods to test for rare variant associations in family-based designs. Genet Epidemiol 2012; 36:499-507. [PMID: 22674630 DOI: 10.1002/gepi.21646] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2011] [Revised: 04/26/2012] [Accepted: 04/26/2012] [Indexed: 11/06/2022]
Abstract
Although next-generation DNA sequencing technologies have made rare variant association studies feasible and affordable, the development of powerful statistical methods for rare variant association studies is still under way. Most of the existing methods for rare variant association studies compare the number of rare mutations in a group of rare variants (in a gene or a pathway) between cases and controls. However, these methods assume that all causal variants are risk to diseases. Recently, several methods that are robust to the direction and magnitude of effects of causal variants have been proposed. However, they are applicable to unrelated individuals only, whereas family data have been shown to improve power to detect rare variants. In this article, we propose two adaptive weighting methods for rare variant association studies based on family data for quantitative traits. Using extensive simulation studies, we evaluate and compare our proposed methods with two methods based on the weights proposed by Madsen and Browning. Our results show that both proposed methods are robust to population stratification, robust to the direction and magnitude of the effects of causal variants, and more powerful than the methods using weights suggested by Madsen and Browning, especially when both risk and protective variants are present.
Collapse
Affiliation(s)
- Shurong Fang
- Department of Mathematical Sciences, Michigan Technological University, Houghton, Michigan 49931, USA
| | | | | |
Collapse
|
1669
|
Köttgen A, Yang Q, Shimmin LC, Tin A, Schaeffer C, Coresh J, Liu X, Rampoldi L, Hwang SJ, Boerwinkle E, Hixson JE, Kao WHL, Fox CS. Association of estimated glomerular filtration rate and urinary uromodulin concentrations with rare variants identified by UMOD gene region sequencing. PLoS One 2012; 7:e38311. [PMID: 22693617 PMCID: PMC3365030 DOI: 10.1371/journal.pone.0038311] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2012] [Accepted: 05/08/2012] [Indexed: 11/19/2022] Open
Abstract
BACKGROUND Recent genome-wide association studies (GWAS) have identified common variants in the UMOD region associated with kidney function and disease in the general population. To identify novel rare variants as well as common variants that may account for this GWAS signal, the exons and 4 kb upstream region of UMOD were sequenced. METHODOLOGY/PRINCIPAL FINDINGS Individuals (n = 485) were selected based on presence of the GWAS risk haplotype and chronic kidney disease (CKD) in the ARIC Study and on the extremes of of the UMOD gene product, uromodulin, in urine (Tamm Horsfall protein, THP) in the Framingham Heart Study (FHS). Targeted sequencing was conducted using capillary based Sanger sequencing (3730 DNA Analyzer). Variants were tested for association with THP concentrations and estimated glomerular filtration rate (eGFR), and identified non-synonymous coding variants were genotyped in up to 22,546 follow-up samples. Twenty-four and 63 variants were identified in the 285 ARIC and 200 FHS participants, respectively. In both studies combined, there were 33 common and 54 rare (MAF<0.05) variants. Five non-synonymous rare variants were identified in FHS; borderline enrichment of rare variants was found in the extremes of THP (SKAT p-value = 0.08). Only V458L was associated with THP in the FHS general-population validation sample (p = 9*10(-3), n = 2,522), but did not show direction-consistent and significant association with eGFR in both the ARIC (n = 14,635) and FHS (n = 7,520) validation samples. Pooling all non-synonymous rare variants except V458L together showed non-significant associations with THP and eGFR in the FHS validation sample. Functional studies of V458L revealed no alternations in protein trafficking. CONCLUSIONS/SIGNIFICANCE Multiple novel rare variants in the UMOD region were identified, but none were consistently associated with eGFR in two independent study samples. Only V458L had modest association with THP levels in the general population and thus could not account for the observed GWAS signal.
Collapse
Affiliation(s)
- Anna Köttgen
- Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland, United States of America
- Renal Division, Freiburg University Clinic, Freiburg, Germany
| | - Qiong Yang
- Department of Biostatistics, Boston University School of Public Health, Boston, Massachussets, United States of America
| | - Lawrence C. Shimmin
- Human Genetics Center, Division of Epidemiology and Disease Control, UT-Houston School of Public Health, Houston, Texas, United States of America
| | - Adrienne Tin
- Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland, United States of America
| | - Céline Schaeffer
- Dulbecco Telethon Institute and Division of Genetics and Cell Biology, San Raffaele Scientific Institute, Milan, Italy
| | - Josef Coresh
- Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland, United States of America
- Welch Center for Prevention, Epidemiology and Clinical Research, Johns Hopkins Medical Institutions, Baltimore, Maryland, United States of America
| | - Xuan Liu
- Department of Biostatistics, Boston University School of Public Health, Boston, Massachussets, United States of America
| | - Luca Rampoldi
- Dulbecco Telethon Institute and Division of Genetics and Cell Biology, San Raffaele Scientific Institute, Milan, Italy
| | - Shih-Jen Hwang
- NHLBI's Framingham Heart Study and the Center for Population Studies, Framingham, Massachussets, United States of America
| | - Eric Boerwinkle
- Human Genetics Center, Division of Epidemiology and Disease Control, UT-Houston School of Public Health, Houston, Texas, United States of America
| | - James E. Hixson
- Human Genetics Center, Division of Epidemiology and Disease Control, UT-Houston School of Public Health, Houston, Texas, United States of America
| | - W. H. Linda Kao
- Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland, United States of America
- Welch Center for Prevention, Epidemiology and Clinical Research, Johns Hopkins Medical Institutions, Baltimore, Maryland, United States of America
| | - Caroline S. Fox
- NHLBI's Framingham Heart Study and the Center for Population Studies, Framingham, Massachussets, United States of America
- Division of Endocrinology, Brigham and Women's Hospital and Harvard Medical School, Boston, Massachussets, United States of America
| |
Collapse
|
1670
|
|
1671
|
Abstract
Many common human diseases are complex and are expected to be highly heterogeneous, with multiple causative loci and multiple rare and common variants at some of the causative loci contributing to the risk of these diseases. Data from the genome-wide association studies (GWAS) and metadata such as known gene functions and pathways provide the possibility of identifying genetic variants, genes and pathways that are associated with complex phenotypes. Single-marker-based tests have been very successful in identifying thousands of genetic variants for hundreds of complex phenotypes. However, these variants only explain very small percentages of the heritabilities. To account for the locus- and allelic-heterogeneity, gene-based and pathway-based tests can be very useful in the next stage of the analysis of GWAS data. U-statistics, which summarize the genomic similarity between pair of individuals and link the genomic similarity to phenotype similarity, have proved to be very useful for testing the associations between a set of single nucleotide polymorphisms and the phenotypes. Compared to single marker analysis, the advantages afforded by the U-statistics-based methods is large when the number of markers involved is large. We review several formulations of U-statistics in genetic association studies and point out the links of these statistics with other similarity-based tests of genetic association. Finally, potential application of U-statistics in analysis of the next-generation sequencing data and rare variants association studies are discussed.
Collapse
Affiliation(s)
- Hongzhe Li
- Department of Biostatistics and Epidemiology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA.
| |
Collapse
|
1672
|
Zhang F, Chen Y, Liu C, Lu T, Yan H, Ruan Y, Yue W, Wang L, Zhang D. Systematic association analysis of microRNA machinery genes with schizophrenia informs further study. Neurosci Lett 2012; 520:47-50. [PMID: 22595464 DOI: 10.1016/j.neulet.2012.05.028] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2012] [Revised: 04/23/2012] [Accepted: 05/05/2012] [Indexed: 10/28/2022]
Abstract
microRNAs (miRNAs) play a vital role in development via the post-transcriptional regulation of most genes. Variation in the miRNA machinery pathway proteins which mediate the biogenesis, maturation, transportation, and functioning of miRNAs might be relevant to human traits. In this work, we explored the role of 59 miRNA machinery genes in schizophrenia (SZ). Association analysis of 967 single nucleotide polymorphisms within these genes detected that an intronic polymorphism of EIF4ENIF1, rs7289941, was significantly associated with SZ (P=4.10E-5). We failed to replicate this result in a validation sample comprising 1027 healthy controls and 1012 SZ cases, and the combined data yielded nominal significance (P=0.013). We conducted a gene-based association analysis using VEGAS and SKAT, and found seven associated genes in total, including EIF4ENIF1, PIWIL2, and DGCR8, but none survived correction for multiple testing. Taken together, our data do not provide strong support for the association of common variants within miRNA machinery genes with SZ in the Han Chinese population, but implicate several promising candidate genes for further research.
Collapse
Affiliation(s)
- Fuquan Zhang
- Institute of Mental Health, Peking University, PR China.
| | | | | | | | | | | | | | | | | |
Collapse
|
1673
|
Statistical Challenges in Sequence-Based Association Studies with Population- and Family-Based Designs. STATISTICS IN BIOSCIENCES 2012. [DOI: 10.1007/s12561-012-9062-9] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
1674
|
Liu DJ, Leal SM. SEQCHIP: a powerful method to integrate sequence and genotype data for the detection of rare variant associations. ACTA ACUST UNITED AC 2012; 28:1745-51. [PMID: 22556370 DOI: 10.1093/bioinformatics/bts263] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022]
Abstract
MOTIVATION Next-generation sequencing greatly increases the capacity to detect rare-variant complex-trait associations. However, it is still expensive to sequence a large number of samples and therefore often small datasets are used. Given cost constraints, a potentially more powerful two-step strategy is to sequence a subset of the sample to discover variants, and genotype the identified variants in the remaining sample. If only cases are sequenced, directly combining sequence and genotype data will lead to inflated type-I errors in rare-variant association analysis. Although several methods have been developed to correct for the bias, they are either underpowered or theoretically invalid. We proposed a new method SEQCHIP to integrate genotype and sequence data, which can be used with most existing rare-variant tests. RESULTS It is demonstrated using both simulated and real datasets that the SEQCHIP method has controlled type-I errors, and is substantially more powerful than all other currently available methods. AVAILABILITY SEQCHIP is implemented in an R-Package and is available at http://linkage.rockefeller.edu/suzanne/seqchip/Seqchip.html.
Collapse
Affiliation(s)
- Dajiang J Liu
- Department of Biostatistics, Center of Statistical Genetics, University of Michigan, Ann Arbor, MI 48109, USA.
| | | |
Collapse
|
1675
|
Current World Literature. Curr Opin Cardiol 2012; 27:318-26. [DOI: 10.1097/hco.0b013e328352dfaf] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
1676
|
Chen GK, Chen G, Wei P, DeStefano AL. Incorporating biological information into association studies of sequencing data. Genet Epidemiol 2012; 35 Suppl 1:S29-34. [PMID: 22128055 DOI: 10.1002/gepi.20646] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
We summarize the methodological contributions from Group 3 of Genetic Analysis Workshop 17 (GAW17). The overarching goal of these methods was the evaluation and enhancement of state-of-the-art approaches in integration of biological knowledge into association studies of rare variants. We found that methods loosely fell into three major categories: (1) hypothesis testing of index scores based on aggregating rare variants at the gene level, (2) variable selection techniques that incorporate biological prior information, and (3) novel approaches that integrate external (i.e., not provided by GAW17) prior information, such as pathway and single-nucleotide polymorphism (SNP) annotations. Commonalities among the findings from these contributions are that gene-based analysis of rare variants is advantageous to single-SNP analysis and that the minor allele frequency threshold to identify rare variants may influence power and thus needs to be carefully considered. A consistent increase in power was also identified by considering only nonsynonymous SNPs in the analyses. Overall, we found that no single method had an appreciable advantage over the other methods. However, methods that carried out sensitivity analyses by comparing biologically informative to noninformative prior probabilities demonstrated that integrating biological knowledge into statistical analyses always, at the least, enabled subtle improvements in the performance of any statistical method applied to these simulated data. Although these statistical improvements reflect the simulation model assumed for GAW17, our hope is that the simulation models provide a reasonable representation of the underlying biology and that these methods can thus be of utility in real data.
Collapse
Affiliation(s)
- Gary K Chen
- Division of Biostatistics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA, USA
| | | | | | | |
Collapse
|
1677
|
Liu DJ, Leal SM. A unified framework for detecting rare variant quantitative trait associations in pedigree and unrelated individuals via sequence data. Hum Hered 2012; 73:105-22. [PMID: 22555759 DOI: 10.1159/000336293] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2011] [Accepted: 01/07/2012] [Indexed: 11/19/2022] Open
Abstract
OBJECTIVES There is great interest to sequence unrelated or pedigree samples for detecting rare variant quantitative trait associations. In order to reduce the cost of sequencing and improve power, many studies sequence selected samples with extreme traits. Existing methods for detecting rare variant associations were developed for unrelated samples. Methods are needed to analyze (selected or randomly ascertained) pedigree samples. METHODS We propose a unified framework of modeling extreme trait genetic associations (MEGA) with rare variants. Using MEGA and appropriate permutation algorithms, many rare variant tests can be extended to family data. As an application, we compared study designs using both sib-pairs and unrelated individuals. Extensive simulations were carried out using realistic population genetic and complex trait models. RESULTS It is demonstrated that when extreme sampling is implemented within equal-sized cohorts of unrelated individuals or sib-pairs, analyzing unrelated individuals is consistently more powerful than studying sib-pairs. A higher portion of rare variants can be identified through sequencing unrelated samples compared to sibs. Alternatively, if samples are ascertained using fixed thresholds from an infinite-sized population, sequencing one sib with the most extreme trait from each extreme concordant sib-pair is consistently the most powerful design. CONCLUSIONS MEGA will play an important role in the analysis of sequence-based genetic association studies.
Collapse
Affiliation(s)
- Dajiang J Liu
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | | |
Collapse
|
1678
|
Bacanu SA. On optimal gene-based analysis of genome scans. Genet Epidemiol 2012; 36:333-9. [PMID: 22508187 DOI: 10.1002/gepi.21625] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2011] [Revised: 12/22/2011] [Accepted: 01/30/2012] [Indexed: 11/06/2022]
Abstract
Univariate analysis of markers has modest power when there are multiple causal variants within a gene. Under this scenario, combining the effects of all variants from a gene in a gene-wide statistic is thought to increase power. However, it is not really clear (1) what is the performance of most commonly used gene-wide methods for whole genome scans and (2) how scalable these methods are for more computationally intensive analyses, e.g. analysis of genome-wide sequence data. We attempt to answer these questions by using realistic simulations to assess the performance of a range of gene-based methods: (1) commonly used, e.g. VEGAS and GATES; (2) less commonly used, e.g. Simes, adaptive sum (aSUM), and kernel methods; and (3) a combination of univariate and multivariate tests we proposed for the analysis of markers in linkage disequilibrium. Simes is the fastest method and has good power for single causal variant models. aSUM method has good power for multiple causal variant models, especially at lower gene lengths. Our proposed statistic yields good power for all causal models. Given the extreme data volumes coming from sequencing studies, we recommend a two step analysis of genome scans. The initial step uses the very fast Simes procedure to flag possibly interesting genes. The second step refines interesting signals by using more computationally intensive methods, e.g. (1) aSUM for shorter and (2) VEGAS for larger gene lengths. Alternatively, genome scans can be analyzed using only our proposed method while sacrificing only a modest amount of power.
Collapse
|
1679
|
Neale BM, Kou Y, Liu L, Ma'ayan A, Samocha KE, Sabo A, Lin CF, Stevens C, Wang LS, Makarov V, Polak P, Yoon S, Maguire J, Crawford EL, Campbell NG, Geller ET, Valladares O, Shafer C, Liu H, Zhao T, Cai G, Lihm J, Dannenfelser R, Jabado O, Peralta Z, Nagaswamy U, Muzny D, Reid JG, Newsham I, Wu Y, Lewis L, Han Y, Voight BF, Lim E, Rossin E, Kirby A, Flannick J, Fromer M, Shakir K, Fennell T, Garimella K, Banks E, Poplin R, Gabriel S, DePristo M, Wimbish JR, Boone BE, Levy SE, Betancur C, Sunyaev S, Boerwinkle E, Buxbaum JD, Cook EH, Devlin B, Gibbs RA, Roeder K, Schellenberg GD, Sutcliffe JS, Daly MJ. Patterns and rates of exonic de novo mutations in autism spectrum disorders. Nature 2012; 485:242-5. [PMID: 22495311 PMCID: PMC3613847 DOI: 10.1038/nature11011] [Citation(s) in RCA: 1278] [Impact Index Per Article: 106.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2011] [Accepted: 03/06/2012] [Indexed: 01/21/2023]
Abstract
Autism spectrum disorders (ASD) are believed to have genetic and environmental origins, yet in only a modest fraction of individuals can specific causes be identified. To identify further genetic risk factors, here we assess the role of de novo mutations in ASD by sequencing the exomes of ASD cases and their parents (n = 175 trios). Fewer than half of the cases (46.3%) carry a missense or nonsense de novo variant, and the overall rate of mutation is only modestly higher than the expected rate. In contrast, the proteins encoded by genes that harboured de novo missense or nonsense mutations showed a higher degree of connectivity among themselves and to previous ASD genes as indexed by protein-protein interaction screens. The small increase in the rate of de novo events, when taken together with the protein interaction results, are consistent with an important but limited role for de novo point mutations in ASD, similar to that documented for de novo copy number variants. Genetic models incorporating these data indicate that most of the observed de novo events are unconnected to ASD; those that do confer risk are distributed across many genes and are incompletely penetrant (that is, not necessarily sufficient for disease). Our results support polygenic models in which spontaneous coding mutations in any of a large number of genes increases risk by 5- to 20-fold. Despite the challenge posed by such models, results from de novo events and a large parallel case-control study provide strong evidence in favour of CHD8 and KATNAL2 as genuine autism risk factors.
Collapse
Affiliation(s)
- Benjamin M. Neale
- Analytic and Translational Genetics Unit, Department of Medicine, Massachusetts General Hospital and Harvard Medical School, Boston, Massachusetts, 02114
- Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, 7 Cambridge Center, Cambridge, Massachusetts, 02142
| | - Yan Kou
- Pharmacology and Systems Therapeutics, Mount Sinai School of Medicine, New York, New York, 10029
- Seaver Autism Center for Research and Treatment, Mount Sinai School of Medicine, New York, New York, 10029
| | - Li Liu
- Department of Statistics, Carnegie Mellon University, Pittsburgh, Pennsylvania, 15232
| | - Avi Ma'ayan
- Pharmacology and Systems Therapeutics, Mount Sinai School of Medicine, New York, New York, 10029
| | - Kaitlin E. Samocha
- Analytic and Translational Genetics Unit, Department of Medicine, Massachusetts General Hospital and Harvard Medical School, Boston, Massachusetts, 02114
- Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, 7 Cambridge Center, Cambridge, Massachusetts, 02142
| | - Aniko Sabo
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, 77030
| | - Chiao-Feng Lin
- Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, 19104
| | - Christine Stevens
- Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, 7 Cambridge Center, Cambridge, Massachusetts, 02142
| | - Li-San Wang
- Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, 19104
| | - Vladimir Makarov
- Seaver Autism Center for Research and Treatment, Mount Sinai School of Medicine, New York, New York, 10029
- Department of Psychiatry, Mount Sinai School of Medicine, New York, New York, 10029
| | - Paz Polak
- Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, 7 Cambridge Center, Cambridge, Massachusetts, 02142
- Division of Genetics, Department of Medicine Brigham & Women's Hospital and Harvard Medical School, Boston, Massachusetts, 02115
| | - Seungtai Yoon
- Seaver Autism Center for Research and Treatment, Mount Sinai School of Medicine, New York, New York, 10029
- Department of Psychiatry, Mount Sinai School of Medicine, New York, New York, 10029
| | - Jared Maguire
- Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, 7 Cambridge Center, Cambridge, Massachusetts, 02142
| | - Emily L. Crawford
- Vanderbilt Brain Institute, Departments of Molecular Physiology & Biophysics and Psychiatry, Vanderbilt University, Nashville, Tennessee, 37232
| | - Nicholas G. Campbell
- Vanderbilt Brain Institute, Departments of Molecular Physiology & Biophysics and Psychiatry, Vanderbilt University, Nashville, Tennessee, 37232
| | - Evan T. Geller
- Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, 19104
| | - Otto Valladares
- Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, 19104
| | - Chad Shafer
- Department of Statistics, Carnegie Mellon University, Pittsburgh, Pennsylvania, 15232
| | - Han Liu
- Biostatistics Department and Computer Science Department, Johns Hopkins University, Baltimore, Maryland, 21205
| | - Tuo Zhao
- Biostatistics Department and Computer Science Department, Johns Hopkins University, Baltimore, Maryland, 21205
| | - Guiqing Cai
- Seaver Autism Center for Research and Treatment, Mount Sinai School of Medicine, New York, New York, 10029
- Department of Psychiatry, Mount Sinai School of Medicine, New York, New York, 10029
| | - Jayon Lihm
- Seaver Autism Center for Research and Treatment, Mount Sinai School of Medicine, New York, New York, 10029
- Department of Psychiatry, Mount Sinai School of Medicine, New York, New York, 10029
| | - Ruth Dannenfelser
- Pharmacology and Systems Therapeutics, Mount Sinai School of Medicine, New York, New York, 10029
| | - Omar Jabado
- Genetics and Genomic Sciences, Mount Sinai School of Medicine, New York, New York, 10029
| | - Zuleyma Peralta
- Genetics and Genomic Sciences, Mount Sinai School of Medicine, New York, New York, 10029
| | - Uma Nagaswamy
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, 77030
| | - Donna Muzny
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, 77030
| | - Jeffrey G. Reid
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, 77030
| | - Irene Newsham
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, 77030
| | - Yuanqing Wu
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, 77030
| | - Lora Lewis
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, 77030
| | - Yi Han
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, 77030
| | - Benjamin F. Voight
- Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, 7 Cambridge Center, Cambridge, Massachusetts, 02142
- Department of Pharmacology, University of Pennsylvania, Perelman School of Medicine, Philadelphia, Pennsylvania 19104
| | - Elaine Lim
- Analytic and Translational Genetics Unit, Department of Medicine, Massachusetts General Hospital and Harvard Medical School, Boston, Massachusetts, 02114
- Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, 7 Cambridge Center, Cambridge, Massachusetts, 02142
| | - Elizabeth Rossin
- Analytic and Translational Genetics Unit, Department of Medicine, Massachusetts General Hospital and Harvard Medical School, Boston, Massachusetts, 02114
- Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, 7 Cambridge Center, Cambridge, Massachusetts, 02142
| | - Andrew Kirby
- Analytic and Translational Genetics Unit, Department of Medicine, Massachusetts General Hospital and Harvard Medical School, Boston, Massachusetts, 02114
- Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, 7 Cambridge Center, Cambridge, Massachusetts, 02142
| | - Jason Flannick
- Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, 7 Cambridge Center, Cambridge, Massachusetts, 02142
| | - Menachem Fromer
- Analytic and Translational Genetics Unit, Department of Medicine, Massachusetts General Hospital and Harvard Medical School, Boston, Massachusetts, 02114
- Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, 7 Cambridge Center, Cambridge, Massachusetts, 02142
| | - Khalid Shakir
- Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, 7 Cambridge Center, Cambridge, Massachusetts, 02142
| | - Tim Fennell
- Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, 7 Cambridge Center, Cambridge, Massachusetts, 02142
| | - Kiran Garimella
- Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, 7 Cambridge Center, Cambridge, Massachusetts, 02142
| | - Eric Banks
- Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, 7 Cambridge Center, Cambridge, Massachusetts, 02142
| | - Ryan Poplin
- Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, 7 Cambridge Center, Cambridge, Massachusetts, 02142
| | - Stacey Gabriel
- Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, 7 Cambridge Center, Cambridge, Massachusetts, 02142
| | - Mark DePristo
- Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, 7 Cambridge Center, Cambridge, Massachusetts, 02142
| | - Jack R. Wimbish
- HudsonAlpha Institute for Biotechnology, Huntsville Alabama, 35806
| | - Braden E. Boone
- HudsonAlpha Institute for Biotechnology, Huntsville Alabama, 35806
| | - Shawn E. Levy
- HudsonAlpha Institute for Biotechnology, Huntsville Alabama, 35806
| | - Catalina Betancur
- INSERM U952 and CNRS UMR 7224 and UPMC Univ Paris 06, 75005 Paris, France
| | - Shamil Sunyaev
- Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, 7 Cambridge Center, Cambridge, Massachusetts, 02142
- Division of Genetics, Department of Medicine Brigham & Women's Hospital and Harvard Medical School, Boston, Massachusetts, 02115
| | - Eric Boerwinkle
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, 77030
- Human Genetics Center, University of Texas Health Science Center at Houston, Houston, Texas, 77030
| | - Joseph D. Buxbaum
- Seaver Autism Center for Research and Treatment, Mount Sinai School of Medicine, New York, New York, 10029
- Department of Psychiatry, Mount Sinai School of Medicine, New York, New York, 10029
- Genetics and Genomic Sciences, Mount Sinai School of Medicine, New York, New York, 10029
- Friedman Brain Institute, Mount Sinai School of Medicine, New York, New York, 10029
| | - Edwin H. Cook
- Department of Psychiatry, University of Illinois at Chicago, Chicago, Illinois, 60608
| | - Bernie Devlin
- Department of Psychiatry, University of Pittsburgh School of Medicine, Pittsburgh, Pennsylvania, 15213
| | - Richard A. Gibbs
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, 77030
| | - Kathryn Roeder
- Department of Statistics, Carnegie Mellon University, Pittsburgh, Pennsylvania, 15232
| | - Gerard D. Schellenberg
- Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, 19104
| | - James S. Sutcliffe
- Vanderbilt Brain Institute, Departments of Molecular Physiology & Biophysics and Psychiatry, Vanderbilt University, Nashville, Tennessee, 37232
| | - Mark J. Daly
- Analytic and Translational Genetics Unit, Department of Medicine, Massachusetts General Hospital and Harvard Medical School, Boston, Massachusetts, 02114
- Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, 7 Cambridge Center, Cambridge, Massachusetts, 02142
| |
Collapse
|
1680
|
Joint rare variant association test of the average and individual effects for sequencing studies. PLoS One 2012; 7:e32485. [PMID: 22468164 PMCID: PMC3309869 DOI: 10.1371/journal.pone.0032485] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2011] [Accepted: 01/30/2012] [Indexed: 11/19/2022] Open
Abstract
For many complex traits, single nucleotide polymorphisms (SNPs) identified from genome-wide association studies (GWAS) only explain a small percentage of heritability. Next generation sequencing technology makes it possible to explore unexplained heritability by identifying rare variants (RVs). Existing tests designed for RVs look for optimal strategies to combine information across multiple variants. Many of the tests have good power when the true underlying associations are either in the same direction or in opposite directions. We propose three tests for examining the association between a phenotype and RVs, where two of them jointly consider the common association across RVs and the individual deviations from the common effect. On one hand, similar to some of the best existing methods, the individual deviations are modeled as random effects to borrow information across multiple RVs. On the other hand, unlike the existing methods which pool individual effects towards zero, we pool them towards a possibly non-zero common effect by adding a pooled variant into the model. The common effect and the individual effects are jointly tested. We show through extensive simulations that at least one of the three tests proposed here is the most powerful or very close to being the most powerful in various settings of true models. This is appealing in practice because the direction and size of the true effects of the associated RVs are unknown. Researchers can apply the developed tests to improve power under a wide range of true models.
Collapse
|
1681
|
Kinnamon DD, Hershberger RE, Martin ER. Reconsidering association testing methods using single-variant test statistics as alternatives to pooling tests for sequence data with rare variants. PLoS One 2012; 7:e30238. [PMID: 22363423 PMCID: PMC3281828 DOI: 10.1371/journal.pone.0030238] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2011] [Accepted: 12/16/2011] [Indexed: 12/14/2022] Open
Abstract
Association tests that pool minor alleles into a measure of burden at a locus have been proposed for case-control studies using sequence data containing rare variants. However, such pooling tests are not robust to the inclusion of neutral and protective variants, which can mask the association signal from risk variants. Early studies proposing pooling tests dismissed methods for locus-wide inference using nonnegative single-variant test statistics based on unrealistic comparisons. However, such methods are robust to the inclusion of neutral and protective variants and therefore may be more useful than previously appreciated. In fact, some recently proposed methods derived within different frameworks are equivalent to performing inference on weighted sums of squared single-variant score statistics. In this study, we compared two existing methods for locus-wide inference using nonnegative single-variant test statistics to two widely cited pooling tests under more realistic conditions. We established analytic results for a simple model with one rare risk and one rare neutral variant, which demonstrated that pooling tests were less powerful than even Bonferroni-corrected single-variant tests in most realistic situations. We also performed simulations using variants with realistic minor allele frequency and linkage disequilibrium spectra, disease models with multiple rare risk variants and extensive neutral variation, and varying rates of missing genotypes. In all scenarios considered, existing methods using nonnegative single-variant test statistics had power comparable to or greater than two widely cited pooling tests. Moreover, in disease models with only rare risk variants, an existing method based on the maximum single-variant Cochran-Armitage trend chi-square statistic in the locus had power comparable to or greater than another existing method closely related to some recently proposed methods. We conclude that efficient locus-wide inference using single-variant test statistics should be reconsidered as a useful framework for devising powerful association tests in sequence data with rare variants.
Collapse
Affiliation(s)
- Daniel D. Kinnamon
- Dr. John T. Macdonald Foundation Department of Human Genetics, Miller School of Medicine, University of Miami, Miami, Florida, United States of America
| | - Ray E. Hershberger
- Cardiovascular Division, Miller School of Medicine, University of Miami, Miami, Florida, United States of America
| | - Eden R. Martin
- Dr. John T. Macdonald Foundation Department of Human Genetics, Miller School of Medicine, University of Miami, Miami, Florida, United States of America
| |
Collapse
|
1682
|
Ladouceur M, Dastani Z, Aulchenko YS, Greenwood CMT, Richards JB. The empirical power of rare variant association methods: results from sanger sequencing in 1,998 individuals. PLoS Genet 2012; 8:e1002496. [PMID: 22319458 PMCID: PMC3271058 DOI: 10.1371/journal.pgen.1002496] [Citation(s) in RCA: 89] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2011] [Accepted: 12/08/2011] [Indexed: 01/09/2023] Open
Abstract
The role of rare genetic variation in the etiology of complex disease remains unclear. However, the development of next-generation sequencing technologies offers the experimental opportunity to address this question. Several novel statistical methodologies have been recently proposed to assess the contribution of rare variation to complex disease etiology. Nevertheless, no empirical estimates comparing their relative power are available. We therefore assessed the parameters that influence their statistical power in 1,998 individuals Sanger-sequenced at seven genes by modeling different distributions of effect, proportions of causal variants, and direction of the associations (deleterious, protective, or both) in simulated continuous trait and case/control phenotypes. Our results demonstrate that the power of recently proposed statistical methods depend strongly on the underlying hypotheses concerning the relationship of phenotypes with each of these three factors. No method demonstrates consistently acceptable power despite this large sample size, and the performance of each method depends upon the underlying assumption of the relationship between rare variants and complex traits. Sensitivity analyses are therefore recommended to compare the stability of the results arising from different methods, and promising results should be replicated using the same method in an independent sample. These findings provide guidance in the analysis and interpretation of the role of rare base-pair variation in the etiology of complex traits and diseases.
Collapse
Affiliation(s)
- Martin Ladouceur
- Department of Human Genetics, McGill University, Montreal, Canada
- Lady Davis Institute for Medical Research, Jewish General Hospital, Montreal, Canada
| | - Zari Dastani
- Lady Davis Institute for Medical Research, Jewish General Hospital, Montreal, Canada
- Department of Epidemiology, Biostatistics and Occupational Health, McGill University, Montreal, Canada
| | - Yurii S. Aulchenko
- Department of Epidemiology, Erasmus MC, Rotterdam, The Netherlands
- Institute of Cytology and Genetics SD RAS, Novosibirsk, Russia
| | - Celia M. T. Greenwood
- Lady Davis Institute for Medical Research, Jewish General Hospital, Montreal, Canada
- Department of Epidemiology, Biostatistics and Occupational Health, McGill University, Montreal, Canada
- Department of Oncology, McGill University, Montreal, Canada
| | - J. Brent Richards
- Department of Human Genetics, McGill University, Montreal, Canada
- Lady Davis Institute for Medical Research, Jewish General Hospital, Montreal, Canada
- Department of Medicine, Jewish General Hospital, McGill University, Montreal, Canada
- Twin Research and Genetic Epidemiology, King's College London, London, United Kingdom
| |
Collapse
|
1683
|
Tomlinson I. Colorectal cancer genetics: from candidate genes to GWAS and back again. Mutagenesis 2012; 27:141-2. [DOI: 10.1093/mutage/ger072] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/14/2023] Open
|
1684
|
Abstract
Identity-by-descent (IBD) mapping tests whether cases share more segments of IBD around a putative causal variant than do controls. These segments of IBD can be accurately detected from genome-wide SNP data. We investigate the power of IBD mapping relative to that of SNP association testing for genome-wide case-control SNP data. Our focus is particularly on rare variants, as these tend to be more recent and hence more likely to have recent shared ancestry. We simulate data from both large and small populations and find that the relative performance of IBD mapping and SNP association testing depends on population demographic history and the strength of selection against causal variants. We also present an IBD mapping analysis of a type 1 diabetes data set. In those data we find that we can detect association only with the HLA region using IBD mapping. Overall, our results suggest that IBD mapping may have higher power than association analysis of SNP data when multiple rare causal variants are clustered within a gene. However, for outbred populations, very large sample sizes may be required for genome-wide significance unless the causal variants have strong effects.
Collapse
|
1685
|
Daye ZJ, Li H, Wei Z. A powerful test for multiple rare variants association studies that incorporates sequencing qualities. Nucleic Acids Res 2012; 40:e60. [PMID: 22262732 PMCID: PMC3340416 DOI: 10.1093/nar/gks024] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
Next-generation sequencing data will soon become routinely available for association studies between complex traits and rare variants. Sequencing data, however, are characterized by the presence of sequencing errors at each individual genotype. This makes it especially challenging to perform association studies of rare variants, which, due to their low minor allele frequencies, can be easily perturbed by genotype errors. In this article, we develop the quality-weighted multivariate score association test (qMSAT), a new procedure that allows powerful association tests between complex traits and multiple rare variants under the presence of sequencing errors. Simulation results based on quality scores from real data show that the qMSAT often dominates over current methods, that do not utilize quality information. In particular, the qMSAT can dramatically increase power over existing methods under moderate sample sizes and relatively low coverage. Moreover, in an obesity data study, we identified using the qMSAT two functional regions (MGLL promoter and MGLL 3′-untranslated region) where rare variants are associated with extreme obesity. Due to the high cost of sequencing data, the qMSAT is especially valuable for large-scale studies involving rare variants, as it can potentially increase power without additional experimental cost. qMSAT is freely available at http://qmsat.sourceforge.net/.
Collapse
Affiliation(s)
- Z John Daye
- Department of Biostatistics and Epidemiology, University of Pennsylvania School of Medicine, Philadelphia, PA 19104, USA
| | | | | |
Collapse
|
1686
|
Xing G, Lin CY, Wooding SP, Xing C. Blindly using Wald's test can miss rare disease-causal variants in case-control association studies. Ann Hum Genet 2012; 76:168-77. [PMID: 22256951 DOI: 10.1111/j.1469-1809.2011.00700.x] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023]
Abstract
There are four tests--the likelihood ratio (LR) test, Wald's test, the score test and the exact test--commonly employed in genetic association studies. On comparison of the four tests, we found that Wald's test, popular in genome-wide screens due to its low computational demands, exhibited a paradoxical behaviour in that the test statistic decreased as the effect size of the variant increased, resulting in a loss of power. The LR test always achieved the most significant P-values, followed by the exact test. We further examined the results in a real data set composed of high- and low-cholesterol subjects from the Dallas Heart Study (DHS). We also compared the single-variant LR test with two multi-variant analysis approaches--the burden test and the C-alpha test--in analysing the sequencing data by simulation. Our results call for caution in using Wald's test in genome-wide case-control association studies and suggest that the LR test is a better alternative in spite of its computational demands.
Collapse
Affiliation(s)
- Guan Xing
- Bristol-Myers Squibb Company, Pennington, NJ, USA
| | | | | | | |
Collapse
|
1687
|
Pongpanich M, Neely ML, Tzeng JY. On the Aggregation of Multimarker Information for Marker-Set and Sequencing Data Analysis: Genotype Collapsing vs. Similarity Collapsing. Front Genet 2012; 2:110. [PMID: 22303404 PMCID: PMC3266618 DOI: 10.3389/fgene.2011.00110] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2011] [Accepted: 12/25/2011] [Indexed: 12/12/2022] Open
Abstract
Methods that collapse information across genetic markers when searching for association signals are gaining momentum in the literature. Although originally developed to achieve a better balance between retaining information and controlling degrees of freedom when performing multimarker association analysis, these methods have recently been proven to be a powerful tool for identifying rare variants that contribute to complex phenotypes. The information among markers can be collapsed at the genotype level, which focuses on the mean of genetic information, or the similarity level, which focuses on the variance of genetic information. The aim of this work is to understand the strengths and weaknesses of these two collapsing strategies. Our results show that neither collapsing strategy outperforms the other across all simulated scenarios. Two factors that dominate the performance of these strategies are the signal-to-noise ratio and the underlying genetic architecture of the causal variants. Genotype collapsing is more sensitive to the marker set being contaminated by noise loci than similarity collapsing. In addition, genotype collapsing performs best when the genetic architecture of the causal variants is not complex (e.g., causal loci with similar effects and similar frequencies). Similarity collapsing is more robust as the complexity of the genetic architecture increases and outperforms genotype collapsing when the genetic architecture of the marker set becomes more sophisticated (e.g., causal loci with various effect sizes or frequencies and potential non-linear or interactive effects). Because the underlying genetic architecture is not known a priori, we also considered a two-stage analysis that combines the two top-performing methods from different collapsing strategies. We find that it is reasonably robust across all simulated scenarios.
Collapse
Affiliation(s)
- Monnat Pongpanich
- Bioinformatics Research Center, North Carolina State University Raleigh, NC, USA
| | | | | |
Collapse
|
1688
|
Ionita-Laza I, Makarov V, Yoon S, Raby B, Buxbaum J, Nicolae DL, Lin X. Finding disease variants in Mendelian disorders by using sequence data: methods and applications. Am J Hum Genet 2011; 89:701-12. [PMID: 22137099 DOI: 10.1016/j.ajhg.2011.11.003] [Citation(s) in RCA: 45] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2011] [Revised: 09/19/2011] [Accepted: 11/03/2011] [Indexed: 12/11/2022] Open
Abstract
Many sequencing studies are now underway to identify the genetic causes for both Mendelian and complex traits. Via exome-sequencing, genes harboring variants implicated in several Mendelian traits have already been identified. The underlying methodology in these studies is a multistep algorithm based on filtering variants identified in a small number of affected individuals and depends on whether they are novel (not yet seen in public resources such as dbSNP), shared among affected individuals, and other external functional information on the variants. Although intuitive, these filter-based methods are nonoptimal and do not provide any measure of statistical uncertainty. We describe here a formal statistical approach that has several distinct advantages: (1) it provides fast computation of approximate p values for individual genes, (2) it adjusts for the background variation in each gene, (3) it allows for incorporation of functional or linkage-based information, and (4) it accommodates designs based on both affected relative pairs and unrelated affected individuals. We show via simulations that the proposed approach can be used in conjunction with the existing filter-based methods to achieve a substantially better ranking of a gene relevant for disease when compared to currently used filter-based approaches, this is especially so in the presence of disease locus heterogeneity. We revisit recent studies on three Mendelian diseases and show that the proposed approach results in the implicated gene being ranked first in all studies, and approximate p values of 10(-6) for the Miller Syndrome gene, 1.0 × 10(-4) for the Freeman-Sheldon Syndrome gene, and 3.5 × 10(-5) for the Kabuki Syndrome gene.
Collapse
|
1689
|
Khetarpal SA, Edmondson AC, Raghavan A, Neeli H, Jin W, Badellino KO, Demissie S, Manning AK, DerOhannessian SL, Wolfe ML, Cupples LA, Li M, Kathiresan S, Rader DJ. Mining the LIPG allelic spectrum reveals the contribution of rare and common regulatory variants to HDL cholesterol. PLoS Genet 2011; 7:e1002393. [PMID: 22174694 PMCID: PMC3234219 DOI: 10.1371/journal.pgen.1002393] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2011] [Accepted: 10/07/2011] [Indexed: 11/18/2022] Open
Abstract
Genome-wide association studies (GWAS) have successfully identified loci associated with quantitative traits, such as blood lipids. Deep resequencing studies are being utilized to catalogue the allelic spectrum at GWAS loci. The goal of these studies is to identify causative variants and missing heritability, including heritability due to low frequency and rare alleles with large phenotypic impact. Whereas rare variant efforts have primarily focused on nonsynonymous coding variants, we hypothesized that noncoding variants in these loci are also functionally important. Using the HDL-C gene LIPG as an example, we explored the effect of regulatory variants identified through resequencing of subjects at HDL-C extremes on gene expression, protein levels, and phenotype. Resequencing a portion of the LIPG promoter and 5' UTR in human subjects with extreme HDL-C, we identified several rare variants in individuals from both extremes. Luciferase reporter assays were used to measure the effect of these rare variants on LIPG expression. Variants conferring opposing effects on gene expression were enriched in opposite extremes of the phenotypic distribution. Minor alleles of a common regulatory haplotype and noncoding GWAS SNPs were associated with reduced plasma levels of the LIPG gene product endothelial lipase (EL), consistent with its role in HDL-C catabolism. Additionally, we found that a common nonfunctional coding variant associated with HDL-C (rs2000813) is in linkage disequilibrium with a 5' UTR variant (rs34474737) that decreases LIPG promoter activity. We attribute the gene regulatory role of rs34474737 to the observed association of the coding variant with plasma EL levels and HDL-C. Taken together, the findings show that both rare and common noncoding regulatory variants are important contributors to the allelic spectrum in complex trait loci.
Collapse
Affiliation(s)
- Sumeet A. Khetarpal
- Institute for Translational Medicine and Therapeutics, Institute for Diabetes, Obesity, and Metabolism, and Cardiovascular Institute, University of Pennsylvania School of Medicine, Philadelphia, Pennsylvania, United States of America
| | - Andrew C. Edmondson
- Institute for Translational Medicine and Therapeutics, Institute for Diabetes, Obesity, and Metabolism, and Cardiovascular Institute, University of Pennsylvania School of Medicine, Philadelphia, Pennsylvania, United States of America
| | - Avanthi Raghavan
- Institute for Translational Medicine and Therapeutics, Institute for Diabetes, Obesity, and Metabolism, and Cardiovascular Institute, University of Pennsylvania School of Medicine, Philadelphia, Pennsylvania, United States of America
| | - Hemanth Neeli
- Section of Hospital Medicine, Temple University Hospital, Philadelphia, Pennsylvania, United States of America
| | - Weijun Jin
- Department of Cell Biology, State University of New York Downstate Medical Center, Brooklyn, New York, United States of America
| | - Karen O. Badellino
- University of Pennsylvania School of Nursing, Philadelphia, Pennsylvania, United States of America
| | - Serkalem Demissie
- Department of Biostatistics, Boston University School of Public Health, Boston, Massachusetts, United States of America
- Framingham Heart Study, National Heart, Lung, and Blood Institute, Framingham, Massachusetts, United States of America
| | - Alisa K. Manning
- Department of Biostatistics, Boston University School of Public Health, Boston, Massachusetts, United States of America
| | - Stephanie L. DerOhannessian
- Institute for Translational Medicine and Therapeutics, Institute for Diabetes, Obesity, and Metabolism, and Cardiovascular Institute, University of Pennsylvania School of Medicine, Philadelphia, Pennsylvania, United States of America
| | - Megan L. Wolfe
- Institute for Translational Medicine and Therapeutics, Institute for Diabetes, Obesity, and Metabolism, and Cardiovascular Institute, University of Pennsylvania School of Medicine, Philadelphia, Pennsylvania, United States of America
| | - L. Adrienne Cupples
- Department of Biostatistics, Boston University School of Public Health, Boston, Massachusetts, United States of America
- Framingham Heart Study, National Heart, Lung, and Blood Institute, Framingham, Massachusetts, United States of America
| | - Mingyao Li
- Department of Biostatistics and Epidemiology, University of Pennsylvania School of Medicine, Philadelphia, Pennsylvania, United States of America
| | - Sekar Kathiresan
- Cardiovascular Research Center and Center for Human Genetic Research, Massachusetts General Hospital and Harvard Medical School, Boston, Massachusetts, United States of America
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
| | - Daniel J. Rader
- Institute for Translational Medicine and Therapeutics, Institute for Diabetes, Obesity, and Metabolism, and Cardiovascular Institute, University of Pennsylvania School of Medicine, Philadelphia, Pennsylvania, United States of America
- * E-mail:
| |
Collapse
|
1690
|
Udpa N, Zhou D, Haddad GG, Bafna V. Tests of selection in pooled case-control data: an empirical study. Front Genet 2011; 2:83. [PMID: 22303377 PMCID: PMC3268381 DOI: 10.3389/fgene.2011.00083] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2011] [Accepted: 11/01/2011] [Indexed: 11/13/2022] Open
Abstract
For smaller organisms with faster breeding cycles, artificial selection can be used to create sub-populations with different phenotypic traits. Genetic tests can be employed to identify the causal markers for the phenotypes, as a precursor to engineering strains with a combination of traits. Traditional approaches involve analyzing crosses of inbred strains to test for co-segregation with genetic markers. Here we take advantage of cheaper next generation sequencing techniques to identify genetic signatures of adaptation to the selection constraints. Obtaining individual sequencing data is often unrealistic due to cost and sample issues, so we focus on pooled genomic data. We explore a series of statistical tests for selection using pooled case (under selection) and control populations. The tests generally capture skews in the scaled frequency spectrum of alleles in a region, which are indicative of a selective sweep. Extensive simulations are used to show that these approaches work well for a wide range of population divergence times and strong selective pressures. Control vs control simulations are used to determine an empirical False Positive Rate, and regions under selection are determined using a 1% FPR level. We show that pooling does not have a significant impact on statistical power. The tests are also robust to reasonable variations in several different parameters, including window size, base-calling error rate, and sequencing coverage. We then demonstrate the viability (and the challenges) of one of these methods in two independent Drosophila populations (Drosophila melanogaster) bred under selection for hypoxia and accelerated development, respectively. Testing for extreme hypoxia tolerance showed clear signals of selection, pointing to loci that are important for hypoxia adaptation. Overall, we outline a strategy for finding regions under selection using pooled sequences, then devise optimal tests for that strategy. The approaches show promise for detecting selection, even several generations after fixation of the beneficial allele has occurred.
Collapse
Affiliation(s)
- Nitin Udpa
- Bioinformatics and Systems Biology Graduate Program, University of California San Diego La Jolla, CA, USA
| | | | | | | |
Collapse
|
1691
|
Powers S, Gopalakrishnan S, Tintle N. Assessing the impact of non-differential genotyping errors on rare variant tests of association. Hum Hered 2011; 72:153-60. [PMID: 22004945 DOI: 10.1159/000332222] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2011] [Accepted: 08/24/2011] [Indexed: 11/19/2022] Open
Abstract
BACKGROUND/AIMS We aim to quantify the effect of non-differential genotyping errors on the power of rare variant tests and identify those situations when genotyping errors are most harmful. METHODS We simulated genotype and phenotype data for a range of sample sizes, minor allele frequencies, disease relative risks and numbers of rare variants. Genotype errors were then simulated using five different error models covering a wide range of error rates. RESULTS Even at very low error rates, misclassifying a common homozygote as a heterozygote translates into a substantial loss of power, a result that is exacerbated even further as the minor allele frequency decreases. While the power loss from heterozygote to common homozygote errors tends to be smaller for a given error rate, in practice heterozygote to homozygote errors are more frequent and, thus, will have measurable impact on power. CONCLUSION Error rates from genotype-calling technology for next-generation sequencing data suggest that substantial power loss may be seen when applying current rare variant tests of association to called genotypes.
Collapse
Affiliation(s)
- Scott Powers
- Department of Statistics and Operations Research, University of North Carolina, Chapel Hill, NC, USA
| | | | | |
Collapse
|
1692
|
A general framework for detecting disease associations with rare variants in sequencing studies. Am J Hum Genet 2011; 89:354-67. [PMID: 21885029 DOI: 10.1016/j.ajhg.2011.07.015] [Citation(s) in RCA: 209] [Impact Index Per Article: 16.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2011] [Revised: 07/21/2011] [Accepted: 07/26/2011] [Indexed: 12/19/2022] Open
Abstract
Biological and empirical evidence suggests that rare variants account for a large proportion of the genetic contributions to complex human diseases. Recent technological advances in high-throughput sequencing platforms have made it possible for researchers to generate comprehensive information on rare variants in large samples. We provide a general framework for association testing with rare variants by combining mutation information across multiple variant sites within a gene and relating the enriched genetic information to disease phenotypes through appropriate regression models. Our framework covers all major study designs (i.e., case-control, cross-sectional, cohort and family studies) and all common phenotypes (e.g., binary, quantitative, and age at onset), and it allows arbitrary covariates (e.g., environmental factors and ancestry variables). We derive theoretically optimal procedures for combining rare mutations and construct suitable test statistics for various biological scenarios. The allele-frequency threshold can be fixed or variable. The effects of the combined rare mutations on the phenotype can be in the same direction or different directions. The proposed methods are statistically more powerful and computationally more efficient than existing ones. An application to a deep-resequencing study of drug targets led to a discovery of rare variants associated with total cholesterol. The relevant software is freely available.
Collapse
|