401
|
Greenawalt DM, Sieberts SK, Cornelis MC, Girman CJ, Zhong H, Yang X, Guinney J, Qi L, Hu FB. Integrating genetic association, genetics of gene expression, and single nucleotide polymorphism set analysis to identify susceptibility Loci for type 2 diabetes mellitus. Am J Epidemiol 2012; 176:423-30. [PMID: 22865700 PMCID: PMC3499116 DOI: 10.1093/aje/kws123] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2011] [Accepted: 01/30/2012] [Indexed: 12/14/2022] Open
Abstract
Large-scale genome-wide association studies (GWAS) have identified over 40 genomic regions significantly associated with type 2 diabetes mellitus. However, GWAS results are not always straightforward to interpret, and linking these loci to meaningful disease etiology is often difficult without extensive follow-up studies. The authors expanded on previously reported type 2 diabetes mellitus GWAS from the nested case-control studies of 2 prospective US cohorts by incorporating expression single nucleotide polymorphism (SNP) information and applying SNP set enrichment analysis to identify sets of SNPs associated with genes that could provide further biologic insight to traditional genome-wide analysis. Using data collected between 1989 and 1994 in these previous studies to form a nested case-control study, the authors found that 3 of the most significantly associated SNPs to type 2 diabetes mellitus in their study are expression SNPs to the lymphocyte antigen 75 gene (LY75), the ubiquitin-specific peptidase 36 gene (USP36), and the phosphatidylinositol transfer protein, cytoplasmic 1 gene (PITPNC1). SNP set enrichment analysis of the GWAS results identified enrichment for expression SNPs to the macrophage-enriched module and the Gene Ontology (GO) biologic process fat cell differentiation human, which includes the transcription factor 7-like 2 gene (TCF7L2), as well as other type 2 diabetes mellitus-associated genes. Integrating genome-wide association, gene expression, and gene set analysis may provide valuable biologic support for potential type 2 diabetes mellitus susceptibility loci and may be useful in identifying new targets or pathways of interest for the treatment and prevention of type 2 diabetes mellitus.
Collapse
Affiliation(s)
- Danielle M Greenawalt
- Department of Genetics, Merck Research Laboratories, Pasteur, Boston, MA 02115, USA.
| | | | | | | | | | | | | | | | | |
Collapse
|
402
|
Maity A, Sullivan PF, Tzeng JY. Multivariate phenotype association analysis by marker-set kernel machine regression. Genet Epidemiol 2012; 36:686-95. [PMID: 22899176 DOI: 10.1002/gepi.21663] [Citation(s) in RCA: 68] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2012] [Revised: 05/23/2012] [Accepted: 06/18/2012] [Indexed: 11/06/2022]
Abstract
Genetic studies of complex diseases often collect multiple phenotypes relevant to the disorders. As these phenotypes can be correlated and share common genetic mechanisms, jointly analyzing these traits may bring more power to detect genes influencing individual or multiple phenotypes. Given the advancement brought by the multivariate phenotype approaches and the multimarker kernel machine regression, we construct a multivariate regression based on kernel machine to facilitate the joint evaluation of multimarker effects on multiple phenotypes. The kernel machine serves as a powerful dimension-reduction tool to capture complex effects among markers. The multivariate framework incorporates the potentially correlated multidimensional phenotypic information and accommodates common or different environmental covariates for each trait. We derive the multivariate kernel machine test based on a score-like statistic, and conduct simulations to evaluate the validity and efficacy of the method. We also study the performance of the commonly adapted strategies for kernel machine analysis on multiple phenotypes, including the multiple univariate kernel machine tests with original phenotypes or with their principal components. Our results suggest that none of these approaches has the uniformly best power, and the optimal test depends on the magnitude of the phenotype correlation and the effect patterns. However, the multivariate test retains to be a reasonable approach when the multiple phenotypes have none or mild correlations, and gives the best power once the correlation becomes stronger or when there exist genes that affect more than one phenotype. We illustrate the utility of the multivariate kernel machine method through the Clinical Antipsychotic Trails of Intervention Effectiveness antibody study.
Collapse
Affiliation(s)
- Arnab Maity
- Department of Statistics, North Carolina State University, Raleigh, USA
| | | | | |
Collapse
|
403
|
Lin WY, Tiwari HK, Gao G, Zhang K, Arcaroli JJ, Abraham E, Liu N. Similarity-based multimarker association tests for continuous traits. Ann Hum Genet 2012; 76:246-60. [PMID: 22497480 DOI: 10.1111/j.1469-1809.2012.00706.x] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Testing multiple markers simultaneously not only can capture the linkage disequilibrium patterns but also can decrease the number of tests and thus alleviate the multiple-testing penalty. If a gene is associated with a phenotype, subjects with similar genotypes in this gene should also have similar phenotypes. Based on this concept, we have developed a general framework that is applicable to continuous traits. Two similarity-based tests (namely, SIMc and SIMp tests) were derived as special cases of the general framework. In our simulation study, we compared the power of the two tests with that of the single-marker analysis, a standard haplotype regression, and a popular and powerful kernel machine regression. Our SIMc test outperforms other tests when the average R(2) (a measure of linkage disequilibrium) between the causal variant and the surrounding markers is larger than 0.3 or when the causal allele is common (say, frequency = 0.3). Our SIMp test outperforms other tests when the causal variant was introduced at common haplotypes (the maximum frequency of risk haplotypes >0.4). We also applied our two tests to an adiposity data set to show their utility.
Collapse
Affiliation(s)
- Wan-Yu Lin
- Department of Biostatistics, University of Alabama at Birmingham, USA
| | | | | | | | | | | | | |
Collapse
|
404
|
Meyer NJ, Daye ZJ, Rushefski M, Aplenc R, Lanken PN, Shashaty MGS, Christie JD, Feng R. SNP-set analysis replicates acute lung injury genetic risk factors. BMC MEDICAL GENETICS 2012; 13:52. [PMID: 22742663 PMCID: PMC3512475 DOI: 10.1186/1471-2350-13-52] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/31/2012] [Accepted: 06/18/2012] [Indexed: 12/19/2022]
Abstract
BACKGROUND We used a gene - based replication strategy to test the reproducibility of prior acute lung injury (ALI) candidate gene associations. METHODS We phenotyped 474 patients from a prospective severe trauma cohort study for ALI. Genomic DNA from subjects' blood was genotyped using the IBC chip, a multiplex single nucleotide polymorphism (SNP) array. Results were filtered for 25 candidate genes selected using prespecified literature search criteria and present on the IBC platform. For each gene, we grouped SNPs according to haplotype blocks and tested the joint effect of all SNPs on susceptibility to ALI using the SNP-set kernel association test. Results were compared to single SNP analysis of the candidate SNPs. Analyses were separate for genetically determined ancestry (African or European). RESULTS We identified 4 genes in African ancestry and 2 in European ancestry trauma subjects which replicated their associations with ALI. Ours is the first replication of IL6, IL10, IRAK3, and VEGFA associations in non-European populations with ALI. Only one gene - VEGFA - demonstrated association with ALI in both ancestries, with distinct haplotype blocks in each ancestry driving the association. We also report the association between trauma-associated ALI and NFKBIA in European ancestry subjects. CONCLUSIONS Prior ALI genetic associations are reproducible and replicate in a trauma cohort. Kernel - based SNP-set analysis is a more powerful method to detect ALI association than single SNP analysis, and thus may be more useful for replication testing. Further, gene-based replication can extend candidate gene associations to diverse ethnicities.
Collapse
Affiliation(s)
- Nuala J Meyer
- Department of Medicine: Pulmonary, Allergy, and Critical Care Division, Perelman School of Medicine University of Pennsylvania, 3600 Spruce Street, 874 Maloney, Philadelphia, PA 19104, USA.
| | | | | | | | | | | | | | | |
Collapse
|
405
|
Cai T, Lin X, Carroll RJ. Identifying genetic marker sets associated with phenotypes via an efficient adaptive score test. Biostatistics 2012; 13:776-90. [PMID: 22734045 DOI: 10.1093/biostatistics/kxs015] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open
Abstract
In recent years, genome-wide association studies (GWAS) and gene-expression profiling have generated a large number of valuable datasets for assessing how genetic variations are related to disease outcomes. With such datasets, it is often of interest to assess the overall effect of a set of genetic markers, assembled based on biological knowledge. Genetic marker-set analyses have been advocated as more reliable and powerful approaches compared with the traditional marginal approaches (Curtis and others, 2005. Pathways to the analysis of microarray data. TRENDS in Biotechnology 23, 429-435; Efroni and others, 2007. Identification of key processes underlying cancer phenotypes using biologic pathway analysis. PLoS One 2, 425). Procedures for testing the overall effect of a marker-set have been actively studied in recent years. For example, score tests derived under an Empirical Bayes (EB) framework (Liu and others, 2007. Semiparametric regression of multidimensional genetic pathway data: least-squares kernel machines and linear mixed models. Biometrics 63, 1079-1088; Liu and others, 2008. Estimation and testing for the effect of a genetic pathway on a disease outcome using logistic kernel machine regression via logistic mixed models. BMC bioinformatics 9, 292-2; Wu and others, 2010. Powerful SNP-set analysis for case-control genome-wide association studies. American Journal of Human Genetics 86, 929) have been proposed as powerful alternatives to the standard Rao score test (Rao, 1948. Large sample tests of statistical hypotheses concerning several parameters with applications to problems of estimation. Mathematical Proceedings of the Cambridge Philosophical Society, 44, 50-57). The advantages of these EB-based tests are most apparent when the markers are correlated, due to the reduction in the degrees of freedom. In this paper, we propose an adaptive score test which up- or down-weights the contributions from each member of the marker-set based on the Z-scores of their effects. Such an adaptive procedure gains power over the existing procedures when the signal is sparse and the correlation among the markers is weak. By combining evidence from both the EB-based score test and the adaptive test, we further construct an omnibus test that attains good power in most settings. The null distributions of the proposed test statistics can be approximated well either via simple perturbation procedures or via distributional approximations. Through extensive simulation studies, we demonstrate that the proposed procedures perform well in finite samples. We apply the tests to a breast cancer genetic study to assess the overall effect of the FGFR2 gene on breast cancer risk.
Collapse
Affiliation(s)
- Tianxi Cai
- Department of Biostatistics, Harvard School of Public Health, Boston, MA 02115, USA.
| | | | | |
Collapse
|
406
|
Wang K, Fingert JH. Statistical tests for detecting rare variants using variance-stabilising transformations. Ann Hum Genet 2012; 76:402-9. [PMID: 22724536 DOI: 10.1111/j.1469-1809.2012.00718.x] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Next generation sequencing holds great promise for detecting rare variants underlying complex human traits. Due to their extremely low allele frequencies, the normality approximation for a proportion no longer works well. The Fisher's exact method appears to be suitable but it is conservative. We investigate the utility of various variance-stabilising transformations in single marker association analysis on rare variants. Unlike a proportion itself, the variance of the transformed proportions no longer depends on the proportion, making application of such transformations to rare variant association analysis extremely appealing. Simulation studies demonstrate that tests based on such transformations are more powerful than the Fisher's exact test while controlling for type I error rate. Based on theoretical considerations and results from simulation studies, we recommend the test based on the Anscombe transformation over tests with other transformations.
Collapse
Affiliation(s)
- Kai Wang
- Department of Biostatistics, College of Public Health, The University of Iowa, Iowa City, IA 52242, USA.
| | | |
Collapse
|
407
|
Abstract
Many common human diseases are complex and are expected to be highly heterogeneous, with multiple causative loci and multiple rare and common variants at some of the causative loci contributing to the risk of these diseases. Data from the genome-wide association studies (GWAS) and metadata such as known gene functions and pathways provide the possibility of identifying genetic variants, genes and pathways that are associated with complex phenotypes. Single-marker-based tests have been very successful in identifying thousands of genetic variants for hundreds of complex phenotypes. However, these variants only explain very small percentages of the heritabilities. To account for the locus- and allelic-heterogeneity, gene-based and pathway-based tests can be very useful in the next stage of the analysis of GWAS data. U-statistics, which summarize the genomic similarity between pair of individuals and link the genomic similarity to phenotype similarity, have proved to be very useful for testing the associations between a set of single nucleotide polymorphisms and the phenotypes. Compared to single marker analysis, the advantages afforded by the U-statistics-based methods is large when the number of markers involved is large. We review several formulations of U-statistics in genetic association studies and point out the links of these statistics with other similarity-based tests of genetic association. Finally, potential application of U-statistics in analysis of the next-generation sequencing data and rare variants association studies are discussed.
Collapse
Affiliation(s)
- Hongzhe Li
- Department of Biostatistics and Epidemiology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA.
| |
Collapse
|
408
|
Zhang F, Chen Y, Liu C, Lu T, Yan H, Ruan Y, Yue W, Wang L, Zhang D. Systematic association analysis of microRNA machinery genes with schizophrenia informs further study. Neurosci Lett 2012; 520:47-50. [PMID: 22595464 DOI: 10.1016/j.neulet.2012.05.028] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2012] [Revised: 04/23/2012] [Accepted: 05/05/2012] [Indexed: 10/28/2022]
Abstract
microRNAs (miRNAs) play a vital role in development via the post-transcriptional regulation of most genes. Variation in the miRNA machinery pathway proteins which mediate the biogenesis, maturation, transportation, and functioning of miRNAs might be relevant to human traits. In this work, we explored the role of 59 miRNA machinery genes in schizophrenia (SZ). Association analysis of 967 single nucleotide polymorphisms within these genes detected that an intronic polymorphism of EIF4ENIF1, rs7289941, was significantly associated with SZ (P=4.10E-5). We failed to replicate this result in a validation sample comprising 1027 healthy controls and 1012 SZ cases, and the combined data yielded nominal significance (P=0.013). We conducted a gene-based association analysis using VEGAS and SKAT, and found seven associated genes in total, including EIF4ENIF1, PIWIL2, and DGCR8, but none survived correction for multiple testing. Taken together, our data do not provide strong support for the association of common variants within miRNA machinery genes with SZ in the Han Chinese population, but implicate several promising candidate genes for further research.
Collapse
Affiliation(s)
- Fuquan Zhang
- Institute of Mental Health, Peking University, PR China.
| | | | | | | | | | | | | | | | | |
Collapse
|
409
|
Statistical Challenges in Sequence-Based Association Studies with Population- and Family-Based Designs. STATISTICS IN BIOSCIENCES 2012. [DOI: 10.1007/s12561-012-9062-9] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
410
|
Kent JW. Rare variants, common markers: synthetic association and beyond. Genet Epidemiol 2012; 35 Suppl 1:S80-4. [PMID: 22128064 DOI: 10.1002/gepi.20655] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
Abstract
The phenomenon of synthetic association raises the possibility that common variant genetic markers may be coupled with functional rare variants sufficiently often to allow the rare variants to be tagged by the common ones. Using human exome sequence data from the 1000 Genomes Project, two investigative teams in Group 12 of Genetic Analysis Workshop 17 found that stochastic coupling between rare and common variants does occur, although perhaps not sufficiently often that we can expect common variant signals to reflect synthetic association; other teams considered methods for detecting association using both rare and common variants. Common themes were that synthetic association is more apparent in population strata (ancestral or familial) and that careful selection of the unit of analysis (gene, gene network, or other genomic subset) is likely to be crucial to the discovery of rare variants that contribute to risk of disease.
Collapse
Affiliation(s)
- Jack W Kent
- Department of Genetics, Texas Biomedical Research Institute, San Antonio, TX 78245-0549, USA.
| |
Collapse
|
411
|
Sun YV, Sung YJ, Tintle N, Ziegler A. Identification of genetic association of multiple rare variants using collapsing methods. Genet Epidemiol 2012; 35 Suppl 1:S101-6. [PMID: 22128049 DOI: 10.1002/gepi.20658] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
Next-generation sequencing technology allows investigation of both common and rare variants in humans. Exomes are sequenced on the population level or in families to further study the genetics of human diseases. Genetic Analysis Workshop 17 (GAW17) provided exomic data from the 1000 Genomes Project and simulated phenotypes. These data enabled evaluations of existing and newly developed statistical methods for rare variant sequence analysis for which standard statistical methods fail because of the rareness of the alleles. Various alternative approaches have been proposed that overcome the rareness problem by combining multiple rare variants within a gene. These approaches are termed collapsing methods, and our GAW17 group focused on studying the performance of existing and novel collapsing methods using rare variants. All tested methods performed similarly, as measured by type I error and power. Inflated type I error fractions were consistently observed and might be caused by gametic phase disequilibrium between causal and noncausal rare variants in this relatively small sample as well as by population stratification. Incorporating prior knowledge, such as appropriate covariates and information on functionality of SNPs, increased the power of detecting associated genes. Overall, collapsing rare variants can increase the power of identifying disease-associated genes. However, studying genetic associations of rare variants remains a challenging task that requires further development and improvement in data collection, management, analysis, and computation.
Collapse
Affiliation(s)
- Yan V Sun
- Department of Epidemiology, Rollins School of Public Health, Emory University, Atlanta, GA 30322, USA.
| | | | | | | |
Collapse
|
412
|
Namkung J, Raska P, Kang J, Liu Y, Lu Q, Zhu X. Analysis of exome sequences with and without incorporating prior biological knowledge. Genet Epidemiol 2012; 35 Suppl 1:S48-55. [PMID: 22128058 DOI: 10.1002/gepi.20649] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Next-generation sequencing technology provides new opportunities and challenges in the search for genetic variants that underlie complex traits. It will also presumably uncover many new rare variants, but exactly how these variants should be incorporated into the data analysis remains a question. Several papers in our group from Genetic Analysis Workshop 17 evaluated different methods of rare variant analysis, including single-variant, gene-based, and pathway-based analyses and analyses that incorporated biological information. Although the performance of some of these methods strongly depends on the underlying disease model, integration of known biological information is helpful in detecting causal genes. Two work groups demonstrated that use of a Bayesian network and a collapsing receiver operating characteristic curve approach improves risk prediction when a disease is caused by many rare variants. Another work group suggested that modeling local rather than global ancestry may be beneficial when controlling the effect of population structure in rare variant association analysis.
Collapse
Affiliation(s)
- Junghyun Namkung
- Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH 44106, USA
| | | | | | | | | | | |
Collapse
|
413
|
Abstract
Advances in sequencing technology allow assessing the impact of rare variation on common disorders. For this purpose, methods combine rare variants across a gene and compare an aggregate statistic between cases and controls. However, sequencing many individuals is costly. Hence, it is necessary to identify case samples that are most likely to result in powerful tests under realistic model assumptions. Power can be increased by selecting cases that are highly likely to carry risk variants. As rare variants that contribute to the heritability of a disease co-segregate among affected family members, selecting cases that have affected family members may increase the power of rare variant tests considerably. Here I compare sequencing random cases to cases ascertained to have affected family members. I quantify the power of the different approaches and provide criteria for sample selection under different models of inheritance. Under a model of multiplicative gene-gene interaction, a sample of random cases has to be 2-16-fold larger to achieve the same power as a sample of cases ascertained to have affected family members. However, in traits with high heritability this power gain can be reduced or even reversed under models of additive gene-gene interaction. Hence study designs should depend on the studied disease's heritability and on the available sample size. I also show that selecting cases that share both chromosomes identical by descent with an affected sibling at candidate regions can result in a further power gain.
Collapse
|
414
|
Shui IM, Mucci LA, Kraft P, Tamimi RM, Lindstrom S, Penney KL, Nimptsch K, Hollis BW, Dupre N, Platz EA, Stampfer MJ, Giovannucci E. Vitamin D-related genetic variation, plasma vitamin D, and risk of lethal prostate cancer: a prospective nested case-control study. J Natl Cancer Inst 2012; 104:690-9. [PMID: 22499501 DOI: 10.1093/jnci/djs189] [Citation(s) in RCA: 137] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
BACKGROUND The association of vitamin D status with prostate cancer is controversial; no association has been observed for overall incidence, but there is a potential link with lethal disease. METHODS We assessed prediagnostic 25-hydroxyvitamin D [25(OH)D] levels in plasma, variation in vitamin D-related genes, and risk of lethal prostate cancer using a prospective case-control study nested within the Health Professionals Follow-up Study. We included 1260 men who were diagnosed with prostate cancer after providing a blood sample in 1993-1995 and 1331 control subjects. Men with prostate cancer were followed through March 2011 for lethal outcomes (n = 114). We selected 97 single-nucleotide polymorphisms (SNPs) in genomic regions with high linkage disequilibrium (tagSNPs) to represent common genetic variation among seven vitamin D-related genes (CYP27A1, CYP2R1, CYP27B1, GC, CYP24A1, RXRA, and VDR). We used a logistic kernel machine test to assess whether multimarker SNP sets in seven vitamin D pathway-related genes were collectively associated with prostate cancer. Tests for statistical significance were two-sided. RESULTS Higher 25(OH)D levels were associated with a 57% reduction in the risk of lethal prostate cancer (highest vs lowest quartile: odds ratio = 0.43, 95% confidence interval = 0.24 to 0.76). This finding did not vary by time from blood collection to diagnosis. We found no statistically significant association of plasma 25(OH)D levels with overall prostate cancer. Pathway analyses found that the set of SNPs that included all seven genes (P = .008) as well as sets of SNPs that included VDR (P = .01) and CYP27A1 (P = .02) were associated with risk of lethal prostate cancer. CONCLUSION In this prospective study, plasma 25(OH)D levels and common variation among several vitamin D-related genes were associated with lethal prostate cancer risk, suggesting that vitamin D is relevant for lethal prostate cancer.
Collapse
Affiliation(s)
- Irene M Shui
- Department of Epidemiology, Harvard School of Public Health, Boston, MA 02215, USA.
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
415
|
Joint rare variant association test of the average and individual effects for sequencing studies. PLoS One 2012; 7:e32485. [PMID: 22468164 PMCID: PMC3309869 DOI: 10.1371/journal.pone.0032485] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2011] [Accepted: 01/30/2012] [Indexed: 11/19/2022] Open
Abstract
For many complex traits, single nucleotide polymorphisms (SNPs) identified from genome-wide association studies (GWAS) only explain a small percentage of heritability. Next generation sequencing technology makes it possible to explore unexplained heritability by identifying rare variants (RVs). Existing tests designed for RVs look for optimal strategies to combine information across multiple variants. Many of the tests have good power when the true underlying associations are either in the same direction or in opposite directions. We propose three tests for examining the association between a phenotype and RVs, where two of them jointly consider the common association across RVs and the individual deviations from the common effect. On one hand, similar to some of the best existing methods, the individual deviations are modeled as random effects to borrow information across multiple RVs. On the other hand, unlike the existing methods which pool individual effects towards zero, we pool them towards a possibly non-zero common effect by adding a pooled variant into the model. The common effect and the individual effects are jointly tested. We show through extensive simulations that at least one of the three tests proposed here is the most powerful or very close to being the most powerful in various settings of true models. This is appealing in practice because the direction and size of the true effects of the associated RVs are unknown. Researchers can apply the developed tests to improve power under a wide range of true models.
Collapse
|
416
|
Kazma R, Babron MC, Gaborieau V, Génin E, Brennan P, Hung RJ, McLaughlin JR, Krokan HE, Elvestad MB, Skorpen F, Anderssen E, Vooder T, Välk K, Metspalu A, Field JK, Lathrop M, Sarasin A, Benhamou S. Lung cancer and DNA repair genes: multilevel association analysis from the International Lung Cancer Consortium. Carcinogenesis 2012; 33:1059-64. [PMID: 22382497 DOI: 10.1093/carcin/bgs116] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
Lung cancer (LC) is the leading cause of cancer-related death worldwide and tobacco smoking is the major associated risk factor. DNA repair is an important process, maintaining genome integrity and polymorphisms in DNA repair genes may contribute to susceptibility to LC. To explore the role of DNA repair genes in LC, we conducted a multilevel association study with 1655 single nucleotide polymorphisms (SNPs) in 211 DNA repair genes using 6911 individuals pooled from four genome-wide case-control studies. Single SNP association corroborates previous reports of association with rs3131379, located on the gene MSH5 (P = 3.57 × 10-5) and returns a similar risk estimate. The effect of this SNP is modulated by histological subtype. On the log-additive scale, the odds ratio per allele is 1.04 (0.84-1.30) for adenocarcinomas, 1.52 (1.28-1.80) for squamous cell carcinomas and 1.31 (1.09-1.57) for other histologies (heterogeneity test: P = 9.1 × 10(-)(3)). Gene-based association analysis identifies three repair genes associated with LC (P < 0.01): UBE2N, structural maintenance of chromosomes 1L2 and POLB. Two additional genes (RAD52 and POLN) are borderline significant. Pathway-based association analysis identifies five repair pathways associated with LC (P < 0.01): chromatin structure, DNA polymerases, homologous recombination, genes involved in human diseases with sensitivity to DNA-damaging agents and Rad6 pathway and ubiquitination. This first international pooled analysis of a large dataset unravels the role of specific DNA repair pathways in LC and highlights the importance of accounting for gene and pathway effects when studying LC.
Collapse
|
417
|
Xue F, Li S, Luan J, Yuan Z, Luben RN, Khaw KT, Wareham NJ, Loos RJF, Zhao JH. A latent variable partial least squares path modeling approach to regional association and polygenic effect with applications to a human obesity study. PLoS One 2012; 7:e31927. [PMID: 22384102 PMCID: PMC3288051 DOI: 10.1371/journal.pone.0031927] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2011] [Accepted: 01/18/2012] [Indexed: 01/10/2023] Open
Abstract
Genetic association studies are now routinely used to identify single nucleotide polymorphisms (SNPs) linked with human diseases or traits through single SNP-single trait tests. Here we introduced partial least squares path modeling (PLSPM) for association between single or multiple SNPs and a latent trait that can involve single or multiple correlated measurement(s). Furthermore, the framework naturally provides estimators of polygenic effect by appropriately weighting trait-attributing alleles. We conducted computer simulations to assess the performance via multiple SNPs and human obesity-related traits as measured by body mass index (BMI), waist and hip circumferences. Our results showed that the associate statistics had type I error rates close to nominal level and were powerful for a range of effect and sample sizes. When applied to 12 candidate regions in data (N = 2,417) from the European Prospective Investigation of Cancer (EPIC)-Norfolk study, a region in FTO was found to have stronger association (rs7204609∼rs9939881 at the first intron P = 4.29×10(-7)) than single SNP analysis (all with P>10(-4)) and a latent quantitative phenotype was obtained using a subset sample of EPIC-Norfolk (N = 12,559). We believe our method is appropriate for assessment of regional association and polygenic effect on a single or multiple traits.
Collapse
Affiliation(s)
- Fuzhong Xue
- Department of Epidemiology and Health Statistics, School of Public Health, Shandong University, Jinan, China
- MRC Epidemiology Unit and Institute of Metabolic Science, Cambridge, United Kingdom
| | - Shengxu Li
- Department of Epidemiology, School of Public Health and Tropical Medicine, Tulane University, New Orleans, Louisiana, United States of America
| | - Jian'an Luan
- MRC Epidemiology Unit and Institute of Metabolic Science, Cambridge, United Kingdom
| | - Zhongshang Yuan
- Department of Epidemiology and Health Statistics, School of Public Health, Shandong University, Jinan, China
| | - Robert N. Luben
- Strangeways Research Laboratory, Department of Public Health and Primary Care, University of Cambridge, Cambridge, United Kingdom
| | - Kay-Tee Khaw
- Clinical Gerontology Unit, School of Clinical Medicine, University of Cambridge, Cambridge, United Kingdom
| | - Nicholas J. Wareham
- MRC Epidemiology Unit and Institute of Metabolic Science, Cambridge, United Kingdom
| | - Ruth J. F. Loos
- MRC Epidemiology Unit and Institute of Metabolic Science, Cambridge, United Kingdom
| | - Jing Hua Zhao
- MRC Epidemiology Unit and Institute of Metabolic Science, Cambridge, United Kingdom
- * E-mail:
| |
Collapse
|
418
|
Dai Y, Jiang R, Dong J. Weighted selective collapsing strategy for detecting rare and common variants in genetic association study. BMC Genet 2012; 13:7. [PMID: 22309429 PMCID: PMC3296579 DOI: 10.1186/1471-2156-13-7] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2011] [Accepted: 02/06/2012] [Indexed: 01/12/2023] Open
Abstract
Background Genome-wide association studies (GWAS) have been used successfully in detecting associations between common genetic variants and complex diseases. However, common SNPs detected by current GWAS only explain a small proportion of heritable variability. With the development of next-generation sequencing technologies, researchers find more and more evidence to support the role played by rare variants in heritable variability. However, rare and common variants are often studied separately. The objective of this paper is to develop a robust strategy to analyze association between complex traits and genetic regions using both common and rare variants. Results We propose a weighted selective collapsing strategy for both candidate gene studies and genome-wide association scans. The strategy considers genetic information from both common and rare variants, selectively collapses all variants in a given region by a forward selection procedure, and uses an adaptive weight to favor more likely causal rare variants. Under this strategy, two tests are proposed. One test denoted by BwSC is sensitive to the directions of genetic effects, and it separates the deleterious and protective effects into two components. Another denoted by BwSCd is robust in the directions of genetic effects, and it considers the difference of the two components. In our simulation studies, BwSC achieves a higher power when the casual variants have the same genetic effect, while BwSCd is as powerful as several existing tests when a mixed genetic effect exists. Both of the proposed tests work well with and without the existence of genetic effects from common variants. Conclusions Two tests using a weighted selective collapsing strategy provide potentially powerful methods for association studies of sequencing data. The tests have a higher power when both common and rare variants contribute to the heritable variability and the effect of common variants is not strong enough to be detected by traditional methods. Our simulation studies have demonstrated a substantially higher power for both tests in all scenarios regardless whether the common SNPs are associated with the trait or not.
Collapse
Affiliation(s)
- Yilin Dai
- Department of Mathematical Sciences, Michigan Technological University, Houghton, MI 49931, USA.
| | | | | |
Collapse
|
419
|
Wang H, Nie F, Huang H, Kim S, Nho K, Risacher SL, Saykin AJ, Shen L. Identifying quantitative trait loci via group-sparse multitask regression and feature selection: an imaging genetics study of the ADNI cohort. Bioinformatics 2012; 28:229-37. [PMID: 22155867 PMCID: PMC3259438 DOI: 10.1093/bioinformatics/btr649] [Citation(s) in RCA: 106] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2011] [Revised: 11/01/2011] [Accepted: 11/17/2011] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Recent advances in high-throughput genotyping and brain imaging techniques enable new approaches to study the influence of genetic variation on brain structures and functions. Traditional association studies typically employ independent and pairwise univariate analysis, which treats single nucleotide polymorphisms (SNPs) and quantitative traits (QTs) as isolated units and ignores important underlying interacting relationships between the units. New methods are proposed here to overcome this limitation. RESULTS Taking into account the interlinked structure within and between SNPs and imaging QTs, we propose a novel Group-Sparse Multi-task Regression and Feature Selection (G-SMuRFS) method to identify quantitative trait loci for multiple disease-relevant QTs and apply it to a study in mild cognitive impairment and Alzheimer's disease. Built upon regression analysis, our model uses a new form of regularization, group ℓ(2,1)-norm (G(2,1)-norm), to incorporate the biological group structures among SNPs induced from their genetic arrangement. The new G(2,1)-norm considers the regression coefficients of all the SNPs in each group with respect to all the QTs together and enforces sparsity at the group level. In addition, an ℓ(2,1)-norm regularization is utilized to couple feature selection across multiple tasks to make use of the shared underlying mechanism among different brain regions. The effectiveness of the proposed method is demonstrated by both clearly improved prediction performance in empirical evaluations and a compact set of selected SNP predictors relevant to the imaging QTs. AVAILABILITY Software is publicly available at: http://ranger.uta.edu/%7eheng/imaging-genetics/.
Collapse
Affiliation(s)
- Hua Wang
- Department of Computer Science and Engineering, University of Texas at Arlington, Arlington, TX 76019, USA
| | | | | | | | | | | | | | | |
Collapse
|
420
|
Pongpanich M, Neely ML, Tzeng JY. On the Aggregation of Multimarker Information for Marker-Set and Sequencing Data Analysis: Genotype Collapsing vs. Similarity Collapsing. Front Genet 2012; 2:110. [PMID: 22303404 PMCID: PMC3266618 DOI: 10.3389/fgene.2011.00110] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2011] [Accepted: 12/25/2011] [Indexed: 12/12/2022] Open
Abstract
Methods that collapse information across genetic markers when searching for association signals are gaining momentum in the literature. Although originally developed to achieve a better balance between retaining information and controlling degrees of freedom when performing multimarker association analysis, these methods have recently been proven to be a powerful tool for identifying rare variants that contribute to complex phenotypes. The information among markers can be collapsed at the genotype level, which focuses on the mean of genetic information, or the similarity level, which focuses on the variance of genetic information. The aim of this work is to understand the strengths and weaknesses of these two collapsing strategies. Our results show that neither collapsing strategy outperforms the other across all simulated scenarios. Two factors that dominate the performance of these strategies are the signal-to-noise ratio and the underlying genetic architecture of the causal variants. Genotype collapsing is more sensitive to the marker set being contaminated by noise loci than similarity collapsing. In addition, genotype collapsing performs best when the genetic architecture of the causal variants is not complex (e.g., causal loci with similar effects and similar frequencies). Similarity collapsing is more robust as the complexity of the genetic architecture increases and outperforms genotype collapsing when the genetic architecture of the marker set becomes more sophisticated (e.g., causal loci with various effect sizes or frequencies and potential non-linear or interactive effects). Because the underlying genetic architecture is not known a priori, we also considered a two-stage analysis that combines the two top-performing methods from different collapsing strategies. We find that it is reasonably robust across all simulated scenarios.
Collapse
Affiliation(s)
- Monnat Pongpanich
- Bioinformatics Research Center, North Carolina State University Raleigh, NC, USA
| | | | | |
Collapse
|
421
|
Liu J, Peissig P, Zhang C, Burnside E, McCarty C, Page D. Graphical-model Based Multiple Testing under Dependence, with Applications to Genome-wide Association Studies. UNCERTAINTY IN ARTIFICIAL INTELLIGENCE : PROCEEDINGS OF THE ... CONFERENCE. CONFERENCE ON UNCERTAINTY IN ARTIFICIAL INTELLIGENCE 2012; 2012:511-522. [PMID: 25285046 PMCID: PMC4184466] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
Large-scale multiple testing tasks often exhibit dependence, and leveraging the dependence between individual tests is still one challenging and important problem in statistics. With recent advances in graphical models, it is feasible to use them to perform multiple testing under dependence. We propose a multiple testing procedure which is based on a Markov-random-field-coupled mixture model. The ground truth of hypotheses is represented by a latent binary Markov random-field, and the observed test statistics appear as the coupled mixture variables. The parameters in our model can be automatically learned by a novel EM algorithm. We use an MCMC algorithm to infer the posterior probability that each hypothesis is null (termed local index of significance), and the false discovery rate can be controlled accordingly. Simulations show that the numerical performance of multiple testing can be improved substantially by using our procedure. We apply the procedure to a real-world genome-wide association study on breast cancer, and we identify several SNPs with strong association evidence.
Collapse
Affiliation(s)
- Jie Liu
- Computer Sciences, UW-Madison
| | | | | | | | | | | |
Collapse
|
422
|
Liu J, Peissig P, Zhang C, Burnside E, McCarty C, Page D. High-Dimensional Structured Feature Screening Using Binary Markov Random Fields. JMLR WORKSHOP AND CONFERENCE PROCEEDINGS 2012; 22:712-721. [PMID: 23606924 PMCID: PMC3630518] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
Feature screening is a useful feature selection approach for high-dimensional data when the goal is to identify all the features relevant to the response variable. However, common feature screening methods do not take into account the correlation structure of the covariate space. We propose the concept of a feature relevance network, a binary Markov random field to represent the relevance of each individual feature by potentials on the nodes, and represent the correlation structure by potentials on the edges. By performing inference on the feature relevance network, we can accordingly select relevant features. Our algorithm does not yield sparsity, which is different from the particular popular family of feature selection approaches based on penalized least squares or penalized pseudo-likelihood. We give one concrete algorithm under this framework and show its superior performance over common feature selection methods in terms of prediction error and recovery of the truly relevant features on real-world data and synthetic data.
Collapse
Affiliation(s)
- Jie Liu
- Department of Computer Sciences Univ. of Wisconsin-Madison
| | | | | | | | | | | |
Collapse
|
423
|
Design and Statistical Analysis of Pooled Next Generation Sequencing for Rare Variants. JOURNAL OF PROBABILITY AND STATISTICS 2012. [DOI: 10.1155/2012/524724] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Next generation sequencing (NGS) is a revolutionary technology for biomedical research. One highly cost-efficient application of NGS is to detect disease association based on pooled DNA samples. However, several key issues need to be addressed for pooled NGS. One of them is the high sequencing error rate and its high variability across genomic positions and experiment runs, which, if not well considered in the experimental design and analysis, could lead to either inflated false positive rates or loss in statistical power. Another important issue is how to test association of a group of rare variants. To address the first issue, we proposed a new blocked pooling design in which multiple pools of DNA samples from cases and controls are sequenced together on same NGS functional units. To address the second issue, we proposed a testing procedure that does not require individual genotypes but by taking advantage of multiple DNA pools. Through a simulation study, we demonstrated that our approach provides a good control of the type I error rate, and yields satisfactory power compared to the test-based on individual genotypes. Our results also provide guidelines for designing an efficient pooled.
Collapse
|
424
|
Gao G, Kang G, Wang J, Chen W, Qin H, Jiang B, Li Q, Sun C, Liu N, Archer KJ, Allison DB. A generalized sequential Bonferroni procedure using smoothed weights for genome-wide association studies incorporating information on Hardy-Weinberg disequilibrium among cases. Hum Hered 2011; 73:1-13. [PMID: 22212195 DOI: 10.1159/000332916] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2011] [Accepted: 09/07/2011] [Indexed: 01/14/2023] Open
Abstract
BACKGROUND/OBJECTIVES For genome-wide association studies (GWAS) with case-control designs, one of the most widely used association tests is the Cochran-Armitage (CA) trend test assuming an additive mode of inheritance. The CA trend test often has higher power than other association tests under additive and multiplicative disease models. However, it can have very low power under a recessive disease model in GWAS. Although tests (such as MAX3) robust to different genetic models have been developed, they often have relatively lower power than the CA trend test under additive and multiplicative models. The goal of this study is to propose an efficient method that not only has higher power than the CA trend test under dominant and recessive models but also maintains the power of the CA trend test under additive and multiplicative models. METHODS We employed the generalized sequential Bonferroni (GSB) procedure of Holm to incorporate information from a Hardy-Weinberg disequilibrium (HWD) test into the CA trend test based on estimating weights from the p values of the HWD test. We proposed to smooth the weights to reduce possible noise. RESULTS AND CONCLUSIONS Results from extensive simulation studies showed that the proposed GSB procedure can achieve the goal described above.
Collapse
Affiliation(s)
- Guimin Gao
- Department of Biostatistics, Virginia Commonwealth University, Richmond, Va. 23298-0032, USA.
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
425
|
Feng R, Wu Y, Jang GH, Ordovas JM, Arnett D. A powerful test of parent-of-origin effects for quantitative traits using haplotypes. PLoS One 2011; 6:e28909. [PMID: 22174922 PMCID: PMC3236760 DOI: 10.1371/journal.pone.0028909] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2011] [Accepted: 11/17/2011] [Indexed: 01/08/2023] Open
Abstract
Imprinting is an epigenetic phenomenon where the same alleles have unequal transcriptions and thus contribute differently to a trait depending on their parent of origin. This mechanism has been found to affect a variety of human disorders. Although various methods for testing parent-of-origin effects have been proposed in linkage analysis settings, only a few are available for association analysis and they are usually restricted to small families and particular study designs. In this study, we develop a powerful maximum likelihood test to evaluate the parent-of-origin effects of SNPs on quantitative phenotypes in general family studies. Our method incorporates haplotype distribution to take advantage of inter-marker LD information in genome-wide association studies (GWAS). Our method also accommodates missing genotypes that often occur in genetic studies. Our simulation studies with various minor allele frequencies, LD structures, family sizes, and missing schemes have uniformly shown that using the new method significantly improves the power of detecting imprinted genes compared with the method using the SNP at the testing locus only. Our simulations suggest that the most efficient strategy to investigate parent-of-origin effects is to recruit one parent and as many offspring as possible under practical constraints. As a demonstration, we applied our method to a dataset from the Genetics of Lipid Lowering Drugs and Diet Network (GOLDN) to test the parent-of-origin effects of the SNPs within the PPARGC1A, MTP and FABP2 genes on diabetes-related phenotypes, and found that several SNPs in the MTP gene show parent-of-origin effects on insulin and glucose levels.
Collapse
Affiliation(s)
- Rui Feng
- Department of Biostatistics and Epidemiology, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America.
| | | | | | | | | |
Collapse
|
426
|
Schaid DJ, Sinnwell JP, Jenkins GD, McDonnell SK, Ingle JN, Kubo M, Goss PE, Costantino JP, Wickerham DL, Weinshilboum RM. Using the gene ontology to scan multilevel gene sets for associations in genome wide association studies. Genet Epidemiol 2011; 36:3-16. [PMID: 22161999 DOI: 10.1002/gepi.20632] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2011] [Revised: 07/22/2011] [Accepted: 08/02/2011] [Indexed: 11/07/2022]
Abstract
Gene-set analyses have been widely used in gene expression studies, and some of the developed methods have been extended to genome wide association studies (GWAS). Yet, complications due to linkage disequilibrium (LD) among single nucleotide polymorphisms (SNPs), and variable numbers of SNPs per gene and genes per gene-set, have plagued current approaches, often leading to ad hoc "fixes." To overcome some of the current limitations, we developed a general approach to scan GWAS SNP data for both gene-level and gene-set analyses, building on score statistics for generalized linear models, and taking advantage of the directed acyclic graph structure of the gene ontology when creating gene-sets. However, other types of gene-set structures can be used, such as the popular Kyoto Encyclopedia of Genes and Genomes (KEGG). Our approach combines SNPs into genes, and genes into gene-sets, but assures that positive and negative effects of genes on a trait do not cancel. To control for multiple testing of many gene-sets, we use an efficient computational strategy that accounts for LD and provides accurate step-down adjusted P-values for each gene-set. Application of our methods to two different GWAS provide guidance on the potential strengths and weaknesses of our proposed gene-set analyses.
Collapse
Affiliation(s)
- Daniel J Schaid
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota 55905, USA.
| | | | | | | | | | | | | | | | | | | |
Collapse
|
427
|
Wang X, Liu X, Sim X, Xu H, Khor CC, Ong RTH, Tay WT, Suo C, Poh WT, Ng DPK, Liu J, Aung T, Chia KS, Wong TY, Tai ES, Teo YY. A statistical method for region-based meta-analysis of genome-wide association studies in genetically diverse populations. Eur J Hum Genet 2011; 20:469-75. [PMID: 22126751 PMCID: PMC3306862 DOI: 10.1038/ejhg.2011.219] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022] Open
Abstract
Genome-wide association studies (GWAS) have become the preferred experimental design in exploring the genetic etiology of complex human traits and diseases. Standard SNP-based meta-analytic approaches have been utilized to integrate the results from multiple experiments. This fundamentally assumes that the patterns of linkage disequilibrium (LD) between the underlying causal variants and the directly genotyped SNPs are similar across the populations for the same SNPs to emerge with surrogate evidence of disease association. We introduce a novel strategy for assessing regional evidence of phenotypic association that explicitly incorporates the extent of LD in the region. This provides a natural framework for combining evidence from multi-ethnic studies of both dichotomous and quantitative traits that (i) accommodates different patterns of LD, (ii) integrates different genotyping platforms and (iii) allows for the presence of allelic heterogeneity between the populations. Our method can also be generalized to perform gene-based or pathway-based analyses. Applying this method on real GWAS data in type 2 diabetes (T2D) boosted the association evidence in regions well-established for T2D etiology in three diverse South-East Asian populations, as well as identified two novel gene regions and a biologically convincing pathway that are subsequently validated with data from the Wellcome Trust Case Control Consortium.
Collapse
Affiliation(s)
- Xu Wang
- Department of Epidemiology and Public Health, National University of Singapore, Singapore
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
428
|
Petersen A, Sitarik A, Luedtke A, Powers S, Bekmetjev A, Tintle NL. Evaluating methods for combining rare variant data in pathway-based tests of genetic association. BMC Proc 2011; 5 Suppl 9:S48. [PMID: 22373429 PMCID: PMC3287885 DOI: 10.1186/1753-6561-5-s9-s48] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
Abstract
Analyzing sets of genes in genome-wide association studies is a relatively new approach that aims to capitalize on biological knowledge about the interactions of genes in biological pathways. This approach, called pathway analysis or gene set analysis, has not yet been applied to the analysis of rare variants. Applying pathway analysis to rare variants offers two competing approaches. In the first approach rare variant statistics are used to generate p-values for each gene (e.g., combined multivariate collapsing [CMC] or weighted-sum [WS]) and the gene-level p-values are combined using standard pathway analysis methods (e.g., gene set enrichment analysis or Fisher’s combined probability method). In the second approach, rare variant methods (e.g., CMC and WS) are applied directly to sets of single-nucleotide polymorphisms (SNPs) representing all SNPs within genes in a pathway. In this paper we use simulated phenotype and real next-generation sequencing data from Genetic Analysis Workshop 17 to analyze sets of rare variants using these two competing approaches. The initial results suggest substantial differences in the methods, with Fisher’s combined probability method and the direct application of the WS method yielding the best power. Evidence suggests that the WS method works well in most situations, although Fisher’s method was more likely to be optimal when the number of causal SNPs in the set was low but the risk of the causal SNPs was high.
Collapse
Affiliation(s)
- Ashley Petersen
- Departments of Mathematics, Computer Science, and Statistics, St. Olaf College, 1520 St. Olaf Avenue, Northfield, MN 55057, USA
| | - Alexandra Sitarik
- Department of Mathematics, Wittenberg University, 200 West Ward Street, Springfield, OH 45501, USA
| | - Alexander Luedtke
- Division of Applied Mathematics, Brown University, 151 Thayer Street, Providence, RI 02912, USA
| | - Scott Powers
- Department of Statistics and Operations Research, University of North Carolina, 318 Hanes Hall, CB 3260, Chapel Hill, NC 27599-3260, USA
| | - Airat Bekmetjev
- Department of Mathematics, Statistics and Computer Science, Dordt College, 498 4th Ave. NE, Sioux Center, IA 51250, USA
| | - Nathan L Tintle
- Department of Mathematics, Statistics and Computer Science, Dordt College, 498 4th Ave. NE, Sioux Center, IA 51250, USA
| |
Collapse
|
429
|
Kang J, Zheng W, Li L, Lee JS, Yan X, Zhao H. Use of Bayesian networks to dissect the complexity of genetic disease: application to the Genetic Analysis Workshop 17 simulated data. BMC Proc 2011; 5 Suppl 9:S37. [PMID: 22373110 PMCID: PMC3287873 DOI: 10.1186/1753-6561-5-s9-s37] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022] Open
Abstract
Complex diseases are often the downstream event of a number of risk factors, including both environmental and genetic variables. To better understand the mechanism of disease onset, it is of great interest to systematically investigate the crosstalk among various risk factors. Bayesian networks provide an intuitive graphical interface that captures not only the association but also the conditional independence and dependence structures among the variables, resulting in sparser relationships between risk factors and the disease phenotype than traditional correlation-based methods. In this paper, we apply a Bayesian network to dissect the complex regulatory relationships among disease traits and various risk factors for the Genetic Analysis Workshop 17 simulated data. We use the Bayesian network as a tool for the risk prediction of disease outcome.
Collapse
Affiliation(s)
- Jia Kang
- Interdepartmental Program in Computational Biology and Bioinformatics, Yale University, PO Box 208009, New Haven, CT 06520-8114, USA.
| | | | | | | | | | | |
Collapse
|
430
|
Li L, Zheng W, Lee JS, Zhang X, Ferguson J, Yan X, Zhao H. Collapsing-based and kernel-based single-gene analyses applied to Genetic Analysis Workshop 17 mini-exome data. BMC Proc 2011; 5 Suppl 9:S117. [PMID: 22373309 PMCID: PMC3287841 DOI: 10.1186/1753-6561-5-s9-s117] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Recently there has been great interest in identifying rare variants associated with common diseases. We apply several collapsing-based and kernel-based single-gene association tests to Genetic Analysis Workshop 17 (GAW17) rare variant association data with unrelated individuals without knowledge of the simulation model. We also implement modified versions of these methods using additional information, such as minor allele frequency (MAF) and functional annotation. For each of four given traits provided in GAW17, we use the Bayesian mixed-effects model to estimate the phenotypic variance explained by the given environmental and genotypic data and to infer an individual-specific genetic effect to use directly in single-gene association tests. After obtaining information on the GAW17 simulation model, we compare the performance of all methods and examine the top genes identified by those methods. We find that collapsing-based methods with weights based on MAFs are sensitive to the “lower MAF, larger effect size” assumption, whereas kernel-based methods are more robust when this assumption is violated. In addition, many false-positive genes identified by multiple methods often contain variants with exactly the same genotype distribution as the causal variants used in the simulation model. When the sample size is much smaller than the number of rare variants, it is more likely that causal and noncausal variants will share the same or similar genotype distribution. This likely contributes to the low power and large number of false-positive results of all methods in detecting causal variants associated with disease in the GAW17 data set.
Collapse
Affiliation(s)
- Lun Li
- Division of Biostatistics, Yale School of Public Health, Yale University, 60 College St., PO Box 208034, New Haven, CT 06520-8034, USA.,Hubei Bioinformatics and Molecular Imaging Key Laboratory, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
| | - Wei Zheng
- Keck Biotechnology Resource Laboratory, Yale University, 300 George St., New Haven, CT 06511, USA
| | - Joon Sang Lee
- Division of Biostatistics, Yale School of Public Health, Yale University, 60 College St., PO Box 208034, New Haven, CT 06520-8034, USA
| | - Xianghua Zhang
- Division of Biostatistics, Yale School of Public Health, Yale University, 60 College St., PO Box 208034, New Haven, CT 06520-8034, USA.,Department of Electronic Science and Technology, University of Science and Technology of China, Hefei, Anhui 230027, China
| | - John Ferguson
- Division of Biostatistics, Yale School of Public Health, Yale University, 60 College St., PO Box 208034, New Haven, CT 06520-8034, USA
| | - Xiting Yan
- Division of Biostatistics, Yale School of Public Health, Yale University, 60 College St., PO Box 208034, New Haven, CT 06520-8034, USA
| | - Hongyu Zhao
- Division of Biostatistics, Yale School of Public Health, Yale University, 60 College St., PO Box 208034, New Haven, CT 06520-8034, USA
| |
Collapse
|
431
|
Abstract
Gene-based and single-nucleotide polymorphism (SNP) set association studies provide an important complement to SNP analysis. Kernel-based nonparametric regression has recently emerged as a powerful and flexible tool for this purpose. Our goal is to explore whether this approach can be extended to incorporate and test for interaction effects, especially for genes containing rare variant SNPs. Here, we construct nonparametric regression models that can be used to include a gene-environment interaction effect under the framework of the least-squares kernel machine and examine the performance of the proposed method on the Genetic Analysis Workshop 17 unrelated individuals data set. Two hundred simulated replicates were used to explore the power for detecting interaction. We demonstrate through a genome scan of the quantitative phenotype Q1 that the simulated gene-environment interaction effect in the data can be detected with reasonable power by using the least-squares kernel machine method.
Collapse
|
432
|
Abstract
We found from our analysis of the Genetic Analysis Workshop 17 data that the population structure of the 697 unrelated individuals was an important confounding factor for association studies, even if it was not explicitly considered when simulating the phenotypes. We uncovered structures beyond the reported ethnicities and found ample evidence of phenotype–population structure associations. The first 10 principal components of the genotype data of the 697 individuals demonstrated much stronger associations with Q1, Q2, and the disease than did the individuals’ ethnicities. In addition, we observed that population structure was a confounding factor for the Q1-gene association when identifying the significant genes both with and without adjusting for the causal single-nucleotide polymorphisms, the ethnicities, and the principal components. Many false discoveries remained after adjusting for the causal single-nucleotide polymorphisms. Adjusting for the principal components appeared more effective than did adjusting for ethnicity in terms of preventing false discoveries. This analysis was performed with knowledge of the causal loci.
Collapse
Affiliation(s)
- Huaizhen Qin
- Case Western Reserve University School of Medicine, Cleveland, OH 44106, USA.
| | | | | |
Collapse
|
433
|
Abstract
Rare variants are believed to play an important role in disease etiology. Recent advances in high-throughput sequencing technology enable investigators to systematically characterize the genetic effects of both common and rare variants. We introduce several approaches that simultaneously test the effects of common and rare variants within a single-nucleotide polymorphism (SNP) set based on logistic regression models and logistic kernel machine models. Gene-environment interactions and SNP-SNP interactions are also considered in some of these models. We illustrate the performance of these methods using the unrelated individuals data from Genetic Analysis Workshop 17. Three true disease genes (FLT1, PIK3C3, and KDR) were consistently selected using the proposed methods. In addition, compared to logistic regression models, the logistic kernel machine models were more powerful, presumably because they reduced the effective number of parameters through regularization. Our results also suggest that a screening step is effective in decreasing the number of false-positive findings, which is often a big concern for association studies.
Collapse
Affiliation(s)
- Ru Wang
- Department of Statistics, University of California, Davis, CA 95616, USA
- Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, 1100 Fairview Avenue North, PO Box 19024, Seattle, WA 98109, USA
| | - Jie Peng
- Department of Statistics, University of California, Davis, CA 95616, USA
| | - Pei Wang
- Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, 1100 Fairview Avenue North, PO Box 19024, Seattle, WA 98109, USA
| |
Collapse
|
434
|
Basu S, Pan W, Shen X, Oetting WS. Multilocus association testing with penalized regression. Genet Epidemiol 2011; 35:755-65. [PMID: 21922539 DOI: 10.1002/gepi.20625] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2011] [Revised: 06/09/2011] [Accepted: 07/04/2011] [Indexed: 12/26/2022]
Abstract
In multilocus association analysis, since some markers may not be associated with a trait, it seems attractive to use penalized regression with the capability of automatic variable selection. On the other hand, in spite of a rapidly growing body of literature on penalized regression, most focus on variable selection and outcome prediction, for which penalized methods are generally more effective than their nonpenalized counterparts. However, for statistical inference, i.e. hypothesis testing and interval estimation, it is less clear how penalized methods would perform, or even how to best apply them, largely due to lack of studies on this topic. In our motivating data for a cohort of kidney transplant recipients, it is of primary interest to assess whether a group of genetic variants are associated with a binary clinical outcome, acute rejection at 6 months. In this article, we study some technical issues and alternative implementations of hypothesis testing in Lasso penalized logistic regression, and compare their performance with each other and with several existing global tests, some of which are specifically designed as variance component tests for high-dimensional data. The most interesting, and perhaps surprising, conclusion of this study is that, for low to moderately high-dimensional data, statistical tests based on Lasso penalized regression are not necessarily more powerful than some existing global tests. In addition, in penalized regression, rather than building a test based on a single selected "best" model, combining multiple tests, each of which is built on a candidate model, might be more promising.
Collapse
Affiliation(s)
- Saonli Basu
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455, USA
| | | | | | | |
Collapse
|
435
|
Stitziel NO, Kiezun A, Sunyaev S. Computational and statistical approaches to analyzing variants identified by exome sequencing. Genome Biol 2011; 12:227. [PMID: 21920052 PMCID: PMC3308043 DOI: 10.1186/gb-2011-12-9-227] [Citation(s) in RCA: 99] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023] Open
Abstract
New sequencing technology has enabled the identification of thousands of single nucleotide polymorphisms in the exome, and many computational and statistical approaches to identify disease-association signals have emerged.
Collapse
Affiliation(s)
- Nathan O Stitziel
- Division of Cardiovascular Medicine, Brigham and Women’s Hospital, Harvard Medical School, 75 Francis Street, Boston, MA 02115, USA
| | | | | |
Collapse
|
436
|
Gao Q, He Y, Yuan Z, Zhao J, Zhang B, Xue F. Gene- or region-based association study via kernel principal component analysis. BMC Genet 2011; 12:75. [PMID: 21871061 PMCID: PMC3176196 DOI: 10.1186/1471-2156-12-75] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2011] [Accepted: 08/26/2011] [Indexed: 11/12/2022] Open
Abstract
Background In genetic association study, especially in GWAS, gene- or region-based methods have been more popular to detect the association between multiple SNPs and diseases (or traits). Kernel principal component analysis combined with logistic regression test (KPCA-LRT) has been successfully used in classifying gene expression data. Nevertheless, the purpose of association study is to detect the correlation between genetic variations and disease rather than to classify the sample, and the genomic data is categorical rather than numerical. Recently, although the kernel-based logistic regression model in association study has been proposed by projecting the nonlinear original SNPs data into a linear feature space, it is still impacted by multicolinearity between the projections, which may lead to loss of power. We, therefore, proposed a KPCA-LRT model to avoid the multicolinearity. Results Simulation results showed that KPCA-LRT was always more powerful than principal component analysis combined with logistic regression test (PCA-LRT) at different sample sizes, different significant levels and different relative risks, especially at the genewide level (1E-5) and lower relative risks (RR = 1.2, 1.3). Application to the four gene regions of rheumatoid arthritis (RA) data from Genetic Analysis Workshop16 (GAW16) indicated that KPCA-LRT had better performance than single-locus test and PCA-LRT. Conclusions KPCA-LRT is a valid and powerful gene- or region-based method for the analysis of GWAS data set, especially under lower relative risks and lower significant levels.
Collapse
Affiliation(s)
- Qingsong Gao
- Department of Epidemiology and Health Statistics, School of Public Health, Shandong University, Jinan 250012, China
| | | | | | | | | | | |
Collapse
|
437
|
Tzeng JY, Zhang D, Pongpanich M, Smith C, McCarthy MI, Sale MM, Worrall BB, Hsu FC, Thomas DC, Sullivan PF. Studying gene and gene-environment effects of uncommon and common variants on continuous traits: a marker-set approach using gene-trait similarity regression. Am J Hum Genet 2011; 89:277-88. [PMID: 21835306 DOI: 10.1016/j.ajhg.2011.07.007] [Citation(s) in RCA: 65] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2010] [Revised: 06/16/2011] [Accepted: 07/13/2011] [Indexed: 11/15/2022] Open
Abstract
Genomic association analyses of complex traits demand statistical tools that are capable of detecting small effects of common and rare variants and modeling complex interaction effects and yet are computationally feasible. In this work, we introduce a similarity-based regression method for assessing the main genetic and interaction effects of a group of markers on quantitative traits. The method uses genetic similarity to aggregate information from multiple polymorphic sites and integrates adaptive weights that depend on allele frequencies to accomodate common and uncommon variants. Collapsing information at the similarity level instead of the genotype level avoids canceling signals that have the opposite etiological effects and is applicable to any class of genetic variants without the need for dichotomizing the allele types. To assess gene-trait associations, we regress trait similarities for pairs of unrelated individuals on their genetic similarities and assess association by using a score test whose limiting distribution is derived in this work. The proposed regression framework allows for covariates, has the capacity to model both main and interaction effects, can be applied to a mixture of different polymorphism types, and is computationally efficient. These features make it an ideal tool for evaluating associations between phenotype and marker sets defined by linkage disequilibrium (LD) blocks, genes, or pathways in whole-genome analysis.
Collapse
Affiliation(s)
- Jung-Ying Tzeng
- Department of Statistics, North Carolina State University, Raleigh, NC 27695, USA.
| | | | | | | | | | | | | | | | | | | |
Collapse
|
438
|
Biason P, Hattinger CM, Innocenti F, Talamini R, Alberghini M, Scotlandi K, Zanusso C, Serra M, Toffoli G. Nucleotide excision repair gene variants and association with survival in osteosarcoma patients treated with neoadjuvant chemotherapy. THE PHARMACOGENOMICS JOURNAL 2011; 12:476-83. [PMID: 21826087 DOI: 10.1038/tpj.2011.33] [Citation(s) in RCA: 50] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Abstract
The aim of this study was to investigate the role of common polymorphisms in the nucleotide excision repair pathway genes in the tumorigenesis of osteosarcoma and in the response to DNA damaging therapies, such as cisplatin-based neoadjuvant therapy. Excision repair cross-complementing (ERCC) group 2 (XPD; rs13181 and rs1799793), group 5 (XPG; rs17655) and group 1 (XPA; rs3212986 and rs11615) polymorphisms were analyzed in a group of 130 homogenously treated patients with high-grade osteosarcoma, for association with event-free survival (EFS), using the Kaplan-Meier plots and log-rank test. A positive association was observed between both XPD single-nucleotide polymorphisms and an increased EFS (hazards ratio (HR) = 0.34, 95% confidence interval (CI) 0.12-0.98 and HR = 0.19, 95% CI 0.05-0.77, respectively). We had also performed a case-control study for relative risk to develop osteosarcoma. Patients carrying at least one variant allele of XPD rs1799793 had a reduced risk of developing osteosarcoma, compared with wild-type patients (odds ratio = 0.55, 95% CI 0.36-0.84). This study suggests that XPD rs1799793 could be a marker of osteosarcoma associated with features conferring either a better prognosis or a better outcome after platinum therapy, or both.
Collapse
Affiliation(s)
- P Biason
- Experimental and Clinical Pharmacology Unit, Centro di Riferimento Oncologico, National Cancer Institute, Aviano, Italy
| | | | | | | | | | | | | | | | | |
Collapse
|
439
|
Lin X, Cai T, Wu MC, Zhou Q, Liu G, Christiani DC, Lin X. Kernel machine SNP-set analysis for censored survival outcomes in genome-wide association studies. Genet Epidemiol 2011; 35:620-31. [PMID: 21818772 DOI: 10.1002/gepi.20610] [Citation(s) in RCA: 49] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2010] [Revised: 05/06/2011] [Accepted: 06/03/2011] [Indexed: 02/01/2023]
Abstract
In this article, we develop a powerful test for identifying single nucleotide polymorphism (SNP)-sets that are predictive of survival with data from genome-wide association studies. We first group typed SNPs into SNP-sets based on genomic features and then apply a score test to assess the overall effect of each SNP-set on the survival outcome through a kernel machine Cox regression framework. This approach uses genetic information from all SNPs in the SNP-set simultaneously and accounts for linkage disequilibrium (LD), leading to a powerful test with reduced degrees of freedom when the typed SNPs are in LD with each other. This type of test also has the advantage of capturing the potentially nonlinear effects of the SNPs, SNP-SNP interactions (epistasis), and the joint effects of multiple causal variants. By simulating SNP data based on the LD structure of real genes from the HapMap project, we demonstrate that our proposed test is more powerful than the standard single SNP minimum P-value-based test for association studies with censored survival outcomes. We illustrate the proposed test with a real data application.
Collapse
Affiliation(s)
- Xinyi Lin
- Department of Biostatistics, Harvard School of Public Health, Boston, Massachusetts 02115, USA
| | | | | | | | | | | | | |
Collapse
|
440
|
Basu S, Pan W. Comparison of statistical tests for disease association with rare variants. Genet Epidemiol 2011; 35:606-19. [PMID: 21769936 DOI: 10.1002/gepi.20609] [Citation(s) in RCA: 188] [Impact Index Per Article: 13.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2010] [Revised: 03/23/2011] [Accepted: 06/03/2011] [Indexed: 01/31/2023]
Abstract
In anticipation of the availability of next-generation sequencing data, there is increasing interest in investigating association between complex traits and rare variants (RVs). In contrast to association studies for common variants (CVs), due to the low frequencies of RVs, common wisdom suggests that existing statistical tests for CVs might not work, motivating the recent development of several new tests for analyzing RVs, most of which are based on the idea of pooling/collapsing RVs. However, there is a lack of evaluations of, and thus guidance on the use of, existing tests. Here we provide a comprehensive comparison of various statistical tests using simulated data. We consider both independent and correlated rare mutations, and representative tests for both CVs and RVs. As expected, if there are no or few non-causal (i.e. neutral or non-associated) RVs in a locus of interest while the effects of causal RVs on the trait are all (or mostly) in the same direction (i.e. either protective or deleterious, but not both), then the simple pooled association tests (without selecting RVs and their association directions) and a new test called kernel-based adaptive clustering (KBAC) perform similarly and are most powerful; KBAC is more robust than simple pooled association tests in the presence of non-causal RVs; however, as the number of non-causal CVs increases and/or in the presence of opposite association directions, the winners are two methods originally proposed for CVs and a new test called C-alpha test proposed for RVs, each of which can be regarded as testing on a variance component in a random-effects model. Interestingly, several methods based on sequential model selection (i.e. selecting causal RVs and their association directions), including two new methods proposed here, perform robustly and often have statistical power between those of the above two classes.
Collapse
Affiliation(s)
- Saonli Basu
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota 55455-0392, USA
| | | |
Collapse
|
441
|
Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet 2011; 89:82-93. [PMID: 21737059 DOI: 10.1016/j.ajhg.2011.05.029] [Citation(s) in RCA: 1736] [Impact Index Per Article: 124.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2011] [Revised: 05/27/2011] [Accepted: 05/30/2011] [Indexed: 01/18/2023] Open
Abstract
Sequencing studies are increasingly being conducted to identify rare variants associated with complex traits. The limited power of classical single-marker association analysis for rare variants poses a central challenge in such studies. We propose the sequence kernel association test (SKAT), a supervised, flexible, computationally efficient regression method to test for association between genetic variants (common and rare) in a region and a continuous or dichotomous trait while easily adjusting for covariates. As a score-based variance-component test, SKAT can quickly calculate p values analytically by fitting the null model containing only the covariates, and so can easily be applied to genome-wide data. Using SKAT to analyze a genome-wide sequencing study of 1000 individuals, by segmenting the whole genome into 30 kb regions, requires only 7 hr on a laptop. Through analysis of simulated data across a wide range of practical scenarios and triglyceride data from the Dallas Heart Study, we show that SKAT can substantially outperform several alternative rare-variant association tests. We also provide analytic power and sample-size calculations to help design candidate-gene, whole-exome, and whole-genome sequence association studies.
Collapse
Affiliation(s)
- Michael C Wu
- Department of Biostatistics, The University of North Carolina at Chapel Hill, 27599, USA
| | | | | | | | | | | |
Collapse
|
442
|
Basu S, Pan W, Oetting WS. A dimension reduction approach for modeling multi-locus interaction in case-control studies. Hum Hered 2011; 71:234-45. [PMID: 21734407 DOI: 10.1159/000328842] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2010] [Accepted: 04/12/2011] [Indexed: 01/01/2023] Open
Abstract
Studying one locus or one single nucleotide polymorphism (SNP) at a time may not be sufficient to understand complex diseases because they are unlikely to result from the effect of only one SNP. Each SNP alone may have little or no effect on the risk of the disease, but together they may increase the risk substantially. Analyses focusing on individual SNPs ignore the possibility of interaction among SNPs. In this paper, we propose a parsimonious model to assess the joint effect of a group of SNPs in a case-control study. The model implements a data reduction strategy within a likelihood framework and uses a test to assess the statistical significance of the effect of the group of SNPs on the binary trait. The primary advantage of the proposed approach is that the dimension reduction technique produces a test statistic with degrees of freedom significantly lower than a multiple logistic regression with only main effects of the SNPs, and our parsimonious model can incorporate the possibility of interaction among the SNPs. Moreover, the proposed approach estimates the direction of association of each SNP with the disease and provides an estimate of the average effect of the group of SNPs positively and negatively associated with the disease in the given SNP set. We illustrate the proposed model on simulated and real data, and compare its performance with a few other existing approaches. Our proposed approach appeared to outperform the other approaches for independent SNPs in our simulation studies.
Collapse
Affiliation(s)
- Saonli Basu
- Division of Biostatistics, University of Minnesota, Minneapolis, USA. saonli @ umn.edu
| | | | | |
Collapse
|
443
|
Wang W, Zhang X. Network-based group variable selection for detecting expression quantitative trait loci (eQTL). BMC Bioinformatics 2011; 12:269. [PMID: 21718480 PMCID: PMC3152919 DOI: 10.1186/1471-2105-12-269] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2010] [Accepted: 06/30/2011] [Indexed: 11/10/2022] Open
Abstract
Background Analysis of expression quantitative trait loci (eQTL) aims to identify the genetic loci associated with the expression level of genes. Penalized regression with a proper penalty is suitable for the high-dimensional biological data. Its performance should be enhanced when we incorporate biological knowledge of gene expression network and linkage disequilibrium (LD) structure between loci in high-noise background. Results We propose a network-based group variable selection (NGVS) method for QTL detection. Our method simultaneously maps highly correlated expression traits sharing the same biological function to marker sets formed by LD. By grouping markers, complex joint activity of multiple SNPs can be considered and the dimensionality of eQTL problem is reduced dramatically. In order to demonstrate the power and flexibility of our method, we used it to analyze two simulations and a mouse obesity and diabetes dataset. We considered the gene co-expression network, grouped markers into marker sets and treated the additive and dominant effect of each locus as a group: as a consequence, we were able to replicate results previously obtained on the mouse linkage dataset. Furthermore, we observed several possible sex-dependent loci and interactions of multiple SNPs. Conclusions The proposed NGVS method is appropriate for problems with high-dimensional data and high-noise background. On eQTL problem it outperforms the classical Lasso method, which does not consider biological knowledge. Introduction of proper gene expression and loci correlation information makes detecting causal markers more accurate. With reasonable model settings, NGVS can lead to novel biological findings.
Collapse
Affiliation(s)
- Weichen Wang
- Mathematics and Physics, School of Sciences, Tsinghua University, Beijing 100084, China.
| | | |
Collapse
|
444
|
Feng T, Elston RC, Zhu X. Detecting rare and common variants for complex traits: sibpair and odds ratio weighted sum statistics (SPWSS, ORWSS). Genet Epidemiol 2011; 35:398-409. [PMID: 21594893 DOI: 10.1002/gepi.20588] [Citation(s) in RCA: 46] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2011] [Revised: 03/25/2011] [Accepted: 03/30/2011] [Indexed: 01/04/2023]
Abstract
It is generally known that risk variants segregate together with a disease within families, but this information has not been used in the existing statistical methods for detecting rare variants. Here we introduce two weighted sum statistics that can apply to either genome-wide association data or resequencing data for identifying rare disease variants: weights calculated based on sibpairs and odd ratios, respectively. We evaluated the two methods via extensive simulations under different disease models. We compared the proposed methods with the weighted sum statistic (WSS) proposed by Madsen and Browning, keeping the same genotyping or resequencing cost. Our methods clearly demonstrate more statistical power than the WSS. In addition, we found that using sibpair information can increase power over using only unrelated samples by more than 40%. We applied our methods to the Framingham Heart Study (FHS) and Wellcome Trust Case Control Consortium (WTCCC) hypertension datasets. Although we did not identify any genes as reaching a genome-wide significance level, we found variants in the candidate gene angiotensinogen significantly associated with hypertension at P = 6.9 × 10(-4), whereas the most significant single SNP association evidence is P = 0.063. We further applied the odds ratio weighted method to the IFIH1 gene for type-1 diabetes in the WTCCC data. Our method yielded a P-value of 4.82 × 10(-4), much more significant than that obtained by haplotype-based methods. We demonstrated that family data are extremely informative in searching for rare variants underlying complex traits, and the odds ratio weighted sum statistic is more efficient than currently existing methods.
Collapse
Affiliation(s)
- Tao Feng
- Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH 44106, USA
| | | | | |
Collapse
|
445
|
Butterbach K, Beckmann L, de Sanjosé S, Benavente Y, Becker N, Foretova L, Maynadie M, Cocco P, Staines A, Boffetta P, Brennan P, Nieters A. Association of JAK-STAT pathway related genes with lymphoma risk: results of a European case-control study (EpiLymph). Br J Haematol 2011; 153:318-33. [PMID: 21418178 DOI: 10.1111/j.1365-2141.2011.08632.x] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]
Abstract
Previous studies have suggested an important role for the Janus kinase-signal transducer and activator of transcription (JAK-STAT) signalling pathway in tumour development. Therefore, we explored genetic variants in JAK-STAT pathway associated genes with lymphoma risk. In samples of the EpiLymph case-control study we genotyped 1536 single nucleotide polymorphisms (SNPs) using GoldenGate BeadArray™ Technology (Illumina, San Diego, CA, USA). Here, we report the associations between selected SNPs and haplotypes of the JAK-STAT pathway and risk of Hodgkin lymphoma (HL), B-cell non-Hodgkin lymphoma (B-NHL) and most frequent B-NHL subtypes. Among 210 relevant JAK-STAT pathway-related SNPs, polymorphisms in nine genes (BMF, IFNG, IL12A, SOCS1, STAT1, STAT3, STAT5A, STAT6, TP63) were significantly associated with lymphoma risk. At a study-wise significance level, we obtained a risk reduction of 28% among carriers of the heterozygous genotype of the STAT3 variant (rs1053023) for B-NHL. For six other variants within the STAT3 gene we observed an inverse association with different lymphoma subtypes. A reduced risk for HL was observed for the heterozygous genotype of the STAT6 SNP (rs324011). This is an explorative investigation to examine associations between JAK-STAT signalling related genes and lymphoma risk. The results implicate a relevant role of certain pathway-related genes in lymphomagenesis, but still need to be approved by independent studies.
Collapse
Affiliation(s)
- Katja Butterbach
- Division of Cancer Epidemiology, German Cancer Research Center, Heidelberg, Germany
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
446
|
Shriner D, Vaughan LK. A unified framework for multi-locus association analysis of both common and rare variants. BMC Genomics 2011; 12:89. [PMID: 21281506 PMCID: PMC3040731 DOI: 10.1186/1471-2164-12-89] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2010] [Accepted: 01/31/2011] [Indexed: 11/10/2022] Open
Abstract
Background Common, complex diseases are hypothesized to result from a combination of common and rare genetic variants. We developed a unified framework for the joint association testing of both types of variants. Within the framework, we developed a union-intersection test suitable for genome-wide analysis of single nucleotide polymorphisms (SNPs), candidate gene data, as well as medical sequencing data. The union-intersection test is a composite test of association of genotype frequencies and differential correlation among markers. Results We demonstrated by computer simulation that the false positive error rate was controlled at the expected level. We also demonstrated scenarios in which the multi-locus test was more powerful than traditional single marker analysis. To illustrate use of the union-intersection test with real data, we analyzed a publically available data set of 319,813 autosomal SNPs genotyped for 938 cases of Parkinson disease and 863 neurologically normal controls for which no genome-wide significant results were found by traditional single marker analysis. We also analyzed an independent follow-up sample of 183 cases and 248 controls for replication. Conclusions We identified a single risk haplotype with a directionally consistent effect in both samples in the gene GAK, which is involved in clathrin-mediated membrane trafficking. We also found suggestive evidence that directionally inconsistent marginal effects from single marker analysis appeared to result from risk being driven by different haplotypes in the two samples for the genes SYN3 and NGLY1, which are involved in neurotransmitter release and proteasomal degradation, respectively. These results illustrate the utility of our unified framework for genome-wide association analysis of common, complex diseases.
Collapse
Affiliation(s)
- Daniel Shriner
- Center for Research on Genomics and Global Health, National Human Genome Research Institute, Bethesda, MD 20892, USA.
| | | |
Collapse
|
447
|
Wang L, Jia P, Wolfinger RD, Chen X, Grayson BL, Aune TM, Zhao Z. An efficient hierarchical generalized linear mixed model for pathway analysis of genome-wide association studies. ACTA ACUST UNITED AC 2011; 27:686-92. [PMID: 21266443 DOI: 10.1093/bioinformatics/btq728] [Citation(s) in RCA: 37] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
MOTIVATION In genome-wide association studies (GWAS) of complex diseases, genetic variants having real but weak associations often fail to be detected at the stringent genome-wide significance level. Pathway analysis, which tests disease association with combined association signals from a group of variants in the same pathway, has become increasingly popular. However, because of the complexities in genetic data and the large sample sizes in typical GWAS, pathway analysis remains to be challenging. We propose a new statistical model for pathway analysis of GWAS. This model includes a fixed effects component that models mean disease association for a group of genes, and a random effects component that models how each gene's association with disease varies about the gene group mean, thus belongs to the class of mixed effects models. RESULTS The proposed model is computationally efficient and uses only summary statistics. In addition, it corrects for the presence of overlapping genes and linkage disequilibrium (LD). Via simulated and real GWAS data, we showed our model improved power over currently available pathway analysis methods while preserving type I error rate. Furthermore, using the WTCCC Type 1 Diabetes (T1D) dataset, we demonstrated mixed model analysis identified meaningful biological processes that agreed well with previous reports on T1D. Therefore, the proposed methodology provides an efficient statistical modeling framework for systems analysis of GWAS. AVAILABILITY The software code for mixed models analysis is freely available at http://biostat.mc.vanderbilt.edu/LilyWang.
Collapse
Affiliation(s)
- Lily Wang
- Department of Biostatistics, Vanderbilt University, Nashville, TN 37232, USA.
| | | | | | | | | | | | | |
Collapse
|
448
|
Wang K, Li M, Hakonarson H. Analysing biological pathways in genome-wide association studies. Nat Rev Genet 2010; 11:843-54. [PMID: 21085203 DOI: 10.1038/nrg2884] [Citation(s) in RCA: 581] [Impact Index Per Article: 38.7] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
Genome-wide association (GWA) studies have typically focused on the analysis of single markers, which often lacks the power to uncover the relatively small effect sizes conferred by most genetic variants. Recently, pathway-based approaches have been developed, which use prior biological knowledge on gene function to facilitate more powerful analysis of GWA study data sets. These approaches typically examine whether a group of related genes in the same functional pathway are jointly associated with a trait of interest. Here we review the development of pathway-based approaches for GWA studies, discuss their practical use and caveats, and suggest that pathway-based approaches may also be useful for future GWA studies with sequencing data.
Collapse
Affiliation(s)
- Kai Wang
- Center for Applied Genomics, The Childrens Hospital of Philadelphia, Pennsylvania 19104, USA
| | | | | |
Collapse
|
449
|
Amos W, Driscoll E, Hoffman JI. Candidate genes versus genome-wide associations: which are better for detecting genetic susceptibility to infectious disease? Proc Biol Sci 2010; 278:1183-8. [PMID: 20926441 DOI: 10.1098/rspb.2010.1920] [Citation(s) in RCA: 55] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023] Open
Abstract
Technological developments allow increasing numbers of markers to be deployed in case-control studies searching for genetic factors that influence disease susceptibility. However, with vast numbers of markers, true 'hits' may become lost in a sea of false positives. This problem may be particularly acute for infectious diseases, where the control group may contain unexposed individuals with susceptible genotypes. To explore this effect, we used a series of stochastic simulations to model a scenario based loosely on bovine tuberculosis. We find that a candidate gene approach tends to have greater statistical power than studies that use large numbers of single nucleotide polymorphisms (SNPs) in genome-wide association tests, almost regardless of the number of SNPs deployed. Both approaches struggle to detect genetic effects when these are either weak or if an appreciable proportion of individuals are unexposed to the disease when modest sample sizes (250 each of cases and controls) are used, but these issues are largely mitigated if sample sizes can be increased to 2000 or more of each class. We conclude that the power of any genotype-phenotype association test will be improved if the sampling strategy takes account of exposure heterogeneity, though this is not necessarily easy to do.
Collapse
Affiliation(s)
- W Amos
- Department of Zoology, University of Cambridge, Cambridge CB2 3EJ, UK.
| | | | | |
Collapse
|