1
|
Nickchi P, Karunarathna C, Graham J. An exploration of linkage fine-mapping on sequences from case-control studies. Genet Epidemiol 2023; 47:78-94. [PMID: 36047334 PMCID: PMC10087369 DOI: 10.1002/gepi.22502] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2021] [Revised: 05/30/2022] [Accepted: 08/09/2022] [Indexed: 02/01/2023]
Abstract
Linkage analysis maps genetic loci for a heritable trait by identifying genomic regions with excess relatedness among individuals with similar trait values. Analysis may be conducted on related individuals from families, or on samples of unrelated individuals from a population. For allelically heterogeneous traits, population-based linkage analysis can be more powerful than genotypic-association analysis. Here, we focus on linkage analysis in a population sample, but use sequences rather than individuals as our unit of observation. Earlier investigations of sequence-based linkage mapping relied on known sequence relatedness, whereas we infer relatedness from the sequence data. We propose two ways to associate similarity in relatedness of sequences with similarity in their trait values and compare the resulting linkage methods to two genotypic-association methods. We also introduce a procedure to label case sequences as potential carriers or noncarriers of causal variants after an association has been found. This post hoc labeling of case sequences is based on inferred relatedness to other case sequences. Our simulation results indicate that methods based on sequence relatedness improve localization and perform as well as genotypic-association methods for detecting rare causal variants. Sequence-based linkage analysis therefore has potential to fine-map allelically heterogeneous disease traits.
Collapse
Affiliation(s)
- Payman Nickchi
- Department of Statistics and Actuarial Science, Simon Fraser University, Burnaby, British Columbia, Canada
| | - Charith Karunarathna
- Department of Statistics and Actuarial Science, Simon Fraser University, Burnaby, British Columbia, Canada.,Department of Mathematics and Statistics, Dalhousie University, Halifax, Nova Scotia, Canada
| | - Jinko Graham
- Department of Statistics and Actuarial Science, Simon Fraser University, Burnaby, British Columbia, Canada
| |
Collapse
|
2
|
Pluta D, Shen T, Xue G, Chen C, Ombao H, Yu Z. Ridge-penalized adaptive Mantel test and its application in imaging genetics. Stat Med 2021; 40:5313-5332. [PMID: 34216035 DOI: 10.1002/sim.9127] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2020] [Revised: 06/01/2021] [Accepted: 06/16/2021] [Indexed: 01/23/2023]
Abstract
We propose a ridge-penalized adaptive Mantel test (AdaMant) for evaluating the association of two high-dimensional sets of features. By introducing a ridge penalty, AdaMant tests the association across many metrics simultaneously. We demonstrate how ridge penalization bridges Euclidean and Mahalanobis distances and their corresponding linear models from the perspective of association measurement and testing. This result is not only theoretically interesting but also has important implications in penalized hypothesis testing, especially in high-dimensional settings such as imaging genetics. Applying the proposed method to an imaging genetic study of visual working memory in healthy adults, we identified interesting associations of brain connectivity (measured by electroencephalogram coherence) with selected genetic features.
Collapse
Affiliation(s)
- Dustin Pluta
- Department of Statistics, University of California, Irvine, Irvine, California, USA
| | - Tong Shen
- Department of Statistics, University of California, Irvine, Irvine, California, USA
| | - Gui Xue
- Center for Brain and Learning Science, Beijing Normal University, Beijing, China
| | - Chuansheng Chen
- Department of Psychology and Social Behavior, University of California, Irvine, Irvine, California, USA
| | - Hernando Ombao
- Statistics Program, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
| | - Zhaoxia Yu
- Department of Statistics, University of California, Irvine, Irvine, California, USA
| |
Collapse
|
3
|
Tzeng JY, Lu W, Hsu FC. GENE-LEVEL PHARMACOGENETIC ANALYSIS ON SURVIVAL OUTCOMES USING GENE-TRAIT SIMILARITY REGRESSION. Ann Appl Stat 2014; 8:1232-1255. [PMID: 25018788 DOI: 10.1214/14-aoas735] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
Gene/pathway-based methods are drawing significant attention due to their usefulness in detecting rare and common variants that affect disease susceptibility. The biological mechanism of drug responses indicates that a gene-based analysis has even greater potential in pharmacogenetics. Motivated by a study from the Vitamin Intervention for Stroke Prevention (VISP) trial, we develop a gene-trait similarity regression for survival analysis to assess the effect of a gene or pathway on time-to-event outcomes. The similarity regression has a general framework that covers a range of survival models, such as the proportional hazards model and the proportional odds model. The inference procedure developed under the proportional hazards model is robust against model misspecification. We derive the equivalence between the similarity survival regression and a random effects model, which further unifies the current variance-component based methods. We demonstrate the effectiveness of the proposed method through simulation studies. In addition, we apply the method to the VISP trial data to identify the genes that exhibit an association with the risk of a recurrent stroke. TCN2 gene was found to be associated with the recurrent stroke risk in the low-dose arm. This gene may impact recurrent stroke risk in response to cofactor therapy.
Collapse
Affiliation(s)
- Jung-Ying Tzeng
- North Carolina State University ; National Cheng-Kung University
| | | | | |
Collapse
|
4
|
Georges A, Cambisano N, Ahariz N, Karim L, Georges M. A genome scan conducted in a multigenerational pedigree with convergent strabismus supports a complex genetic determinism. PLoS One 2014; 8:e83574. [PMID: 24376720 PMCID: PMC3871668 DOI: 10.1371/journal.pone.0083574] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2013] [Accepted: 11/06/2013] [Indexed: 11/26/2022] Open
Abstract
A genome-wide linkage scan was conducted in a Northern-European multigenerational pedigree with nine of 40 related members affected with concomitant strabismus. Twenty-seven members of the pedigree including all affected individuals were genotyped using a SNP array interrogating > 300,000 common SNPs. We conducted parametric and non-parametric linkage analyses assuming segregation of an autosomal dominant mutation, yet allowing for incomplete penetrance and phenocopies. We detected two chromosome regions with near-suggestive evidence for linkage, respectively on chromosomes 8 and 18. The chromosome 8 linkage implied a penetrance of 0.80 and a rate of phenocopy of 0.11, while the chromosome 18 linkage implied a penetrance of 0.64 and a rate of phenocopy of 0. Our analysis excludes a simple genetic determinism of strabismus in this pedigree.
Collapse
Affiliation(s)
- Anouk Georges
- Department of Ophtalmology, Faculty of Medicine, University of Liège (CHU), Liège, Belgium
| | - Nadine Cambisano
- Unit of Animal Genomics, GIGA-R & Faculty of Veterinary Medicine, University of Liège (B34), Liège, Belgium
| | - Naïma Ahariz
- Unit of Animal Genomics, GIGA-R & Faculty of Veterinary Medicine, University of Liège (B34), Liège, Belgium
| | - Latifa Karim
- GIGA-R Genotranscriptomics Core Faclity, University of Liège (B34), Liège, Belgium
| | - Michel Georges
- Unit of Animal Genomics, GIGA-R & Faculty of Veterinary Medicine, University of Liège (B34), Liège, Belgium
- * E-mail:
| |
Collapse
|
5
|
Minas C, Curry E, Montana G. A distance-based test of association between paired heterogeneous genomic data. ACTA ACUST UNITED AC 2013; 29:2555-63. [PMID: 23918252 DOI: 10.1093/bioinformatics/btt450] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION Due to rapid technological advances, a wide range of different measurements can be obtained from a given biological sample including single nucleotide polymorphisms, copy number variation, gene expression levels, DNA methylation and proteomic profiles. Each of these distinct measurements provides the means to characterize a certain aspect of biological diversity, and a fundamental problem of broad interest concerns the discovery of shared patterns of variation across different data types. Such data types are heterogeneous in the sense that they represent measurements taken at different scales or represented by different data structures. RESULTS We propose a distance-based statistical test, the generalized RV (GRV) test, to assess whether there is a common and non-random pattern of variability between paired biological measurements obtained from the same random sample. The measurements enter the test through the use of two distance measures, which can be chosen to capture a particular aspect of the data. An approximate null distribution is proposed to compute P-values in closed-form and without the need to perform costly Monte Carlo permutation procedures. Compared with the classical Mantel test for association between distance matrices, the GRV test has been found to be more powerful in a number of simulation settings. We also demonstrate how the GRV test can be used to detect biological pathways in which genetic variability is associated to variation in gene expression levels in an ovarian cancer sample, and present results obtained from two independent cohorts. AVAILABILITY R code to compute the GRV test is freely available from http://www2.imperial.ac.uk/∼gmontana
Collapse
Affiliation(s)
- Christopher Minas
- Department of Imaging Sciences, Institute of Clinical Sciences, Hammersmith Campus, Statistics Section, Department of Mathematics, South Kensington Campus and Department of Surgery and Cancer, Ovarian Cancer Action Research Centre, Hammersmith Campus, Imperial College London, London W12 0NN, UK
| | | | | |
Collapse
|
6
|
Jiao S, Hsu L, Bézieau S, Brenner H, Chan AT, Chang-Claude J, Le Marchand L, Lemire M, Newcomb PA, Slattery ML, Peters U. SBERIA: set-based gene-environment interaction test for rare and common variants in complex diseases. Genet Epidemiol 2013; 37:452-64. [PMID: 23720162 PMCID: PMC3713231 DOI: 10.1002/gepi.21735] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2013] [Revised: 04/04/2013] [Accepted: 04/30/2013] [Indexed: 01/28/2023]
Abstract
Identification of gene-environment interaction (G × E) is important in understanding the etiology of complex diseases. However, partially due to the lack of power, there have been very few replicated G × E findings compared to the success in marginal association studies. The existing G × E testing methods mainly focus on improving the power for individual markers. In this paper, we took a different strategy and proposed a set-based gene-environment interaction test (SBERIA), which can improve the power by reducing the multiple testing burdens and aggregating signals within a set. The major challenge of the signal aggregation within a set is how to tell signals from noise and how to determine the direction of the signals. SBERIA takes advantage of the established correlation screening for G × E to guide the aggregation of genotypes within a marker set. The correlation screening has been shown to be an efficient way of selecting potential G × E candidate SNPs in case-control studies for complex diseases. Importantly, the correlation screening in case-control combined samples is independent of the interaction test. With this desirable feature, SBERIA maintains the correct type I error level and can be easily implemented in a regular logistic regression setting. We showed that SBERIA had higher power than benchmark methods in various simulation scenarios, both for common and rare variants. We also applied SBERIA to real genome-wide association studies (GWAS) data of 10,729 colorectal cancer cases and 13,328 controls and found evidence of interaction between the set of known colorectal cancer susceptibility loci and smoking.
Collapse
Affiliation(s)
- Shuo Jiao
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, Washington 98109, USA.
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
7
|
Thomas DC. Some surprising twists on the road to discovering the contribution of rare variants to complex diseases. Hum Hered 2013; 74:113-7. [PMID: 23594489 DOI: 10.1159/000347020] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
|
8
|
Zakharov S, Salim A, Thalamuthu A. Comparison of similarity-based tests and pooling strategies for rare variants. BMC Genomics 2013; 14:50. [PMID: 23343094 PMCID: PMC3600007 DOI: 10.1186/1471-2164-14-50] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2012] [Accepted: 01/17/2013] [Indexed: 11/10/2022] Open
Abstract
Background As several rare genomic variants have been shown to affect common phenotypes, rare variants association analysis has received considerable attention. Several efficient association tests using genotype and phenotype similarity measures have been proposed in the literature. The major advantages of similarity-based tests are their ability to accommodate multiple types of DNA variations within one association test, and to account for the possible interaction within a region. However, not much work has been done to compare the performance of similarity-based tests on rare variants association scenarios, especially when applied with different rare variants pooling strategies. Results Based on the population genetics simulations and analysis of a publicly-available sequencing data set, we compared the performance of four similarity-based tests and two rare variants pooling strategies. We showed that weighting approach outperforms collapsing under the presence of strong effect from rare variants and under the presence of moderate effect from common variants, whereas collapsing of rare variants is preferable when common variants possess a strong effect. We also demonstrated that the difference in statistical power between the two pooling strategies may be substantial. The results also highlighted consistently high power of two similarity-based approaches when applied with an appropriate pooling strategy. Conclusions Population genetics simulations and sequencing data set analysis showed high power of two similarity-based tests and a substantial difference in power between the two pooling strategies.
Collapse
Affiliation(s)
- Sergii Zakharov
- Human Genetics, Genome Institute of Singapore, 60 Biopolis Street, Singapore 138672, Singapore.
| | | | | |
Collapse
|
9
|
Sun YV, Sung YJ, Tintle N, Ziegler A. Identification of genetic association of multiple rare variants using collapsing methods. Genet Epidemiol 2012; 35 Suppl 1:S101-6. [PMID: 22128049 DOI: 10.1002/gepi.20658] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
Next-generation sequencing technology allows investigation of both common and rare variants in humans. Exomes are sequenced on the population level or in families to further study the genetics of human diseases. Genetic Analysis Workshop 17 (GAW17) provided exomic data from the 1000 Genomes Project and simulated phenotypes. These data enabled evaluations of existing and newly developed statistical methods for rare variant sequence analysis for which standard statistical methods fail because of the rareness of the alleles. Various alternative approaches have been proposed that overcome the rareness problem by combining multiple rare variants within a gene. These approaches are termed collapsing methods, and our GAW17 group focused on studying the performance of existing and novel collapsing methods using rare variants. All tested methods performed similarly, as measured by type I error and power. Inflated type I error fractions were consistently observed and might be caused by gametic phase disequilibrium between causal and noncausal rare variants in this relatively small sample as well as by population stratification. Incorporating prior knowledge, such as appropriate covariates and information on functionality of SNPs, increased the power of detecting associated genes. Overall, collapsing rare variants can increase the power of identifying disease-associated genes. However, studying genetic associations of rare variants remains a challenging task that requires further development and improvement in data collection, management, analysis, and computation.
Collapse
Affiliation(s)
- Yan V Sun
- Department of Epidemiology, Rollins School of Public Health, Emory University, Atlanta, GA 30322, USA.
| | | | | | | |
Collapse
|
10
|
Sun YV, Zhao W, Shedden KA, Kardia SL. Identification of genes associated with complex traits by testing the genetic dissimilarity between individuals. BMC Proc 2011; 5 Suppl 9:S120. [PMID: 22373401 PMCID: PMC3287845 DOI: 10.1186/1753-6561-5-s9-s120] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Using the exome sequencing data from 697 unrelated individuals and their simulated disease phenotypes from Genetic Analysis Workshop 17, we develop and apply a gene-based method to identify the relationship between a gene with multiple rare genetic variants and a phenotype. The method is based on the Mantel test, which assesses the correlation between two distance matrices using a permutation procedure. Using up to 100,000 permutations to estimate the statistical significance in 200 replicate data sets, we found that the method had 5.1% type I error at an α level of 0.05 and had various power to detect genes with simulated genetic associations. FLT1 and KDR had the most significant correlations with Q1 and were replicated 170 and 24 times, respectively, in 200 simulated data sets using a Bonferroni corrected p-value of 0.05 as a threshold. These results suggest that the distance correlation method can be used to identify genotype-phenotype association when multiple rare genetic variants in a gene are involved.
Collapse
Affiliation(s)
- Yan V Sun
- Department of Epidemiology, School of Public Health, Emory University, 1518 Clifton Road NE, Atlanta, GA 30322, USA.
| | | | | | | |
Collapse
|
11
|
Tzeng JY, Zhang D, Pongpanich M, Smith C, McCarthy MI, Sale MM, Worrall BB, Hsu FC, Thomas DC, Sullivan PF. Studying gene and gene-environment effects of uncommon and common variants on continuous traits: a marker-set approach using gene-trait similarity regression. Am J Hum Genet 2011; 89:277-88. [PMID: 21835306 DOI: 10.1016/j.ajhg.2011.07.007] [Citation(s) in RCA: 65] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2010] [Revised: 06/16/2011] [Accepted: 07/13/2011] [Indexed: 11/15/2022] Open
Abstract
Genomic association analyses of complex traits demand statistical tools that are capable of detecting small effects of common and rare variants and modeling complex interaction effects and yet are computationally feasible. In this work, we introduce a similarity-based regression method for assessing the main genetic and interaction effects of a group of markers on quantitative traits. The method uses genetic similarity to aggregate information from multiple polymorphic sites and integrates adaptive weights that depend on allele frequencies to accomodate common and uncommon variants. Collapsing information at the similarity level instead of the genotype level avoids canceling signals that have the opposite etiological effects and is applicable to any class of genetic variants without the need for dichotomizing the allele types. To assess gene-trait associations, we regress trait similarities for pairs of unrelated individuals on their genetic similarities and assess association by using a score test whose limiting distribution is derived in this work. The proposed regression framework allows for covariates, has the capacity to model both main and interaction effects, can be applied to a mixture of different polymorphism types, and is computationally efficient. These features make it an ideal tool for evaluating associations between phenotype and marker sets defined by linkage disequilibrium (LD) blocks, genes, or pathways in whole-genome analysis.
Collapse
Affiliation(s)
- Jung-Ying Tzeng
- Department of Statistics, North Carolina State University, Raleigh, NC 27695, USA.
| | | | | | | | | | | | | | | | | | | |
Collapse
|
12
|
Yu Z, Wang S. Contrasting linkage disequilibrium as a multilocus family-based association test. Genet Epidemiol 2011; 35:487-98. [PMID: 21769928 DOI: 10.1002/gepi.20598] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2010] [Revised: 04/20/2011] [Accepted: 04/24/2011] [Indexed: 02/04/2023]
Abstract
Linkage disequilibrium (LD) of genetic loci is routinely estimated and graphically illustrated in genetic association studies. It has been suggested that the information in LD is also useful for association mapping and genetic association can be detected by comparing LD patterns between cases and controls. Here, we extend this idea to analyze case-parents data by comparing LD patterns between transmitted and nontransmitted genotypes. We provide the condition when contrasting LD is valid for testing gene-gene interactions. A permutation procedure is given to assess statistical significance. One advantage of our proposed methods is that haplotype information is not required. Thus, the implementation of our methods is straightforward and the resulted tests are free from potential bias caused by assumptions made to estimate haplotypes in silico. Since our test statistics use pairwise LD measurements, they are less affected by missing data than many other multilocus methods. With simulated data, we demonstrate that examining LD patterns of case-parents data is a useful multilocus association mapping strategy and it complements existing association mapping methods. The application of our methods to a Crohn's disease data set shows that our methods can detect multilocus association that might be missed by other association methods. Our permutation procedure can also be modified to allow multiple offspring from a family to be analyzed.
Collapse
Affiliation(s)
- Zhaoxia Yu
- Department of Statistics, University of California-Irvine, CA 92697, USA.
| | | |
Collapse
|
13
|
Liang Y, Kelemen A. Sequential Support Vector Regression with Embedded Entropy for SNP Selection and Disease Classification. Stat Anal Data Min 2011; 4:301-312. [PMID: 21666834 DOI: 10.1002/sam.10110] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023]
Abstract
Comprehensive evaluation of common genetic variations through association of SNP structure with common diseases on the genome-wide scale is currently a hot area in human genome research. For less costly and faster diagnostics, advanced computational approaches are needed to select the minimum SNPs with the highest prediction accuracy for common complex diseases. In this paper, we present a sequential support vector regression model with embedded entropy algorithm to deal with the redundancy for the selection of the SNPs that have best prediction performance of diseases. We implemented our proposed method for both SNP selection and disease classification, and applied it to simulation data sets and two real disease data sets. Results show that on the average, our proposed method outperforms the well known methods of Support Vector Machine Recursive Feature Elimination, logistic regression, CART, and logic regression based SNP selections for disease classification.
Collapse
Affiliation(s)
- Yulan Liang
- Department of Family and Community Health, University of Maryland, Baltimore 655 W. Lombard Street, Baltimore, MD 21201-1579
| | | |
Collapse
|
14
|
Zhang F, Deng HW. Confounding from cryptic relatedness in haplotype-based association studies. Genetica 2010; 138:945-50. [PMID: 20680405 DOI: 10.1007/s10709-010-9476-6] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2009] [Accepted: 07/12/2010] [Indexed: 10/19/2022]
Abstract
Cryptic relatedness was suggested to be an important source of confounding in population-based association studies (PBAS). The impact of cryptic relatedness on the performance of haplotype phase inference and haplotype-based association tests is not clear. In this study, we used the Hapmap genetic data to simulate a set of related samples. We evaluated the accuracy of haplotype phase inferred by PHASE 2.1 and calculated the power, type I error rates, accuracy and positive prediction value (PPV) of haplotype frequency-based association tests (HFAT) and haplotype similarity-based association tests (HSAT) under various scenarios, considering relatedness levels, disease models and sample sizes. Cryptic relatedness appeared to slightly increase the accuracy of haplotype phase inference. We observed significant negative effect of cryptic relatedness on the performance of HFAT and HSAT. Ignoring cryptic relatedness may increase spurious association results in haplotype-based PBAS.
Collapse
Affiliation(s)
- Feng Zhang
- College of Medicine, Xi'an Jiaotong University, 710061 Xi'an, People's Republic of China.
| | | |
Collapse
|
15
|
Schaid DJ. Genomic similarity and kernel methods I: advancements by building on mathematical and statistical foundations. Hum Hered 2010; 70:109-31. [PMID: 20610906 DOI: 10.1159/000312641] [Citation(s) in RCA: 75] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2009] [Accepted: 03/09/2010] [Indexed: 01/05/2023] Open
Abstract
Measures of genomic similarity are the basis of many statistical analytic methods. We review the mathematical and statistical basis of similarity methods, particularly based on kernel methods. A kernel function converts information for a pair of subjects to a quantitative value representing either similarity (larger values meaning more similar) or distance (smaller values meaning more similar), with the requirement that it must create a positive semidefinite matrix when applied to all pairs of subjects. This review emphasizes the wide range of statistical methods and software that can be used when similarity is based on kernel methods, such as nonparametric regression, linear mixed models and generalized linear mixed models, hierarchical models, score statistics, and support vector machines. The mathematical rigor for these methods is summarized, as is the mathematical framework for making kernels. This review provides a framework to move from intuitive and heuristic approaches to define genomic similarities to more rigorous methods that can take advantage of powerful statistical modeling and existing software. A companion paper reviews novel approaches to creating kernels that might be useful for genomic analyses, providing insights with examples [1].
Collapse
Affiliation(s)
- Daniel J Schaid
- Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, Minn., USA
| |
Collapse
|
16
|
Mukhopadhyay I, Feingold E, Weeks DE, Thalamuthu A. Association tests using kernel-based measures of multi-locus genotype similarity between individuals. Genet Epidemiol 2010; 34:213-21. [PMID: 19697357 DOI: 10.1002/gepi.20451] [Citation(s) in RCA: 53] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2023]
Abstract
In a genetic association study, it is often desirable to perform an overall test of whether any or all single-nucleotide polymorphisms (SNPs) in a gene are associated with a phenotype. Several such tests exist, but most of them are powerful only under very specific assumptions about the genetic effects of the individual SNPs. In addition, some of the existing tests assume that the direction of the effect of each SNP is known, which is a highly unlikely scenario. Here, we propose a new kernel-based association test of joint association of several SNPs. Our test is non-parametric and robust, and does not make any assumption about the directions of individual SNP effects. It can be used to test multiple correlated SNPs within a gene and can also be used to test independent SNPs or genes in a biological pathway. Our test uses an analysis of variance paradigm to compare variation between cases and controls to the variation within the groups. The variation is measured using kernel functions for each marker, and then a composite statistic is constructed to combine the markers into a single test. We present simulation results comparing our statistic to the U-statistic-based method by Schaid et al. ([2005] Am. J. Hum. Genet. 76:780-793) and another statistic by Wessel and Schork ([2006] Am. J. Hum. Genet. 79:792-806). We consider a variety of different disease models and assumptions about how many SNPs within the gene are actually associated with disease. Our results indicate that our statistic has higher power than other statistics under most realistic conditions.
Collapse
|
17
|
Schulz A, Fischer C, Chang-Claude J, Beckmann L. Entropy-supported marker selection and Mantel statistics for haplotype sharing analysis. Genet Epidemiol 2010; 34:354-63. [DOI: 10.1002/gepi.20491] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
|
18
|
Thomas A. Assessment of SNP streak statistics using gene drop simulation with linkage disequilibrium. Genet Epidemiol 2010; 34:119-24. [PMID: 19582786 PMCID: PMC2811755 DOI: 10.1002/gepi.20440] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
We describe methods and programs for simulating the genotypes of individuals in a pedigree at large numbers of linked loci when the alleles of the founders are under linkage disequilibrium. Both simulation and estimation of linkage disequilibrium models are shown to be feasible on a genome wide scale. The methods are applied to evaluate the statistical significance of streaks of loci at which sets of related individuals share a common allele. The effects of properly allowing for linkage disequilibrium are shown to be important as they explain many of the large observations. This is illustrated by reanalysis of a previously reported linkage of prostate cancer to chromosome 1p23.
Collapse
Affiliation(s)
- Alun Thomas
- Department of Biomedical Informatics, University of Utah, USA.
| |
Collapse
|
19
|
Ziegler A, Ewhida A, Brendel M, Kleensang A. More powerful haplotype sharing by accounting for the mode of inheritance. Genet Epidemiol 2009; 33:228-36. [PMID: 18839399 DOI: 10.1002/gepi.20373] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
Abstract
The concept of haplotype sharing (HS) has received considerable attention recently, and several haplotype association methods have been proposed. Here, we extend the work of Beckmann and colleagues [2005 Hum. Hered. 59:67-78] who derived an HS statistic (BHS) as special case of Mantel's space-time clustering approach. The Mantel-type HS statistic correlates genetic similarity with phenotypic similarity across pairs of individuals. While phenotypic similarity is measured as the mean-corrected cross product of phenotypes, we propose to incorporate information of the underlying genetic model in the measurement of the genetic similarity. Specifically, for the recessive and dominant modes of inheritance we suggest the use of the minimum and maximum of shared length of haplotypes around a marker locus for pairs of individuals. If the underlying genetic model is unknown, we propose a model-free HS Mantel statistic using the max-test approach. We compare our novel HS statistics to BHS using simulated case-control data and illustrate its use by re-analyzing data from a candidate region of chromosome 18q from the Rheumatoid Arthritis (RA) Consortium. We demonstrate that our approach is point-wise valid and superior to BHS. In the re-analysis of the RA data, we identified three regions with point-wise P-values<0.005 containing six known genes (PMIP1, MC4R, PIGN, KIAA1468, TNFRSF11A and ZCCHC2) which might be worth follow-up.
Collapse
Affiliation(s)
- Andreas Ziegler
- Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Lübeck, Germany.
| | | | | | | |
Collapse
|
20
|
Tzeng JY, Zhang D, Chang SM, Thomas DC, Davidian M. Gene-trait similarity regression for multimarker-based association analysis. Biometrics 2009; 65:822-32. [PMID: 19210740 DOI: 10.1111/j.1541-0420.2008.01176.x] [Citation(s) in RCA: 42] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
We propose a similarity-based regression method to detect associations between traits and multimarker genotypes. The model regresses similarity in traits for pairs of "unrelated" individuals on their haplotype similarities, and detects the significance by a score test for which the limiting distribution is derived. The proposed method allows for covariates, uses phase-independent similarity measures to bypass the needs to impute phase information, and is applicable to traits of general types (e.g., quantitative and qualitative traits). We also show that the gene-trait similarity regression is closely connected with random effects haplotype analysis, although commonly they are considered as separate modeling tools. This connection unites the classic haplotype sharing methods with the variance-component approaches, which enables direct derivation of analytical properties of the sharing statistics even when the similarity regression model becomes analytically challenging.
Collapse
Affiliation(s)
- Jung-Ying Tzeng
- Department of Statistics, North Carolina State University, Raleigh, North Carolina 27695, USA.
| | | | | | | | | |
Collapse
|
21
|
Marquard V, Beckmann L, Heid IM, Lamina C, Chang-Claude J. Impact of genotyping errors on the type I error rate and the power of haplotype-based association methods. BMC Genet 2009; 10:3. [PMID: 19178712 PMCID: PMC2648998 DOI: 10.1186/1471-2156-10-3] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2008] [Accepted: 01/29/2009] [Indexed: 11/14/2022] Open
Abstract
Background We investigated the influence of genotyping errors on the type I error rate and empirical power of two haplotype based association methods applied to candidate regions. We compared the performance of the Mantel Statistic Using Haplotype Sharing and the haplotype frequency based score test with that of the Armitage trend test. Our study is based on 1000 replication of simulated case-control data settings with 500 cases and 500 controls, respectively. One of the examined markers was set to be the disease locus with a simulated odds ratio of 3. Differential and non-differential genotyping errors were introduced following a misclassification model with varying mean error rates per locus in the range of 0.2% to 15.6%. Results We found that the type I error rate of all three test statistics hold the nominal significance level in the presence of nondifferential genotyping errors and low error rates. For high and differential error rates, the type I error rate of all three test statistics was inflated, even when genetic markers not in Hardy-Weinberg Equilibrium were removed. The empirical power of all three association test statistics remained high at around 89% to 94% when genotyping error rates were low, but decreased to 48% to 80% for high and nondifferential genotyping error rates. Conclusion Currently realistic genotyping error rates for candidate gene analysis (mean error rate per locus of 0.2%) pose no significant problem for the type I error rate as well as the power of all three investigated test statistics.
Collapse
Affiliation(s)
- Vivien Marquard
- Department of Cancer Epidemiology, German Cancer Research Center, Heidelberg, Germany.
| | | | | | | | | |
Collapse
|
22
|
Sauter W, Rosenberger A, Beckmann L, Kropp S, Mittelstrass K, Timofeeva M, Wölke G, Steinwachs A, Scheiner D, Meese E, Sybrecht G, Kronenberg F, Dienemann H, Chang-Claude J, Illig T, Wichmann HE, Bickeböller H, Risch A. Matrix metalloproteinase 1 (MMP1) is associated with early-onset lung cancer. Cancer Epidemiol Biomarkers Prev 2008; 17:1127-35. [PMID: 18483334 DOI: 10.1158/1055-9965.epi-07-2840] [Citation(s) in RCA: 93] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open
Abstract
Matrix metalloproteinases (MMP) play a key role in the breakdown of extracellular matrix and in inflammatory processes. MMP1 is the most highly expressed interstitial collagenase degrading fibrillar collagens. Overexpression of MMP1 has been shown in tumor tissues and has been suggested to be associated with tumor invasion and metastasis. Nine haplotype tagging and additional two intronic single nucleotide polymorphisms (SNP) of MMP1 were genotyped in a case control sample, consisting of 635 lung cancer cases with onset of disease below 51 years of age and 1,300 age- and sex-matched cancer-free controls. Two regions of linkage disequilibrium (LD) of MMP1 could be observed: a region of low LD comprising the 5' region including the promoter and a region of high LD starting from exon 1 to the end of the gene and including the 3' flanking region. Several SNPs were identified to be individually significantly associated with risk of early-onset lung cancer. The most significant effect was seen for rs1938901 (P = 0.0089), rs193008 (P = 0.0108), and rs996999 (P = 0.0459). For rs996999, significance vanished after correction for multiple testing. For each of these SNPs, the major allele was associated with an increase in risk with an odds ratio between 1.2 and 1.3 (95% confidence interval, 1.0-1.5). The haplotype analysis supported these findings, especially for subgroups with high smoking intensity. In summary, we identified MMP1 to be associated with an increased risk for lung cancer, which was modified by smoking.
Collapse
Affiliation(s)
- Wiebke Sauter
- Institute of Epidemiology, GSF-National Research Center for Environment and Health, D-85764 Neuherberg, Germany.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
23
|
Estimation of pairwise identity by descent from dense genetic marker data in a population sample of haplotypes. Genetics 2008; 178:2123-32. [PMID: 18430938 DOI: 10.1534/genetics.107.084624] [Citation(s) in RCA: 53] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
I present a new approach for calculating probabilities of identity by descent for pairs of haplotypes. The approach is based on a joint hidden Markov model for haplotype frequencies and identity by descent (IBD). This model allows for linkage disequilibrium, and the method can be applied to very dense marker data. The method has high power for detecting IBD tracts of genetic length of 1 cM, with the use of sufficiently dense markers. This enables detection of pairwise IBD between haplotypes from individuals whose most recent common ancestor lived up to 50 generations ago.
Collapse
|
24
|
Liu Q, Yang J, Chen Z, Yang MQ, Sung AH, Huang X. Supervised learning-based tagSNP selection for genome-wide disease classifications. BMC Genomics 2008; 9 Suppl 1:S6. [PMID: 18366619 PMCID: PMC2386071 DOI: 10.1186/1471-2164-9-s1-s6] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
Abstract
Background Comprehensive evaluation of common genetic variations through association of single nucleotide polymorphisms (SNPs) with complex human diseases on the genome-wide scale is an active area in human genome research. One of the fundamental questions in a SNP-disease association study is to find an optimal subset of SNPs with predicting power for disease status. To find that subset while reducing study burden in terms of time and costs, one can potentially reconcile information redundancy from associations between SNP markers. Results We have developed a feature selection method named Supervised Recursive Feature Addition (SRFA). This method combines supervised learning and statistical measures for the chosen candidate features/SNPs to reconcile the redundancy information and, in doing so, improve the classification performance in association studies. Additionally, we have proposed a Support Vector based Recursive Feature Addition (SVRFA) scheme in SNP-disease association analysis. Conclusions We have proposed using SRFA with different statistical learning classifiers and SVRFA for both SNP selection and disease classification and then applying them to two complex disease data sets. In general, our approaches outperform the well-known feature selection method of Support Vector Machine Recursive Feature Elimination and logic regression-based SNP selection for disease classification in genetic association studies. Our study further indicates that both genetic and environmental variables should be taken into account when doing disease predictions and classifications for the most complex human diseases that have gene-environment interactions.
Collapse
Affiliation(s)
- Qingzhong Liu
- Department of Computer Science, New Mexico Institute of Mining and Technology, Socorro, NM 87801, USA.
| | | | | | | | | | | |
Collapse
|
25
|
Tiwari HK, Barnholtz-Sloan J, Wineinger N, Padilla MA, Vaughan LK, Allison DB. Review and evaluation of methods correcting for population stratification with a focus on underlying statistical principles. Hum Hered 2008; 66:67-86. [PMID: 18382087 PMCID: PMC2803696 DOI: 10.1159/000119107] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023] Open
Abstract
When two or more populations have been separated by geographic or cultural boundaries for many generations, drift, spontaneous mutations, differential selection pressures and other factors may lead to allele frequency differences among populations. If these 'parental' populations subsequently come together and begin inter-mating, disequilibrium among linked markers may span a greater genetic distance than it typically does among populations under panmixia [see glossary]. This extended disequilibrium can make association studies highly effective and more economical than disequilibrium mapping in panmictic populations since less marker loci are needed to detect regions of the genome that harbor phenotype-influencing loci. However, under some circumstances, this process of intermating (as well as other processes) can produce disequilibrium between pairs of unlinked loci and thus create the possibility of confounding or spurious associations due to this population stratification. Accordingly, researchers are advised to employ valid statistical tests for linkage disequilibrium mapping allowing conduct of genetic association studies that control for such confounding. Many recent papers have addressed this need. We provide a comprehensive review of advances made in recent years in correcting for population stratification and then evaluate and synthesize these methods based on statistical principles such as (1) randomization, (2) conditioning on sufficient statistics, and (3) identifying whether the method is based on testing the genotype-phenotype covariance (conditional upon familial information) and/or testing departures of the marginal distribution from the expected genotypic frequencies.
Collapse
Affiliation(s)
- Hemant K Tiwari
- Department of Biostatistics, Section on Statistical Genetics, University of Alabama at Birmingham, Birmingham, AL 35294, USA.
| | | | | | | | | | | |
Collapse
|
26
|
Thomas A, Camp NJ, Farnham JM, Allen-Brady K, Cannon-Albright LA. Shared genomic segment analysis. Mapping disease predisposition genes in extended pedigrees using SNP genotype assays. Ann Hum Genet 2008; 72:279-87. [PMID: 18093282 PMCID: PMC2964273 DOI: 10.1111/j.1469-1809.2007.00406.x] [Citation(s) in RCA: 68] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
We examine the utility of high density genotype assays for predisposition gene localization using extended pedigrees. Results for the distribution of the number and length of genomic segments shared identical by descent among relatives previously derived in the context of genomic mismatch scanning are reviewed in the context of dense single nucleotide polymorphism maps. We use long runs of loci at which cases share a common allele identically by state to localize hypothesized predisposition genes. The distribution of such runs under the hypothesis of no genetic effect is evaluated by simulation. Methods are illustrated by analysis of an extended prostate cancer pedigree previously reported to show significant linkage to chromosome 1p23. Our analysis establishes that runs of simple single locus statistics can be powerful, tractable and robust for finding DNA shared between relatives, and that extended pedigrees offer powerful designs for gene detection based on these statistics.
Collapse
Affiliation(s)
- A Thomas
- Department of Biomedical Informatics, University of Utah, 391 Chipeta Way, Salt Lake City, UT 84108, USA.
| | | | | | | | | |
Collapse
|
27
|
Liang Y, Kelemen A. Statistical advances and challenges for analyzing correlated high dimensional SNP data in genomic study for complex diseases. STATISTICS SURVEYS 2008. [DOI: 10.1214/07-ss026] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
28
|
Abstract
Association methods based on linkage disequilibrium (LD) offer a promising approach for detecting genetic variations that are responsible for complex human diseases. Although methods based on individual single nucleotide polymorphisms (SNPs) may lead to significant findings, methods based on haplotypes comprising multiple SNPs on the same inherited chromosome may provide additional power for mapping disease genes and also provide insight on factors influencing the dependency among genetic markers. Such insights may provide information essential for understanding human evolution and also for identifying cis-interactions between two or more causal variants. Because obtaining haplotype information directly from experiments can be cost prohibitive in most studies, especially in large scale studies, haplotype analysis presents many unique challenges. In this chapter, we focus on two main issues: haplotype inference and haplotype-association analysis. We first provide a detailed review of methods for haplotype inference using unrelated individuals as well as related individuals from pedigrees. We then cover a number of statistical methods that employ haplotype information in association analysis. In addition, we discuss the advantages and limitations of different methods.
Collapse
Affiliation(s)
- Nianjun Liu
- Section on Statistical Genetics, Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL 35294, USA
| | | | | |
Collapse
|
29
|
Marquard V, Beckmann L, Bermejo JL, Fischer C, Chang-Claude J. Comparison of measures for haplotype similarity. BMC Proc 2007; 1 Suppl 1:S128. [PMID: 18466470 PMCID: PMC2367614 DOI: 10.1186/1753-6561-1-s1-s128] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Measuring the association of haplotype similarities with phenotype similarities has been used to develop statistical tests of genetic association. Previously, we applied the general approach of Mantel statistics to correlate genetic and phenotype similarity, where genetic similarity was defined by the number of intervals flanked by markers identical by state for pairs of haplotypes. Here we investigated in the case-control study design the effect on power of the Mantel statistics for five different measures of genetic similarity based on haplotypes: 1) the number of shared intervals, 2) the physical length of the shared intervals, 3) the genetic length of the shared intervals in centimorgans, 4) the genetic length of the shared intervals in linkage disequilibrium units (LDU) and 5) Yu's measure that attaches more weight to the sharing of rare than common alleles. With prior knowledge of the answers of Genetic Analysis Workshop 15 Problem 3, we analyzed the simulated data sets in two genomic regions surrounding the disease loci on chromosomes 6 and 18. For the dense map on chromosome 6, all methods showed a very high power of comparable magnitude. For chromosome 18, we observed a power between 19% and 99% at the pointwise 5% significance level using 1000 cases and 1000 controls for all methods except Yu's measure. While it yielded a much lower power, Yu's measure had 80% power around the disease locus.
Collapse
Affiliation(s)
- Vivien Marquard
- Cancer Epidemiology, German Cancer Research Center DKFZ, Im Neuenheimer Feld 280, 69120 Heidelberg, Germany
| | - Lars Beckmann
- Cancer Epidemiology, German Cancer Research Center DKFZ, Im Neuenheimer Feld 280, 69120 Heidelberg, Germany
| | - Justo L Bermejo
- Molecular Genetic Epidemiology, German Cancer Research Center DKFZ, Im Neuenheimer Feld 280, 69120 Heidelberg, Germany
| | - Christine Fischer
- Institute of Human Genetics, University of Heidelberg, Im Neuenheimer Feld 366, 69120 Heidelberg, Germany
| | - Jenny Chang-Claude
- Cancer Epidemiology, German Cancer Research Center DKFZ, Im Neuenheimer Feld 280, 69120 Heidelberg, Germany
| |
Collapse
|
30
|
Dempfle A, Hein R, Beckmann L, Scherag A, Nguyen TT, Schäfer H, Chang-Claude J. Comparison of the power of haplotype-based versus single- and multilocus association methods for gene x environment (gene x sex) interactions and application to gene x smoking and gene x sex interactions in rheumatoid arthritis. BMC Proc 2007; 1 Suppl 1:S73. [PMID: 18466575 PMCID: PMC2367597 DOI: 10.1186/1753-6561-1-s1-s73] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Accounting for interactions with environmental factors in association studies may improve the power to detect genetic effects and may help identifying important environmental effect modifiers. The power of unphased genotype-versus haplotype-based methods in regions with high linkage disequilibrium (LD), as measured by D', for analyzing gene x environment (gene x sex) interactions was compared using the Genetic Analysis Workshop 15 (GAW15) simulated data on rheumatoid arthritis with prior knowledge of the answers. Stepwise and regular conditional logistic regression (CLR) was performed using a matched case-control sample for a HLA region interacting with sex. Haplotype-based analyses were performed using a haplotype-sharing-based Mantel statistic and a test for haplotype-trait association in a general linear model framework. A step-down minP algorithm was applied to derive adjusted p-values and to allow for power comparisons. These methods were also applied to the GAW15 real data set for PTPN22.For markers in strong LD, stepwise CLR performed poorly because of the correlation/collinearity between the predictors in the model. The power was high for detecting genetic main effects using simple CLR models and haplotype-based methods and for detecting joint effects using CLR and Mantel statistics. Only the haplotype-trait association test had high power to detect the gene x sex interaction.In the PTPN22 region with markers characterized by strong LD, all methods indicated a significant genotype x sex interaction in a sample of about 1000 subjects. The previously reported R620W single-nucleotide polymorphism was identified using logistic regression, but the haplotype-based methods did not provide any precise location information.
Collapse
Affiliation(s)
- Astrid Dempfle
- Institute of Medical Biometry and Epidemiology, Philipps-University Marburg, 35037 Marburg, Germany
| | - Rebecca Hein
- Division of Cancer Epidemiology, German Cancer Research Center DKFZ, 69120 Heidelberg, Germany
| | - Lars Beckmann
- Division of Cancer Epidemiology, German Cancer Research Center DKFZ, 69120 Heidelberg, Germany
| | - André Scherag
- Institute of Medical Biometry and Epidemiology, Philipps-University Marburg, 35037 Marburg, Germany.,Institute of Medical Informatics, Biometry and Epidemiology, University of Duisburg-Essen, 45122 Essen, Germany
| | - Thuy Trang Nguyen
- Institute of Medical Biometry and Epidemiology, Philipps-University Marburg, 35037 Marburg, Germany
| | - Helmut Schäfer
- Institute of Medical Biometry and Epidemiology, Philipps-University Marburg, 35037 Marburg, Germany
| | - Jenny Chang-Claude
- Division of Cancer Epidemiology, German Cancer Research Center DKFZ, 69120 Heidelberg, Germany
| |
Collapse
|
31
|
Gasbarra D, Pirinen M, Sillanpää MJ, Arjas E. Estimating genealogies from linked marker data: a Bayesian approach. BMC Bioinformatics 2007; 8:411. [PMID: 17961219 PMCID: PMC2233650 DOI: 10.1186/1471-2105-8-411] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2007] [Accepted: 10/25/2007] [Indexed: 02/02/2023] Open
Abstract
Background Answers to several fundamental questions in statistical genetics would ideally require knowledge of the ancestral pedigree and of the gene flow therein. A few examples of such questions are haplotype estimation, relatedness and relationship estimation, gene mapping by combining pedigree and linkage disequilibrium information, and estimation of population structure. Results We present a probabilistic method for genealogy reconstruction. Starting with a group of genotyped individuals from some population isolate, we explore the state space of their possible ancestral histories under our Bayesian model by using Markov chain Monte Carlo (MCMC) sampling techniques. The main contribution of our work is the development of sampling algorithms in the resulting vast state space with highly dependent variables. The main drawback is the computational complexity that limits the time horizon within which explicit reconstructions can be carried out in practice. Conclusion The estimates for IBD (identity-by-descent) and haplotype distributions are tested in several settings using simulated data. The results appear to be promising for a further development of the method.
Collapse
Affiliation(s)
- Dario Gasbarra
- Department of Mathematics and Statistics, University of Helsinki, Finland.
| | | | | | | |
Collapse
|
32
|
Abstract
Multi-locus association analyses, including haplotype-based analyses, can sometimes provide greater power than single-locus analyses for detecting disease susceptibility loci. This potential gain, however, can be compromised by the large number of degrees of freedom caused by irrelevant markers. Exhaustive search for the optimal set of markers might be possible for a small number of markers, yet it is computationally inefficient. In this paper, we present a sequential haplotype scan method to search for combinations of adjacent markers that are jointly associated with disease status. When evaluating each marker, we add markers close to it in a sequential manner: a marker is added if its contribution to the haplotype association with disease is warranted, conditional on current haplotypes. This conditional evaluation is based on the well-known Mantel-Haenszel statistic. We propose two permutation based methods to evaluate the growing haplotypes: a haplotype method for the combined markers, and a summary method that sums conditional statistics. We compared our proposed methods, the single-locus method, and a sliding window method using simulated data. We also applied our sequential haplotype scan algorithm to experimental data for CYP2D6. The results indicate that the sequential scan procedure can identify a set of adjacent markers whose haplotypes might have strong genetic effects or be in linkage disequilibrium with disease predisposing variants. As a result, our methods can achieve greater power than the single-locus method, yet is much more computationally efficient than sliding window methods.
Collapse
Affiliation(s)
- Zhaoxia Yu
- Division of Biostatistics, Department of Health Sciences Research, Mayo Clinic College of Medicine, Rochester, Minnesota 55905, USA
| | | |
Collapse
|
33
|
Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, Maller J, Sklar P, de Bakker PIW, Daly MJ, Sham PC. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 2007; 81:559-75. [PMID: 17701901 PMCID: PMC1950838 DOI: 10.1086/519795] [Citation(s) in RCA: 22464] [Impact Index Per Article: 1321.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2007] [Accepted: 05/02/2007] [Indexed: 12/30/2022] Open
Abstract
Whole-genome association studies (WGAS) bring new computational, as well as analytic, challenges to researchers. Many existing genetic-analysis tools are not designed to handle such large data sets in a convenient manner and do not necessarily exploit the new opportunities that whole-genome data bring. To address these issues, we developed PLINK, an open-source C/C++ WGAS tool set. With PLINK, large data sets comprising hundreds of thousands of markers genotyped for thousands of individuals can be rapidly manipulated and analyzed in their entirety. As well as providing tools to make the basic analytic steps computationally efficient, PLINK also supports some novel approaches to whole-genome data that take advantage of whole-genome coverage. We introduce PLINK and describe the five main domains of function: data management, summary statistics, population stratification, association analysis, and identity-by-descent estimation. In particular, we focus on the estimation and use of identity-by-state and identity-by-descent information in the context of population-based whole-genome studies. This information can be used to detect and correct for population stratification and to identify extended chromosomal segments that are shared identical by descent between very distantly related individuals. Analysis of the patterns of segmental sharing has the potential to map disease loci that contain multiple rare variants in a population-based linkage analysis.
Collapse
Affiliation(s)
- Shaun Purcell
- Center for Human Genetic Research, Massachusetts General Hospital, Boston, MA 02114, USA.
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
34
|
Abstract
BACKGROUND Haplotype sharing statistics have been introduced in an ad-hoc way, often relying heavily on permutation testing. As a result, applying these approaches to whole genome association studies or to evaluate their properties in extensive simulation experiments is problematic. Further, permutation testing may be inappropriate in the presence of phase ambiguity and population stratification. AIMS To present a simple framework for a class of haplotype sharing statistics useful for association mapping in case-parent trio data. This framework allows derivation of novel haplotype sharing tests as well as simple variance estimators and asymptotic distributions for haplotype sharing tests. RESULTS AND CONCLUSIONS We validated that our approach is appropriately sized using simulated data, and illustrate the methodology by analyzing a Crohn's disease dataset. We find that haplotype-based analyses are much more powerful than single-locus analyses for these data.
Collapse
Affiliation(s)
- Andrew S Allen
- Department of Bioinformatics and Biostatistics, Duke University, North Carolina, USA
| | | |
Collapse
|
35
|
|
36
|
Abstract
Although genetic association studies have been with us for many years, even for the simplest analyses there is little consensus on the most appropriate statistical procedures. Here I give an overview of statistical approaches to population association studies, including preliminary analyses (Hardy-Weinberg equilibrium testing, inference of phase and missing data, and SNP tagging), and single-SNP and multipoint tests for association. My goal is to outline the key methods with a brief discussion of problems (population structure and multiple testing), avenues for solutions and some ongoing developments.
Collapse
Affiliation(s)
- David J Balding
- Department of Epidemiology and Public Health, Imperial College, St Marys Campus, Norfolk Place, London W2 1PG, UK.
| |
Collapse
|
37
|
Beckmann L, Fischer C, Obreiter M, Rabes M, Chang-Claude J. Haplotype-sharing analysis using Mantel statistics for combined genetic effects. BMC Genet 2005; 6 Suppl 1:S70. [PMID: 16451684 PMCID: PMC1866711 DOI: 10.1186/1471-2156-6-s1-s70] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/04/2022] Open
Abstract
We applied a new approach based on Mantel statistics to analyze the Genetic Analysis Workshop 14 simulated data with prior knowledge of the answers. The method was developed in order to improve the power of a haplotype sharing analysis for gene mapping in complex disease. The new statistic correlates genetic similarity and phenotypic similarity across pairs of haplotypes from case-control studies. The genetic similarity is measured as the shared length between haplotype pairs around a genetic marker. The phenotypic similarity is measured as the mean corrected cross-product based on the respective phenotypes. Cases with phenotype P1 and unrelated controls were drawn from the population of Danacaa. Power to detect main effects was compared to the X2-test for association based on 3-marker haplotypes and a global permutation test for haplotype association to test for main effects. Power to detect gene × gene interaction was compared to unconditional logistic regression. The results suggest that the Mantel statistics might be more powerful than alternative tests.
Collapse
Affiliation(s)
- Lars Beckmann
- German Cancer Research Center DKFZ, Heidelberg, Germany
| | - Christine Fischer
- Institute of Human Genetics, University of Heidelberg, Heidelberg, Germany
| | | | - Michael Rabes
- German Cancer Research Center DKFZ, Heidelberg, Germany
| | | |
Collapse
|
38
|
Beckmann L, Ziegler A, Duggal P, Bailey-Wilson JE. Haplotypes and haplotype-tagging single-nucleotide polymorphism: Presentation Group 8 of Genetic Analysis Workshop 14. Genet Epidemiol 2005; 29 Suppl 1:S59-71. [PMID: 16342175 DOI: 10.1002/gepi.20111] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
Moderately dense maps of single-nucleotide polymorphism (SNP) markers across the human genome for both the simulated data set and data from the Collaborative Study of the Genetics of Alcoholism were available at Genetic Analysis Workshop 14 for the first time. This allowed examination of various novel and existing methods for haplotype analyses. Three contributors applied Mantel statistics in different ways for both linkage and association analysis by using the shared length between two haplotypes at a marker locus as a measure of genetic similarity. The results indicate that haplotype-sharing based on Mantel statistics can be a powerful approach and needs further methodological evaluation. Four contributors investigated haplotype-tagging SNP (htSNP) selection procedures, two contributors examined the use of multilocus haplotypes compared to single loci in association tests, and two contributors compared the accuracy of various methods for reconstructing haplotypes and estimating haplotype frequencies for both pedigree data and data from unrelated individuals. For all three different tasks, software packages and procedures gave similar results in regions of high linkage disequilibrium (LD). However, they were not as consistent in regions of moderate to low LD. One coalescence-based approach for estimating haplotype frequencies, coupled with a Markov chain Monte Carlo technique, outperformed the other haplotype frequency estimation methods in regions of low LD. In conclusion, regardless of the task, results were similar in chromosomal regions of high LD. However, based on the differing results observed here, methodological improvements are required for chromosomal regions of low to moderate LD.
Collapse
Affiliation(s)
- Lars Beckmann
- German Cancer Research Center (Deutsches Krebsforschungszontrum) DKFZ, Heidelberg, Germany
| | | | | | | |
Collapse
|