1
|
Yang CJ, Ladejobi O, Mott R, Powell W, Mackay I. Analysis of historical selection in winter wheat. TAG. THEORETICAL AND APPLIED GENETICS. THEORETISCHE UND ANGEWANDTE GENETIK 2022; 135:3005-3023. [PMID: 35864201 PMCID: PMC9482581 DOI: 10.1007/s00122-022-04163-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/23/2022] [Accepted: 06/22/2022] [Indexed: 06/15/2023]
Abstract
KEY MESSAGE Modeling of the distribution of allele frequency over year of variety release identifies major loci involved in historical breeding of winter wheat. Winter wheat is a major crop with a rich selection history in the modern era of crop breeding. Genetic gains across economically important traits like yield have been well characterized and are the major force driving its production. Winter wheat is also an excellent model for analyzing historical genetic selection. As a proof of concept, we analyze two major collections of winter wheat varieties that were bred in Western Europe from 1916 to 2010, namely the Triticeae Genome (TG) and WAGTAIL panels, which include 333 and 403 varieties, respectively. We develop and apply a selection mapping approach, Regression of Alleles on Years (RALLY), in these panels, as well as in simulated populations. RALLY maps loci under sustained historical selection by using a simple logistic model to regress allele counts on years of variety release. To control for drift-induced allele frequency change, we develop a hybrid approach of genomic control and delta control. Within the TG panel, we identify 22 significant RALLY quantitative selection loci (QSLs) and estimate the local heritabilities for 12 traits across these QSLs. By correlating predicted marker effects with RALLY regression estimates, we show that alleles whose frequencies have increased over time are heavily biased toward conferring positive yield effect, but negative effects in flowering time, lodging, plant height and grain protein content. Altogether, our results (1) demonstrate the use of RALLY to identify selected genomic regions while controlling for drift, and (2) reveal key patterns in the historical selection in winter wheat and guide its future breeding.
Collapse
Affiliation(s)
- Chin Jian Yang
- Scotland's Rural College (SRUC), Kings Buildings, West Mains Road, Edinburgh, EH9 3JG, UK
| | - Olufunmilayo Ladejobi
- Department of Genetics, Evolution and Environment, University College London, London, WC1E 6BT, UK
| | - Richard Mott
- Department of Genetics, Evolution and Environment, University College London, London, WC1E 6BT, UK
| | - Wayne Powell
- Scotland's Rural College (SRUC), Kings Buildings, West Mains Road, Edinburgh, EH9 3JG, UK
| | - Ian Mackay
- Scotland's Rural College (SRUC), Kings Buildings, West Mains Road, Edinburgh, EH9 3JG, UK.
- IMplant Consultancy Ltd, Chelmsford, UK.
| |
Collapse
|
2
|
Chen Y, Liang KY, Tong P, Beaty TH, Barnes KC, Linda Kao WH. A pseudolikelihood approach for assessing genetic association in case-control studies with unmeasured population structure. Stat Methods Med Res 2020; 29:3153-3165. [PMID: 32393154 DOI: 10.1177/0962280220921212] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
The case-control study design is one of the main tools for detecting associations between genetic markers and diseases. It is well known that population substructure can lead to spurious association between disease status and a genetic marker if the prevalence of disease and the marker allele frequency vary across subpopulations. In this paper, we propose a novel statistical method to estimate the association in case-control studies with unmeasured population substructure. The proposed method takes two steps. First, the information on genomic markers and disease status is used to infer the population substructure; second, the association between the disease and the test marker adjusting for the population substructure is modeled and estimated parametrically through polytomous logistic regression. The performance of the proposed method, relative to the existing methods, on bias, coverage probability and computational time, is assessed through simulations. The method is applied to an end-stage renal disease study in African Americans population.
Collapse
Affiliation(s)
- Yong Chen
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, USA
| | | | - Pan Tong
- Department of Bioinformatics & Computational Biology, University of Texas, Houston, USA
| | - Terri H Beaty
- Department of Epidemiology, Johns Hopkins University, Baltimore, USA
| | - Kathleen C Barnes
- University of Colorado Denver - Anschutz Medical Campus, Aurora, USA
| | - W H Linda Kao
- Department of Epidemiology, Johns Hopkins University, Baltimore, USA
| |
Collapse
|
3
|
Tsepilov YA, Ried JS, Strauch K, Grallert H, van Duijn CM, Axenovich TI, Aulchenko YS. Development and application of genomic control methods for genome-wide association studies using non-additive models. PLoS One 2013; 8:e81431. [PMID: 24358113 PMCID: PMC3864791 DOI: 10.1371/journal.pone.0081431] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2013] [Accepted: 10/12/2013] [Indexed: 11/18/2022] Open
Abstract
Genome-wide association studies (GWAS) comprise a powerful tool for mapping genes of complex traits. However, an inflation of the test statistic can occur because of population substructure or cryptic relatedness, which could cause spurious associations. If information on a large number of genetic markers is available, adjusting the analysis results by using the method of genomic control (GC) is possible. GC was originally proposed to correct the Cochran-Armitage additive trend test. For non-additive models, correction has been shown to depend on allele frequencies. Therefore, usage of GC is limited to situations where allele frequencies of null markers and candidate markers are matched. In this work, we extended the capabilities of the GC method for non-additive models, which allows us to use null markers with arbitrary allele frequencies for GC. Analytical expressions for the inflation of a test statistic describing its dependency on allele frequency and several population parameters were obtained for recessive, dominant, and over-dominant models of inheritance. We proposed a method to estimate these required population parameters. Furthermore, we suggested a GC method based on approximation of the correction coefficient by a polynomial of allele frequency and described procedures to correct the genotypic (two degrees of freedom) test for cases when the model of inheritance is unknown. Statistical properties of the described methods were investigated using simulated and real data. We demonstrated that all considered methods were effective in controlling type 1 error in the presence of genetic substructure. The proposed GC methods can be applied to statistical tests for GWAS with various models of inheritance. All methods developed and tested in this work were implemented using R language as a part of the GenABEL package.
Collapse
Affiliation(s)
- Yakov A. Tsepilov
- Institute of Cytology and Genetics SD RAS, Novosibirsk, Russia
- Novosibirsk State University, Novosibirsk, Russia
| | - Janina S. Ried
- Institute of Genetic Epidemiology, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany
| | - Konstantin Strauch
- Institute of Genetic Epidemiology, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany
- Institute of Medical Informatics, Biometry and Epidemiology, Chair of Genetic Epidemiology, Ludwig-Maximilians-Universität, Munich, Germany
| | - Harald Grallert
- Research Unit of Molecular Epidemiology, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany
| | | | - Tatiana I. Axenovich
- Institute of Cytology and Genetics SD RAS, Novosibirsk, Russia
- Novosibirsk State University, Novosibirsk, Russia
| | - Yurii S. Aulchenko
- Institute of Cytology and Genetics SD RAS, Novosibirsk, Russia
- Novosibirsk State University, Novosibirsk, Russia
- Centre for Population Health Sciences, University of Edinburgh, Edinburgh, United Kingdom
- * E-mail:
| |
Collapse
|
4
|
Edge MD, Gorroochurn P, Rosenberg NA. Windfalls and pitfalls: Applications of population genetics to the search for disease genes. EVOLUTION MEDICINE AND PUBLIC HEALTH 2013; 2013:254-72. [PMID: 24481204 PMCID: PMC3868415 DOI: 10.1093/emph/eot021] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/26/2022]
Abstract
Association mapping can be viewed as an application of population genetics and evolutionary biology to the problem of identifying genes causally connected to phenotypes. However, some population-genetic principles important to the design and analysis of association studies have not been widely understood or have even been generally misunderstood. Some of these principles underlie techniques that can aid in the discovery of genetic variants that influence phenotypes (‘windfalls’), whereas others can interfere with study design or interpretation of results (‘pitfalls’). Here, considering examples involving genetic variant discovery, linkage disequilibrium, power to detect associations, population stratification and genotype imputation, we address misunderstandings in the application of population genetics to association studies, and we illuminate how some surprising results in association contexts can be easily explained when considered from evolutionary and population-genetic perspectives. Through our examples, we argue that population-genetic thinking—which takes a theoretical view of the evolutionary forces that guide the emergence and propagation of genetic variants—substantially informs the design and interpretation of genetic association studies. In particular, population-genetic thinking sheds light on genetic confounding, on the relationships between association signals of typed markers and causal variants, and on the advantages and disadvantages of particular strategies for measuring genetic variation in association studies.
Collapse
Affiliation(s)
- Michael D Edge
- Department of Biology, Stanford University, Stanford, CA 94305-5020, USA and Department of Biostatistics, Mailman School of Public Health, Columbia University, New York, NY 10032, USA
| | | | | |
Collapse
|
5
|
Correcting for differential genotyping error in genetic association analysis. J Hum Genet 2013; 58:657-66. [DOI: 10.1038/jhg.2013.74] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2013] [Revised: 05/06/2013] [Accepted: 06/02/2013] [Indexed: 11/08/2022]
|
6
|
Deo AJ, Huang YY, Hodgkinson CA, Xin Y, Oquendo MA, Dwork AJ, Arango V, Brent DA, Goldman D, Mann JJ, Haghighi F. A large-scale candidate gene analysis of mood disorders: evidence of neurotrophic tyrosine kinase receptor and opioid receptor signaling dysfunction. Psychiatr Genet 2013; 23:47-55. [PMID: 23277131 PMCID: PMC3869619 DOI: 10.1097/ypg.0b013e32835d7028] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
Abstract
BACKGROUND Despite proven heritability, little is known about the genetic architecture of mood disorders. Although a number of family and case-control studies have examined the genetics of mood disorders, none have carried out joint linkage-association studies and sought to validate the results with gene expression analyses in an independent cohort. METHODS We present findings from a large candidate gene study that combines linkage and association analyses using families and singletons, providing a systematic candidate gene investigation of mood disorder. For this study, 876 individuals were recruited, including 83 families with 313 individuals and 563 singletons. This large-scale candidate gene analysis included 130 candidate genes implicated in addictive and other psychiatric disorders. These data showed significant genetic associations for 28 of these candidate genes, although none remained significant after correction for multiple testing. To evaluate the functional significance of these 28 candidate genes in mood disorders, we examined the transcriptional profiles of these genes within the dorsolateral prefrontal cortex and anterior cingulate for 21 cases with mood disorders and 25 nonpsychiatric controls, and carried out a pathway analysis to identify points of high connectivity suggestive of particular molecular pathways that may be dysregulated. RESULTS Two primary gene candidates were supported by the linkage-association, gene expression profiling, and network analysis: neurotrophic tyrosine kinase receptor, type 2 (NTRK2), and the opioid receptor, κ1 (OPRK1). CONCLUSION This study supports a role for NTRK2 and OPRK1 signaling in the pathophysiology of mood disorder. The unique approach incorporating evidence from multiple experimental and computational modalities enhances confidence in these findings.
Collapse
Affiliation(s)
- Anthony J. Deo
- Department of Psychiatry, Division of Molecular Imaging and Neuropathology, New York State Psychiatric Institute, Columbia University, New York, New York
| | - Yung-yu Huang
- Department of Psychiatry, Division of Molecular Imaging and Neuropathology, New York State Psychiatric Institute, Columbia University, New York, New York
| | - Colin A. Hodgkinson
- Section of Human Neurogenetics, Laboratory of Neurogenetics, National Institute on Alcohol Abuse and Alcoholism, Rockville, Maryland
| | - Yurong Xin
- Department of Psychiatry, Division of Molecular Imaging and Neuropathology, New York State Psychiatric Institute, Columbia University, New York, New York
| | - Maria A. Oquendo
- Department of Psychiatry, Division of Molecular Imaging and Neuropathology, New York State Psychiatric Institute, Columbia University, New York, New York
| | - Andrew J. Dwork
- Department of Psychiatry, Division of Molecular Imaging and Neuropathology, New York State Psychiatric Institute, Columbia University, New York, New York
| | - Victoria Arango
- Department of Psychiatry, Division of Molecular Imaging and Neuropathology, New York State Psychiatric Institute, Columbia University, New York, New York
| | - David A. Brent
- Department of Child and Adolescent Psychiatry, Western Psychiatric Institute and Clinic, Pittsburgh, Pennsylvania, USA
| | - David Goldman
- Section of Human Neurogenetics, Laboratory of Neurogenetics, National Institute on Alcohol Abuse and Alcoholism, Rockville, Maryland
| | - J. John Mann
- Department of Psychiatry, Division of Molecular Imaging and Neuropathology, New York State Psychiatric Institute, Columbia University, New York, New York
| | - Fatemeh Haghighi
- Department of Psychiatry, Division of Molecular Imaging and Neuropathology, New York State Psychiatric Institute, Columbia University, New York, New York
| |
Collapse
|
7
|
Gorroochurn P, Hodge SE, Heiman GA, Greenberg DA. An improved delta-centralization method for population stratification. Hum Hered 2011; 71:180-5. [PMID: 21778737 DOI: 10.1159/000327728] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2010] [Accepted: 03/21/2011] [Indexed: 11/19/2022] Open
Abstract
Dadd et al. [Hum Hered 2010;69:285-294] recently criticized our delta-centralization (DC) method of controlling for population stratification (PS) and concluded that DC does not work. To explore our method, the authors simulated data under the Balding-Nichols (BN) model, which is more general than the model we had used in our simulations. They determined that the DC method underestimated the PS parameter (δ) and inflated the type I error rates when applied to BN-simulated data, and from this they concluded that the DC method is invalid. However, we argue that this conclusion is premature. In this paper, we (1) show why δ is underestimated and type I error rates are inflated when BN-simulated data are used, and (2) present a simple adjustment to DC that works reasonably well for data from both kinds of simulations. We also show that the adjusted DC method has appropriate power under a range of scenarios.
Collapse
Affiliation(s)
- Prakash Gorroochurn
- Division of Statistical Genetics, Department of Biostatistics, Mailman School of Public Health, Columbia University, New York, NY 10032, USA.
| | | | | | | |
Collapse
|
8
|
Yang Y, Remmers EF, Ogunwole CB, Kastner DL, Gregersen PK, Li W. Effective sample size: Quick estimation of the effect of related samples in genetic case-control association analyses. Comput Biol Chem 2011; 35:40-9. [PMID: 21333602 DOI: 10.1016/j.compbiolchem.2010.12.006] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2010] [Revised: 12/28/2010] [Accepted: 12/29/2010] [Indexed: 01/21/2023]
Abstract
Affected relatives are essential for pedigree linkage analysis, however, they cause a violation of the independent sample assumption in case-control association studies. To avoid the correlation between samples, a common practice is to take only one affected sample per pedigree in association analysis. Although several methods exist in handling correlated samples, they are still not widely used in part because these are not easily implemented, or because they are not widely known. We advocate the effective sample size method as a simple and accessible approach for case-control association analysis with correlated samples. This method modifies the chi-square test statistic, p-value, and 95% confidence interval of the odds-ratio by replacing the apparent number of allele or genotype counts with the effective ones in the standard formula, without the need for specialized computer programs. We present a simple formula for calculating effective sample size for many types of relative pairs and relative sets. For allele frequency estimation, the effective sample size method captures the variance inflation exactly. For genotype frequency, simulations showed that effective sample size provides a satisfactory approximation. A gene which is previously identified as a type 1 diabetes susceptibility locus, the interferon-induced helicase gene (IFIH1), is shown to be significantly associated with rheumatoid arthritis when the effective sample size method is applied. This significant association is not established if only one affected sib per pedigree were used in the association analysis. Relationship between the effective sample size method and other methods - the generalized estimation equation, variance of eigenvalues for correlation matrices, and genomic controls - are discussed.
Collapse
Affiliation(s)
- Yaning Yang
- Department of Statistics and Finance, University of Science and Technology of China, Anhui, Hefei, China
| | | | | | | | | | | |
Collapse
|
9
|
Sperrin M, Jaki T. Direct effects testing: A two-stage procedure to test for effect size and variable importance for correlated binary predictors and a binary response. Stat Med 2010; 29:2544-56. [DOI: 10.1002/sim.4014] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
|
10
|
Li T, Li Z, Ying Z, Zhang H. Influence of population stratification on population-based marker-disease association analysis. Ann Hum Genet 2010; 74:351-60. [PMID: 20529080 DOI: 10.1111/j.1469-1809.2010.00588.x] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Population-based genetic association analysis may suffer from the failure to control for confounders such as population stratification (PS). There has been extensive study on the influence of PS on candidate gene-disease association analysis, but much less attention has been paid to its influence on marker-disease association analysis. In this paper, we focus on the Pearson chi(2) test and the trend test for marker-disease association analysis. The mean and variance of the test statistics are derived under presence of PS, so that the power and inflated type I error rate can be evaluated. It is shown that the bias and the variance distortion are not zero in the presence of both PS and penetrance heterogeneity (PH). Unlike candidate gene-disease association analysis, when PS is present, the bias is not zero no matter whether PH is present or not. This work generalises the published results, where only the fully recessive penetrance model is considered and only the bias is calculated. It is shown that candidate gene-disease association analysis can be treated as a special case of marker-disease association analysis. Consequently, our results extend previous studies on candidate gene-disease association analysis. A simulation study confirms the theoretical findings.
Collapse
Affiliation(s)
- Tengfei Li
- Department of Mathematics, Fudan University, 220 Handan Road, Shanghai 200433, PR China
| | | | | | | |
Collapse
|
11
|
Gorroochurn P, Hodge SE, Heiman GA, Greenberg DA. Comments on 'Delta-centralization fails to control for population stratification in genetic association studies'. Hum Hered 2010; 69:295. [PMID: 20389098 DOI: 10.1159/000298766] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
|
12
|
Gray-McGuire C, Bochud M, Goodloe R, Elston RC. Genetic association tests: a method for the joint analysis of family and case-control data. Hum Genomics 2010; 4:2-20. [PMID: 19951892 PMCID: PMC2874328 DOI: 10.1186/1479-7364-4-1-2] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
With the trend in molecular epidemiology towards both genome-wide association studies and complex modelling, the need for large sample sizes to detect small effects and to allow for the estimation of many parameters within a model continues to increase. Unfortunately, most methods of association analysis have been restricted to either a family-based or a case-control design, resulting in the lack of synthesis of data from multiple studies. Transmission disequilibrium-type methods for detecting linkage disequilibrium from family data were developed as an effective way of preventing the detection of association due to population stratification. Because these methods condition on parental genotype, however, they have precluded the joint analysis of family and case-control data, although methods for case-control data may not protect against population stratification and do not allow for familial correlations. We present here an extension of a family-based association analysis method for continuous traits that will simultaneously test for, and if necessary control for, population stratification. We further extend this method to analyse binary traits (and therefore family and case-control data together) and accurately to estimate genetic effects in the population, even when using an ascertained family sample. Finally, we present the power of this binary extension for both family-only and joint family and case-control data, and demonstrate the accuracy of the association parameter and variance components in an ascertained family sample.
Collapse
Affiliation(s)
- Courtney Gray-McGuire
- Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH, USA.
| | | | | | | |
Collapse
|
13
|
Yan T, Hou B, Yang Y. Correcting for cryptic relatedness by a regression-based genomic control method. BMC Genet 2009; 10:78. [PMID: 19954543 PMCID: PMC3087514 DOI: 10.1186/1471-2156-10-78] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2009] [Accepted: 12/02/2009] [Indexed: 11/10/2022] Open
Abstract
Background Genomic control (GC) method is a useful tool to correct for the cryptic relatedness in population-based association studies. It was originally proposed for correcting for the variance inflation of Cochran-Armitage's additive trend test by using information from unlinked null markers, and was later generalized to be applicable to other tests with the additional requirement that the null markers are matched with the candidate marker in allele frequencies. However, matching allele frequencies limits the number of available null markers and thus limits the applicability of the GC method. On the other hand, errors in genotype/allele frequencies may cause further bias and variance inflation and thereby aggravate the effect of GC correction. Results In this paper, we propose a regression-based GC method using null markers that are not necessarily matched in allele frequencies with the candidate marker. Variation of allele frequencies of the null markers is adjusted by a regression method. Conclusion The proposed method can be readily applied to the Cochran-Armitage's trend tests other than the additive trend test, the Pearson's chi-square test and other robust efficiency tests. Simulation results show that the proposed method is effective in controlling type I error in the presence of population substructure.
Collapse
Affiliation(s)
- Ting Yan
- Department of Statistics and Finance, University of Science and Technology of China, Hefei, Anhui 230026, PR China.
| | | | | |
Collapse
|
14
|
Guan W, Liang L, Boehnke M, Abecasis GR. Genotype-based matching to correct for population stratification in large-scale case-control genetic association studies. Genet Epidemiol 2009; 33:508-17. [PMID: 19170134 DOI: 10.1002/gepi.20403] [Citation(s) in RCA: 39] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Genome-wide association studies are helping to dissect the etiology of complex diseases. Although case-control association tests are generally more powerful than family-based association tests, population stratification can lead to spurious disease-marker association or mask a true association. Several methods have been proposed to match cases and controls prior to genotyping, using family information or epidemiological data, or using genotype data for a modest number of genetic markers. Here, we describe a genetic similarity score matching (GSM) method for efficient matched analysis of cases and controls in a genome-wide or large-scale candidate gene association study. GSM comprises three steps: (1) calculating similarity scores for pairs of individuals using the genotype data; (2) matching sets of cases and controls based on the similarity scores so that matched cases and controls have similar genetic background; and (3) using conditional logistic regression to perform association tests. Through computer simulation we show that GSM correctly controls false-positive rates and improves power to detect true disease predisposing variants. We compare GSM to genomic control using computer simulations, and find improved power using GSM. We suggest that initial matching of cases and controls prior to genotyping combined with careful re-matching after genotyping is a method of choice for genome-wide association studies.
Collapse
Affiliation(s)
- Weihua Guan
- Department of Biostatistics and Center for Statistical Genetics, School of Public Health, University of Michigan, Ann Arbor, Michigan 48109-2029, USA
| | | | | | | |
Collapse
|
15
|
Dadd T, Weale ME, Lewis CM. A critical evaluation of genomic control methods for genetic association studies. Genet Epidemiol 2009; 33:290-8. [PMID: 19051284 DOI: 10.1002/gepi.20379] [Citation(s) in RCA: 42] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
Population stratification is an important potential confounder of genetic case-control association studies. For replication studies, limited availability of samples may lead to imbalanced sampling from heterogeneous populations. Genomic control (GC) can be used to correct chi(2) test statistics which are presumed to be inflated by a factor lambda; this may be estimated by a summary chi(2) value (lambda(median) or lambda(mean)) from a set of unlinked markers. Many studies applying GC methods have used fewer than 50 unlinked markers and an important question is whether this can adequately correct for population stratification. We assess the behavior of GC methods in imbalanced case-control studies using simulation. SNPs are sampled from two subpopulations with intra-continental levels of FST (< or =0.005) and sampling schemata ranging from balanced to completely imbalanced between subpopulations. The sampling properties of lambda(median) and lambda(mean) are explored using 6-1,600 unlinked markers to estimate Type 1 error and power empirically. GC corrections based on the chi(2)-distribution (GC(median) or GC(mean)) can be anti-conservative even when more than 100 single nucleotide polymorphisms (SNPs) are genotyped and realistic levels of population stratification exist. The GCF procedure performs well over a wider range of conditions, only becoming anti-conservative at low levels of alpha and with fewer than 25 SNPs genotyped. A substantial loss of power can arise when population stratification is present, but this is largely independent of the number of SNPs used. A literature survey shows that most studies applying GC have used GC(median) or GC(mean), rather than GCF, which is the most appropriate GC correction method.
Collapse
Affiliation(s)
- Tony Dadd
- Unilever Research Colworth, Sharnbrook, Bedfordshire, UK
| | | | | |
Collapse
|
16
|
Zheng G, Li Z, Gail MH, Gastwirth JL. Impact of Population Substructure on Trend Tests for Genetic Case-Control Association Studies. Biometrics 2009; 66:196-204. [DOI: 10.1111/j.1541-0420.2009.01264.x] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
17
|
Li Z, Zhang H, Zheng G, Gastwirth JL, Gail MH. Excess false positive rate caused by population stratification and disease rate heterogeneity in case–control association studies. Comput Stat Data Anal 2009. [DOI: 10.1016/j.csda.2008.02.021] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
18
|
Gorroochurn P. Perils in the Use of Linkage Disequilibrium for Fine Gene Mapping: Simple Insights from Population Genetics. Cancer Epidemiol Biomarkers Prev 2008; 17:3292-7. [DOI: 10.1158/1055-9965.epi-08-0717] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
|
19
|
Rodriguez-Murillo L, Greenberg DA. Genetic association analysis: a primer on how it works, its strengths and its weaknesses. ACTA ACUST UNITED AC 2008; 31:546-56. [PMID: 18522673 DOI: 10.1111/j.1365-2605.2008.00896.x] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022]
Affiliation(s)
- Laura Rodriguez-Murillo
- Division of Statistical Genetics, Department of Biostatistics, New York State Psychiatric Institute, Colombia University Medical Center, New York, NY 10032, USA.
| | | |
Collapse
|
20
|
He Y, Jiang R, Fu W, Bergen AW, Swan GE, Jin L. Correlation of population parameters leading to power differences in association studies with population stratification. Ann Hum Genet 2008; 72:801-11. [PMID: 18652602 DOI: 10.1111/j.1469-1809.2008.00465.x] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
The power of statistical tests to measure effect sizes in the presence of population stratification is an important issue for the design and analysis of population-based association studies. Comparisons of statistical tests have shown that the power of different statistical approaches varies in different genetic scenarios. However, the impact of stratified population parameters on statistical power is not yet understood in a general statistical framework, particularly the impact of correlated population parameters. To investigate such impact in detail, we implemented a genetic model for population-based association studies with stratified samples and evaluated the impact on power with different genetic scenarios. The investigation shows that correlation between disease prevalence and risk allele frequency among subpopulations impacts statistical power. In a model with five subpopulations and moderate population divergence (Fst= 0.01), the correlation accounts for more than 85% of power difference. Our results also show that the estimation of genetic effect for candidate loci is biased by population divergence. Beneficial alleles could be wrongly characterized as risk alleles when prevalence differences and divergences of risk loci are large among subpopulations.
Collapse
Affiliation(s)
- Y He
- Center for Health Sciences, SRI International, Menlo Park, CA 94025, USA.
| | | | | | | | | | | |
Collapse
|
21
|
Abstract
This protocol describes how to appropriately design a genetic association case-control study, either focusing on a candidate gene (CG) or region or implementing a genome-wide approach. The steps described involve: (i) defining the case phenotype in adequate detail; (ii) checking the heritability of the disease in question; (iii) considering whether a population-based study is the appropriate design for the research question; (iv) the appropriate selection of controls; (v) sample size calculations and (vi) giving due consideration to whether it is a de novo or replication study. General guidelines are given, as well as specific examples of a CG and a genome-wide association study into type 2 diabetes. Software and websites used in this protocol include the International HapMap Consortium website, Genetic Power Calculator, CaT, and SNPSpD. Running each of the programs takes only a few seconds; the rate-limiting steps involve thinking through the designs and parameters in the disease models.
Collapse
|
22
|
Zheng G, Ng HKT. Genetic model selection in two-phase analysis for case-control association studies. Biostatistics 2007; 9:391-9. [PMID: 18003629 DOI: 10.1093/biostatistics/kxm039] [Citation(s) in RCA: 63] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022] Open
Abstract
The Cochran-Armitage trend test (CATT) is well suited for testing association between a marker and a disease in case-control studies. When the underlying genetic model for the disease is known, the CATT optimal for the genetic model is used. For complex diseases, however, the genetic models of the true disease loci are unknown. In this situation, robust tests are preferable. We propose a two-phase analysis with model selection for the case-control design. In the first phase, we use the difference of Hardy-Weinberg disequilibrium coefficients between the cases and the controls for model selection. Then, an optimal CATT corresponding to the selected model is used for testing association. The correlation of the statistics used for selection and the test for association is derived to adjust the two-phase analysis with control of the Type-I error rate. The simulation studies show that this new approach has greater efficiency robustness than the existing methods.
Collapse
Affiliation(s)
- Gang Zheng
- Office of Biostatistics Research, National Heart, Lung and Blood Institute, 6701 Rockledge Drive, Bethesda, MD 20892-7931, USA
| | | |
Collapse
|
23
|
Cheng KF, Lin WJ. Simultaneously correcting for population stratification and for genotyping error in case-control association studies. Am J Hum Genet 2007; 81:726-43. [PMID: 17846998 PMCID: PMC2227923 DOI: 10.1086/520962] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2007] [Accepted: 06/18/2007] [Indexed: 11/03/2022] Open
Abstract
In population-based case-control association studies, the regular chi (2) test is often used to investigate association between a candidate locus and disease. However, it is well known that this test may be biased in the presence of population stratification and/or genotyping error. Unlike some other biases, this bias will not go away with increasing sample size. On the contrary, the false-positive rate will be much larger when the sample size is increased. The usual family-based designs are robust against population stratification, but they are sensitive to genotype error. In this article, we propose a novel method of simultaneously correcting for the bias arising from population stratification and/or for the genotyping error in case-control studies. The appropriate corrections depend on sample odds ratios of the standard 2x3 tables of genotype by case and control from null loci. Therefore, the test is simple to apply. The corrected test is robust against misspecification of the genetic model. If the null hypothesis of no association is rejected, the corrections can be further used to estimate the effect of the genetic factor. We considered a simulation study to investigate the performance of the new method, using parameter values similar to those found in real-data examples. The results show that the corrected test approximately maintains the expected type I error rate under various simulation conditions. It also improves the power of the association test in the presence of population stratification and/or genotyping error. The discrepancy in power between the tests with correction and those without correction tends to be more extreme as the magnitude of the bias becomes larger. Therefore, the bias-correction method proposed in this article should be useful for the genetic analysis of complex traits.
Collapse
Affiliation(s)
- K F Cheng
- Biostatistics Center and Department of Public Health, China Medical University, Taiwan, China.
| | | |
Collapse
|
24
|
Abstract
In the past, to study Mendelian diseases, segregating families have been carefully ascertained for segregation analysis, followed by collecting extended multiplex families for linkage analysis. This would then be followed by association studies, using independent case-control samples and/or additional family data. Recently, for complex diseases, the initial sampling has been for a genome-wide linkage analysis, often using independent sib-pairs or nuclear families, to identify candidate regions for follow-up with association studies, again using case-control samples and/or additional family data. We now have the ability to conduct genome-wide association studies using 100,000-500,000 diallelic genetic markers. For such studies we focus especially on efficient two-stage association sampling designs, which can retain nearly optimal statistical power at about half the genotyping cost. Similarly, beginning an association study by genotyping pooled samples may also be a viable option if the cost of accurately pooling DNA samples outweighs genotyping costs. Finally, we note that the sampling of family data for linkage analysis is not a practice that should be automatically discontinued.
Collapse
Affiliation(s)
- Robert C Elston
- Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, Ohio 44106, USA.
| | | | | |
Collapse
|
25
|
Gorroochurn P, Hodge SE, Heiman GA, Durner M, Greenberg DA. Non-replication of association studies: "pseudo-failures" to replicate? Genet Med 2007; 9:325-31. [PMID: 17575498 DOI: 10.1097/gim.0b013e3180676d79] [Citation(s) in RCA: 67] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open
Abstract
Recently, serious doubts have been cast on the usefulness of association studies as a means to genetically dissect complex diseases because most initial findings fail to replicate in subsequent studies. The reasons usually invoked are population stratification, genetic heterogeneity, and inflated Type I errors. In this article, we argue that, even when these problems are addressed, the scientific community usually has unreasonably high expectations on replication success, based on initial low P values, a phenomenon known as the replication fallacy. We present a modified formula that gives the replication power of a second association study based on the P value of an initial study. When both studies have similar sample sizes, this formula shows that: (1) a P value only slightly lower than the nominal alpha results in only approximately 50% replication power; (2) very low P values are required to achieve a replication power of at least 80% (e.g., at alpha = 0.05, a P value of <0.005 is required). Because many initially significant findings result in low replication power, replication failure should not be surprising or be interpreted as necessarily refuting the initial findings. We refer to replication failures for which the replication power is low as "pseudo-failures."
Collapse
Affiliation(s)
- Prakash Gorroochurn
- Division of Statistical Genetics, Department of Biostatistics, Mailman School of Public Health, Columbia University, New York, New York 10032, USA.
| | | | | | | | | |
Collapse
|
26
|
Gorroochurn P, Hodge SE, Heiman GA, Greenberg DA. A unified approach for quantifying, testing and correcting population stratification in case-control association studies. Hum Hered 2007; 64:149-59. [PMID: 17536209 PMCID: PMC2874730 DOI: 10.1159/000102988] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2006] [Accepted: 03/21/2007] [Indexed: 11/19/2022] Open
Abstract
The HapMap project has given case-control association studies a unique opportunity to uncover the genetic basis of complex diseases. However, persistent issues in such studies remain the proper quantification of, testing for, and correction for population stratification (PS). In this paper, we present the first unified paradigm that addresses all three fundamental issues within one statistical framework. Our unified approach makes use of an omnibus quantity (delta), which can be estimated in a case-control study from suitable null loci. We show how this estimated value can be used to quantify PS, to statistically test for PS, and to correct for PS, all in the context of case-control studies. Moreover, we provide guidelines for interpreting values of delta in association studies (e.g., at alpha = 0.05, a delta of size 0.416 is small, a delta of size 0.653 is medium, and a delta of size 1.115 is large). A novel feature of our testing procedure is its ability to test for either strictly any PS or only 'practically important' PS. We also performed simulations to compare our correction procedure with Genomic Control (GC). Our results show that, unlike GC, it maintains good Type I error rates and power across all levels of PS.
Collapse
Affiliation(s)
- Prakash Gorroochurn
- Division of Statistical Genetics, Department of Biostatistics, Columbia University, New York, NY 10032, USA.
| | | | | | | |
Collapse
|
27
|
Zang Y, Zhang H, Yang Y, Zheng G. Robust genomic control and robust delta centralization tests for case-control association studies. Hum Hered 2007; 63:187-95. [PMID: 17310128 DOI: 10.1159/000099831] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2006] [Accepted: 12/08/2006] [Indexed: 11/19/2022] Open
Abstract
The population-based case-control design is a powerful approach for detecting susceptibility markers of a complex disease. However, this approach may lead to spurious association when there is population substructure: population stratification (PS) or cryptic relatedness (CR). Two simple approaches to correct for the population substructure are genomic control (GC) and delta centralization (DC). GC uses the variance inflation factor to correct for the variance distortion of a test statistic, and the DC centralizes the non-central chi-square distribution of the test statistic. Both GC and DC have been studied for case-control association studies mainly under a specific genetic model (e.g. recessive, additive or dominant), under which an optimal trend test is available. The genetic model is usually unknown for many complex diseases. In this situation, we study the performance of three robust tests based on the GC and DC corrections in the presence of the population substructure. Our results show that, when the genetic model is unknown, the DC- (or GC-) corrected maximum and Pearson's association test are robust and have good control of Type I error and high power relative to the optimal trend tests in the presence of PS (or CR).
Collapse
Affiliation(s)
- Yong Zang
- Department of Statistics and Finance, University of Science and Technology of China, Hefei, Anhui, PR China
| | | | | | | |
Collapse
|
28
|
Pharoah PDP, Tyrer J, Dunning AM, Easton DF, Ponder BAJ. Association between common variation in 120 candidate genes and breast cancer risk. PLoS Genet 2007; 3:e42. [PMID: 17367212 PMCID: PMC1828694 DOI: 10.1371/journal.pgen.0030042] [Citation(s) in RCA: 117] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2006] [Accepted: 02/02/2007] [Indexed: 01/20/2023] Open
Abstract
Association studies in candidate genes have been widely used to search for common low penetrance susceptibility alleles, but few definite associations have been established. We have conducted association studies in breast cancer using an empirical single nucleotide polymorphism (SNP) tagging approach to capture common genetic variation in genes that are candidates for breast cancer based on their known function. We genotyped 710 SNPs in 120 candidate genes in up to 4,400 breast cancer cases and 4,400 controls using a staged design. Correction for population stratification was done using the genomic control method, on the basis of data from 280 genomic control SNPs. Evidence for association with each SNP was assessed using a Cochran–Armitage trend test (p-trend) and a two-degrees of freedom χ2 test for heterogeneity (p-het). The most significant single SNP (p-trend = 8 × 10−5) was not significant at a nominal 5% level after adjusting for population stratification and multiple testing. To evaluate the overall evidence for an excess of positive associations over the proportion expected by chance, we applied two global tests: the admixture maximum likelihood (AML) test and the rank truncated product (RTP) test corrected for population stratification. The admixture maximum likelihood experiment-wise test for association was significant for both the heterogeneity test (p = 0.0031) and the trend test (p = 0.017), but no association was observed using the rank truncated product method for either the heterogeneity test or the trend test (p = 0.12 and p = 0.24, respectively). Genes in the cell-cycle control pathway and genes involved in steroid hormone metabolism and signalling were the main contributors to the association. These results suggest that a proportion of SNPs in these candidate genes are associated with breast cancer risk, but that the effects of individual SNPs is likely to be small. Large sample sizes from multicentre collaboration will be needed to identify associated SNPs with certainty. The polygenic model of cancer susceptibility suggests that multiple alleles contribute to the excess familial risk of most common cancers. Candidate gene association studies have been a commonly used approach in the search for such alleles. We have investigated over 700 common variants in genes that are candidates for breast cancer susceptibility in a large case-control study of breast cancer, but no single variant was identified at an appropriate level of statistical significance. The purpose of this study was to consider these data as a whole, using a novel method, the admixture maximum likelihood test, to test the hypothesis that a proportion (unknown) of the variants we investigated are associated with breast cancer. After adjusting for population substructure, we found evidence for association that was robust to all but the most extreme assumptions about the degree of population stratification. Genes in the cell-cycle control and steroid hormone metabolism and signalling pathways were the main contributors. These results suggest that a proportion of single nucleotide polymorphisms (SNPs) in these candidate genes are associated with breast cancer risk, but that the effects of individual SNPs are likely to be small. Large sample sizes from multicentre collaboration will be needed to identify associated SNPs with certainty.
Collapse
Affiliation(s)
- Paul D P Pharoah
- Department of Oncology, University of Cambridge, Cambridge, United Kingdom.
| | | | | | | | | |
Collapse
|
29
|
Abstract
Genetic association studies are increasingly used in the search for susceptibility variants for human traits. While many of the statistical tools available for such studies are well established, the field is advancing rapidly, as biological and technological developments allow investigators to generate vast amounts of detailed genetic data. This chapter gives an overview of the statistical evaluation of genetic data in both unrelated individuals and families. A brief introduction to fundamental population genetics concepts is followed by detailed examinations of measures of linkage disequilibrium and single-marker and haplotype association tests. Emphasis is given to the historical development of family-based tests to provide the context for more recent advancements. The chapter concludes with a discussion of design strategies for genetic association studies with dense genotyping of hundreds or thousands of markers, such as those planned for follow up of a linkage-candidate region or genome-wide association studies.
Collapse
Affiliation(s)
- Carl D Langefeld
- Department of Biostatistical Sciences, Wake Forest University Health Sciences, Winston-Salem, NC, USA
| | | |
Collapse
|
30
|
Durner M, Gorroochurn P, Marini C, Guerrini R. Can we increase the likelihood of success for future association studies in epilepsy? Epilepsia 2006; 47:1617-21; author reply 1757-8. [PMID: 17054682 DOI: 10.1111/j.1528-1167.2006.00842.x] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
31
|
Abstract
The past 25 years has seen an explosion in the number of genetic markers that can be measured on DNA samples at an ever decreasing cost. Although basic statistical methods for analysing such data gathered on samples of either independent individuals or family members, one or two markers at a time, were already well developed before this explosion occurred, there has been a corresponding burst in activity to develop multiple marker models to find disease-causing gene variants, capitalizing on the data that have become available, to increase the power of such methods. This has required the concomitant development of faster algorithms to speed up the computation of various likelihoods. For linkage analysis, to obtain the approximate locations for genes of interest, Mendelian segregation models have been extended to be more realistic and statistical models that do not assume specific modes of inheritance have been extended to allow for the analysis of larger pedigree structures. For association analysis, to obtain more precise locations for genes of interest, the recent completion of the first stage of the HapMap project has spurred the development, still underway, of novel experimental designs and analytical methods to combat the curse of dimensionality and the resulting multiple testing problem. Perhaps the greatest current challenge concerns how best to gather and synthesize the many lines of evidence possible in order to discover the genetic determinants underlying complex diseases.
Collapse
Affiliation(s)
- Robert C Elston
- Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH, USA.
| | | |
Collapse
|