1
|
Abstract
This unit provides an overview of the design and analysis of population-based case-control studies of genetic risk factors for complex disease. Considerations specific to genetic studies are emphasized. The unit reviews basic study designs differentiating case-control studies from others, presents different genetic association strategies (candidate gene, genome-wide association, and high-throughput sequencing), introduces basic methods of statistical analysis for case-control data and approaches to combining case-control studies, and discusses measures of association and impact. Admixed populations, controlling for confounding (including population stratification), consideration of multiple loci and environmental risk factors, and complementary analyses of haplotypes, genes, and pathways are briefly discussed. Readers are referred to basic texts on epidemiology for more details on general conduct of case-control studies.
Collapse
Affiliation(s)
- Dana B Hancock
- Research Triangle Institute International, Research Triangle Park, North Carolina, USA
| | | |
Collapse
|
2
|
Hancock DB, Scott WK. Population-based case-control association studies. ACTA ACUST UNITED AC 2008; Chapter 1:Unit 1.17. [PMID: 18428402 DOI: 10.1002/0471142905.hg0117s52] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
This unit provides an overview of the design and analysis of population-based case-control studies of genetic risk factors for complex disease. Considerations specific to genetic studies are emphasized. The unit reviews basic study designs, differentiating case-control studies from others, discusses selection of genetic markers for use in studies, introduces basic methods of analysis of case-control data, and discusses measures of association and impact. Controlling for confounding (including population stratification), consideration of multiple loci, and haplotype analysis are briefly discussed. Readers are referred to basic texts on epidemiology for more details on general conduct of case-control studies.
Collapse
Affiliation(s)
- Dana B Hancock
- Duke University Medical Center, Durham, North Carolina, USA
| | | |
Collapse
|
3
|
Wollstein A, Herrmann A, Wittig M, Nothnagel M, Franke A, Nürnberg P, Schreiber S, Krawczak M, Hampe J. Efficacy assessment of SNP sets for genome-wide disease association studies. Nucleic Acids Res 2007; 35:e113. [PMID: 17726055 PMCID: PMC2034459 DOI: 10.1093/nar/gkm621] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open
Abstract
The power of a genome-wide disease association study depends critically upon the properties of the marker set used, particularly the number and physical spacing of markers, and the level of inter-marker association due to linkage disequilibrium. Extending our previously devised theoretical framework for the entropy-based selection of genetic markers, we have developed a local measure of the efficacy of a marker set, relative to including a maximally polymorphic single nucleotide polymorphism (SNP) at the map position of interest. Using this quantitative criterion, we evaluated five currently available SNP sets, namely Affymetrix 100K and 500K, and Illumina 100K, 300K and 550K in the CEU, YRI and JPT + CHB HapMap populations. At 50% relative efficacy, the commercial marker sets cover between 19 and 68% of the human genome, depending upon the population under study. An optimal technology-independent 500K marker set constructed from HapMap for Caucasians, in contrast, would achieve 73% coverage at the same relative efficacy.
Collapse
Affiliation(s)
- Andreas Wollstein
- Cologne Center for Genomics, Cologne, Institute of Clinical Molecular Biology, Christian-Albrechts University, Ist Department of Medicine and Institute of Medical Informatics and Statistics, Christian-Albrechts University, University Hospital Schleswig-Holstein Campus Kiel, Kiel, Germany
| | - Alexander Herrmann
- Cologne Center for Genomics, Cologne, Institute of Clinical Molecular Biology, Christian-Albrechts University, Ist Department of Medicine and Institute of Medical Informatics and Statistics, Christian-Albrechts University, University Hospital Schleswig-Holstein Campus Kiel, Kiel, Germany
| | - Michael Wittig
- Cologne Center for Genomics, Cologne, Institute of Clinical Molecular Biology, Christian-Albrechts University, Ist Department of Medicine and Institute of Medical Informatics and Statistics, Christian-Albrechts University, University Hospital Schleswig-Holstein Campus Kiel, Kiel, Germany
| | - Michael Nothnagel
- Cologne Center for Genomics, Cologne, Institute of Clinical Molecular Biology, Christian-Albrechts University, Ist Department of Medicine and Institute of Medical Informatics and Statistics, Christian-Albrechts University, University Hospital Schleswig-Holstein Campus Kiel, Kiel, Germany
| | - Andre Franke
- Cologne Center for Genomics, Cologne, Institute of Clinical Molecular Biology, Christian-Albrechts University, Ist Department of Medicine and Institute of Medical Informatics and Statistics, Christian-Albrechts University, University Hospital Schleswig-Holstein Campus Kiel, Kiel, Germany
| | - Peter Nürnberg
- Cologne Center for Genomics, Cologne, Institute of Clinical Molecular Biology, Christian-Albrechts University, Ist Department of Medicine and Institute of Medical Informatics and Statistics, Christian-Albrechts University, University Hospital Schleswig-Holstein Campus Kiel, Kiel, Germany
| | - Stefan Schreiber
- Cologne Center for Genomics, Cologne, Institute of Clinical Molecular Biology, Christian-Albrechts University, Ist Department of Medicine and Institute of Medical Informatics and Statistics, Christian-Albrechts University, University Hospital Schleswig-Holstein Campus Kiel, Kiel, Germany
| | - Michael Krawczak
- Cologne Center for Genomics, Cologne, Institute of Clinical Molecular Biology, Christian-Albrechts University, Ist Department of Medicine and Institute of Medical Informatics and Statistics, Christian-Albrechts University, University Hospital Schleswig-Holstein Campus Kiel, Kiel, Germany
| | - Jochen Hampe
- Cologne Center for Genomics, Cologne, Institute of Clinical Molecular Biology, Christian-Albrechts University, Ist Department of Medicine and Institute of Medical Informatics and Statistics, Christian-Albrechts University, University Hospital Schleswig-Holstein Campus Kiel, Kiel, Germany
- *To whom correspondence should be addressed. +49 431 597 1246+49 431 597 1842
| |
Collapse
|
4
|
Nicolas P, Sun F, Li LM. A model-based approach to selection of tag SNPs. BMC Bioinformatics 2006; 7:303. [PMID: 16776821 PMCID: PMC1525207 DOI: 10.1186/1471-2105-7-303] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2006] [Accepted: 06/15/2006] [Indexed: 11/23/2022] Open
Abstract
Background Single Nucleotide Polymorphisms (SNPs) are the most common type of polymorphisms found in the human genome. Effective genetic association studies require the identification of sets of tag SNPs that capture as much haplotype information as possible. Tag SNP selection is analogous to the problem of data compression in information theory. According to Shannon's framework, the optimal tag set maximizes the entropy of the tag SNPs subject to constraints on the number of SNPs. This approach requires an appropriate probabilistic model. Compared to simple measures of Linkage Disequilibrium (LD), a good model of haplotype sequences can more accurately account for LD structure. It also provides a machinery for the prediction of tagged SNPs and thereby to assess the performances of tag sets through their ability to predict larger SNP sets. Results Here, we compute the description code-lengths of SNP data for an array of models and we develop tag SNP selection methods based on these models and the strategy of entropy maximization. Using data sets from the HapMap and ENCODE projects, we show that the hidden Markov model introduced by Li and Stephens outperforms the other models in several aspects: description code-length of SNP data, information content of tag sets, and prediction of tagged SNPs. This is the first use of this model in the context of tag SNP selection. Conclusion Our study provides strong evidence that the tag sets selected by our best method, based on Li and Stephens model, outperform those chosen by several existing methods. The results also suggest that information content evaluated with a good model is more sensitive for assessing the quality of a tagging set than the correct prediction rate of tagged SNPs. Besides, we show that haplotype phase uncertainty has an almost negligible impact on the ability of good tag sets to predict tagged SNPs. This justifies the selection of tag SNPs on the basis of haplotype informativeness, although genotyping studies do not directly assess haplotypes. A software that implements our approach is available.
Collapse
Affiliation(s)
- Pierre Nicolas
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, USA
- Mathématique, Informatique et Génome, INRA, Jouy-en-Josas, France
| | - Fengzhu Sun
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, USA
| | - Lei M Li
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, USA
- Department of Mathematics, University of Southern California, Los Angeles, USA
| |
Collapse
|
5
|
Burkett KM, Ghadessi M, McNeney B, Graham J, Daley D. A comparison of five methods for selecting tagging single-nucleotide polymorphisms. BMC Genet 2005; 6 Suppl 1:S71. [PMID: 16451685 PMCID: PMC1866710 DOI: 10.1186/1471-2156-6-s1-s71] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022] Open
Abstract
Our goal was to compare methods for tagging single-nucleotide polymorphisms (tagSNPs) with respect to the power to detect disease association under differing haplotype-disease association models. We were also interested in the effect that SNP selection samples, consisting of either cases, controls, or a mixture, would have on power. We investigated five previously described algorithms for choosing tagSNPS: two that picked SNPs based on haplotype structure (Chapman-haplotypic and Stram), two that picked SNPs based on pair-wise allelic association (Chapman-allelic and Cousin), and one control method that chose equally spaced SNPs (Zhai). In two disease-associated regions from the Genetic Analysis Workshop 14 simulated data, we tested the association between tagSNP genotype and disease over the tagSNP sets chosen by each method for each sampling scheme. This was repeated for 100 replicates to estimate power. The two allelic methods chose essentially all SNPs in the region and had nearly optimal power. The two haplotypic methods chose about half as many SNPs. The haplotypic methods had poor performance compared to the allelic methods in both regions. We expected an improvement in power when the selection sample contained cases; however, there was only moderate variation in power between the sampling approaches for each method. Finally, when compared to the haplotypic methods, the reference method performed as well or worse in the region with ancestral disease haplotype structure.
Collapse
Affiliation(s)
- Kelly M Burkett
- The James Hogg-iCAPTURE Centre for Cardiovascular and Pulmonary Research, University of British Columbia, St. Paul's Hospital, Vancouver, BC V6Z 146, Canada
| | - Mercedeh Ghadessi
- Department of Statistics and Actuarial Science, Simon Fraser University, Burnaby, BC 15A 156, Canada
| | - Brad McNeney
- Department of Statistics and Actuarial Science, Simon Fraser University, Burnaby, BC 15A 156, Canada
| | - Jinko Graham
- Department of Statistics and Actuarial Science, Simon Fraser University, Burnaby, BC 15A 156, Canada
| | - Denise Daley
- The James Hogg-iCAPTURE Centre for Cardiovascular and Pulmonary Research, University of British Columbia, St. Paul's Hospital, Vancouver, BC V6Z 146, Canada
- Department of Epidemiology and Biostatistics, Case Western Reserve University, 44106, Cleveland, OH, USA
| |
Collapse
|
6
|
Nothnagel M, Rohde K. The effect of single-nucleotide polymorphism marker selection on patterns of haplotype blocks and haplotype frequency estimates. Am J Hum Genet 2005; 77:988-98. [PMID: 16380910 PMCID: PMC1285181 DOI: 10.1086/498175] [Citation(s) in RCA: 64] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2004] [Accepted: 09/16/2004] [Indexed: 11/03/2022] Open
Abstract
The definition of haplotype blocks of single-nucleotide polymorphisms (SNPs) has been proposed so that the haplotypes can be used as markers in association studies and to efficiently describe human genetic variation. The International Haplotype Map (HapMap) project to construct a comprehensive catalog of haplotypic variation in humans is underway. However, a number of factors have already been shown to influence the definition of blocks, including the population studied and the sample SNP density. Here, we examine the effect that marker selection has on the definition of blocks and the pattern of haplotypes by using comparable but complementary SNP sets and a number of block definition methods in various genomic regions and populations that were provided by the Encyclopedia of DNA Elements (ENCODE) project. We find that the chosen SNP set has a profound effect on the block-covered sequence and block borders, even at high marker densities. Our results question the very concept of discrete haplotype blocks and the possibility of generalizing block findings from the HapMap project. We comparatively apply the block-free tagging-SNP approach and discuss both the haplotype approach and the tagging-SNP approach as means to efficiently catalog genetic variation.
Collapse
Affiliation(s)
- Michael Nothnagel
- Department of Bioinformatics, Max Delbrück Center for Molecular Medicine, Berlin, Germany.
| | | |
Collapse
|
7
|
Zhang K, Sun F. Assessing the power of tag SNPs in the mapping of quantitative trait loci (QTL) with extremal and random samples. BMC Genet 2005; 6:51. [PMID: 16236175 PMCID: PMC1274312 DOI: 10.1186/1471-2156-6-51] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2005] [Accepted: 10/19/2005] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Recent studies have indicated that the human genome could be divided into regions with low haplotype diversity interspersed with regions of high haplotype diversity. In regions of low haplotype diversity, a small fraction of SNPs (tag SNPs) are sufficient to account for most of the haplotype diversity of the human genome. These tag SNPs can be extremely useful for testing the association of a marker locus with a qualitative or quantitative trait locus in that it may not be necessary to genotype all the SNPs. When tag SNPs are used to reduce the genotyping effort in association studies, it is important to know how much power is lost. It is also important to know how much power is gained when tag SNPs instead of the same number of randomly chosen SNPs are used. RESULTS We design a simulation study to tackle these problems for a variety of quantitative association tests using either case-parent samples or unrelated population samples. First, the samples are generated based on the quantitative trait model with the assumption of either an extremal sampling scheme or a random sampling scheme. Second, a small number of samples are selected to determine the haplotype blocks and the tag SNPs. Third, the statistical power of the tests is evaluated using four kinds of data: (1) all the SNPs and the corresponding haplotypes, (2) the tag SNPs and the corresponding haplotypes, (3) the same number of evenly spaced SNPs with minor allele frequency greater than a threshold and the corresponding haplotypes, (4) the same number of randomly chosen SNPs and their corresponding haplotypes. CONCLUSION Our results suggest that in most situations genotyping efforts can be significantly reduced by using tag SNPs for mapping the QTL in association studies without much loss of power, which is consistent with previous studies on association mapping of qualitative traits. For all situations considered, two-locus haplotype analysis using tag SNPs are more powerful than those using the same number of randomly selected SNPs, but the degree of such power differences depends upon the sampling scheme and the population history.
Collapse
Affiliation(s)
- Kui Zhang
- Section on Statistical Genetics, Department of Biostatistics, School of Public Health, University of Alabama at Birmingham, Birmingham, AL 35294, USA
| | - Fengzhu Sun
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA
| |
Collapse
|
8
|
Van Steen K, McQueen MB, Herbert A, Raby B, Lyon H, Demeo DL, Murphy A, Su J, Datta S, Rosenow C, Christman M, Silverman EK, Laird NM, Weiss ST, Lange C. Genomic screening and replication using the same data set in family-based association testing. Nat Genet 2005; 37:683-91. [PMID: 15937480 DOI: 10.1038/ng1582] [Citation(s) in RCA: 152] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2004] [Accepted: 04/25/2005] [Indexed: 11/09/2022]
Abstract
The Human Genome Project and its spin-offs are making it increasingly feasible to determine the genetic basis of complex traits using genome-wide association studies. The statistical challenge of analyzing such studies stems from the severe multiple-comparison problem resulting from the analysis of thousands of SNPs. Our methodology for genome-wide family-based association studies, using single SNPs or haplotypes, can identify associations that achieve genome-wide significance. In relation to developing guidelines for our screening tools, we determined lower bounds for the estimated power to detect the gene underlying the disease-susceptibility locus, which hold regardless of the linkage disequilibrium structure present in the data. We also assessed the power of our approach in the presence of multiple disease-susceptibility loci. Our screening tools accommodate genomic control and use the concept of haplotype-tagging SNPs. Our methods use the entire sample and do not require separate screening and validation samples to establish genome-wide significance, as population-based designs do.
Collapse
Affiliation(s)
- Kristel Van Steen
- Department of Biostatistics, Harvard School of Public Health, Boston, Massachusetts 02115, USA.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
9
|
De La Vega FM, Isaac H, Collins A, Scafe CR, Halldórsson BV, Su X, Lippert RA, Wang Y, Laig-Webster M, Koehler RT, Ziegle JS, Wogan LT, Stevens JF, Leinen KM, Olson SJ, Guegler KJ, You X, Xu LH, Hemken HG, Kalush F, Itakura M, Zheng Y, de Thé G, O'Brien SJ, Clark AG, Istrail S, Hunkapiller MW, Spier EG, Gilbert DA. The linkage disequilibrium maps of three human chromosomes across four populations reflect their demographic history and a common underlying recombination pattern. Genome Res 2005; 15:454-62. [PMID: 15781572 PMCID: PMC1074360 DOI: 10.1101/gr.3241705] [Citation(s) in RCA: 102] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
The extent and patterns of linkage disequilibrium (LD) determine the feasibility of association studies to map genes that underlie complex traits. Here we present a comparison of the patterns of LD across four major human populations (African-American, Caucasian, Chinese, and Japanese) with a high-resolution single-nucleotide polymorphism (SNP) map covering almost the entire length of chromosomes 6, 21, and 22. We constructed metric LD maps formulated such that the units measure the extent of useful LD for association mapping. LD reaches almost twice as far in chromosome 6 as in chromosomes 21 or 22, in agreement with their differences in recombination rates. By all measures used, out-of-Africa populations showed over a third more LD than African-Americans, highlighting the role of the population's demography in shaping the patterns of LD. Despite those differences, the long-range contour of the LD maps is remarkably similar across the four populations, presumably reflecting common localization of recombination hot spots. Our results have practical implications for the rational design and selection of SNPs for disease association studies.
Collapse
|
10
|
Halldórsson BV, Istrail S, De La Vega FM. Optimal Selection of SNP Markers for Disease Association Studies. Hum Hered 2005; 58:190-202. [PMID: 15812176 DOI: 10.1159/000083546] [Citation(s) in RCA: 48] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023] Open
Abstract
Genetic association studies with population samples hold the promise of uncovering the susceptibility genes underlying the heritability of complex or common disease. Most association studies rely on the use of surrogate markers, single-nucleotide polymorphism (SNP) being the most suitable due to their abundance and ease of scoring. SNP marker selection is aimed to increase the chances that at least one typed SNP would be in linkage disequilibrium (LD) with the disease causative variant, while at the same time controlling the cost of the study in terms of the number of markers genotyped and samples. Empirical studies reporting block-like segments in the genome with high LD and low haplotype diversity have motivated a marker selection strategy whereby subsets of SNPs that 'tag' the common haplotypes of a region are picked for genotyping, avoiding typing redundant SNPs. Based on these initial observations, a plethora of 'tagging' algorithms for selecting minimum informative subsets of SNPs has recently appeared in the literature. These differ mostly in two major aspects: the quality or correlation measure used to define tagging and the algorithm used for the minimization of the final number of tagging SNPs. In this review we describe the available tagging algorithms utilizing a 3-step unifying framework, point out their methodological and conceptual differences, and make an assessment of their assumptions, performance, and scalability.
Collapse
|
11
|
Beckmann L, Ziegler A, Duggal P, Bailey-Wilson JE. Haplotypes and haplotype-tagging single-nucleotide polymorphism: Presentation Group 8 of Genetic Analysis Workshop 14. Genet Epidemiol 2005; 29 Suppl 1:S59-71. [PMID: 16342175 DOI: 10.1002/gepi.20111] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
Moderately dense maps of single-nucleotide polymorphism (SNP) markers across the human genome for both the simulated data set and data from the Collaborative Study of the Genetics of Alcoholism were available at Genetic Analysis Workshop 14 for the first time. This allowed examination of various novel and existing methods for haplotype analyses. Three contributors applied Mantel statistics in different ways for both linkage and association analysis by using the shared length between two haplotypes at a marker locus as a measure of genetic similarity. The results indicate that haplotype-sharing based on Mantel statistics can be a powerful approach and needs further methodological evaluation. Four contributors investigated haplotype-tagging SNP (htSNP) selection procedures, two contributors examined the use of multilocus haplotypes compared to single loci in association tests, and two contributors compared the accuracy of various methods for reconstructing haplotypes and estimating haplotype frequencies for both pedigree data and data from unrelated individuals. For all three different tasks, software packages and procedures gave similar results in regions of high linkage disequilibrium (LD). However, they were not as consistent in regions of moderate to low LD. One coalescence-based approach for estimating haplotype frequencies, coupled with a Markov chain Monte Carlo technique, outperformed the other haplotype frequency estimation methods in regions of low LD. In conclusion, regardless of the task, results were similar in chromosomal regions of high LD. However, based on the differing results observed here, methodological improvements are required for chromosomal regions of low to moderate LD.
Collapse
Affiliation(s)
- Lars Beckmann
- German Cancer Research Center (Deutsches Krebsforschungszontrum) DKFZ, Heidelberg, Germany
| | | | | | | |
Collapse
|
12
|
Abstract
Human geneticists working on systems for which it is possible to make a strong case for a set of candidate genes face the problem of whether it is necessary to consider the variation in those genes as phased haplotypes, or whether the one-SNP-at-a-time approach might perform as well. There are three reasons why the phased haplotype route should be an improvement. First, the protein products of the candidate genes occur in polypeptide chains whose folding and other properties may depend on particular combinations of amino acids. Second, population genetic principles show us that variation in populations is inherently structured into haplotypes. Third, the statistical power of association tests with phased data is likely to be improved because of the reduction in dimension. However, in reality it takes a great deal of extra work to obtain valid haplotype phase information, and inferred phase information may simply compound the errors. In addition, if the causal connection between SNPs and a phenotype is truly driven by just a single SNP, then the haplotype-based approach may perform worse than the one-SNP-at-a-time approach. Here we examine some of the factors that affect haplotype patterns in genes, how haplotypes may be inferred, and how haplotypes have been useful in the context of testing association between candidate genes and complex traits.
Collapse
Affiliation(s)
- Andrew G Clark
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York 14853, USA.
| |
Collapse
|