1
|
Sánchez Lasheras JE, Suárez Gómez SL, Santos JD, Castaño-Vinyals G, Pérez-Gómez B, Tardón A. A multivariate regression approach for identification of SNPs importance in prostate cancer. J EXP THEOR ARTIF IN 2018. [DOI: 10.1080/0952813x.2018.1552319] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Affiliation(s)
| | | | | | - Gemma Castaño-Vinyals
- Spanish Consortium for Research on Epidemiology and Public Health (CIBERESP), Madrid, Spain
- ISGlobal, Centre for Research in Environmental Epidemiology (CREAL), Barcelona, Spain
| | - Beatriz Pérez-Gómez
- Spanish Consortium for Research on Epidemiology and Public Health (CIBERESP), Madrid, Spain
- Cancer and Environmental Epidemiology Unit, Carlos III Institute of Health, National Center for Epidemiology, Madrid, Spain
| | - Adonina Tardón
- Spanish Consortium for Research on Epidemiology and Public Health (CIBERESP), Madrid, Spain
- Universitary Institute of Oncology of Asturias (IUOPA), University of Oviedo, Oviedo, Spain
| |
Collapse
|
2
|
Woo HJ, Yu C, Kumar K, Gold B, Reifman J. Genotype distribution-based inference of collective effects in genome-wide association studies: insights to age-related macular degeneration disease mechanism. BMC Genomics 2016; 17:695. [PMID: 27576376 PMCID: PMC5006276 DOI: 10.1186/s12864-016-2871-3] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2015] [Accepted: 07/01/2016] [Indexed: 12/18/2022] Open
Abstract
BACKGROUND Genome-wide association studies provide important insights to the genetic component of disease risks. However, an existing challenge is how to incorporate collective effects of interactions beyond the level of independent single nucleotide polymorphism (SNP) tests. While methods considering each SNP pair separately have provided insights, a large portion of expected heritability may reside in higher-order interaction effects. RESULTS We describe an inference approach (discrete discriminant analysis; DDA) designed to probe collective interactions while treating both genotypes and phenotypes as random variables. The genotype distributions in case and control groups are modeled separately based on empirical allele frequency and covariance data, whose differences yield disease risk parameters. We compared pairwise tests and collective inference methods, the latter based both on DDA and logistic regression. Analyses using simulated data demonstrated that significantly higher sensitivity and specificity can be achieved with collective inference in comparison to pairwise tests, and with DDA in comparison to logistic regression. Using age-related macular degeneration (AMD) data, we demonstrated two possible applications of DDA. In the first application, a genome-wide SNP set is reduced into a small number (∼100) of variants via filtering and SNP pairs with significant interactions are identified. We found that interactions between SNPs with highest AMD association were epigenetically active in the liver, adipocytes, and mesenchymal stem cells. In the other application, multiple groups of SNPs were formed from the genome-wide data and their relative strengths of association were compared using cross-validation. This analysis allowed us to discover novel collections of loci for which interactions between SNPs play significant roles in their disease association. In particular, we considered pathway-based groups of SNPs containing up to ∼10, 000 variants in each group. In addition to pathways related to complement activation, our collective inference pointed to pathway groups involved in phospholipid synthesis, oxidative stress, and apoptosis, consistent with the AMD pathogenesis mechanism where the dysfunction of retinal pigment epithelium cells plays central roles. CONCLUSIONS The simultaneous inference of collective interaction effects within a set of SNPs has the potential to reveal novel aspects of disease association.
Collapse
Affiliation(s)
- Hyung Jun Woo
- Biotechnology High Performance Computing Software Applications Institute, Telemedicine and Advanced Technology Research Center, U.S. Army Medical Research and Materiel Command, Fort Detrick, Maryland, USA
| | - Chenggang Yu
- Biotechnology High Performance Computing Software Applications Institute, Telemedicine and Advanced Technology Research Center, U.S. Army Medical Research and Materiel Command, Fort Detrick, Maryland, USA
| | - Kamal Kumar
- Biotechnology High Performance Computing Software Applications Institute, Telemedicine and Advanced Technology Research Center, U.S. Army Medical Research and Materiel Command, Fort Detrick, Maryland, USA
| | - Bert Gold
- Laboratory of Genomic Diversity, National Cancer Institute, Frederick, Maryland, USA
| | - Jaques Reifman
- Biotechnology High Performance Computing Software Applications Institute, Telemedicine and Advanced Technology Research Center, U.S. Army Medical Research and Materiel Command, Fort Detrick, Maryland, USA.
| |
Collapse
|
3
|
Pszczola M, Strabel T, Mulder HA, Calus MPL. Reliability of direct genomic values for animals with different relationships within and to the reference population. J Dairy Sci 2012; 95:389-400. [PMID: 22192218 DOI: 10.3168/jds.2011-4338] [Citation(s) in RCA: 208] [Impact Index Per Article: 17.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2011] [Accepted: 09/18/2011] [Indexed: 11/19/2022]
Abstract
Accuracy of genomic selection depends on the accuracy of prediction of single nucleotide polymorphism effects and the proportion of genetic variance explained by markers. Design of the reference population with respect to its family structure may influence the accuracy of genomic selection. The objective of this study was to investigate the effect of various relationship levels within the reference population and different level of relationship of evaluated animals to the reference population on the reliability of direct genomic breeding values (DGV). The DGV reliabilities, expressed as squared correlation between estimated and true breeding value, were calculated for evaluated animals at 3 heritability levels. To emulate a trait that is difficult or expensive to measure, such as methane emission, reference populations were kept small and consisted of females with own performance records. A population reflecting a dairy cattle population structure was simulated. Four chosen reference populations consisted of all females available in the first genotyped generation. They consisted of highly (HR), moderately (MR), or lowly (LR) related animals, by selecting paternal half-sib families of decreasing size, or consisted of randomly chosen animals (RND). Of those 4 reference populations, RND had the lowest average relationship. Three sets of evaluated animals were chosen from 3 consecutive generations of genotyped animals, starting from the same generation as the reference population. Reliabilities of DGV predictions were calculated deterministically using selection index theory. The randomly chosen reference population had the lowest average relationship within the reference population. Average reliabilities increased when average relationship within the reference population decreased and the highest average reliabilities were achieved for RND (e.g., from 0.53 in HR to 0.61 in RND for a heritability of 0.30). A higher relationship to the reference population resulted in higher reliability values. At the average squared relationship of evaluated animals to the reference population of 0.005, reliabilities were, on average, 0.49 (HR) and 0.63 (RND) for a heritability of 0.30; 0.20 (HR) and 0.27 (RND) for a heritability of 0.05; and 0.07 (HR) and 0.09 (RND) for a heritability of 0.01. Substantial decrease in the reliability was observed when the number of generations to the reference population increased [e.g., for heritability of 0.30, the decrease from evaluated set I (chosen from the same generation as the reference population) to II (one generation younger than the reference population) was 0.04 for HR, and 0.07 for RND]. In this study, the importance of the design of a reference population consisting of cows was shown and optimal designs of the reference population for genomic prediction were suggested.
Collapse
Affiliation(s)
- M Pszczola
- Animal Breeding and Genomics Centre, Wageningen UR Livestock Research, Lelystad, the Netherlands.
| | | | | | | |
Collapse
|
4
|
Shriner D. Moving toward System Genetics through Multiple Trait Analysis in Genome-Wide Association Studies. Front Genet 2012; 3:1. [PMID: 22303408 PMCID: PMC3266611 DOI: 10.3389/fgene.2012.00001] [Citation(s) in RCA: 43] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2011] [Accepted: 01/01/2012] [Indexed: 02/05/2023] Open
Abstract
Association studies are a staple of genotype–phenotype mapping studies, whether they are based on single markers, haplotypes, candidate genes, genome-wide genotypes, or whole genome sequences. Although genetic epidemiological studies typically contain data collected on multiple traits which themselves are often correlated, most analyses have been performed on single traits. Here, I review several methods that have been developed to perform multiple trait analysis. These methods range from traditional multivariate models for systems of equations to recently developed graphical approaches based on network theory. The application of network theory to genetics is termed systems genetics and has the potential to address long-standing questions in genetics about complex processes such as coordinate regulation, homeostasis, and pleiotropy.
Collapse
Affiliation(s)
- Daniel Shriner
- Center for Research on Genomics and Global Health, National Human Genome Research Institute Bethesda, MD, USA
| |
Collapse
|
5
|
Bardel C, Danjean V, Morange P, Génin E, Darlu P. On the use of phylogeny-based tests to detect association between quantitative traits and haplotypes. Genet Epidemiol 2010; 33:729-39. [PMID: 19399905 DOI: 10.1002/gepi.20425] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
With the increasing availability of genetic data, several SNPs in a candidate gene can be combined into haplotypes to test for association with a quantitative trait. When the number of SNPs increases, the number of haplotypes can become very large and there is a need to group them together. The use of the phylogenetic relationships between haplotypes provides a natural and efficient way of grouping. Moreover, it allows us to identify disease or quantitative trait-related loci. In this article, we describe ALTree-q, a phylogeny-based approach to test for association between quantitative traits and haplotypes and to identify putative quantitative trait nucleotides (QTN). This study focuses on ALTree-q association test which is based on one-way analyses of variance (ANOVA) performed at the different levels of the tree. The statistical properties (type-one error and power rates) were estimated through simulations under different genetic models and were compared to another phylogeny-based test, TreeScan, (Templeton, 2005) and to a haplotypic omnibus test consisting in a one-way ANOVA between all haplotypes. For dominant and additive models ALTree-q is usually the most powerful test whereas TreeScan performs better under a recessive model. However, power depends strongly on the recurrence rate of the QTN, on the QTN allele frequency, and on the linkage disequilibrium between the QTN and other markers. An application of the method on Thrombin Activatable Fibronolysis Inhibitor Antigen levels in European and African samples confirms a possible association with polymorphisms of the CPB2 gene and identifies several QTNs.
Collapse
|
6
|
Yu K, Li Q, Bergen AW, Pfeiffer RM, Rosenberg PS, Caporaso N, Kraft P, Chatterjee N. Pathway analysis by adaptive combination of P-values. Genet Epidemiol 2010; 33:700-9. [PMID: 19333968 DOI: 10.1002/gepi.20422] [Citation(s) in RCA: 222] [Impact Index Per Article: 15.9] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
It is increasingly recognized that pathway analyses-a joint test of association between the outcome and a group of single nucleotide polymorphisms (SNPs) within a biological pathway-could potentially complement single-SNP analysis and provide additional insights for the genetic architecture of complex diseases. Building upon existing P-value combining methods, we propose a class of highly flexible pathway analysis approaches based on an adaptive rank truncated product statistic that can effectively combine evidence of associations over different SNPs and genes within a pathway. The statistical significance of the pathway-level test statistics is evaluated using a highly efficient permutation algorithm that remains computationally feasible irrespective of the size of the pathway and complexity of the underlying test statistics for summarizing SNP- and gene-level associations. We demonstrate through simulation studies that a gene-based analysis that treats the underlying genes, as opposed to the underlying SNPs, as the basic units for hypothesis testing, is a very robust and powerful approach to pathway-based association testing. We also illustrate the advantage of the proposed methods using a study of the association between the nicotinic receptor pathway and cigarette smoking behaviors.
Collapse
Affiliation(s)
- Kai Yu
- Division of Cancer Epidemiology and Genetics, NCI, Rockville, Maryland 20892, USA.
| | | | | | | | | | | | | | | |
Collapse
|
7
|
The diverse applications of cladistic analysis of molecular evolution, with special reference to nested clade analysis. Int J Mol Sci 2010; 11:124-39. [PMID: 20162005 PMCID: PMC2820993 DOI: 10.3390/ijms11010124] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2009] [Revised: 01/06/2010] [Accepted: 01/06/2010] [Indexed: 11/17/2022] Open
Abstract
The genetic variation found in small regions of the genomes of many species can be arranged into haplotype trees that reflect the evolutionary genealogy of the DNA lineages found in that region and the accumulation of mutations on those lineages. This review demonstrates some of the many ways in which clades (branches) of haplotype trees have been applied in recent years, including the study of genotype/phenotype associations at candidate loci and in genome-wide association studies, the phylogeographic history of species, human evolution, the conservation of endangered species, and the identification of species.
Collapse
|
8
|
|
9
|
Yu K, Wheeler W, Li Q, Bergen AW, Caporaso N, Chatterjee N, Chen J. A partially linear tree-based regression model for multivariate outcomes. Biometrics 2009; 66:89-96. [PMID: 19432770 DOI: 10.1111/j.1541-0420.2009.01235.x] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
In the genetic study of complex traits, especially behavior related ones, such as smoking and alcoholism, usually several phenotypic measurements are obtained for the description of the complex trait, but no single measurement can quantify fully the complicated characteristics of the symptom because of our lack of understanding of the underlying etiology. If those phenotypes share a common genetic mechanism, rather than studying each individual phenotype separately, it is more advantageous to analyze them jointly as a multivariate trait to enhance the power to identify associated genes. We propose a multilocus association test for the study of multivariate traits. The test is derived from a partially linear tree-based regression model for multiple outcomes. This novel tree-based model provides a formal statistical testing framework for the evaluation of the association between a multivariate outcome and a set of candidate predictors, such as markers within a gene or pathway, while accommodating adjustment for other covariates. Through simulation studies we show that the proposed method has an acceptable type I error rate and improved power over the univariate outcome analysis, which studies each component of the complex trait separately with multiple-comparison adjustment. A candidate gene association study of multiple smoking-related phenotypes is used to demonstrate the application and advantages of this new method. The proposed method is general enough to be used for the assessment of the joint effect of a set of multiple risk factors on a multivariate outcome in other biomedical research settings.
Collapse
Affiliation(s)
- Kai Yu
- Division of Cancer Epidemiology and Genetics, NCI, Rockville, Maryland 20892, USA.
| | | | | | | | | | | | | |
Collapse
|
10
|
Calus MPL, Meuwissen THE, Windig JJ, Knol EF, Schrooten C, Vereijken ALJ, Veerkamp RF. Effects of the number of markers per haplotype and clustering of haplotypes on the accuracy of QTL mapping and prediction of genomic breeding values. Genet Sel Evol 2009; 41:11. [PMID: 19284677 PMCID: PMC3225874 DOI: 10.1186/1297-9686-41-11] [Citation(s) in RCA: 42] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2008] [Accepted: 01/15/2009] [Indexed: 11/26/2022] Open
Abstract
The aim of this paper was to compare the effect of haplotype definition on the precision of QTL-mapping and on the accuracy of predicted genomic breeding values. In a multiple QTL model using identity-by-descent (IBD) probabilities between haplotypes, various haplotype definitions were tested i.e. including 2, 6, 12 or 20 marker alleles and clustering base haplotypes related with an IBD probability of > 0.55, 0.75 or 0.95. Simulated data contained 1100 animals with known genotypes and phenotypes and 1000 animals with known genotypes and unknown phenotypes. Genomes comprising 3 Morgan were simulated and contained 74 polymorphic QTL and 383 polymorphic SNP markers with an average r2 value of 0.14 between adjacent markers. The total number of haplotypes decreased up to 50% when the window size was increased from two to 20 markers and decreased by at least 50% when haplotypes related with an IBD probability of > 0.55 instead of > 0.95 were clustered. An intermediate window size led to more precise QTL mapping. Window size and clustering had a limited effect on the accuracy of predicted total breeding values, ranging from 0.79 to 0.81. Our conclusion is that different optimal window sizes should be used in QTL-mapping versus genome-wide breeding value prediction.
Collapse
Affiliation(s)
- Mario P L Calus
- Animal Breeding and Genomics Centre, Animal Sciences Group, Wageningen University and Research Centre, Lelystad, The Netherlands.
| | | | | | | | | | | | | |
Collapse
|
11
|
Abo R, Knight S, Wong J, Cox A, Camp NJ. hapConstructor: automatic construction and testing of haplotypes in a Monte Carlo framework. Bioinformatics 2008; 24:2105-7. [PMID: 18653522 PMCID: PMC2530882 DOI: 10.1093/bioinformatics/btn359] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Summary: Haplotypes carry important information that can direct investigators towards underlying susceptibility variants, and hence multiple tagging single nucleotide polymorphisms (tSNPs) are usually studied in candidate gene association studies. However, it is often unknown which SNPs should be included in haplotype analyses, or which tests should be performed for maximum power. We have developed a program, hapConstructor, which automatically builds multi-locus SNP sets to test for association in a case-control framework. The multi-SNP sets considered need not be contiguous; they are built based on significance. An important feature is that the missing data imputation is carried out based on the full data, for maximal information and consistency. HapConstructor is implemented in a Monte Carlo framework and naturally extends to allow for significance testing and false discovery rates that account for the construction process and to related individuals. HapConstructor is a useful tool for exploring multi-locus associations in candidate genes and regions. Availability: http://www-genepi.med.utah.edu/Genie Contact:ryan.abo@hsc.utah.edu
Collapse
Affiliation(s)
- Ryan Abo
- Department of Biomedical Informatics, University of Utah, UT, USA.
| | | | | | | | | |
Collapse
|
12
|
Tiwari HK, Barnholtz-Sloan J, Wineinger N, Padilla MA, Vaughan LK, Allison DB. Review and evaluation of methods correcting for population stratification with a focus on underlying statistical principles. Hum Hered 2008; 66:67-86. [PMID: 18382087 PMCID: PMC2803696 DOI: 10.1159/000119107] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023] Open
Abstract
When two or more populations have been separated by geographic or cultural boundaries for many generations, drift, spontaneous mutations, differential selection pressures and other factors may lead to allele frequency differences among populations. If these 'parental' populations subsequently come together and begin inter-mating, disequilibrium among linked markers may span a greater genetic distance than it typically does among populations under panmixia [see glossary]. This extended disequilibrium can make association studies highly effective and more economical than disequilibrium mapping in panmictic populations since less marker loci are needed to detect regions of the genome that harbor phenotype-influencing loci. However, under some circumstances, this process of intermating (as well as other processes) can produce disequilibrium between pairs of unlinked loci and thus create the possibility of confounding or spurious associations due to this population stratification. Accordingly, researchers are advised to employ valid statistical tests for linkage disequilibrium mapping allowing conduct of genetic association studies that control for such confounding. Many recent papers have addressed this need. We provide a comprehensive review of advances made in recent years in correcting for population stratification and then evaluate and synthesize these methods based on statistical principles such as (1) randomization, (2) conditioning on sufficient statistics, and (3) identifying whether the method is based on testing the genotype-phenotype covariance (conditional upon familial information) and/or testing departures of the marginal distribution from the expected genotypic frequencies.
Collapse
Affiliation(s)
- Hemant K Tiwari
- Department of Biostatistics, Section on Statistical Genetics, University of Alabama at Birmingham, Birmingham, AL 35294, USA.
| | | | | | | | | | | |
Collapse
|
13
|
Abstract
Association methods based on linkage disequilibrium (LD) offer a promising approach for detecting genetic variations that are responsible for complex human diseases. Although methods based on individual single nucleotide polymorphisms (SNPs) may lead to significant findings, methods based on haplotypes comprising multiple SNPs on the same inherited chromosome may provide additional power for mapping disease genes and also provide insight on factors influencing the dependency among genetic markers. Such insights may provide information essential for understanding human evolution and also for identifying cis-interactions between two or more causal variants. Because obtaining haplotype information directly from experiments can be cost prohibitive in most studies, especially in large scale studies, haplotype analysis presents many unique challenges. In this chapter, we focus on two main issues: haplotype inference and haplotype-association analysis. We first provide a detailed review of methods for haplotype inference using unrelated individuals as well as related individuals from pedigrees. We then cover a number of statistical methods that employ haplotype information in association analysis. In addition, we discuss the advantages and limitations of different methods.
Collapse
Affiliation(s)
- Nianjun Liu
- Section on Statistical Genetics, Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL 35294, USA
| | | | | |
Collapse
|
14
|
Liu J, Papasian C, Deng HW. Incorporating single-locus tests into haplotype cladistic analysis in case-control studies. PLoS Genet 2007; 3:e46. [PMID: 17381242 PMCID: PMC1829402 DOI: 10.1371/journal.pgen.0030046] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2006] [Accepted: 02/13/2007] [Indexed: 11/21/2022] Open
Abstract
In case-control studies, genetic associations for complex diseases may be probed either with single-locus tests or with haplotype-based tests. Although there are different views on the relative merits and preferences of the two test strategies, haplotype-based analyses are generally believed to be more powerful to detect genes with modest effects. However, a main drawback of haplotype-based association tests is the large number of distinct haplotypes, which increases the degrees of freedom for corresponding test statistics and thus reduces the statistical power. To decrease the degrees of freedom and enhance the efficiency and power of haplotype analysis, we propose an improved haplotype clustering method that is based on the haplotype cladistic analysis developed by Durrant et al. In our method, we attempt to combine the strengths of single-locus analysis and haplotype-based analysis into one single test framework. Novel in our method is that we develop a more informative haplotype similarity measurement by using p-values obtained from single-locus association tests to construct a measure of weight, which to some extent incorporates the information of disease outcomes. The weights are then used in computation of similarity measures to construct distance metrics between haplotype pairs in haplotype cladistic analysis. To assess our proposed new method, we performed simulation analyses to compare the relative performances of (1) conventional haplotype-based analysis using original haplotype, (2) single-locus allele-based analysis, (3) original haplotype cladistic analysis (CLADHC) by Durrant et al., and (4) our weighted haplotype cladistic analysis method, under different scenarios. Our weighted cladistic analysis method shows an increased statistical power and robustness, compared with the methods of haplotype cladistic analysis, single-locus test, and the traditional haplotype-based analyses. The real data analyses also show that our proposed method has practical significance in the human genetics field. Methods of haplotype-based analysis and single-locus analysis are widely used in genetic association studies. There is no consensus as to the best strategy for the performance of the two methods. Although haplotype-based analysis is a powerful tool, the large number of distinct haplotypes may reduce its efficiency. Haplotype clustering analysis is a promising way of decreasing haplotype dimensionality. A potential limitation of many existing clustering methods is that they do not allow the clustering to adapt to the position of the underlying trait locus. In this study, we proposed a weighted haplotype cladistic analysis method by incorporating a single-locus test into haplotype clustering. Under this framework, relationships between single loci and the disease outcomes can be considered when creating the hierarchical tree of haplotypes. The extensive simulations show that our method is robust against varied simulation conditions and is more powerful than either the original unweighted cladistic analysis method or single-locus analysis methods in case-control studies. Our hybrid method combining haplotype-based and single-locus analyses can be readily extended to whole genome association studies.
Collapse
Affiliation(s)
- Jianfeng Liu
- Department of Orthopedic Surgery, School of Medicine, University of Missouri-Kansas City, Kansas City, Missouri, United States of America
- Department of Basic Medical Science, School of Medicine, University of Missouri-Kansas City, Kansas City, Missouri, United States of America
| | - Chris Papasian
- Department of Basic Medical Science, School of Medicine, University of Missouri-Kansas City, Kansas City, Missouri, United States of America
| | - Hong-Wen Deng
- Department of Orthopedic Surgery, School of Medicine, University of Missouri-Kansas City, Kansas City, Missouri, United States of America
- Department of Basic Medical Science, School of Medicine, University of Missouri-Kansas City, Kansas City, Missouri, United States of America
- Laboratory of Molecular and Statistical Genetics, College of Life Sciences, Hunan Normal University, Changsha, Hunan, People's Republic of China
- The Key Laboratory of Biomedical Information Engineering of Ministry of Education and Institute of Molecular Genetics, School of Life Science and Technology, Xi'an Jiaotong University, Xi'an, People's Republic of China
- * To whom correspondence should be addressed. E-mail:
| |
Collapse
|
15
|
Chen J, Yu K, Hsing A, Therneau TM. A partially linear tree-based regression model for assessing complex joint gene-gene and gene-environment effects. Genet Epidemiol 2007; 31:238-51. [PMID: 17266115 DOI: 10.1002/gepi.20205] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
The success of genetic dissection of complex diseases may greatly benefit from judicious exploration of joint gene effects, which, in turn, critically depends on the power of statistical tools. Standard regression models are convenient for assessing main effects and low-order gene-gene interactions but not for exploring complex higher-order interactions. Tree-based methodology is an attractive alternative for disentangling possible interactions, but it has difficulty in modeling additive main effects. This work proposes a new class of semiparametric regression models, termed partially linear tree-based regression (PLTR) models, which exhibit the advantages of both generalized linear regression and tree models. A PLTR model quantifies joint effects of genes and other risk factors by a combination of linear main effects and a non-parametric tree -structure. We propose an iterative algorithm to fit the PLTR model, and a unified resampling approach for identifying and testing the significance of the optimal "pruned" tree nested within the tree resultant from the fitting algorithm. Simulation studies showed that the resampling procedure maintained the correct type I error rate. We applied the PLTR model to assess the association between biliary stone risk and 53 single nucleotide polymorphisms (SNPs) in the inflammation pathway in a population-based case-control study. The analysis yielded an interesting parsimonious summary of the joint effect of all SNPs. The proposed model is also useful for exploring gene-environment interactions and has broad implications for applying the tree methodology to genetic epidemiology research.
Collapse
Affiliation(s)
- Jinbo Chen
- Department of Biostatistics and Epidemiology, University of Pennsylvania School of Medicine, Philadelphia, Pennsylvania 19104, USA.
| | | | | | | |
Collapse
|
16
|
Yu K, Martin R, Rothman N, Zheng T, Lan Q. Two-sample comparison based on prediction error, with applications to candidate gene association studies. Ann Hum Genet 2007; 71:107-18. [PMID: 17227481 DOI: 10.1111/j.1469-1809.2006.00306.x] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Abstract
To take advantage of the increasingly available high-density SNP maps across the genome, various tests that compare multilocus genotypes or estimated haplotypes between cases and controls have been developed for candidate gene association studies. Here we view this two-sample testing problem from the perspective of supervised machine learning and propose a new association test. The approach adopts the flexible and easy-to-understand classification tree model as the learning machine, and uses the estimated prediction error of the resulting prediction rule as the test statistic. This procedure not only provides an association test but also generates a prediction rule that can be useful in understanding the mechanisms underlying complex disease. Under the set-up of a haplotype-based transmission/disequilibrium test (TDT) type of analysis, we find through simulation studies that the proposed procedure has the correct type I error rates and is robust to population stratification. The power of the proposed procedure is sensitive to the chosen prediction error estimator. Among commonly used prediction error estimators, the .632+ estimator results in a test that has the best overall performance. We also find that the test using the .632+ estimator is more powerful than the standard single-point TDT analysis, the Pearson's goodness-of-fit test based on estimated haplotype frequencies, and two haplotype-based global tests implemented in the genetic analysis package FBAT. To illustrate the application of the proposed method in population-based association studies, we use the procedure to study the association between non-Hodgkin lymphoma and the IL10 gene.
Collapse
Affiliation(s)
- K Yu
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892, USA.
| | | | | | | | | |
Collapse
|