1
|
Wang DR, Guadagno CR, Mao X, Mackay DS, Pleban JR, Baker RL, Weinig C, Jannink JL, Ewers BE. A framework for genomics-informed ecophysiological modeling in plants. JOURNAL OF EXPERIMENTAL BOTANY 2019; 70:2561-2574. [PMID: 30825375 PMCID: PMC6487588 DOI: 10.1093/jxb/erz090] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/14/2018] [Accepted: 02/18/2019] [Indexed: 05/06/2023]
Abstract
Dynamic process-based plant models capture complex physiological response across time, carrying the potential to extend simulations out to novel environments and lend mechanistic insight to observed phenotypes. Despite the translational opportunities for varietal crop improvement that could be unlocked by linking natural genetic variation to first principles-based modeling, these models are challenging to apply to large populations of related individuals. Here we use a combination of model development, experimental evaluation, and genomic prediction in Brassica rapa L. to set the stage for future large-scale process-based modeling of intraspecific variation. We develop a new canopy growth submodel for B. rapa within the process-based model Terrestrial Regional Ecosystem Exchange Simulator (TREES), test input parameters for feasibility of direct estimation with observed phenotypes across cultivated morphotypes and indirect estimation using genomic prediction on a recombinant inbred line population, and explore model performance on an in silico population under non-stressed and mild water-stressed conditions. We find evidence that the updated whole-plant model has the capacity to distill genotype by environment interaction (G×E) into tractable components. The framework presented offers a means to link genetic variation with environment-modulated plant response and serves as a stepping stone towards large-scale prediction of unphenotyped, genetically related individuals under untested environmental scenarios.
Collapse
Affiliation(s)
- Diane R Wang
- Geography Department, University at Buffalo, Buffalo, NY, USA
| | | | - Xiaowei Mao
- Plant Breeding and Genetics Section, Cornell University, Ithaca, NY, USA
| | - D Scott Mackay
- Geography Department, University at Buffalo, Buffalo, NY, USA
| | | | | | - Cynthia Weinig
- Botany Department, University of Wyoming, Laramie, WY, USA
| | - Jean-Luc Jannink
- Plant Breeding and Genetics Section, Cornell University, Ithaca, NY, USA
- USDA-ARS, Ithaca, NY, USA
| | - Brent E Ewers
- Botany Department, University of Wyoming, Laramie, WY, USA
| |
Collapse
|
2
|
Kayondo SI, Pino Del Carpio D, Lozano R, Ozimati A, Wolfe M, Baguma Y, Gracen V, Offei S, Ferguson M, Kawuki R, Jannink JL. Genome-wide association mapping and genomic prediction for CBSD resistance in Manihot esculenta. Sci Rep 2018; 8:1549. [PMID: 29367617 PMCID: PMC5784162 DOI: 10.1038/s41598-018-19696-1] [Citation(s) in RCA: 50] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2017] [Accepted: 01/08/2018] [Indexed: 12/04/2022] Open
Abstract
Cassava (Manihot esculenta Crantz) is an important security crop that faces severe yield loses due to cassava brown streak disease (CBSD). Motivated by the slow progress of conventional breeding, genetic improvement of cassava is undergoing rapid change due to the implementation of quantitative trait loci mapping, Genome-wide association mapping (GWAS), and genomic selection (GS). In this study, two breeding panels were genotyped for SNP markers using genotyping by sequencing and phenotyped for foliar and CBSD root symptoms at five locations in Uganda. Our GWAS study found two regions associated to CBSD, one on chromosome 4 which co-localizes with a Manihot glaziovii introgression segment and one on chromosome 11, which contains a cluster of nucleotide-binding site-leucine-rich repeat (NBS-LRR) genes. We evaluated the potential of GS to improve CBSD resistance by assessing the accuracy of seven prediction models. Predictive accuracy values varied between CBSD foliar severity traits at 3 months after planting (MAP) (0.27-0.32), 6 MAP (0.40-0.42) and root severity (0.31-0.42). For all traits, Random Forest and reproducing kernel Hilbert spaces regression showed the highest predictive accuracies. Our results provide an insight into the genetics of CBSD resistance to guide CBSD marker-assisted breeding and highlight the potential of GS to improve cassava breeding.
Collapse
Affiliation(s)
- Siraj Ismail Kayondo
- National Crop Resources Research Institute, NaCRRI, P.O. Box, 7084, Kampala, Uganda.
- West Africa Center for Crop Improvement, , (WACCI), University of Ghana, Accra, Ghana.
| | - Dunia Pino Del Carpio
- School of Integrative Plant Sciences, Section of Plant Breeding and Genetics, Cornell University, Ithaca, New York, USA
| | - Roberto Lozano
- School of Integrative Plant Sciences, Section of Plant Breeding and Genetics, Cornell University, Ithaca, New York, USA
| | - Alfred Ozimati
- National Crop Resources Research Institute, NaCRRI, P.O. Box, 7084, Kampala, Uganda
- School of Integrative Plant Sciences, Section of Plant Breeding and Genetics, Cornell University, Ithaca, New York, USA
| | - Marnin Wolfe
- School of Integrative Plant Sciences, Section of Plant Breeding and Genetics, Cornell University, Ithaca, New York, USA
| | - Yona Baguma
- National Crop Resources Research Institute, NaCRRI, P.O. Box, 7084, Kampala, Uganda
| | - Vernon Gracen
- West Africa Center for Crop Improvement, , (WACCI), University of Ghana, Accra, Ghana
- School of Integrative Plant Sciences, Section of Plant Breeding and Genetics, Cornell University, Ithaca, New York, USA
| | - Samuel Offei
- West Africa Center for Crop Improvement, , (WACCI), University of Ghana, Accra, Ghana
| | - Morag Ferguson
- International Institute for Tropical Agriculture (IITA), Nairobi, Kenya
| | - Robert Kawuki
- National Crop Resources Research Institute, NaCRRI, P.O. Box, 7084, Kampala, Uganda
| | - Jean-Luc Jannink
- School of Integrative Plant Sciences, Section of Plant Breeding and Genetics, Cornell University, Ithaca, New York, USA
- US Department of Agriculture, Agricultural Research Service (USDA-ARS), Ithaca, New York, USA
| |
Collapse
|
3
|
Stephan J, Stegle O, Beyer A. A random forest approach to capture genetic effects in the presence of population structure. Nat Commun 2015; 6:7432. [DOI: 10.1038/ncomms8432] [Citation(s) in RCA: 60] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2014] [Accepted: 05/08/2015] [Indexed: 01/07/2023] Open
|
4
|
Beam AL, Motsinger-Reif A, Doyle J. Bayesian neural networks for detecting epistasis in genetic association studies. BMC Bioinformatics 2014; 15:368. [PMID: 25413600 PMCID: PMC4256933 DOI: 10.1186/s12859-014-0368-0] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2014] [Accepted: 10/30/2014] [Indexed: 12/02/2022] Open
Abstract
Background Discovering causal genetic variants from large genetic association studies poses many difficult challenges. Assessing which genetic markers are involved in determining trait status is a computationally demanding task, especially in the presence of gene-gene interactions. Results A non-parametric Bayesian approach in the form of a Bayesian neural network is proposed for use in analyzing genetic association studies. Demonstrations on synthetic and real data reveal they are able to efficiently and accurately determine which variants are involved in determining case-control status. By using graphics processing units (GPUs) the time needed to build these models is decreased by several orders of magnitude. In comparison with commonly used approaches for detecting interactions, Bayesian neural networks perform very well across a broad spectrum of possible genetic relationships. Conclusions The proposed framework is shown to be a powerful method for detecting causal SNPs while being computationally efficient enough to handle large datasets. Electronic supplementary material The online version of this article (doi:10.1186/s12859-014-0368-0) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Andrew L Beam
- Center for Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
| | - Alison Motsinger-Reif
- Bioinformatics Research Center, North Carolina State University, Raleigh, NC, USA. .,Department of Statistics, North Carolina State University, Raleigh, NC, USA.
| | - Jon Doyle
- Department of Computer Science, North Carolina State University, Raleigh, NC, USA.
| |
Collapse
|
5
|
Li CF, Luo FT, Zeng YX, Jia WH. Weighted risk score-based multifactor dimensionality reduction to detect gene-gene interactions in nasopharyngeal carcinoma. Int J Mol Sci 2014; 15:10724-37. [PMID: 24933637 PMCID: PMC4100176 DOI: 10.3390/ijms150610724] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2014] [Revised: 04/21/2014] [Accepted: 06/03/2014] [Indexed: 12/02/2022] Open
Abstract
Determining the complex relationships between diseases, polymorphisms in human genes and environmental factors is challenging. Multifactor dimensionality reduction (MDR) has been proven to be capable of effectively detecting the statistical patterns of epistasis, although classification accuracy is required for this approach. The imbalanced dataset can cause seriously negative effects on classification accuracy. Moreover, MDR methods cannot quantitatively assess the disease risk of genotype combinations. Hence, we introduce a novel weighted risk score-based multifactor dimensionality reduction (WRSMDR) method that uses the Bayesian posterior probability of polymorphism combinations as a new quantitative measure of disease risk. First, we compared the WRSMDR to the MDR method in simulated datasets. Our results showed that the WRSMDR method had reasonable power to identify high-order gene-gene interactions, and it was more effective than MDR at detecting four-locus models. Moreover, WRSMDR reveals more information regarding the effect of genotype combination on the disease risk, and the result was easier to determine and apply than with MDR. Finally, we applied WRSMDR to a nasopharyngeal carcinoma (NPC) case-control study and identified a statistically significant high-order interaction among three polymorphisms: rs2860580, rs11865086 and rs2305806.
Collapse
Affiliation(s)
- Chao-Feng Li
- Department of Medical Statistics and Epidemiology, School of Public Health, Sun Yat-sen University, Guangzhou 510080, China.
| | - Fu-Tian Luo
- Department of Medical Statistics and Epidemiology, School of Public Health, Sun Yat-sen University, Guangzhou 510080, China.
| | - Yi-Xin Zeng
- State Key Laboratory of Oncology in South China, Sun Yat-sen University Cancer Center, Guangzhou 510060, China.
| | - Wei-Hua Jia
- State Key Laboratory of Oncology in South China, Sun Yat-sen University Cancer Center, Guangzhou 510060, China.
| |
Collapse
|
6
|
Impact of natural genetic variation on gene expression dynamics. PLoS Genet 2013; 9:e1003514. [PMID: 23754949 PMCID: PMC3674999 DOI: 10.1371/journal.pgen.1003514] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2012] [Accepted: 04/04/2013] [Indexed: 01/03/2023] Open
Abstract
DNA sequence variation causes changes in gene expression, which in turn has profound effects on cellular states. These variations affect tissue development and may ultimately lead to pathological phenotypes. A genetic locus containing a sequence variation that affects gene expression is called an “expression quantitative trait locus” (eQTL). Whereas the impact of cellular context on expression levels in general is well established, a lot less is known about the cell-state specificity of eQTL. Previous studies differed with respect to how “dynamic eQTL” were defined. Here, we propose a unified framework distinguishing static, conditional and dynamic eQTL and suggest strategies for mapping these eQTL classes. Further, we introduce a new approach to simultaneously infer eQTL from different cell types. By using murine mRNA expression data from four stages of hematopoiesis and 14 related cellular traits, we demonstrate that static, conditional and dynamic eQTL, although derived from the same expression data, represent functionally distinct types of eQTL. While static eQTL affect generic cellular processes, non-static eQTL are more often involved in hematopoiesis and immune response. Our analysis revealed substantial effects of individual genetic variation on cell type-specific expression regulation. Among a total number of 3,941 eQTL we detected 2,729 static eQTL, 1,187 eQTL were conditionally active in one or several cell types, and 70 eQTL affected expression changes during cell type transitions. We also found evidence for feedback control mechanisms reverting the effect of an eQTL specifically in certain cell types. Loci correlated with hematological traits were enriched for conditional eQTL, thus, demonstrating the importance of conditional eQTL for understanding molecular mechanisms underlying physiological trait variation. The classification proposed here has the potential to streamline and unify future analysis of conditional and dynamic eQTL as well as many other kinds of QTL data. Complex physiological traits are affected through subtle changes of molecular traits like gene expression in the relevant tissues, which in turn are caused by genetic variation. A genetic locus containing a sequence variation affecting gene expression is called an expression quantitative trait locus (eQTL). Understanding the tissue and cell type specificity of eQTL effects is essential for revealing the molecular mechanisms underlying disease phenotypes. However, so far the cell-state dependence of eQTL is poorly understood. In order to systematically assess the importance of cell state-specific eQTL, we propose to distinguish static, conditional and dynamic eQTL and suggest strategies for mapping these eQTL classes. We applied our framework to mouse gene expression data from four hematopoietic stages and related cellular traits. The different eQTL classes, although derived from the same expression data, represent functionally distinct types of eQTL. Importantly, conditional eQTL are well correlated with relevant hematological traits. These findings emphasize the condition specificity of many regulatory relationships, even if the conditions under study are related. This calls for due caution when transferring conclusions about regulatory mechanisms across cell types or tissues. The proposed classification will also help to unravel dynamic behaviors in many other kinds of QTL data.
Collapse
|
7
|
Gory JJ, Sweeney HC, Reif DM, Motsinger-Reif AA. A comparison of internal model validation methods for multifactor dimensionality reduction in the case of genetic heterogeneity. BMC Res Notes 2012; 5:623. [PMID: 23126544 PMCID: PMC3599301 DOI: 10.1186/1756-0500-5-623] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2012] [Accepted: 10/29/2012] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Determining the genes responsible for certain human traits can be challenging when the underlying genetic model takes a complicated form such as heterogeneity (in which different genetic models can result in the same trait) or epistasis (in which genes interact with other genes and the environment). Multifactor Dimensionality Reduction (MDR) is a widely used method that effectively detects epistasis; however, it does not perform well in the presence of heterogeneity partly due to its reliance on cross-validation for internal model validation. Cross-validation allows for only one "best" model and is therefore inadequate when more than one model could cause the same trait. We hypothesize that another internal model validation method known as a three-way split will be better at detecting heterogeneity models. RESULTS In this study, we test this hypothesis by performing a simulation study to compare the performance of MDR to detect models of heterogeneity with the two different internal model validation techniques. We simulated a range of disease models with both main effects and gene-gene interactions with a range of effect sizes. We assessed the performance of each method using a range of definitions of power. CONCLUSIONS Overall, the power of MDR to detect heterogeneity models was relatively poor, especially under more conservative (strict) definitions of power. While the overall power was low, our results show that the cross-validation approach greatly outperformed the three-way split approach in detecting heterogeneity. This would motivate using cross-validation with MDR in studies where heterogeneity might be present. These results also emphasize the challenge of detecting heterogeneity models and the need for further methods development.
Collapse
Affiliation(s)
- Jeffrey J Gory
- Bioinformatics Research Center, Department of Statistics, North Carolina State University, Raleigh, NC 27695, USA
| | | | | | | |
Collapse
|
8
|
Urbanowicz RJ, Kiralis J, Fisher JM, Moore JH. Predicting the difficulty of pure, strict, epistatic models: metrics for simulated model selection. BioData Min 2012; 5:15. [PMID: 23014095 PMCID: PMC3549792 DOI: 10.1186/1756-0381-5-15] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2012] [Accepted: 09/14/2012] [Indexed: 11/30/2022] Open
Abstract
Background Algorithms designed to detect complex genetic disease associations are initially evaluated using simulated datasets. Typical evaluations vary constraints that influence the correct detection of underlying models (i.e. number of loci, heritability, and minor allele frequency). Such studies neglect to account for model architecture (i.e. the unique specification and arrangement of penetrance values comprising the genetic model), which alone can influence the detectability of a model. In order to design a simulation study which efficiently takes architecture into account, a reliable metric is needed for model selection. Results We evaluate three metrics as predictors of relative model detection difficulty derived from previous works: (1) Penetrance table variance (PTV), (2) customized odds ratio (COR), and (3) our own Ease of Detection Measure (EDM), calculated from the penetrance values and respective genotype frequencies of each simulated genetic model. We evaluate the reliability of these metrics across three very different data search algorithms, each with the capacity to detect epistatic interactions. We find that a model’s EDM and COR are each stronger predictors of model detection success than heritability. Conclusions This study formally identifies and evaluates metrics which quantify model detection difficulty. We utilize these metrics to intelligently select models from a population of potential architectures. This allows for an improved simulation study design which accounts for differences in detection difficulty attributed to model architecture. We implement the calculation and utilization of EDM and COR into GAMETES, an algorithm which rapidly and precisely generates pure, strict, n-locus epistatic models.
Collapse
Affiliation(s)
- Ryan J Urbanowicz
- Department of Genetics, Institute for Quantitative Biomedical Sciences, Dartmouth Medical School, Lebanon, NH, USA.
| | | | | | | |
Collapse
|
9
|
Che R, Motsinger-Reif AA. A new explained-variance based genetic risk score for predictive modeling of disease risk. Stat Appl Genet Mol Biol 2012; 11:Article 15. [PMID: 23023697 DOI: 10.1515/1544-6115.1796] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
The goal of association mapping is to identify genetic variants that predict disease, and as the field of human genetics matures, the number of successful association studies is increasing. Many such studies have shown that for many diseases, risk is explained by a reasonably large number of variants that each explains a very small amount of disease risk. This is prompting the use of genetic risk scores in building predictive models, where information across several variants is combined for predictive modeling. In the current study, we compare the performance of four previously proposed genetic risk score methods and present a new method for constructing genetic risk score that incorporates explained variance information. The methods compared include: a simple count Genetic Risk Score, an odds ratio weighted Genetic Risk Score, a direct logistic regression Genetic Risk Score, a polygenic Genetic Risk Score, and the new explained variance weighted Genetic Risk Score. We compare the methods using a wide range of simulations in two steps, with a range of the number of deleterious single nucleotide polymorphisms (SNPs) explaining disease risk, genetic modes, baseline penetrances, sample sizes, relative risks (RR) and minor allele frequencies (MAF). Several measures of model performance were compared including overall power, C-statistic and Akaike's Information Criterion. Our results show the relative performance of methods differs significantly, with the new explained variance weighted GRS (EV-GRS) generally performing favorably to the other methods.
Collapse
|
10
|
Canela-Xandri O, Julià A, Gelpí JL, Marsal S. Unveiling case-control relationships in designing a simple and powerful method for detecting gene-gene interactions. Genet Epidemiol 2012; 36:710-6. [PMID: 22886951 DOI: 10.1002/gepi.21665] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2012] [Revised: 06/01/2012] [Accepted: 06/14/2012] [Indexed: 11/10/2022]
Abstract
The detection of gene-gene interactions (i.e., epistasis) in the human genome is becoming decisive for the complete characterization of the genetic factors associated with complex binary traits. Despite the fact that many methods have been developed to address this challenging issue, their performance still remains insufficient. We will show how case and control groups store complementary information regarding interactions, and the use of this fundamental property in the design of a new, rapid, and highly powerful epistasis analysis method. Unlike previous approaches where statistical methods are tested over a very limited range of situations, we have performed an exhaustive evaluation of the power of our new method. To this end, we also propose a more comprehensive interpretation of epistasis in which genotype interactions may be of risk, protective, or neutral. In this extended view of genetic interactions, we demonstrate that our method has superior performance than existing approaches, thus, providing a highly powerful tool for the identification of gene-gene interactions associated with binary traits.
Collapse
Affiliation(s)
- Oriol Canela-Xandri
- Rheumatology Research Group, Vall d'Hebron Research Insitute, Barcelona, Spain
| | | | | | | |
Collapse
|
11
|
Winham SJ, Colby CL, Freimuth RR, Wang X, de Andrade M, Huebner M, Biernacka JM. SNP interaction detection with Random Forests in high-dimensional genetic data. BMC Bioinformatics 2012. [PMID: 22793366 DOI: 10.1186/1471‐2105‐13‐164] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Identifying variants associated with complex human traits in high-dimensional data is a central goal of genome-wide association studies. However, complicated etiologies such as gene-gene interactions are ignored by the univariate analysis usually applied in these studies. Random Forests (RF) are a popular data-mining technique that can accommodate a large number of predictor variables and allow for complex models with interactions. RF analysis produces measures of variable importance that can be used to rank the predictor variables. Thus, single nucleotide polymorphism (SNP) analysis using RFs is gaining popularity as a potential filter approach that considers interactions in high-dimensional data. However, the impact of data dimensionality on the power of RF to identify interactions has not been thoroughly explored. We investigate the ability of rankings from variable importance measures to detect gene-gene interaction effects and their potential effectiveness as filters compared to p-values from univariate logistic regression, particularly as the data becomes increasingly high-dimensional. RESULTS RF effectively identifies interactions in low dimensional data. As the total number of predictor variables increases, probability of detection declines more rapidly for interacting SNPs than for non-interacting SNPs, indicating that in high-dimensional data the RF variable importance measures are capturing marginal effects rather than capturing the effects of interactions. CONCLUSIONS While RF remains a promising data-mining technique that extends univariate methods to condition on multiple variables simultaneously, RF variable importance measures fail to detect interaction effects in high-dimensional data in the absence of a strong marginal component, and therefore may not be useful as a filter technique that allows for interaction effects in genome-wide data.
Collapse
Affiliation(s)
- Stacey J Winham
- Department of Health Sciences Research, Mayo Clinic, 200 First Street Southwest, Rochester, MN 55905, USA.
| | | | | | | | | | | | | |
Collapse
|
12
|
Winham SJ, Colby CL, Freimuth RR, Wang X, de Andrade M, Huebner M, Biernacka JM. SNP interaction detection with Random Forests in high-dimensional genetic data. BMC Bioinformatics 2012; 13:164. [PMID: 22793366 PMCID: PMC3463421 DOI: 10.1186/1471-2105-13-164] [Citation(s) in RCA: 71] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2011] [Accepted: 04/30/2012] [Indexed: 11/26/2022] Open
Abstract
Background Identifying variants associated with complex human traits in high-dimensional data is a central goal of genome-wide association studies. However, complicated etiologies such as gene-gene interactions are ignored by the univariate analysis usually applied in these studies. Random Forests (RF) are a popular data-mining technique that can accommodate a large number of predictor variables and allow for complex models with interactions. RF analysis produces measures of variable importance that can be used to rank the predictor variables. Thus, single nucleotide polymorphism (SNP) analysis using RFs is gaining popularity as a potential filter approach that considers interactions in high-dimensional data. However, the impact of data dimensionality on the power of RF to identify interactions has not been thoroughly explored. We investigate the ability of rankings from variable importance measures to detect gene-gene interaction effects and their potential effectiveness as filters compared to p-values from univariate logistic regression, particularly as the data becomes increasingly high-dimensional. Results RF effectively identifies interactions in low dimensional data. As the total number of predictor variables increases, probability of detection declines more rapidly for interacting SNPs than for non-interacting SNPs, indicating that in high-dimensional data the RF variable importance measures are capturing marginal effects rather than capturing the effects of interactions. Conclusions While RF remains a promising data-mining technique that extends univariate methods to condition on multiple variables simultaneously, RF variable importance measures fail to detect interaction effects in high-dimensional data in the absence of a strong marginal component, and therefore may not be useful as a filter technique that allows for interaction effects in genome-wide data.
Collapse
Affiliation(s)
- Stacey J Winham
- Department of Health Sciences Research, Mayo Clinic, 200 First Street Southwest, Rochester, MN 55905, USA.
| | | | | | | | | | | | | |
Collapse
|
13
|
High-order SNP combinations associated with complex diseases: efficient discovery, statistical power and functional interactions. PLoS One 2012; 7:e33531. [PMID: 22536319 PMCID: PMC3334940 DOI: 10.1371/journal.pone.0033531] [Citation(s) in RCA: 42] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2011] [Accepted: 02/10/2012] [Indexed: 11/19/2022] Open
Abstract
There has been increased interest in discovering combinations of single-nucleotide polymorphisms (SNPs) that are strongly associated with a phenotype even if each SNP has little individual effect. Efficient approaches have been proposed for searching two-locus combinations from genome-wide datasets. However, for high-order combinations, existing methods either adopt a brute-force search which only handles a small number of SNPs (up to few hundreds), or use heuristic search that may miss informative combinations. In addition, existing approaches lack statistical power because of the use of statistics with high degrees-of-freedom and the huge number of hypotheses tested during combinatorial search. Due to these challenges, functional interactions in high-order combinations have not been systematically explored. We leverage discriminative-pattern-mining algorithms from the data-mining community to search for high-order combinations in case-control datasets. The substantially improved efficiency and scalability demonstrated on synthetic and real datasets with several thousands of SNPs allows the study of several important mathematical and statistical properties of SNP combinations with order as high as eleven. We further explore functional interactions in high-order combinations and reveal a general connection between the increase in discriminative power of a combination over its subsets and the functional coherence among the genes comprising the combination, supported by multiple datasets. Finally, we study several significant high-order combinations discovered from a lung-cancer dataset and a kidney-transplant-rejection dataset in detail to provide novel insights on the complex diseases. Interestingly, many of these associations involve combinations of common variations that occur in small fractions of population. Thus, our approach is an alternative methodology for exploring the genetics of rare diseases for which the current focus is on individually rare variations.
Collapse
|
14
|
Wang Y, Liu G, Feng M, Wong L. An empirical comparison of several recent epistatic interaction detection methods. Bioinformatics 2011; 27:2936-43. [DOI: 10.1093/bioinformatics/btr512] [Citation(s) in RCA: 51] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
|
15
|
Chen L, Yu G, Langefeld CD, Miller DJ, Guy RT, Raghuram J, Yuan X, Herrington DM, Wang Y. Comparative analysis of methods for detecting interacting loci. BMC Genomics 2011; 12:344. [PMID: 21729295 PMCID: PMC3161015 DOI: 10.1186/1471-2164-12-344] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2011] [Accepted: 07/05/2011] [Indexed: 12/20/2022] Open
Abstract
Background Interactions among genetic loci are believed to play an important role in disease risk. While many methods have been proposed for detecting such interactions, their relative performance remains largely unclear, mainly because different data sources, detection performance criteria, and experimental protocols were used in the papers introducing these methods and in subsequent studies. Moreover, there have been very few studies strictly focused on comparison of existing methods. Given the importance of detecting gene-gene and gene-environment interactions, a rigorous, comprehensive comparison of performance and limitations of available interaction detection methods is warranted. Results We report a comparison of eight representative methods, of which seven were specifically designed to detect interactions among single nucleotide polymorphisms (SNPs), with the last a popular main-effect testing method used as a baseline for performance evaluation. The selected methods, multifactor dimensionality reduction (MDR), full interaction model (FIM), information gain (IG), Bayesian epistasis association mapping (BEAM), SNP harvester (SH), maximum entropy conditional probability modeling (MECPM), logistic regression with an interaction term (LRIT), and logistic regression (LR) were compared on a large number of simulated data sets, each, consistent with complex disease models, embedding multiple sets of interacting SNPs, under different interaction models. The assessment criteria included several relevant detection power measures, family-wise type I error rate, and computational complexity. There are several important results from this study. First, while some SNPs in interactions with strong effects are successfully detected, most of the methods miss many interacting SNPs at an acceptable rate of false positives. In this study, the best-performing method was MECPM. Second, the statistical significance assessment criteria, used by some of the methods to control the type I error rate, are quite conservative, thereby limiting their power and making it difficult to fairly compare them. Third, as expected, power varies for different models and as a function of penetrance, minor allele frequency, linkage disequilibrium and marginal effects. Fourth, the analytical relationships between power and these factors are derived, aiding in the interpretation of the study results. Fifth, for these methods the magnitude of the main effect influences the power of the tests. Sixth, most methods can detect some ground-truth SNPs but have modest power to detect the whole set of interacting SNPs. Conclusion This comparison study provides new insights into the strengths and limitations of current methods for detecting interacting loci. This study, along with freely available simulation tools we provide, should help support development of improved methods. The simulation tools are available at: http://code.google.com/p/simulation-tool-bmc-ms9169818735220977/downloads/list.
Collapse
Affiliation(s)
- Li Chen
- Bradley Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, VA, USA
| | | | | | | | | | | | | | | | | |
Collapse
|
16
|
Lehr T, Yuan J, Zeumer D, Jayadev S, Ritchie MD. Rule based classifier for the analysis of gene-gene and gene-environment interactions in genetic association studies. BioData Min 2011; 4:4. [PMID: 21362183 PMCID: PMC3060133 DOI: 10.1186/1756-0381-4-4] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2009] [Accepted: 03/01/2011] [Indexed: 11/10/2022] Open
Abstract
Background Several methods have been presented for the analysis of complex interactions between genetic polymorphisms and/or environmental factors. Despite the available methods, there is still a need for alternative methods, because no single method will perform well in all scenarios. The aim of this work was to evaluate the performance of three selected rule based classifier algorithms, RIPPER, RIDOR and PART, for the analysis of genetic association studies. Methods Overall, 42 datasets were simulated with three different case-control models, a varying number of subjects (300, 600), SNPs (500, 1500, 3000) and noise (5%, 10%, 20%). The algorithms were applied to each of the datasets with a set of algorithm-specific settings. Results were further investigated with respect to a) the Model, b) the Rules, and c) the Attribute level. Data analysis was performed using WEKA, SAS and PERL. Results The RIPPER algorithm discovered the true case-control model at least once in >33% of the datasets. The RIDOR and PART algorithm performed poorly for model detection. The RIPPER, RIDOR and PART algorithm discovered the true case-control rules in more than 83%, 83% and 44% of the datasets, respectively. All three algorithms were able to detect the attributes utilized in the respective case-control models in most datasets. Conclusions The current analyses substantiate the utility of rule based classifiers such as RIPPER, RIDOR and PART for the detection of gene-gene/gene-environment interactions in genetic association studies. These classifiers could provide a valuable new method, complementing existing approaches, in the analysis of genetic association studies. The methods provide an advantage in being able to handle both categorical and continuous variable types. Further, because the outputs of the analyses are easy to interpret, the rule based classifier approach could quickly generate testable hypotheses for additional evaluation. Since the algorithms are computationally inexpensive, they may serve as valuable tools for preselection of attributes to be used in more complex, computationally intensive approaches. Whether used in isolation or in conjunction with other tools, rule based classifiers are an important addition to the armamentarium of tools available for analyses of complex genetic association studies.
Collapse
Affiliation(s)
- Thorsten Lehr
- Boehringer Ingelheim Pharma GmbH & Co, KG, Department of Drug Metabolism and Pharmacokinetics, 88397 Biberach an der Riss, Germany.
| | | | | | | | | |
Collapse
|
17
|
Grady BJ, Ritchie MD. Statistical Optimization of Pharmacogenomics Association Studies: Key Considerations from Study Design to Analysis. CURRENT PHARMACOGENOMICS AND PERSONALIZED MEDICINE 2011; 9:41-66. [PMID: 21887206 PMCID: PMC3163263 DOI: 10.2174/187569211794728805] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/11/2023]
Abstract
Research in human genetics and genetic epidemiology has grown significantly over the previous decade, particularly in the field of pharmacogenomics. Pharmacogenomics presents an opportunity for rapid translation of associated genetic polymorphisms into diagnostic measures or tests to guide therapy as part of a move towards personalized medicine. Expansion in genotyping technology has cleared the way for widespread use of whole-genome genotyping in the effort to identify novel biology and new genetic markers associated with pharmacokinetic and pharmacodynamic endpoints. With new technology and methodology regularly becoming available for use in genetic studies, a discussion on the application of such tools becomes necessary. In particular, quality control criteria have evolved with the use of GWAS as we have come to understand potential systematic errors which can be introduced into the data during genotyping. There have been several replicated pharmacogenomic associations, some of which have moved to the clinic to enact change in treatment decisions. These examples of translation illustrate the strength of evidence necessary to successfully and effectively translate a genetic discovery. In this review, the design of pharmacogenomic association studies is examined with the goal of optimizing the impact and utility of this research. Issues of ascertainment, genotyping, quality control, analysis and interpretation are considered.
Collapse
Affiliation(s)
- Benjamin J. Grady
- Department of Molecular Physiology & Biophysics, Center for Human Genetics Research, Vanderbilt University, Nashville, TN, USA
| | - Marylyn D. Ritchie
- Department of Molecular Physiology & Biophysics, Center for Human Genetics Research, Vanderbilt University, Nashville, TN, USA
| |
Collapse
|
18
|
Ritchie MD. Using biological knowledge to uncover the mystery in the search for epistasis in genome-wide association studies. Ann Hum Genet 2011; 75:172-82. [PMID: 21158748 DOI: 10.1111/j.1469-1809.2010.00630.x] [Citation(s) in RCA: 60] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023]
Abstract
The search for the missing heritability in genome-wide association studies (GWAS) has become an important focus for the human genetics community. One suspected location of these genetic effects is in gene-gene interactions, or epistasis. The computational burden of exploring gene-gene interactions in the wealth of data generated in GWAS, along with small to moderate sample sizes, have led to epistasis being an afterthought, rather than a primary focus of GWAS analyses. In this review, I discuss some potential approaches to filter a GWAS dataset to a smaller, more manageable dataset where searching for epistasis is considerably more feasible. I describe a number of alternative approaches, but primarily focus on the use of prior biological knowledge from databases in the public domain to guide the search for epistasis. The manner in which prior knowledge is incorporated into a GWA study can be many and these data can be extracted from a variety of database sources. I discuss a number of these approaches and propose that a comprehensive approach will likely be most fruitful for searching for epistasis in large-scale genomic studies of the current state-of-the-art and into the future.
Collapse
Affiliation(s)
- Marylyn D Ritchie
- Department of Molecular Physiology, Center for Human Genetics Research, Vanderbilt University, Nashville, TN 37232-0700, USA.
| |
Collapse
|
19
|
A comparison of multifactor dimensionality reduction and L1-penalized regression to identify gene-gene interactions in genetic association studies. Stat Appl Genet Mol Biol 2011; 10:Article 4. [PMID: 21291414 DOI: 10.2202/1544-6115.1613] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Recently, the amount of high-dimensional data has exploded, creating new analytical challenges for human genetics. Furthermore, much evidence suggests that common complex diseases may be due to complex etiologies such as gene-gene interactions, which are difficult to identify in high-dimensional data using traditional statistical approaches. Data-mining approaches are gaining popularity for variable selection in association studies, and one of the most commonly used methods to evaluate potential gene-gene interactions is Multifactor Dimensionality Reduction (MDR). Additionally, a number of penalized regression techniques, such as Lasso, are gaining popularity within the statistical community and are now being applied to association studies, including extensions for interactions. In this study, we compare the performance of MDR, the traditional lasso with L1 penalty (TL1), and the group lasso for categorical data with group-wise L1 penalty (GL1) to detect gene-gene interactions through a broad range of simulations. We find that each method has both advantages and disadvantages, and relative performance is context dependent. TL1 frequently over-fits, identifying false positive as well as true positive loci. MDR has higher power for epistatic models that exhibit independent main effects; for both Lasso methods, main effects tend to dominate. For purely epistatic models, GL1 has the best performance for lower minor allele frequencies, but MDR performs best for higher frequencies. These results provide guidance of when each approach might be best suited for detecting and characterizing interactions with different mechanisms.
Collapse
|
20
|
Winham SJ, Motsinger-Reif AA. The effect of retrospective sampling on estimates of prediction error for multifactor dimensionality reduction. Ann Hum Genet 2011; 75:46-61. [PMID: 20560921 PMCID: PMC2955770 DOI: 10.1111/j.1469-1809.2010.00587.x] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
The standard in genetic association studies of complex diseases is replication and validation of positive results, with an emphasis on assessing the predictive value of associations. In response to this need, a number of analytical approaches have been developed to identify predictive models that account for complex genetic etiologies. Multifactor Dimensionality Reduction (MDR) is a commonly used, highly successful method designed to evaluate potential gene-gene interactions. MDR relies on classification error in a cross-validation framework to rank and evaluate potentially predictive models. Previous work has demonstrated the high power of MDR, but has not considered the accuracy and variance of the MDR prediction error estimate. Currently, we evaluate the bias and variance of the MDR error estimate as both a retrospective and prospective estimator and show that MDR can both underestimate and overestimate error. We argue that a prospective error estimate is necessary if MDR models are used for prediction, and propose a bootstrap resampling estimate, integrating population prevalence, to accurately estimate prospective error. We demonstrate that this bootstrap estimate is preferable for prediction to the error estimate currently produced by MDR. While demonstrated with MDR, the proposed estimation is applicable to all data-mining methods that use similar estimates.
Collapse
Affiliation(s)
- Stacey J Winham
- Department of Statistics, North Carolina State University, Raleigh, 27695, USA
| | | |
Collapse
|
21
|
Gui J, Andrew AS, Andrews P, Nelson HM, Kelsey KT, Karagas MR, Moore JH. A robust multifactor dimensionality reduction method for detecting gene-gene interactions with application to the genetic analysis of bladder cancer susceptibility. Ann Hum Genet 2010; 75:20-8. [PMID: 21091664 DOI: 10.1111/j.1469-1809.2010.00624.x] [Citation(s) in RCA: 55] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
Abstract
A central goal of human genetics is to identify susceptibility genes for common human diseases. An important challenge is modelling gene-gene interaction or epistasis that can result in nonadditivity of genetic effects. The multifactor dimensionality reduction (MDR) method was developed as a machine learning alternative to parametric logistic regression for detecting interactions in the absence of significant marginal effects. The goal of MDR is to reduce the dimensionality inherent in modelling combinations of polymorphisms using a computational approach called constructive induction. Here, we propose a Robust Multifactor Dimensionality Reduction (RMDR) method that performs constructive induction using a Fisher's Exact Test rather than a predetermined threshold. The advantage of this approach is that only statistically significant genotype combinations are considered in the MDR analysis. We use simulation studies to demonstrate that this approach will increase the success rate of MDR when there are only a few genotype combinations that are significantly associated with case-control status. We show that there is no loss of success rate when this is not the case. We then apply the RMDR method to the detection of gene-gene interactions in genotype data from a population-based study of bladder cancer in New Hampshire.
Collapse
Affiliation(s)
- Jiang Gui
- Dartmouth Medical School, Lebanon, NH 03756, USA
| | | | | | | | | | | | | |
Collapse
|
22
|
Michaelson JJ, Alberts R, Schughart K, Beyer A. Data-driven assessment of eQTL mapping methods. BMC Genomics 2010; 11:502. [PMID: 20849587 PMCID: PMC2996998 DOI: 10.1186/1471-2164-11-502] [Citation(s) in RCA: 44] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2010] [Accepted: 09/17/2010] [Indexed: 11/10/2022] Open
Abstract
Background The analysis of expression quantitative trait loci (eQTL) is a potentially powerful way to detect transcriptional regulatory relationships at the genomic scale. However, eQTL data sets often go underexploited because legacy QTL methods are used to map the relationship between the expression trait and genotype. Often these methods are inappropriate for complex traits such as gene expression, particularly in the case of epistasis. Results Here we compare legacy QTL mapping methods with several modern multi-locus methods and evaluate their ability to produce eQTL that agree with independent external data in a systematic way. We found that the modern multi-locus methods (Random Forests, sparse partial least squares, lasso, and elastic net) clearly outperformed the legacy QTL methods (Haley-Knott regression and composite interval mapping) in terms of biological relevance of the mapped eQTL. In particular, we found that our new approach, based on Random Forests, showed superior performance among the multi-locus methods. Conclusions Benchmarks based on the recapitulation of experimental findings provide valuable insight when selecting the appropriate eQTL mapping method. Our battery of tests suggests that Random Forests map eQTL that are more likely to be validated by independent data, when compared to competing multi-locus and legacy eQTL mapping methods.
Collapse
Affiliation(s)
- Jacob J Michaelson
- Cellular Networks and Systems Biology, Biotechnology Center - TU Dresden, Dresden, Germany
| | | | | | | |
Collapse
|
23
|
Winham SJ, Slater AJ, Motsinger-Reif AA. A comparison of internal validation techniques for multifactor dimensionality reduction. BMC Bioinformatics 2010; 11:394. [PMID: 20650002 PMCID: PMC2920275 DOI: 10.1186/1471-2105-11-394] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2009] [Accepted: 07/22/2010] [Indexed: 11/24/2022] Open
Abstract
BACKGROUND It is hypothesized that common, complex diseases may be due to complex interactions between genetic and environmental factors, which are difficult to detect in high-dimensional data using traditional statistical approaches. Multifactor Dimensionality Reduction (MDR) is the most commonly used data-mining method to detect epistatic interactions. In all data-mining methods, it is important to consider internal validation procedures to obtain prediction estimates to prevent model over-fitting and reduce potential false positive findings. Currently, MDR utilizes cross-validation for internal validation. In this study, we incorporate the use of a three-way split (3WS) of the data in combination with a post-hoc pruning procedure as an alternative to cross-validation for internal model validation to reduce computation time without impairing performance. We compare the power to detect true disease causing loci using MDR with both 5- and 10-fold cross-validation to MDR with 3WS for a range of single-locus and epistatic disease models. Additionally, we analyze a dataset in HIV immunogenetics to demonstrate the results of the two strategies on real data. RESULTS MDR with 3WS is computationally approximately five times faster than 5-fold cross-validation. The power to find the exact true disease loci without detecting false positive loci is higher with 5-fold cross-validation than with 3WS before pruning. However, the power to find the true disease causing loci in addition to false positive loci is equivalent to the 3WS. With the incorporation of a pruning procedure after the 3WS, the power of the 3WS approach to detect only the exact disease loci is equivalent to that of MDR with cross-validation. In the real data application, the cross-validation and 3WS analyses indicate the same two-locus model. CONCLUSIONS Our results reveal that the performance of the two internal validation methods is equivalent with the use of pruning procedures. The specific pruning procedure should be chosen understanding the trade-off between identifying all relevant genetic effects but including false positives and missing important genetic factors. This implies 3WS may be a powerful and computationally efficient approach to screen for epistatic effects, and could be used to identify candidate interactions in large-scale genetic studies.
Collapse
Affiliation(s)
- Stacey J Winham
- Department of Statistics, North Carolina State University, Raleigh, NC 27695, USA
| | - Andrew J Slater
- Bioinformatics Research Center, North Carolina State University, Raleigh, NC 27695, USA
- Department of Genetics, North Carolina State University, Raleigh, NC 27695, USA
| | - Alison A Motsinger-Reif
- Department of Statistics, North Carolina State University, Raleigh, NC 27695, USA
- Bioinformatics Research Center, North Carolina State University, Raleigh, NC 27695, USA
| |
Collapse
|
24
|
Hua X, Zhang H, Zhang H, Yang Y, Kuk AYC. Testing multiple gene interactions by the ordered combinatorial partitioning method in case-control studies. ACTA ACUST UNITED AC 2010; 26:1871-8. [PMID: 20538724 DOI: 10.1093/bioinformatics/btq290] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
Abstract
MOTIVATION The multifactor-dimensionality reduction (MDR) method has been widely used in multi-locus interaction analysis. It reduces dimensionality by partitioning the multi-locus genotypes into a high-risk group and a low-risk group according to whether the genotype-specific risk ratio exceeds a fixed threshold or not. Alternatively, one can maximize the chi(2) value exhaustively over all possible ways of partitioning the multi-locus genotypes into two groups, and we aim to show that this is computationally feasible. METHODS We advocate finding the optimal MDR (OMDR) that would have resulted from an exhaustive search over all possible ways of partitioning the multi-locus genotypes into two groups. It is shown that this optimal MDR can be obtained efficiently using an ordered combinatorial partitioning (OCP) method, which differs from the existing MDR method in the use of a data-driven rather than fixed threshold. The generalized extreme value distribution (GEVD) theory is applied to find the optimal order of gene combination and assess statistical significance of interactions. RESULTS The computational complexity of OCP strategy is linear in the number of multi-locus genotypes in contrast with an exponential order for the naive exhaustive search strategy. Simulation studies show that OMDR can be more powerful than MDR with substantial power gain possible when the partitioning of OMDR is different from that of MDR. The analysis results of a breast cancer dataset show that the use of GEVD accelerates the determination of interaction order and reduces the time cost for P-value calculation by more than 10-fold. AVAILABILITY C++ program is available at http://home.ustc.edu.cn/~zhanghan/ocp/ocp.html
Collapse
Affiliation(s)
- Xing Hua
- Department of Statistics and Finance, University of Science and Technology of China, Hefei, Anhui, China
| | | | | | | | | |
Collapse
|
25
|
Cattaert T, Urrea V, Naj AC, De Lobel L, De Wit V, Fu M, Mahachie John JM, Shen H, Calle ML, Ritchie MD, Edwards TL, Van Steen K. FAM-MDR: a flexible family-based multifactor dimensionality reduction technique to detect epistasis using related individuals. PLoS One 2010; 5:e10304. [PMID: 20421984 PMCID: PMC2858665 DOI: 10.1371/journal.pone.0010304] [Citation(s) in RCA: 43] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2010] [Accepted: 03/01/2010] [Indexed: 12/05/2022] Open
Abstract
We propose a novel multifactor dimensionality reduction method for epistasis detection in small or extended pedigrees, FAM-MDR. It combines features of the Genome-wide Rapid Association using Mixed Model And Regression approach (GRAMMAR) with Model-Based MDR (MB-MDR). We focus on continuous traits, although the method is general and can be used for outcomes of any type, including binary and censored traits. When comparing FAM-MDR with Pedigree-based Generalized MDR (PGMDR), which is a generalization of Multifactor Dimensionality Reduction (MDR) to continuous traits and related individuals, FAM-MDR was found to outperform PGMDR in terms of power, in most of the considered simulated scenarios. Additional simulations revealed that PGMDR does not appropriately deal with multiple testing and consequently gives rise to overly optimistic results. FAM-MDR adequately deals with multiple testing in epistasis screens and is in contrast rather conservative, by construction. Furthermore, simulations show that correcting for lower order (main) effects is of utmost importance when claiming epistasis. As Type 2 Diabetes Mellitus (T2DM) is a complex phenotype likely influenced by gene-gene interactions, we applied FAM-MDR to examine data on glucose area-under-the-curve (GAUC), an endophenotype of T2DM for which multiple independent genetic associations have been observed, in the Amish Family Diabetes Study (AFDS). This application reveals that FAM-MDR makes more efficient use of the available data than PGMDR and can deal with multi-generational pedigrees more easily. In conclusion, we have validated FAM-MDR and compared it to PGMDR, the current state-of-the-art MDR method for family data, using both simulations and a practical dataset. FAM-MDR is found to outperform PGMDR in that it handles the multiple testing issue more correctly, has increased power, and efficiently uses all available information.
Collapse
Affiliation(s)
- Tom Cattaert
- Montefiore Institute, University of Liège, Liège, Belgium.
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
26
|
Li MD, Xu Q, Lou XY, Payne TJ, Niu T, Ma JZ. Association and interaction analysis of variants in CHRNA5/CHRNA3/CHRNB4 gene cluster with nicotine dependence in African and European Americans. Am J Med Genet B Neuropsychiatr Genet 2010; 153B:745-56. [PMID: 19859904 PMCID: PMC2924635 DOI: 10.1002/ajmg.b.31043] [Citation(s) in RCA: 46] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
Abstract
Several previous genome-wide and targeted association studies revealed that variants in the CHRNA5-CHRNA3-CHRNB4 (CHRNA5/A3/B4) gene cluster on chromosome 15 that encode the alpha5, alpha3, and beta4 subunits of the nicotinic acetylcholine receptors (nAChRs) are associated with nicotine dependence (ND) in European Americans (EAs) or others of European origin. Considering the distinct linkage disequilibrium patterns in European and other ethnic populations such as African Americans (AAs), it would be interesting to determine whether such associations exist in other ethnic populations. We performed a comprehensive association and interaction analysis of the CHRNA5/A3/B4 cluster in two ethnic samples to investigate the role of variants in the risk for ND, which was assessed by Smoking Quantity, Heaviness Smoking Index, and Fagerström test for ND. Using a family-based association test, we found a nominal association of single nucleotide polymorphisms (SNPs) rs1317286 and rs8040868 in CHRNA3 with ND in the AA and combined AA and EA samples. Furthermore, we found that several haplotypes in CHRNA5 and CHRNA3 are nominally associated with ND in AA, EA, and pooled samples. However, none of these associations remained significant after correction for multiple testing. In addition, we performed interaction analysis of SNPs within the CHRNA5/A3/B4 cluster using the pedigree-based generalized multifactor dimensionality reduction method and found significant interactions within CHRNA3 and among the three subunit genes in the AA and pooled samples. Together, these results indicate that variants within CHRNA3 and among CHRNA5, CHRNA3, and CHRNB4 contribute significantly to the etiology of ND through gene-gene interactions, although the association of each subunit gene with ND is weak in both the AA and EA samples.
Collapse
Affiliation(s)
- Ming D Li
- Department of Psychiatry and Neurobehavioral Sciences, Section of Neurobiology, University of Virginia, 1670 Discovery Drive, Suite 110, Charlottesville, VA 22911, USA.
| | | | | | | | | | | |
Collapse
|
27
|
Abstract
Because little is currently known about how genes interact with environmental factors in human diseases, and because of the large number of possible interactions between and within genetic and environmental factors, it is difficult to simulate samples for a disease caused by multiple interacting genetic and environmental factors. A recent article by Amato and colleagues in BMC Bioinformatics describes a mathematical model to characterize gene-environment interactions and a computer program that simulates them using biologically meaningful inputs. Here, I evaluate the advantages and limitations of the authors' approach in terms of its usefulness for simulating genetic samples for real-world studies of gene-environment interactions in complex human diseases.
Collapse
Affiliation(s)
- Bo Peng
- Department of Epidemiology, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA.
| |
Collapse
|
28
|
de la Paz MP, Villaverde-Hueso A, Alonso V, János S, Zurriaga O, Pollán M, Abaitua-Borda I. Rare diseases epidemiology research. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2010; 686:17-39. [PMID: 20824437 DOI: 10.1007/978-90-481-9485-8_2] [Citation(s) in RCA: 49] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
Rare Diseases Epidemiology is a novel action field still largely unexplored. However, Rare Diseases is a topic of growing interest at world level. The aims of this chapter are to revise useful epidemiological tools and define areas where epidemiology can help improve the rare disease knowledge, and facilitate policy decisions taking into account the real burden of rare diseases in society. This chapter also seeks to describe: the problems of coding and classification of diseases, measuring disease frequency, the study designs and association studies, the causality, the evolution from descriptive to epigenetic epidemiology and the natural history of disease. One of the major challenges facing analytical epidemiology and clinical epidemiological research into rare diseases is that genes can be involved in both aetiology and prognosis. Despite the many similarities between genetic association studies and classic observational epidemiological studies, the former pose several specific limitations, including an unprecedented volume of new data and the likelihood of very small individual effects, as well other limitations. Selecting the appropriate pathway from among all those available, i.e. the one that best relates genes from the various known regions and disease mechanisms, is crucial for the success of this type of studies.
Collapse
Affiliation(s)
- Manuel Posada de la Paz
- Instituto de Investigación en Enfermedades Raras (IIER), Instituto de Salud Carlos III and Centro de Investigación Biomédica en Red de Enfermedades Raras (CIBERER), Madrid, Spain.
| | | | | | | | | | | | | |
Collapse
|
29
|
Holzinger ER, Buchanan CC, Dudek SM, Torstenson EC, Turner SD, Ritchie MD. Initialization Parameter Sweep in ATHENA: Optimizing Neural Networks for Detecting Gene-Gene Interactions in the Presence of Small Main Effects. GENETIC AND EVOLUTIONARY COMPUTATION CONFERENCE : [PROCEEDINGS]. GENETIC AND EVOLUTIONARY COMPUTATION CONFERENCE 2010; 12:203-210. [PMID: 21152364 DOI: 10.1145/1830483.1830519] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/17/2023]
Abstract
Recent advances in genotyping technology have led to the generation of an enormous quantity of genetic data. Traditional methods of statistical analysis have proved insufficient in extracting all of the information about the genetic components of common, complex human diseases. A contributing factor to the problem of analysis is that amongst the small main effects of each single gene on disease susceptibility, there are non-linear, gene-gene interactions that can be difficult for traditional, parametric analyses to detect. In addition, exhaustively searching all multi-locus combinations has proved computationally impractical. Novel strategies for analysis have been developed to address these issues. The Analysis Tool for Heritable and Environmental Network Associations (ATHENA) is an analytical tool that incorporates grammatical evolution neural networks (GENN) to detect interactions among genetic factors. Initial parameters define how the evolutionary process will be implemented. This research addresses how different parameter settings affect detection of disease models involving interactions. In the current study, we iterate over multiple parameter values to determine which combinations appear optimal for detecting interactions in simulated data for multiple genetic models. Our results indicate that the factors that have the greatest influence on detection are: input variable encoding, population size, and parallel computation.
Collapse
Affiliation(s)
- Emily R Holzinger
- Ctr. for Human Genetics Research Dept. of Molecular Physiology & Biophysics; Vanderbilt University Nashville, TN 37232
| | | | | | | | | | | |
Collapse
|
30
|
Günther F, Wawro N, Bammann K. Neural networks for modeling gene-gene interactions in association studies. BMC Genet 2009; 10:87. [PMID: 20030838 PMCID: PMC2817696 DOI: 10.1186/1471-2156-10-87] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2009] [Accepted: 12/23/2009] [Indexed: 01/17/2023] Open
Abstract
Background Our aim is to investigate the ability of neural networks to model different two-locus disease models. We conduct a simulation study to compare neural networks with two standard methods, namely logistic regression models and multifactor dimensionality reduction. One hundred data sets are generated for each of six two-locus disease models, which are considered in a low and in a high risk scenario. Two models represent independence, one is a multiplicative model, and three models are epistatic. For each data set, six neural networks (with up to five hidden neurons) and five logistic regression models (the null model, three main effect models, and the full model) with two different codings for the genotype information are fitted. Additionally, the multifactor dimensionality reduction approach is applied. Results The results show that neural networks are more successful in modeling the structure of the underlying disease model than logistic regression models in most of the investigated situations. In our simulation study, neither logistic regression nor multifactor dimensionality reduction are able to correctly identify biological interaction. Conclusions Neural networks are a promising tool to handle complex data situations. However, further research is necessary concerning the interpretation of their parameters.
Collapse
Affiliation(s)
- Frauke Günther
- University of Bremen, Bremen Institute for Prevention Research and Social Medicine, Linzer Strasse 10, 28359 Bremen, Germany.
| | | | | |
Collapse
|
31
|
Günther F, Wawro N, Bammann K. Neural networks for modeling gene-gene interactions in association studies. BMC Genet 2009. [PMID: 20030838 DOI: 10.1186/1471‐2156‐10‐87] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open
Abstract
BACKGROUND Our aim is to investigate the ability of neural networks to model different two-locus disease models. We conduct a simulation study to compare neural networks with two standard methods, namely logistic regression models and multifactor dimensionality reduction. One hundred data sets are generated for each of six two-locus disease models, which are considered in a low and in a high risk scenario. Two models represent independence, one is a multiplicative model, and three models are epistatic. For each data set, six neural networks (with up to five hidden neurons) and five logistic regression models (the null model, three main effect models, and the full model) with two different codings for the genotype information are fitted. Additionally, the multifactor dimensionality reduction approach is applied. RESULTS The results show that neural networks are more successful in modeling the structure of the underlying disease model than logistic regression models in most of the investigated situations. In our simulation study, neither logistic regression nor multifactor dimensionality reduction are able to correctly identify biological interaction. CONCLUSIONS Neural networks are a promising tool to handle complex data situations. However, further research is necessary concerning the interpretation of their parameters.
Collapse
Affiliation(s)
- Frauke Günther
- University of Bremen, Bremen Institute for Prevention Research and Social Medicine, Linzer Strasse 10, 28359 Bremen, Germany.
| | | | | |
Collapse
|
32
|
He H, Oetting WS, Brott MJ, Basu S. Power of multifactor dimensionality reduction and penalized logistic regression for detecting gene-gene interaction in a case-control study. BMC MEDICAL GENETICS 2009; 10:127. [PMID: 19961594 PMCID: PMC2800840 DOI: 10.1186/1471-2350-10-127] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/15/2009] [Accepted: 12/04/2009] [Indexed: 11/13/2022]
Abstract
BACKGROUND There is a growing awareness that interaction between multiple genes play an important role in the risk of common, complex multi-factorial diseases. Many common diseases are affected by certain genotype combinations (associated with some genes and their interactions). The identification and characterization of these susceptibility genes and gene-gene interaction have been limited by small sample size and large number of potential interactions between genes. Several methods have been proposed to detect gene-gene interaction in a case control study. The penalized logistic regression (PLR), a variant of logistic regression with L2 regularization, is a parametric approach to detect gene-gene interaction. On the other hand, the Multifactor Dimensionality Reduction (MDR) is a nonparametric and genetic model-free approach to detect genotype combinations associated with disease risk. METHODS We compared the power of MDR and PLR for detecting two-way and three-way interactions in a case-control study through extensive simulations. We generated several interaction models with different magnitudes of interaction effect. For each model, we simulated 100 datasets, each with 200 cases and 200 controls and 20 SNPs. We considered a wide variety of models such as models with just main effects, models with only interaction effects or models with both main and interaction effects. We also compared the performance of MDR and PLR to detect gene-gene interaction associated with acute rejection(AR) in kidney transplant patients. RESULTS In this paper, we have studied the power of MDR and PLR for detecting gene-gene interaction in a case-control study through extensive simulation. We have compared their performances for different two-way and three-way interaction models. We have studied the effect of different allele frequencies on these methods. We have also implemented their performance on a real dataset. As expected, none of these methods were consistently better for all data scenarios, but, generally MDR outperformed PLR for more complex models. The ROC analysis on the real dataset suggests that MDR outperforms PLR in detecting gene-gene interaction on the real dataset. CONCLUSION As one might expect, the relative success of each method is context dependent. This study demonstrates the strengths and weaknesses of the methods to detect gene-gene interaction.
Collapse
Affiliation(s)
- Hua He
- Division of Biostatistics, School of Public Health, University of Minnesota, Minnesota, USA
| | - William S Oetting
- Department of experimental and clinical pharmacology, College of Pharmacy and Institute of Human Genetics, University of Minnesota, Minnesota, USA
| | - Marcia J Brott
- Department of experimental and clinical pharmacology, College of Pharmacy and Institute of Human Genetics, University of Minnesota, Minnesota, USA
| | - Saonli Basu
- Division of Biostatistics, School of Public Health, University of Minnesota, Minnesota, USA
| |
Collapse
|
33
|
Michaelson JJ, Loguercio S, Beyer A. Detection and interpretation of expression quantitative trait loci (eQTL). Methods 2009; 48:265-76. [DOI: 10.1016/j.ymeth.2009.03.004] [Citation(s) in RCA: 76] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2008] [Revised: 03/05/2009] [Accepted: 03/07/2009] [Indexed: 10/21/2022] Open
|
34
|
Pechlivanis S, Bermejo JL, Pardini B, Naccarati A, Vodickova L, Novotny J, Hemminki K, Vodicka P, Försti A. Genetic variation in adipokine genes and risk of colorectal cancer. Eur J Endocrinol 2009; 160:933-40. [PMID: 19273568 DOI: 10.1530/eje-09-0039] [Citation(s) in RCA: 56] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 01/24/2023]
Abstract
OBJECTIVE Obesity has been related to an increased risk of colorectal cancer (CRC). Adipokines produced by the adipose tissue are directly linked to obesity and may thus contribute to the pathogenesis of CRC. We hypothesized that potentially functional polymorphisms in the adipokine genes leptin (LEP), leptin receptor (LEPR), resistin (RETN), and adiponectin (ADIPOQ) may be associated with CRC. DESIGN AND METHODS We studied the association of four putatively functional single nucleotide polymorphisms (SNPs) with CRC risk using a hospital-based study design with 702 cases and 752 controls from the Czech Republic. We used likelihood ratio tests to select the best model to represent the relationship between genotypes and risk of CRC. Age-adjusted odds ratios (ORs) under the best model were calculated for each SNP. Previous genotyping data on insulin (INS)-related genes were used to explore interactions between genes in obesity- and diabetes-related pathways by using two independent methods, logistic regression, and multifactor-dimensionality reduction. RESULTS A trend to associate between the RETN SNP rs1862513 (C-420G) and CRC risk was observed (per allele OR 1.18, 95% confidence interval (0.99-1.40). Statistically, significant interactions were observed between the INS SNP rs3842754 (+1127INSPstI) genotypes and both the LEPR SNP rs1137101 (Q223R) and the ADIPOQ SNP rs266729 (C-11374G) genotypes. CONCLUSIONS Our results suggest that variants in the adipokine genes may affect CRC risk in combination with variants in diabetes-related genes.
Collapse
Affiliation(s)
- Sonali Pechlivanis
- Division of Molecular Genetic Epidemiology C050, German Cancer Research Center (DKFZ), Im Neuenheimer Feld 580, 69120 Heidelberg, Germany
| | | | | | | | | | | | | | | | | |
Collapse
|
35
|
Edwards TL, Lewis K, Velez DR, Dudek S, Ritchie MD. Exploring the performance of Multifactor Dimensionality Reduction in large scale SNP studies and in the presence of genetic heterogeneity among epistatic disease models. Hum Hered 2008; 67:183-92. [PMID: 19077437 DOI: 10.1159/000181157] [Citation(s) in RCA: 30] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2008] [Accepted: 07/01/2008] [Indexed: 01/27/2023] Open
Abstract
BACKGROUND/AIMS In genetic studies of complex disease a consideration for the investigator is detection of joint effects. The Multifactor Dimensionality Reduction (MDR) algorithm searches for these effects with an exhaustive approach. Previously unknown aspects of MDR performance were the power to detect interactive effects given large numbers of non-model loci or varying degrees of heterogeneity among multiple epistatic disease models. METHODS To address the performance with many non-model loci, datasets of 500 cases and 500 controls with 100 to 10,000 SNPs were simulated for two-locus models, and one hundred 500-case/500-control datasets with 100 and 500 SNPs were simulated for three-locus models. Multiple levels of locus heterogeneity were simulated in several sample sizes. RESULTS These results show MDR is robust to locus heterogeneity when the definition of power is not as conservative as in previous simulation studies where all model loci were required to be found by the method. The results also indicate that MDR performance is related more strongly to broad-sense heritability than sample size and is not greatly affected by non-model loci. CONCLUSIONS A study in which a population with high heritability estimates is sampled predisposes the MDR study to success more than a larger ascertainment in a population with smaller estimates.
Collapse
Affiliation(s)
- Todd L Edwards
- Center for Human Genetics Research, Vanderbilt University Medical Center, Nashville, Tenn., USA
| | | | | | | | | |
Collapse
|
36
|
Nonyane BAS, Foulkes AS. Application of two machine learning algorithms to genetic association studies in the presence of covariates. BMC Genet 2008; 9:71. [PMID: 19014573 PMCID: PMC2620353 DOI: 10.1186/1471-2156-9-71] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2008] [Accepted: 11/14/2008] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Population-based investigations aimed at uncovering genotype-trait associations often involve high-dimensional genetic polymorphism data as well as information on multiple environmental and clinical parameters. Machine learning (ML) algorithms offer a straightforward analytic approach for selecting subsets of these inputs that are most predictive of a pre-defined trait. The performance of these algorithms, however, in the presence of covariates is not well characterized. METHODS AND RESULTS In this manuscript, we investigate two approaches: Random Forests (RFs) and Multivariate Adaptive Regression Splines (MARS). Through multiple simulation studies, the performance under several underlying models is evaluated. An application to a cohort of HIV-1 infected individuals receiving anti-retroviral therapies is also provided. CONCLUSION Consistent with more traditional regression modeling theory, our findings highlight the importance of considering the nature of underlying gene-covariate-trait relationships before applying ML algorithms, particularly when there is potential confounding or effect mediation.
Collapse
Affiliation(s)
- Bareng AS Nonyane
- Division of Biostatistics and Epidemiology, School of Public Health and Health Sciences, University of Massachusetts Amherst, MA, USA
| | - Andrea S Foulkes
- Division of Biostatistics and Epidemiology, School of Public Health and Health Sciences, University of Massachusetts Amherst, MA, USA
| |
Collapse
|