51
|
Abstract
Although many genome-wide association studies have been performed, the identification of disease polymorphisms remains important. It is now suspected that many rare disease variants induce the association signal of common variants in linkage disequilibrium (LD). Based on recent development of genetic models, the current study provides explanations of the existence of rare variants with high impacts and common variants with low impacts. Disease variants are neither necessary nor sufficient due to gene–gene or gene–environment interactions. A new method was developed based on theoretical aspects to identify both rare and common disease variants by their genotypes. Common disease variants were identified with relatively small odds ratios and relatively small sample sizes, except for specific situations in which the disease variants were in strong LD with a variant with a higher frequency. Rare disease variants with small impacts were difficult to identify without increasing sample sizes; however, the method was reasonably accurate for rare disease variants with high impacts. For rare variants, dominant variants generally showed better Type II error rates than recessive variants; however, the trend was reversed for common variants. Type II error rates increased in gene regions containing more than two disease variants because the more common variant, rather than both disease variants, was usually identified. The proposed method would be useful for identifying common disease variants with small impacts and rare disease variants with large impacts when disease variants have the same effects on disease presentation.
Collapse
|
52
|
Grinberg NF, Lovatt A, Hegarty M, Lovatt A, Skøt KP, Kelly R, Blackmore T, Thorogood D, King RD, Armstead I, Powell W, Skøt L. Implementation of Genomic Prediction in Lolium perenne (L.) Breeding Populations. FRONTIERS IN PLANT SCIENCE 2016; 7:133. [PMID: 26904088 PMCID: PMC4751346 DOI: 10.3389/fpls.2016.00133] [Citation(s) in RCA: 36] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/11/2015] [Accepted: 01/25/2016] [Indexed: 05/23/2023]
Abstract
Perennial ryegrass (Lolium perenne L.) is one of the most widely grown forage grasses in temperate agriculture. In order to maintain and increase its usage as forage in livestock agriculture, there is a continued need for improvement in biomass yield, quality, disease resistance, and seed yield. Genetic gain for traits such as biomass yield has been relatively modest. This has been attributed to its long breeding cycle, and the necessity to use population based breeding methods. Thanks to recent advances in genotyping techniques there is increasing interest in genomic selection from which genomically estimated breeding values are derived. In this paper we compare the classical RRBLUP model with state-of-the-art machine learning techniques that should yield themselves easily to use in GS and demonstrate their application to predicting quantitative traits in a breeding population of L. perenne. Prediction accuracies varied from 0 to 0.59 depending on trait, prediction model and composition of the training population. The BLUP model produced the highest prediction accuracies for most traits and training populations. Forage quality traits had the highest accuracies compared to yield related traits. There appeared to be no clear pattern to the effect of the training population composition on the prediction accuracies. The heritability of the forage quality traits was generally higher than for the yield related traits, and could partly explain the difference in accuracy. Some population structure was evident in the breeding populations, and probably contributed to the varying effects of training population on the predictions. The average linkage disequilibrium between adjacent markers ranged from 0.121 to 0.215. Higher marker density and larger training population closely related with the test population are likely to improve the prediction accuracy.
Collapse
Affiliation(s)
| | - Alan Lovatt
- Institute of Biological, Environmental and Rural Sciences, Aberystwyth UniversityAberystwyth, UK
| | - Matt Hegarty
- Institute of Biological, Environmental and Rural Sciences, Aberystwyth UniversityAberystwyth, UK
| | - Andi Lovatt
- Institute of Biological, Environmental and Rural Sciences, Aberystwyth UniversityAberystwyth, UK
| | - Kirsten P. Skøt
- Institute of Biological, Environmental and Rural Sciences, Aberystwyth UniversityAberystwyth, UK
| | - Rhys Kelly
- Institute of Biological, Environmental and Rural Sciences, Aberystwyth UniversityAberystwyth, UK
| | - Tina Blackmore
- Institute of Biological, Environmental and Rural Sciences, Aberystwyth UniversityAberystwyth, UK
| | - Danny Thorogood
- Institute of Biological, Environmental and Rural Sciences, Aberystwyth UniversityAberystwyth, UK
| | - Ross D. King
- Manchester Institute of Biotechnology, University of ManchesterManchester, UK
| | - Ian Armstead
- Institute of Biological, Environmental and Rural Sciences, Aberystwyth UniversityAberystwyth, UK
| | - Wayne Powell
- Institute of Biological, Environmental and Rural Sciences, Aberystwyth UniversityAberystwyth, UK
- CGIAR Consortium, CGIAR Consortium OfficeMontpellier, France
| | - Leif Skøt
- Institute of Biological, Environmental and Rural Sciences, Aberystwyth UniversityAberystwyth, UK
| |
Collapse
|
53
|
Local True Discovery Rate Weighted Polygenic Scores Using GWAS Summary Data. Behav Genet 2016; 46:573-82. [DOI: 10.1007/s10519-015-9770-2] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2015] [Accepted: 11/04/2015] [Indexed: 11/26/2022]
|
54
|
Winham SJ, Jenkins GD, Biernacka JM. Modeling X Chromosome Data Using Random Forests: Conquering Sex Bias. Genet Epidemiol 2015; 40:123-32. [PMID: 26639183 DOI: 10.1002/gepi.21946] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2015] [Revised: 10/29/2015] [Accepted: 10/29/2015] [Indexed: 12/12/2022]
Abstract
Machine learning methods, including Random Forests (RF), are increasingly used for genetic data analysis. However, the standard RF algorithm does not correctly model the effects of X chromosome single nucleotide polymorphisms (SNPs), leading to biased estimates of variable importance. We propose extensions of RF to correctly model X SNPs, including a stratified approach and an approach based on the process of X chromosome inactivation. We applied the new and standard RF approaches to case-control alcohol dependence data from the Study of Addiction: Genes and Environment (SAGE), and compared the performance of the alternative approaches via a simulation study. Standard RF applied to a case-control study of alcohol dependence yielded inflated variable importance estimates for X SNPs, even when sex was included as a variable, but the results of the new RF methods were consistent with univariate regression-based approaches that correctly model X chromosome data. Simulations showed that the new RF methods eliminate the bias in standard RF variable importance for X SNPs when sex is associated with the trait, and are able to detect causal autosomal and X SNPs. Even in the absence of sex effects, the new extensions perform similarly to standard RF. Thus, we provide a powerful multimarker approach for genetic analysis that accommodates X chromosome data in an unbiased way. This method is implemented in the freely available R package "snpRF" (http://www.cran.r-project.org/web/packages/snpRF/).
Collapse
Affiliation(s)
- Stacey J Winham
- Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota, United States of America
| | - Gregory D Jenkins
- Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota, United States of America
| | - Joanna M Biernacka
- Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota, United States of America.,Department of Psychiatry and Psychology, Mayo Clinic, Rochester, Minnesota, United States of America
| |
Collapse
|
55
|
Li J, Zhong W, Li R, Wu R. A FAST ALGORITHM FOR DETECTING GENE-GENE INTERACTIONS IN GENOME-WIDE ASSOCIATION STUDIES. Ann Appl Stat 2014; 8:2292-2318. [PMID: 26457126 DOI: 10.1214/14-aoas771] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
With the recent advent of high-throughput genotyping techniques, genetic data for genome-wide association studies (GWAS) have become increasingly available, which entails the development of efficient and effective statistical approaches. Although many such approaches have been developed and used to identify single-nucleotide polymorphisms (SNPs) that are associated with complex traits or diseases, few are able to detect gene-gene interactions among different SNPs. Genetic interactions, also known as epistasis, have been recognized to play a pivotal role in contributing to the genetic variation of phenotypic traits. However, because of an extremely large number of SNP-SNP combinations in GWAS, the model dimensionality can quickly become so overwhelming that no prevailing variable selection methods are capable of handling this problem. In this paper, we present a statistical framework for characterizing main genetic effects and epistatic interactions in a GWAS study. Specifically, we first propose a two-stage sure independence screening (TS-SIS) procedure and generate a pool of candidate SNPs and interactions, which serve as predictors to explain and predict the phenotypes of a complex trait. We also propose a rates adjusted thresholding estimation (RATE) approach to determine the size of the reduced model selected by an independence screening. Regularization regression methods, such as LASSO or SCAD, are then applied to further identify important genetic effects. Simulation studies show that the TS-SIS procedure is computationally efficient and has an outstanding finite sample performance in selecting potential SNPs as well as gene-gene interactions. We apply the proposed framework to analyze an ultrahigh-dimensional GWAS data set from the Framingham Heart Study, and select 23 active SNPs and 24 active epistatic interactions for the body mass index variation. It shows the capability of our procedure to resolve the complexity of genetic control.
Collapse
Affiliation(s)
- Jiahan Li
- Department of Applied and Computational Mathematics and Statistics University of Notre Dame Notre Dame, Indiana 46556 USA
| | - Wei Zhong
- Institute for Studies in Economics Department of Statistics School of Economics Fujian Key Laboratory of Statistical Science Xiamen University Xiamen, Fujian 361005 China
| | - Runze Li
- The Methodology Center Department of Statistics Pennsylvania State University University Park, Pennsylvania 16802 USA
| | - Rongling Wu
- Center for Statistical Genetics Pennsylvania State University Hershey, Pennsylvania 17033 USA
| |
Collapse
|
56
|
Zeng P, Zhao Y, Qian C, Zhang L, Zhang R, Gou J, Liu J, Liu L, Chen F. Statistical analysis for genome-wide association study. J Biomed Res 2014; 29:285-97. [PMID: 26243515 PMCID: PMC4547377 DOI: 10.7555/jbr.29.20140007] [Citation(s) in RCA: 37] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2014] [Revised: 06/07/2014] [Accepted: 09/27/2014] [Indexed: 12/19/2022] Open
Abstract
In the past few years, genome-wide association study (GWAS) has made great successes in identifying genetic susceptibility loci underlying many complex diseases and traits. The findings provide important genetic insights into understanding pathogenesis of diseases. In this paper, we present an overview of widely used approaches and strategies for analysis of GWAS, offered a general consideration to deal with GWAS data. The issues regarding data quality control, population structure, association analysis, multiple comparison and visual presentation of GWAS results are discussed; other advanced topics including the issue of missing heritability, meta-analysis, set-based association analysis, copy number variation analysis and GWAS cohort analysis are also briefly introduced.
Collapse
Affiliation(s)
- Ping Zeng
- Department of Epidemiology and Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, Jiangsu 211166, China.,Department of Epidemiology and Biostatistics, School of Public Health, Xuzhou Medical College, Xuzhou, Jiangsu 221004, China
| | - Yang Zhao
- Department of Epidemiology and Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, Jiangsu 211166, China
| | - Cheng Qian
- Department of Epidemiology and Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, Jiangsu 211166, China
| | - Liwei Zhang
- Department of Epidemiology and Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, Jiangsu 211166, China
| | - Ruyang Zhang
- Department of Epidemiology and Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, Jiangsu 211166, China
| | - Jianwei Gou
- Department of Epidemiology and Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, Jiangsu 211166, China
| | - Jin Liu
- Department of Epidemiology and Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, Jiangsu 211166, China
| | - Liya Liu
- Department of Epidemiology and Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, Jiangsu 211166, China
| | - Feng Chen
- Department of Epidemiology and Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, Jiangsu 211166, China.
| |
Collapse
|
57
|
Okser S, Pahikkala T, Airola A, Salakoski T, Ripatti S, Aittokallio T. Regularized machine learning in the genetic prediction of complex traits. PLoS Genet 2014; 10:e1004754. [PMID: 25393026 PMCID: PMC4230844 DOI: 10.1371/journal.pgen.1004754] [Citation(s) in RCA: 99] [Impact Index Per Article: 9.9] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/14/2023] Open
Affiliation(s)
- Sebastian Okser
- Department of Information Technology, University of Turku, Turku, Finland
- Turku Centre for Computer Science (TUCS), University of Turku and Åbo Akademi University, Turku, Finland
| | - Tapio Pahikkala
- Department of Information Technology, University of Turku, Turku, Finland
- Turku Centre for Computer Science (TUCS), University of Turku and Åbo Akademi University, Turku, Finland
| | - Antti Airola
- Department of Information Technology, University of Turku, Turku, Finland
- Turku Centre for Computer Science (TUCS), University of Turku and Åbo Akademi University, Turku, Finland
| | - Tapio Salakoski
- Department of Information Technology, University of Turku, Turku, Finland
- Turku Centre for Computer Science (TUCS), University of Turku and Åbo Akademi University, Turku, Finland
| | - Samuli Ripatti
- Hjelt Institute, University of Helsinki, Helsinki, Finland
- Institute for Molecular Medicine Finland (FIMM), University of Helsinki, Helsinki, Finland
- Wellcome Trust Sanger Institute, Hinxton, United Kingdom
| | - Tero Aittokallio
- Turku Centre for Computer Science (TUCS), University of Turku and Åbo Akademi University, Turku, Finland
- Institute for Molecular Medicine Finland (FIMM), University of Helsinki, Helsinki, Finland
- * E-mail:
| |
Collapse
|
58
|
de Maturana EL, Chanok SJ, Picornell AC, Rothman N, Herranz J, Calle ML, García-Closas M, Marenne G, Brand A, Tardón A, Carrato A, Silverman DT, Kogevinas M, Gianola D, Real FX, Malats N. Whole genome prediction of bladder cancer risk with the Bayesian LASSO. Genet Epidemiol 2014; 38:467-76. [PMID: 24796258 DOI: 10.1002/gepi.21809] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2013] [Revised: 03/05/2014] [Accepted: 03/20/2014] [Indexed: 11/11/2022]
Abstract
To build a predictive model for urothelial carcinoma of the bladder (UCB) risk combining both genomic and nongenomic data, 1,127 cases and 1,090 controls from the Spanish Bladder Cancer/EPICURO study were genotyped using the HumanHap 1M SNP array. After quality control filters, genotypes from 475,290 variants were available. Nongenomic information comprised age, gender, region, and smoking status. Three Bayesian threshold models were implemented including: (1) only genomic information, (2) only nongenomic data, and (3) both sources of information. The three models were applied to the whole population, to only nonsmokers, to male smokers, and to extreme phenotypes to potentiate the UCB genetic component. The area under the ROC curve allowed evaluating the predictive ability of each model in a 10-fold cross-validation scenario. Smoking status showed the highest predictive ability of UCB risk (AUCtest = 0.62). On the other hand, the AUC of all genetic variants was poorer (0.53). When the extreme phenotype approach was applied, the predictive ability of the genomic model improved 15%. This study represents a first attempt to build a predictive model for UCB risk combining both genomic and nongenomic data and applying state-of-the-art statistical approaches. However, the lack of genetic relatedness among individuals, the complexity of UCB etiology, as well as a relatively small statistical power, may explain the low predictive ability for UCB risk. The study confirms the difficulty of predicting complex diseases using genetic data, and suggests the limited translational potential of findings from this type of data into public health interventions.
Collapse
|
59
|
Robinson MR, Wray NR, Visscher PM. Explaining additional genetic variation in complex traits. Trends Genet 2014; 30:124-32. [PMID: 24629526 DOI: 10.1016/j.tig.2014.02.003] [Citation(s) in RCA: 89] [Impact Index Per Article: 8.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2013] [Revised: 02/10/2014] [Accepted: 02/12/2014] [Indexed: 12/11/2022]
Abstract
Genome-wide association studies (GWAS) have provided valuable insights into the genetic basis of complex traits, discovering >6000 variants associated with >500 quantitative traits and common complex diseases in humans. The associations identified so far represent only a fraction of those that influence phenotype, because there are likely to be many variants across the entire frequency spectrum, each of which influences multiple traits, with only a small average contribution to the phenotypic variance. This presents a considerable challenge to further dissection of the remaining unexplained genetic variance within populations, which limits our ability to predict disease risk, identify new drug targets, improve and maintain food sources, and understand natural diversity. This challenge will be met within the current framework through larger sample size, better phenotyping, including recording of nongenetic risk factors, focused study designs, and an integration of multiple sources of phenotypic and genetic information. The current evidence supports the application of quantitative genetic approaches, and we argue that one should retain simpler theories until simplicity can be traded for greater explanatory power.
Collapse
Affiliation(s)
- Matthew R Robinson
- The Queensland Brain Institute, The University of Queensland, St Lucia, QLD 4072, Australia
| | - Naomi R Wray
- The Queensland Brain Institute, The University of Queensland, St Lucia, QLD 4072, Australia
| | - Peter M Visscher
- The Queensland Brain Institute, The University of Queensland, St Lucia, QLD 4072, Australia; The University of Queensland Diamantina Institute, The University of Queensland, Translational Research Institute, Brisbane, QLD 4102, Australia.
| |
Collapse
|
60
|
Application of multi-SNP approaches Bayesian LASSO and AUC-RF to detect main effects of inflammatory-gene variants associated with bladder cancer risk. PLoS One 2013; 8:e83745. [PMID: 24391818 PMCID: PMC3877090 DOI: 10.1371/journal.pone.0083745] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2013] [Accepted: 11/07/2013] [Indexed: 12/18/2022] Open
Abstract
The relationship between inflammation and cancer is well established in several tumor types, including bladder cancer. We performed an association study between 886 inflammatory-gene variants and bladder cancer risk in 1,047 cases and 988 controls from the Spanish Bladder Cancer (SBC)/EPICURO Study. A preliminary exploration with the widely used univariate logistic regression approach did not identify any significant SNP after correcting for multiple testing. We further applied two more comprehensive methods to capture the complexity of bladder cancer genetic susceptibility: Bayesian Threshold LASSO (BTL), a regularized regression method, and AUC-Random Forest, a machine-learning algorithm. Both approaches explore the joint effect of markers. BTL analysis identified a signature of 37 SNPs in 34 genes showing an association with bladder cancer. AUC-RF detected an optimal predictive subset of 56 SNPs. 13 SNPs were identified by both methods in the total population. Using resources from the Texas Bladder Cancer study we were able to replicate 30% of the SNPs assessed. The associations between inflammatory SNPs and bladder cancer were reexamined among non-smokers to eliminate the effect of tobacco, one of the strongest and most prevalent environmental risk factor for this tumor. A 9 SNP-signature was detected by BTL. Here we report, for the first time, a set of SNP in inflammatory genes jointly associated with bladder cancer risk. These results highlight the importance of the complex structure of genetic susceptibility associated with cancer risk.
Collapse
|
61
|
Waldmann P, Mészáros G, Gredler B, Fuerst C, Sölkner J. Evaluation of the lasso and the elastic net in genome-wide association studies. Front Genet 2013; 4:270. [PMID: 24363662 PMCID: PMC3850240 DOI: 10.3389/fgene.2013.00270] [Citation(s) in RCA: 132] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2013] [Accepted: 11/18/2013] [Indexed: 01/23/2023] Open
Abstract
The number of publications performing genome-wide association studies (GWAS) has increased dramatically. Penalized regression approaches have been developed to overcome the challenges caused by the high dimensional data, but these methods are relatively new in the GWAS field. In this study we have compared the statistical performance of two methods (the least absolute shrinkage and selection operator—lasso and the elastic net) on two simulated data sets and one real data set from a 50 K genome-wide single nucleotide polymorphism (SNP) panel of 5570 Fleckvieh bulls. The first simulated data set displays moderate to high linkage disequilibrium between SNPs, whereas the second simulated data set from the QTLMAS 2010 workshop is biologically more complex. We used cross-validation to find the optimal value of regularization parameter λ with both minimum MSE and minimum MSE + 1SE of minimum MSE. The optimal λ values were used for variable selection. Based on the first simulated data, we found that the minMSE in general picked up too many SNPs. At minMSE + 1SE, the lasso didn't acquire any false positives, but selected too few correct SNPs. The elastic net provided the best compromise between few false positives and many correct selections when the penalty weight α was around 0.1. However, in our simulation setting, this α value didn't result in the lowest minMSE + 1SE. The number of selected SNPs from the QTLMAS 2010 data was after correction for population structure 82 and 161 for the lasso and the elastic net, respectively. In the Fleckvieh data set after population structure correction lasso and the elastic net identified from 1291 to 1966 important SNPs for milk fat content, with major peaks on chromosomes 5, 14, 15, and 20. Hence, we can conclude that it is important to analyze GWAS data with both the lasso and the elastic net and an alternative tuning criterion to minimum MSE is needed for variable selection.
Collapse
Affiliation(s)
- Patrik Waldmann
- Division of Livestock Sciences, Department of Sustainable Agricultural Systems, University of Natural Resources and Life Sciences Vienna, Austria ; Division of Statistics, Department of Computer and Information Science, Linköping University Linköping, Sweden
| | - Gábor Mészáros
- Division of Livestock Sciences, Department of Sustainable Agricultural Systems, University of Natural Resources and Life Sciences Vienna, Austria
| | | | | | - Johann Sölkner
- Division of Livestock Sciences, Department of Sustainable Agricultural Systems, University of Natural Resources and Life Sciences Vienna, Austria
| |
Collapse
|
62
|
Nelson RM, Pettersson ME, Carlborg Ö. A century after Fisher: time for a new paradigm in quantitative genetics. Trends Genet 2013; 29:669-76. [PMID: 24161664 DOI: 10.1016/j.tig.2013.09.006] [Citation(s) in RCA: 59] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2013] [Revised: 09/17/2013] [Accepted: 09/19/2013] [Indexed: 10/26/2022]
Abstract
Quantitative genetics traces its roots back through more than a century of theory, largely formed in the absence of directly observable genotype data, and has remained essentially unchanged for decades. By contrast, molecular genetics arose from direct observations and is currently undergoing rapid changes, making the amount of available data ever greater. Thus, the two disciplines are disparate both in their origins and their current states, yet they address the same fundamental question: how does the genotype affect the phenotype? The rapidly accumulating genomic data necessitate sophisticated analysis, but many of the current tools are adaptations of methods designed during the early days of quantitative genetics. We argue here that the present analysis paradigm in quantitative genetics is at its limits in regards to unraveling complex traits and it is necessary to re-evaluate the direction that genetic research is taking for the field to realize its full potential.
Collapse
Affiliation(s)
- Ronald M Nelson
- Swedish University of Agricultural Sciences, Department of Clinical Sciences, Division of Computational Genetics, Box 7078, SE-750 07 Uppsala, Sweden.
| | | | | |
Collapse
|
63
|
Roetker NS, Page CD, Yonker JA, Chang V, Roan CL, Herd P, Hauser TS, Hauser RM, Atwood CS. Assessment of genetic and nongenetic interactions for the prediction of depressive symptomatology: an analysis of the Wisconsin Longitudinal Study using machine learning algorithms. Am J Public Health 2013; 103 Suppl 1:S136-44. [PMID: 23927508 DOI: 10.2105/ajph.2012.301141] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/04/2022]
Abstract
OBJECTIVES We examined depression within a multidimensional framework consisting of genetic, environmental, and sociobehavioral factors and, using machine learning algorithms, explored interactions among these factors that might better explain the etiology of depressive symptoms. METHODS We measured current depressive symptoms using the Center for Epidemiologic Studies Depression Scale (n = 6378 participants in the Wisconsin Longitudinal Study). Genetic factors were 78 single nucleotide polymorphisms (SNPs); environmental factors-13 stressful life events (SLEs), plus a composite proportion of SLEs index; and sociobehavioral factors-18 personality, intelligence, and other health or behavioral measures. We performed traditional SNP associations via logistic regression likelihood ratio testing and explored interactions with support vector machines and Bayesian networks. RESULTS After correction for multiple testing, we found no significant single genotypic associations with depressive symptoms. Machine learning algorithms showed no evidence of interactions. Naïve Bayes produced the best models in both subsets and included only environmental and sociobehavioral factors. CONCLUSIONS We found no single or interactive associations with genetic factors and depressive symptoms. Various environmental and sociobehavioral factors were more predictive of depressive symptoms, yet their impacts were independent of one another. A genome-wide analysis of genetic alterations using machine learning methodologies will provide a framework for identifying genetic-environmental-sociobehavioral interactions in depressive symptoms.
Collapse
Affiliation(s)
- Nicholas S Roetker
- Nicholas S. Roetker, James A. Yonker, Vicky Chang, Carol L. Roan, Pamela Herd, Taissa S. Hauser, and Robert M. Hauser are with the Department of Sociology, University of Wisconsin-Madison. Pamela Herd is also with La Follete School of Public Affairs, University of Wisconsin-Madison. C. David Page is with the Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison. Craig S. Atwood is with the Geriatric Research, Education and Clinical Center, William S. Middleton Memorial Veterans Hospital, Madison, WI, and the Department of Medicine, University of Wisconsin-Madison School of Medicine and Public Health
| | | | | | | | | | | | | | | | | |
Collapse
|
64
|
Leung RKK, Wang Y, Ma RCW, Luk AOY, Lam V, Ng M, So WY, Tsui SKW, Chan JCN. Using a multi-staged strategy based on machine learning and mathematical modeling to predict genotype-phenotype risk patterns in diabetic kidney disease: a prospective case-control cohort analysis. BMC Nephrol 2013; 14:162. [PMID: 23879411 PMCID: PMC3726338 DOI: 10.1186/1471-2369-14-162] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2012] [Accepted: 07/18/2013] [Indexed: 11/22/2022] Open
Abstract
Background Multi-causality and heterogeneity of phenotypes and genotypes characterize complex diseases. In a database with comprehensive collection of phenotypes and genotypes, we compared the performance of common machine learning methods to generate mathematical models to predict diabetic kidney disease (DKD). Methods In a prospective cohort of type 2 diabetic patients, we selected 119 subjects with DKD and 554 without DKD at enrolment and after a median follow-up period of 7.8 years for model training, testing and validation using seven machine learning methods (partial least square regression, the classification and regression tree, the C5.0 decision tree, random forest, naïve Bayes classification, neural network and support vector machine). We used 17 clinical attributes and 70 single nucleotide polymorphisms (SNPs) of 54 candidate genes to build different models. The top attributes selected by the best-performing models were then used to build models with performance comparable to those using the entire dataset. Results Age, age of diagnosis, systolic blood pressure and genetic polymorphisms of uteroglobin and lipid metabolism were selected by most methods. Models generated by support vector machine (svmRadial) and random forest (cforest) had the best prediction accuracy whereas models derived from naïve Bayes classifier and partial least squares regression had the least optimal performance. Using 10 clinical attributes (systolic and diastolic blood pressure, age, age of diagnosis, triglyceride, white blood cell count, total cholesterol, waist to hip ratio, LDL cholesterol, and alcohol intake) and 5 genetic attributes (UGB G38A, LIPC -514C > T, APOB Thr71Ile, APOC3 3206T > G and APOC3 1100C > T), selected most often by SVM and cforest, we were able to build high-performance models. Conclusions Amongst different machine learning methods, svmRadial and cforest had the best performance. Genetic polymorphisms related to inflammation and lipid metabolism warrant further investigation for their associations with DKD.
Collapse
Affiliation(s)
- Ross K K Leung
- Hong Kong Bioinformatics Centre, The Chinese University of Hong Kong, Hong Kong, SAR, China
| | | | | | | | | | | | | | | | | |
Collapse
|
65
|
Lu C, Latourelle J, O'Connor GT, Dupuis J, Kolaczyk ED. Network-guided sparse regression modeling for detection of gene-by-gene interactions. ACTA ACUST UNITED AC 2013; 29:1241-9. [PMID: 23599501 DOI: 10.1093/bioinformatics/btt139] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023]
Abstract
MOTIVATION Genetic variants identified by genome-wide association studies to date explain only a small fraction of total heritability. Gene-by-gene interaction is one important potential source of unexplained total heritability. We propose a novel approach to detect such interactions that uses penalized regression and sparse estimation principles, and incorporates outside biological knowledge through a network-based penalty. RESULTS We tested our new method on simulated and real data. Simulation showed that with reasonable outside biological knowledge, our method performs noticeably better than stage-wise strategies (i.e. selecting main effects first, and interactions second, from those main effects selected) in finding true interactions, especially when the marginal strength of main effects is weak. We applied our method to Framingham Heart Study data on total plasma immunoglobulin E (IgE) concentrations and found a number of interactions among different classes of human leukocyte antigen genes that may interact to influence the risk of developing IgE dysregulation and allergy. AVAILABILITY The proposed method is implemented in R and available at http://math.bu.edu/people/kolaczyk/software.html. CONTACT chenlu@bu.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Chen Lu
- Department of Biostatistics, Boston University School of Public Health, Pulmonary Center, Department of Medicine and Department of Neurology, Boston University School of Medicine, Boston, MA, USA.
| | | | | | | | | |
Collapse
|
66
|
Kamm L, Bogdanov D, Laur S, Vilo J. A new way to protect privacy in large-scale genome-wide association studies. ACTA ACUST UNITED AC 2013; 29:886-93. [PMID: 23413435 PMCID: PMC3605601 DOI: 10.1093/bioinformatics/btt066] [Citation(s) in RCA: 76] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
Motivation: Increased availability of various genotyping techniques has initiated a race for finding genetic markers that can be used in diagnostics and personalized medicine. Although many genetic risk factors are known, key causes of common diseases with complex heritage patterns are still unknown. Identification of such complex traits requires a targeted study over a large collection of data. Ideally, such studies bring together data from many biobanks. However, data aggregation on such a large scale raises many privacy issues. Results: We show how to conduct such studies without violating privacy of individual donors and without leaking the data to third parties. The presented solution has provable security guarantees. Contact:jaak.vilo@ut.ee Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Liina Kamm
- Institute of Computer Science, University of Tartu, Liivi 2, Tartu 50409, Estonia
| | | | | | | |
Collapse
|
67
|
Mittag F, Büchel F, Saad M, Jahn A, Schulte C, Bochdanovits Z, Simón-Sánchez J, Nalls MA, Keller M, Hernandez DG, Gibbs JR, Lesage S, Brice A, Heutink P, Martinez M, Wood NW, Hardy J, Singleton AB, Zell A, Gasser T, Sharma M. Use of support vector machines for disease risk prediction in genome-wide association studies: concerns and opportunities. Hum Mutat 2012; 33:1708-18. [PMID: 22777693 PMCID: PMC5968822 DOI: 10.1002/humu.22161] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2012] [Accepted: 06/18/2012] [Indexed: 01/29/2023]
Abstract
The success of genome-wide association studies (GWAS) in deciphering the genetic architecture of complex diseases has fueled the expectations whether the individual risk can also be quantified based on the genetic architecture. So far, disease risk prediction based on top-validated single-nucleotide polymorphisms (SNPs) showed little predictive value. Here, we applied a support vector machine (SVM) to Parkinson disease (PD) and type 1 diabetes (T1D), to show that apart from magnitude of effect size of risk variants, heritability of the disease also plays an important role in disease risk prediction. Furthermore, we performed a simulation study to show the role of uncommon (frequency 1-5%) as well as rare variants (frequency <1%) in disease etiology of complex diseases. Using a cross-validation model, we were able to achieve predictions with an area under the receiver operating characteristic curve (AUC) of ~0.88 for T1D, highlighting the strong heritable component (∼90%). This is in contrast to PD, where we were unable to achieve a satisfactory prediction (AUC ~0.56; heritability ~38%). Our simulations showed that simultaneous inclusion of uncommon and rare variants in GWAS would eventually lead to feasible disease risk prediction for complex diseases such as PD. The used software is available at http://www.ra.cs.uni-tuebingen.de/software/MACLEAPS/.
Collapse
Affiliation(s)
- Florian Mittag
- Center for Bioinformatics Tuebingen (ZBIT), University of Tuebingen, Tubingen, Germany
| | - Finja Büchel
- Center for Bioinformatics Tuebingen (ZBIT), University of Tuebingen, Tubingen, Germany
| | - Mohamad Saad
- Institut National de la Sante et de la Recherche Medicale, UMR 1043, Centre de Physiopathologie de Toulouse-Purpan, Toulouse, France
- Département des Sciences du Vivant, Paul Sabatier University, Toulouse, France
| | - Andreas Jahn
- Center for Bioinformatics Tuebingen (ZBIT), University of Tuebingen, Tubingen, Germany
| | - Claudia Schulte
- Department for Neurodegenerative Diseases, Hertie Institute for Clinical Brain Research, University of Tübingen, and DZNE, German Centre for Neurodegenerative Diseases, Tübingen, Germany
| | - Zoltan Bochdanovits
- Department of Clinical Genetics, Section of Medical Genomics, VU University Medical Centre, Amsterdam, The Netherlands
| | - Javier Simón-Sánchez
- Department of Clinical Genetics, Section of Medical Genomics, VU University Medical Centre, Amsterdam, The Netherlands
| | - Mike A. Nalls
- Laboratory of Neurogenetics. National Institute on Aging, National Institutes of Health, Bethesda, Maryland
| | - Margaux Keller
- Laboratory of Neurogenetics. National Institute on Aging, National Institutes of Health, Bethesda, Maryland
- Department of Biological Anthropology. Temple University, Philadelphia, Pennsylvania
| | - Dena G. Hernandez
- Laboratory of Neurogenetics. National Institute on Aging, National Institutes of Health, Bethesda, Maryland
- Department of Molecular Neuroscience, Institute of Neurology, University College London, London, UK
| | - J. Raphael Gibbs
- Laboratory of Neurogenetics. National Institute on Aging, National Institutes of Health, Bethesda, Maryland
- Department of Molecular Neuroscience, Institute of Neurology, University College London, London, UK
| | - Suzanne Lesage
- Université Pierre et Marie Curie-Paris, Centre de Recherche de l’Institut du Cerveau et de laMoelle Epinière, UMR-S975, Paris, France
- Institut National de la Sante et de la Recherche Medicale, UMR_S975 CRicm, Paris, France
- Centre National de la Recherche Scientifique, UMR 7225, Paris, France
| | - Alexis Brice
- Université Pierre et Marie Curie-Paris, Centre de Recherche de l’Institut du Cerveau et de laMoelle Epinière, UMR-S975, Paris, France
- AP-HP, Hôpital de la Salpêtrière, Département de Génétique et Cytogénétique, Paris, France
- Institut National de la Sante et de la Recherche Medicale, UMR_S975 CRicm, Paris, France
- Centre National de la Recherche Scientifique, UMR 7225, Paris, France
| | - Peter Heutink
- Department of Clinical Genetics, Section of Medical Genomics, VU University Medical Centre, Amsterdam, The Netherlands
| | - Maria Martinez
- Institut National de la Sante et de la Recherche Medicale, UMR 1043, Centre de Physiopathologie de Toulouse-Purpan, Toulouse, France
- Département des Sciences du Vivant, Paul Sabatier University, Toulouse, France
| | - Nicholas W Wood
- Department of Molecular Neuroscience, Institute of Neurology, University College London, London, UK
| | - John Hardy
- Department of Molecular Neuroscience, Institute of Neurology, University College London, London, UK
| | - Andrew B. Singleton
- Laboratory of Neurogenetics. National Institute on Aging, National Institutes of Health, Bethesda, Maryland
| | - Andreas Zell
- Center for Bioinformatics Tuebingen (ZBIT), University of Tuebingen, Tubingen, Germany
| | - Thomas Gasser
- Department for Neurodegenerative Diseases, Hertie Institute for Clinical Brain Research, University of Tübingen, and DZNE, German Centre for Neurodegenerative Diseases, Tübingen, Germany
| | - Manu Sharma
- Department for Neurodegenerative Diseases, Hertie Institute for Clinical Brain Research, University of Tübingen, and DZNE, German Centre for Neurodegenerative Diseases, Tübingen, Germany
| |
Collapse
|
68
|
Walters R, Laurin C, Lubke GH. An integrated approach to reduce the impact of minor allele frequency and linkage disequilibrium on variable importance measures for genome-wide data. ACTA ACUST UNITED AC 2012; 28:2615-23. [PMID: 22847933 DOI: 10.1093/bioinformatics/bts483] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
Abstract
MOTIVATION There is growing momentum to develop statistical learning (SL) methods as an alternative to conventional genome-wide association studies (GWAS). Methods such as random forests (RF) and gradient boosting machine (GBM) result in variable importance measures that indicate how well each single-nucleotide polymorphism (SNP) predicts the phenotype. For RF, it has been shown that variable importance measures are systematically affected by minor allele frequency (MAF) and linkage disequilibrium (LD). To establish RF and GBM as viable alternatives for analyzing genome-wide data, it is necessary to address this potential bias and show that SL methods do not significantly under-perform conventional GWAS methods. RESULTS Both LD and MAF have a significant impact on the variable importance measures commonly used in RF and GBM. Dividing SNPs into overlapping subsets with approximate linkage equilibrium and applying SL methods to each subset successfully reduces the impact of LD. A welcome side effect of this approach is a dramatic reduction in parallel computing time, increasing the feasibility of applying SL methods to large datasets. The created subsets also facilitate a potential correction for the effect of MAF using pseudocovariates. Simulations using simulated SNPs embedded in empirical data-assessing varying effect sizes, minor allele frequencies and LD patterns-suggest that the sensitivity to detect effects is often improved by subsetting and does not significantly under-perform the Armitage trend test, even under ideal conditions for the trend test. AVAILABILITY Code for the LD subsetting algorithm and pseudocovariate correction is available at http://www.nd.edu/~glubke/code.html.
Collapse
Affiliation(s)
- Raymond Walters
- Department of Psychology, University of Notre Dame, Notre Dame, IN 46556, USA
| | | | | |
Collapse
|
69
|
Pahikkala T, Okser S, Airola A, Salakoski T, Aittokallio T. Wrapper-based selection of genetic features in genome-wide association studies through fast matrix operations. Algorithms Mol Biol 2012; 7:11. [PMID: 22551170 PMCID: PMC3606421 DOI: 10.1186/1748-7188-7-11] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2011] [Accepted: 04/23/2012] [Indexed: 12/22/2022] Open
Abstract
BACKGROUND Through the wealth of information contained within them, genome-wide association studies (GWAS) have the potential to provide researchers with a systematic means of associating genetic variants with a wide variety of disease phenotypes. Due to the limitations of approaches that have analyzed single variants one at a time, it has been proposed that the genetic basis of these disorders could be determined through detailed analysis of the genetic variants themselves and in conjunction with one another. The construction of models that account for these subsets of variants requires methodologies that generate predictions based on the total risk of a particular group of polymorphisms. However, due to the excessive number of variants, constructing these types of models has so far been computationally infeasible. RESULTS We have implemented an algorithm, known as greedy RLS, that we use to perform the first known wrapper-based feature selection on the genome-wide level. The running time of greedy RLS grows linearly in the number of training examples, the number of features in the original data set, and the number of selected features. This speed is achieved through computational short-cuts based on matrix calculus. Since the memory consumption in present-day computers can form an even tighter bottleneck than running time, we also developed a space efficient variation of greedy RLS which trades running time for memory. These approaches are then compared to traditional wrapper-based feature selection implementations based on support vector machines (SVM) to reveal the relative speed-up and to assess the feasibility of the new algorithm. As a proof of concept, we apply greedy RLS to the Hypertension - UK National Blood Service WTCCC dataset and select the most predictive variants using 3-fold external cross-validation in less than 26 minutes on a high-end desktop. On this dataset, we also show that greedy RLS has a better classification performance on independent test data than a classifier trained using features selected by a statistical p-value-based filter, which is currently the most popular approach for constructing predictive models in GWAS. CONCLUSIONS Greedy RLS is the first known implementation of a machine learning based method with the capability to conduct a wrapper-based feature selection on an entire GWAS containing several thousand examples and over 400,000 variants. In our experiments, greedy RLS selected a highly predictive subset of genetic variants in a fraction of the time spent by wrapper-based selection methods used together with SVM classifiers. The proposed algorithms are freely available as part of the RLScore software library at http://users.utu.fi/aatapa/RLScore/.
Collapse
Affiliation(s)
- Tapio Pahikkala
- Department of Information Technology, University of Turku, Turku, Finland
- Turku Centre for Computer Science, Turku, Finland
| | - Sebastian Okser
- Department of Information Technology, University of Turku, Turku, Finland
- Turku Centre for Computer Science, Turku, Finland
| | - Antti Airola
- Department of Information Technology, University of Turku, Turku, Finland
- Turku Centre for Computer Science, Turku, Finland
| | - Tapio Salakoski
- Department of Information Technology, University of Turku, Turku, Finland
- Turku Centre for Computer Science, Turku, Finland
| | - Tero Aittokallio
- Turku Centre for Computer Science, Turku, Finland
- Department of Mathematics, University of Turku, Turku, Finland
- Data Mining and Modeling group, Turku Centre for Biotechnology, Turku, Finland
- Institute for Molecular Medicine Finland (FIMM), University of Helsinki, Helsinki, Finland
| |
Collapse
|
70
|
Sun YV, Sung YJ, Tintle N, Ziegler A. Identification of genetic association of multiple rare variants using collapsing methods. Genet Epidemiol 2012; 35 Suppl 1:S101-6. [PMID: 22128049 DOI: 10.1002/gepi.20658] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
Next-generation sequencing technology allows investigation of both common and rare variants in humans. Exomes are sequenced on the population level or in families to further study the genetics of human diseases. Genetic Analysis Workshop 17 (GAW17) provided exomic data from the 1000 Genomes Project and simulated phenotypes. These data enabled evaluations of existing and newly developed statistical methods for rare variant sequence analysis for which standard statistical methods fail because of the rareness of the alleles. Various alternative approaches have been proposed that overcome the rareness problem by combining multiple rare variants within a gene. These approaches are termed collapsing methods, and our GAW17 group focused on studying the performance of existing and novel collapsing methods using rare variants. All tested methods performed similarly, as measured by type I error and power. Inflated type I error fractions were consistently observed and might be caused by gametic phase disequilibrium between causal and noncausal rare variants in this relatively small sample as well as by population stratification. Incorporating prior knowledge, such as appropriate covariates and information on functionality of SNPs, increased the power of detecting associated genes. Overall, collapsing rare variants can increase the power of identifying disease-associated genes. However, studying genetic associations of rare variants remains a challenging task that requires further development and improvement in data collection, management, analysis, and computation.
Collapse
Affiliation(s)
- Yan V Sun
- Department of Epidemiology, Rollins School of Public Health, Emory University, Atlanta, GA 30322, USA.
| | | | | | | |
Collapse
|
71
|
Sepe DM, McWilliams T, Chen J, Kershenbaum A, Zhao H, La M, Devidas M, Lange B, Rebbeck TR, Aplenc R. Germline genetic variation and treatment response on CCG-1891. Pediatr Blood Cancer 2012; 58:695-700. [PMID: 21618417 PMCID: PMC3165089 DOI: 10.1002/pbc.23192] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/15/2011] [Accepted: 04/12/2011] [Indexed: 01/30/2023]
Abstract
BACKGROUND Recent studies suggest that polymorphisms in genes encoding enzymes involved in drug detoxification and metabolism may influence disease outcome in pediatric acute lymphoblastic leukemia (ALL). We sought to extend current knowledge by using standard and novel statistical methodology to examine polymorphic variants of genes and relapse risk, toxicity, and drug dose delivery in standard risk ALL. PROCEDURE We genotyped and abstracted chemotherapy drug dose data from treatment roadmaps on 557 patients on the Children's Cancer Group ALL study, CCG-1891. Fourteen common polymorphisms in genes involved in folate metabolism and/or phase I and II drug detoxification were evaluated individually and clique-finding methodology was employed for detection of significant gene-gene interactions. RESULTS After controlling for known risk factors, polymorphisms in four genes: GSTP1*B (HR = 1.94, P = 0.047), MTHFR (HR = 1.61, P = 0.034), MTRR (HR = 1.95, P = 0.01), and TS (3R/4R, HR = 3.69, P = 0.007) were found to significantly increase relapse risk. One gene-gene pair, MTRR A/G and GSTM1 null genotype, significantly increased the risk of relapse after correction for multiple comparisons (P = 0.012). Multiple polymorphisms were associated with various toxicities and there was no significant difference in dose of chemotherapy delivered by genotypes. CONCLUSIONS These data suggest that various polymorphisms play a role in relapse risk and toxicity during childhood ALL therapy and that genotype does not play a role in adjustment of drug dose administered. Additionally, gene-gene interactions may increase the risk of relapse in childhood ALL and the clique method may have utility in further exploring these interactions. childhood ALL therapy.
Collapse
Affiliation(s)
- Dana M. Sepe
- The Children’s Hospital of Philadelphia, Philadelphia, PA, University of Pennsylvania School of Medicine, Philadelphia, PA
| | | | - Jinbo Chen
- University of Pennsylvania School of Medicine, Philadelphia, PA
| | | | - Huaqing Zhao
- The Children’s Hospital of Philadelphia, Philadelphia, PA
| | - Mei La
- Children’s Oncology Group, Arcadia, CA
| | - Meenakshi Devidas
- Children’s Oncology Group Statistics and Data Center and University of Florida College of Medicine, Gainesville, FL
| | - Beverly Lange
- The Children’s Hospital of Philadelphia, Philadelphia, PA, University of Pennsylvania School of Medicine, Philadelphia, PA
| | | | - Richard Aplenc
- The Children’s Hospital of Philadelphia, Philadelphia, PA, University of Pennsylvania School of Medicine, Philadelphia, PA
| |
Collapse
|
72
|
Dasgupta A, Sun YV, König IR, Bailey-Wilson JE, Malley JD. Brief review of regression-based and machine learning methods in genetic epidemiology: the Genetic Analysis Workshop 17 experience. Genet Epidemiol 2012; 35 Suppl 1:S5-11. [PMID: 22128059 DOI: 10.1002/gepi.20642] [Citation(s) in RCA: 69] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Genetics Analysis Workshop 17 provided common and rare genetic variants from exome sequencing data and simulated binary and quantitative traits in 200 replicates. We provide a brief review of the machine learning and regression-based methods used in the analyses of these data. Several regression and machine learning methods were used to address different problems inherent in the analyses of these data, which are high-dimension, low-sample-size data typical of many genetic association studies. Unsupervised methods, such as cluster analysis, were used for data segmentation and, subset selection. Supervised learning methods, which include regression-based methods (e.g., generalized linear models, logic regression, and regularized regression) and tree-based methods (e.g., decision trees and random forests), were used for variable selection (selecting genetic and clinical features most associated or predictive of outcome) and prediction (developing models using common and rare genetic variants to accurately predict outcome), with the outcome being case-control status or quantitative trait value. We include a discussion of cross-validation for model selection and assessment, and a description of available software resources for these methods.
Collapse
Affiliation(s)
- Abhijit Dasgupta
- Clinical Sciences Section, National Institute of Arthritis, Musculoskeletal, and Skin Diseases, National Institutes of Health, Bethesda, MD 21224, USA
| | | | | | | | | |
Collapse
|
73
|
Fontanarosa JB, Dai Y. Using LASSO regression to detect predictive aggregate effects in genetic studies. BMC Proc 2011; 5 Suppl 9:S69. [PMID: 22373537 PMCID: PMC3287908 DOI: 10.1186/1753-6561-5-s9-s69] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
We use least absolute shrinkage and selection operator (LASSO) regression to select genetic markers and phenotypic features that are most informative with respect to a trait of interest. We compare several strategies for applying LASSO methods in risk prediction models, using the Genetic Analysis Workshop 17 exome simulation data consisting of 697 individuals with information on genotypic and phenotypic features (smoking, age, sex) in 5-fold cross-validated fashion. The cross-validated averages of the area under the receiver operating curve range from 0.45 to 0.63 for different strategies using only genotypic markers. The same values are improved to 0.69–0.87 when both genotypic and phenotypic information are used. The ability of the LASSO method to find true causal markers is limited, but the method was able to discover several common variants (e.g., FLT1) under certain conditions.
Collapse
Affiliation(s)
- Joel B Fontanarosa
- Bioinformatics Program, Department of Bioengineering (MC 063), University of Illinois at Chicago, 851 S, Morgan Street, 218 SEO, Chicago, IL 60607-7052, USA.
| | | |
Collapse
|
74
|
Molinaro AM, Carriero N, Bjornson R, Hartge P, Rothman N, Chatterjee N. Power of data mining methods to detect genetic associations and interactions. Hum Hered 2011; 72:85-97. [PMID: 21934324 DOI: 10.1159/000330579] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2011] [Accepted: 07/04/2011] [Indexed: 01/29/2023] Open
Abstract
BACKGROUND Genetic association studies, thus far, have focused on the analysis of individual main effects of SNP markers. Nonetheless, there is a clear need for modeling epistasis or gene-gene interactions to better understand the biologic basis of existing associations. Tree-based methods have been widely studied as tools for building prediction models based on complex variable interactions. An understanding of the power of such methods for the discovery of genetic associations in the presence of complex interactions is of great importance. Here, we systematically evaluate the power of three leading algorithms: random forests (RF), Monte Carlo logic regression (MCLR), and multifactor dimensionality reduction (MDR). METHODS We use the algorithm-specific variable importance measures (VIMs) as statistics and employ permutation-based resampling to generate the null distribution and associated p values. The power of the three is assessed via simulation studies. Additionally, in a data analysis, we evaluate the associations between individual SNPs in pro-inflammatory and immunoregulatory genes and the risk of non-Hodgkin lymphoma. RESULTS The power of RF is highest in all simulation models, that of MCLR is similar to RF in half, and that of MDR is consistently the lowest. CONCLUSIONS Our study indicates that the power of RF VIMs is most reliable. However, in addition to tuning parameters, the power of RF is notably influenced by the type of variable (continuous vs. categorical) and the chosen VIM.
Collapse
Affiliation(s)
- Annette M Molinaro
- Division of Biostatistics, School of Public Health, Yale University, New Haven, Conn., USA. annette.molinaro @ yale.edu
| | | | | | | | | | | |
Collapse
|
75
|
Chen L, Yu G, Langefeld CD, Miller DJ, Guy RT, Raghuram J, Yuan X, Herrington DM, Wang Y. Comparative analysis of methods for detecting interacting loci. BMC Genomics 2011; 12:344. [PMID: 21729295 PMCID: PMC3161015 DOI: 10.1186/1471-2164-12-344] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2011] [Accepted: 07/05/2011] [Indexed: 12/20/2022] Open
Abstract
Background Interactions among genetic loci are believed to play an important role in disease risk. While many methods have been proposed for detecting such interactions, their relative performance remains largely unclear, mainly because different data sources, detection performance criteria, and experimental protocols were used in the papers introducing these methods and in subsequent studies. Moreover, there have been very few studies strictly focused on comparison of existing methods. Given the importance of detecting gene-gene and gene-environment interactions, a rigorous, comprehensive comparison of performance and limitations of available interaction detection methods is warranted. Results We report a comparison of eight representative methods, of which seven were specifically designed to detect interactions among single nucleotide polymorphisms (SNPs), with the last a popular main-effect testing method used as a baseline for performance evaluation. The selected methods, multifactor dimensionality reduction (MDR), full interaction model (FIM), information gain (IG), Bayesian epistasis association mapping (BEAM), SNP harvester (SH), maximum entropy conditional probability modeling (MECPM), logistic regression with an interaction term (LRIT), and logistic regression (LR) were compared on a large number of simulated data sets, each, consistent with complex disease models, embedding multiple sets of interacting SNPs, under different interaction models. The assessment criteria included several relevant detection power measures, family-wise type I error rate, and computational complexity. There are several important results from this study. First, while some SNPs in interactions with strong effects are successfully detected, most of the methods miss many interacting SNPs at an acceptable rate of false positives. In this study, the best-performing method was MECPM. Second, the statistical significance assessment criteria, used by some of the methods to control the type I error rate, are quite conservative, thereby limiting their power and making it difficult to fairly compare them. Third, as expected, power varies for different models and as a function of penetrance, minor allele frequency, linkage disequilibrium and marginal effects. Fourth, the analytical relationships between power and these factors are derived, aiding in the interpretation of the study results. Fifth, for these methods the magnitude of the main effect influences the power of the tests. Sixth, most methods can detect some ground-truth SNPs but have modest power to detect the whole set of interacting SNPs. Conclusion This comparison study provides new insights into the strengths and limitations of current methods for detecting interacting loci. This study, along with freely available simulation tools we provide, should help support development of improved methods. The simulation tools are available at: http://code.google.com/p/simulation-tool-bmc-ms9169818735220977/downloads/list.
Collapse
Affiliation(s)
- Li Chen
- Bradley Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, VA, USA
| | | | | | | | | | | | | | | | | |
Collapse
|
76
|
Alekseyenko AV, Lytkin NI, Ai J, Ding B, Padyukov L, Aliferis CF, Statnikov A. Causal graph-based analysis of genome-wide association data in rheumatoid arthritis. Biol Direct 2011; 6:25. [PMID: 21592391 PMCID: PMC3118953 DOI: 10.1186/1745-6150-6-25] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2010] [Accepted: 05/18/2011] [Indexed: 01/27/2023] Open
Abstract
BACKGROUND GWAS owe their popularity to the expectation that they will make a major impact on diagnosis, prognosis and management of disease by uncovering genetics underlying clinical phenotypes. The dominant paradigm in GWAS data analysis so far consists of extensive reliance on methods that emphasize contribution of individual SNPs to statistical association with phenotypes. Multivariate methods, however, can extract more information by considering associations of multiple SNPs simultaneously. Recent advances in other genomics domains pinpoint multivariate causal graph-based inference as a promising principled analysis framework for high-throughput data. Designed to discover biomarkers in the local causal pathway of the phenotype, these methods lead to accurate and highly parsimonious multivariate predictive models. In this paper, we investigate the applicability of causal graph-based method TIE* to analysis of GWAS data. To test the utility of TIE*, we focus on anti-CCP positive rheumatoid arthritis (RA) GWAS datasets, where there is a general consensus in the community about the major genetic determinants of the disease. RESULTS Application of TIE* to the North American Rheumatoid Arthritis Cohort (NARAC) GWAS data results in six SNPs, mostly from the MHC locus. Using these SNPs we develop two predictive models that can classify cases and disease-free controls with an accuracy of 0.81 area under the ROC curve, as verified in independent testing data from the same cohort. The predictive performance of these models generalizes reasonably well to Swedish subjects from the closely related but not identical Epidemiological Investigation of Rheumatoid Arthritis (EIRA) cohort with 0.71-0.78 area under the ROC curve. Moreover, the SNPs identified by the TIE* method render many other previously known SNP associations conditionally independent of the phenotype. CONCLUSIONS Our experiments demonstrate that application of TIE* captures maximum amount of genetic information about RA in the data and recapitulates the major consensus findings about the genetic factors of this disease. In addition, TIE* yields reproducible markers and signatures of RA. This suggests that principled multivariate causal and predictive framework for GWAS analysis empowers the community with a new tool for high-quality and more efficient discovery. REVIEWERS This article was reviewed by Prof. Anthony Almudevar, Dr. Eugene V. Koonin, and Prof. Marianthi Markatou.
Collapse
Affiliation(s)
- Alexander V Alekseyenko
- Center for Health Informatics and Bioinformatics, New York University School of Medicine, New York, NY 10016, USA.
| | | | | | | | | | | | | |
Collapse
|
77
|
Cosgun E, Limdi NA, Duarte CW. High-dimensional pharmacogenetic prediction of a continuous trait using machine learning techniques with application to warfarin dose prediction in African Americans. ACTA ACUST UNITED AC 2011; 27:1384-9. [PMID: 21450715 DOI: 10.1093/bioinformatics/btr159] [Citation(s) in RCA: 53] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Abstract
MOTIVATION With complex traits and diseases having potential genetic contributions of thousands of genetic factors, and with current genotyping arrays consisting of millions of single nucleotide polymorphisms (SNPs), powerful high-dimensional statistical techniques are needed to comprehensively model the genetic variance. Machine learning techniques have many advantages including lack of parametric assumptions, and high power and flexibility. RESULTS We have applied three machine learning approaches: Random Forest Regression (RFR), Boosted Regression Tree (BRT) and Support Vector Regression (SVR) to the prediction of warfarin maintenance dose in a cohort of African Americans. We have developed a multi-step approach that selects SNPs, builds prediction models with different subsets of selected SNPs along with known associated genetic and environmental variables and tests the discovered models in a cross-validation framework. Preliminary results indicate that our modeling approach gives much higher accuracy than previous models for warfarin dose prediction. A model size of 200 SNPs (in addition to the known genetic and environmental variables) gives the best accuracy. The R(2) between the predicted and actual square root of warfarin dose in this model was on average 66.4% for RFR, 57.8% for SVR and 56.9% for BRT. Thus RFR had the best accuracy, but all three techniques achieved better performance than the current published R(2) of 43% in a sample of mixed ethnicity, and 27% in an African American sample. In summary, machine learning approaches for high-dimensional pharmacogenetic prediction, and for prediction of clinical continuous traits of interest, hold great promise and warrant further research.
Collapse
Affiliation(s)
- Erdal Cosgun
- Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL 35294, USA
| | | | | |
Collapse
|
78
|
González-Recio O, Forni S. Genome-wide prediction of discrete traits using Bayesian regressions and machine learning. Genet Sel Evol 2011; 43:7. [PMID: 21329522 PMCID: PMC3400433 DOI: 10.1186/1297-9686-43-7] [Citation(s) in RCA: 79] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2010] [Accepted: 02/17/2011] [Indexed: 01/18/2023] Open
Abstract
BACKGROUND Genomic selection has gained much attention and the main goal is to increase the predictive accuracy and the genetic gain in livestock using dense marker information. Most methods dealing with the large p (number of covariates) small n (number of observations) problem have dealt only with continuous traits, but there are many important traits in livestock that are recorded in a discrete fashion (e.g. pregnancy outcome, disease resistance). It is necessary to evaluate alternatives to analyze discrete traits in a genome-wide prediction context. METHODS This study shows two threshold versions of Bayesian regressions (Bayes A and Bayesian LASSO) and two machine learning algorithms (boosting and random forest) to analyze discrete traits in a genome-wide prediction context. These methods were evaluated using simulated and field data to predict yet-to-be observed records. Performances were compared based on the models' predictive ability. RESULTS The simulation showed that machine learning had some advantages over Bayesian regressions when a small number of QTL regulated the trait under pure additivity. However, differences were small and disappeared with a large number of QTL. Bayesian threshold LASSO and boosting achieved the highest accuracies, whereas Random Forest presented the highest classification performance. Random Forest was the most consistent method in detecting resistant and susceptible animals, phi correlation was up to 81% greater than Bayesian regressions. Random Forest outperformed other methods in correctly classifying resistant and susceptible animals in the two pure swine lines evaluated. Boosting and Bayes A were more accurate with crossbred data. CONCLUSIONS The results of this study suggest that the best method for genome-wide prediction may depend on the genetic basis of the population analyzed. All methods were less accurate at correctly classifying intermediate animals than extreme animals. Among the different alternatives proposed to analyze discrete traits, machine-learning showed some advantages over Bayesian regressions. Boosting with a pseudo Huber loss function showed high accuracy, whereas Random Forest produced more consistent results and an interesting predictive ability. Nonetheless, the best method may be case-dependent and a initial evaluation of different methods is recommended to deal with a particular problem.
Collapse
|
79
|
L2-Boosting algorithm applied to high-dimensional problems in genomic selection. Genet Res (Camb) 2010; 92:227-37. [PMID: 20667166 DOI: 10.1017/s0016672310000261] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022] Open
Abstract
The L(2)-Boosting algorithm is one of the most promising machine-learning techniques that has appeared in recent decades. It may be applied to high-dimensional problems such as whole-genome studies, and it is relatively simple from a computational point of view. In this study, we used this algorithm in a genomic selection context to make predictions of yet to be observed outcomes. Two data sets were used: (1) productive lifetime predicted transmitting abilities from 4702 Holstein sires genotyped for 32 611 single nucleotide polymorphisms (SNPs) derived from the Illumina BovineSNP50 BeadChip, and (2) progeny averages of food conversion rate, pre-corrected by environmental and mate effects, in 394 broilers genotyped for 3481 SNPs. Each of these data sets was split into training and testing sets, the latter comprising dairy or broiler sires whose ancestors were in the training set. Two weak learners, ordinary least squares (OLS) and non-parametric (NP) regression were used for the L2-Boosting algorithm, to provide a stringent evaluation of the procedure. This algorithm was compared with BL [Bayesian LASSO (least absolute shrinkage and selection operator)] and BayesA regression. Learning tasks were carried out in the training set, whereas validation of the models was performed in the testing set. Pearson correlations between predicted and observed responses in the dairy cattle (broiler) data set were 0.65 (0.33), 0.53 (0.37), 0.66 (0.26) and 0.63 (0.27) for OLS-Boosting, NP-Boosting, BL and BayesA, respectively. The smallest bias and mean-squared errors (MSEs) were obtained with OLS-Boosting in both the dairy cattle (0.08 and 1.08, respectively) and broiler (-0.011 and 0.006) data sets, respectively. In the dairy cattle data set, the BL was more accurate (bias=0.10 and MSE=1.10) than BayesA (bias=1.26 and MSE=2.81), whereas no differences between these two methods were found in the broiler data set. L2-Boosting with a suitable learner was found to be a competitive alternative for genomic selection applications, providing high accuracy and low bias in genomic-assisted evaluations with a relatively short computational time.
Collapse
|
80
|
Park L. Identifying disease polymorphisms from case-control genetic association data. Genetica 2010; 138:1147-59. [PMID: 20949309 DOI: 10.1007/s10709-010-9505-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2010] [Accepted: 09/27/2010] [Indexed: 12/18/2022]
Abstract
In case-control association studies, it is typical to observe several associated polymorphisms in a gene region. Often the most significantly associated polymorphism is considered to be the disease polymorphism; however, it is not clear whether it is the disease polymorphism or there is more than one disease polymorphism in the gene region. Currently, there is no method that can handle these problems based on the linkage disequilibrium (LD) relationship between polymorphisms. To distinguish real disease polymorphisms from markers in LD, a method that can detect disease polymorphisms in a gene region has been developed. Relying on the LD between polymorphisms in controls, the proposed method utilizes model-based likelihood ratio tests to find disease polymorphisms. This method shows reliable Type I and Type II error rates when sample sizes are large enough, and works better with re-sequenced data. Applying this method to fine mapping using re-sequencing or dense genotyping data would provide important information regarding the genetic architecture of complex traits.
Collapse
Affiliation(s)
- L Park
- Natural Science Research Institute, Yonsei University, 134 Shinchon-Dong, Seodaemun-Ku, Seoul 120-749, Korea.
| |
Collapse
|
81
|
Abstract
GWAS have emerged as popular tools for identifying genetic variants that are associated with disease risk. Standard analysis of a case-control GWAS involves assessing the association between each individual genotyped SNP and disease risk. However, this approach suffers from limited reproducibility and difficulties in detecting multi-SNP and epistatic effects. As an alternative analytical strategy, we propose grouping SNPs together into SNP sets on the basis of proximity to genomic features such as genes or haplotype blocks, then testing the joint effect of each SNP set. Testing of each SNP set proceeds via the logistic kernel-machine-based test, which is based on a statistical framework that allows for flexible modeling of epistatic and nonlinear SNP effects. This flexibility and the ability to naturally adjust for covariate effects are important features of our test that make it appealing in comparison to individual SNP tests and existing multimarker tests. Using simulated data based on the International HapMap Project, we show that SNP-set testing can have improved power over standard individual-SNP analysis under a wide range of settings. In particular, we find that our approach has higher power than individual-SNP analysis when the median correlation between the disease-susceptibility variant and the genotyped SNPs is moderate to high. When the correlation is low, both individual-SNP analysis and the SNP-set analysis tend to have low power. We apply SNP-set analysis to analyze the Cancer Genetic Markers of Susceptibility (CGEMS) breast cancer GWAS discovery-phase data.
Collapse
|
82
|
|