1
|
Hu X, Meng Z. Using potential variable to study gene-gene and gene-environment interaction effects with genetic model uncertainty. Ann Hum Genet 2022; 86:257-267. [PMID: 35582845 DOI: 10.1111/ahg.12470] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2021] [Revised: 03/02/2022] [Accepted: 04/08/2022] [Indexed: 11/28/2022]
Abstract
One of the critical issues in genetic association studies is to evaluate the risk of a disease associated with gene-gene or gene-environment interactions. The commonly employed procedures are derived by assigning a particular set of scores to genotypes. However, the underlying genetic models of inheritance are rarely known in practice. Misspecifying a genetic model may result in power loss. By using some potential genetic variables to separate the genotype coding and genetic model parameter, we construct a model-embedded score test (MEST). Our test is free of assumption of gene-environment independence and allows for covariates in the model. An effective sequential optimization algorithm is developed. Extensive simulations show the proposed MEST is robust and powerful in most of scenarios. Finally, we apply the proposed method to rheumatoid arthritis data from the Genetic Analysis Workshop 16 to further investigate the potential interaction effects.
Collapse
Affiliation(s)
- Xiaonan Hu
- NCMIS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China
| | - Zhen Meng
- School of Statistics, Capital University of Economics and Business, Beijing, China
| |
Collapse
|
2
|
Gola D, König IR. Empowering individual trait prediction using interactions for precision medicine. BMC Bioinformatics 2021; 22:74. [PMID: 33602124 PMCID: PMC7890638 DOI: 10.1186/s12859-021-04011-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2019] [Accepted: 02/08/2021] [Indexed: 11/11/2022] Open
Abstract
Background One component of precision medicine is to construct prediction models with their predicitve ability as high as possible, e.g. to enable individual risk prediction. In genetic epidemiology, complex diseases like coronary artery disease, rheumatoid arthritis, and type 2 diabetes, have a polygenic basis and a common assumption is that biological and genetic features affect the outcome under consideration via interactions. In the case of omics data, the use of standard approaches such as generalized linear models may be suboptimal and machine learning methods are appealing to make individual predictions. However, most of these algorithms focus mostly on main or marginal effects of the single features in a dataset. On the other hand, the detection of interacting features is an active area of research in the realm of genetic epidemiology. One big class of algorithms to detect interacting features is based on the multifactor dimensionality reduction (MDR). Here, we further develop the model-based MDR (MB-MDR), a powerful extension of the original MDR algorithm, to enable interaction empowered individual prediction. Results Using a comprehensive simulation study we show that our new algorithm (median AUC: 0.66) can use information hidden in interactions and outperforms two other state-of-the-art algorithms, namely the Random Forest (median AUC: 0.54) and Elastic Net (median AUC: 0.50), if interactions are present in a scenario of two pairs of two features having small effects. The performance of these algorithms is comparable if no interactions are present. Further, we show that our new algorithm is applicable to real data by comparing the performance of the three algorithms on a dataset of rheumatoid arthritis cases and healthy controls. As our new algorithm is not only applicable to biological/genetic data but to all datasets with discrete features, it may have practical implications in other research fields where interactions between features have to be considered as well, and we made our method available as an R package (https://github.com/imbs-hl/MBMDRClassifieR). Conclusions The explicit use of interactions between features can improve the prediction performance and thus should be included in further attempts to move precision medicine forward.
Collapse
Affiliation(s)
- Damian Gola
- Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Lübeck, Germany
| | - Inke R König
- Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Lübeck, Germany.
| |
Collapse
|
3
|
Kwak M. Genome-wide association study using truncated likelihood with incomplete information for stratum specific missingness. J Korean Stat Soc 2020. [DOI: 10.1007/s42952-020-00064-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
4
|
Wright MN, König IR. Splitting on categorical predictors in random forests. PeerJ 2019; 7:e6339. [PMID: 30746306 PMCID: PMC6368971 DOI: 10.7717/peerj.6339] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2018] [Accepted: 12/22/2018] [Indexed: 11/20/2022] Open
Abstract
One reason for the widespread success of random forests (RFs) is their ability to analyze most datasets without preprocessing. For example, in contrast to many other statistical methods and machine learning approaches, no recoding such as dummy coding is required to handle ordinal and nominal predictors. The standard approach for nominal predictors is to consider all 2k − 1 − 1 2-partitions of the k predictor categories. However, this exponential relationship produces a large number of potential splits to be evaluated, increasing computational complexity and restricting the possible number of categories in most implementations. For binary classification and regression, it was shown that ordering the predictor categories in each split leads to exactly the same splits as the standard approach. This reduces computational complexity because only k − 1 splits have to be considered for a nominal predictor with k categories. For multiclass classification and survival prediction no ordering method producing equivalent splits exists. We therefore propose to use a heuristic which orders the categories according to the first principal component of the weighted covariance matrix in multiclass classification and by log-rank scores in survival prediction. This ordering of categories can be done either in every split or a priori, that is, just once before growing the forest. With this approach, the nominal predictor can be treated as ordinal in the entire RF procedure, speeding up the computation and avoiding category limits. We compare the proposed methods with the standard approach, dummy coding and simply ignoring the nominal nature of the predictors in several simulation settings and on real data in terms of prediction performance and computational efficiency. We show that ordering the categories a priori is at least as good as the standard approach of considering all 2-partitions in all datasets considered, while being computationally faster. We recommend to use this approach as the default in RFs.
Collapse
Affiliation(s)
- Marvin N Wright
- Leibniz Institute for Prevention Research and Epidemiology-BIPS, Bremen, Germany
| | - Inke R König
- Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Lübeck, Germany
| |
Collapse
|
5
|
Xue Y, Wang J, Ding J, Zhang S, Li Q. A powerful test for ordinal trait genetic association analysis. Stat Appl Genet Mol Biol 2019; 18:/j/sagmb.ahead-of-print/sagmb-2017-0066/sagmb-2017-0066.xml. [PMID: 30685746 DOI: 10.1515/sagmb-2017-0066] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Response selective sampling design is commonly adopted in genetic epidemiologic study because it can substantially reduce time cost and increase power of identifying deleterious genetic variants predispose to human complex disease comparing with prospective design. The proportional odds model (POM) can be used to fit data obtained by this design. Unlike the logistic regression model, the estimated genetic effect based on POM by taking data as being enrolled prospectively is inconsistent. So the power of resulted Wald test is not satisfactory. The modified POM is suitable to fit this type of data, however, the corresponding Wald test is not optimal when the genetic effect is small. Here, we propose a new association test to handle this issue. Simulation studies show that the proposed test can control the type I error rate correctly and is more powerful than two existing methods. Finally, we applied three tests to Anticyclic Citrullinated Protein Antibody data from Genetic Workshop 16.
Collapse
Affiliation(s)
- Yuan Xue
- School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing, China.,Key Laboratory of Big Data Mining and Knowledge Management, Chinese Academy of Sciences, Beijing, China
| | - Jinjuan Wang
- LSC, NCMIS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China
| | - Juan Ding
- School of Mathematics and Statistics, Guangxi Normal University, Guilin, China
| | - Sanguo Zhang
- School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing, China.,Key Laboratory of Big Data Mining and Knowledge Management, Chinese Academy of Sciences, Beijing, China
| | - Qizhai Li
- LSC, NCMIS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Zhongguancun East Road, 55, Beijing100190,China, Phone: +86-10-82541839
| |
Collapse
|
6
|
Saad MN, Mabrouk MS, Eldeib AM, Shaker OG. Studying the effects of haplotype partitioning methods on the RA-associated genomic results from the North American Rheumatoid Arthritis Consortium (NARAC) dataset. J Adv Res 2019; 18:113-126. [PMID: 30891314 PMCID: PMC6403413 DOI: 10.1016/j.jare.2019.01.006] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2018] [Revised: 01/03/2019] [Accepted: 01/14/2019] [Indexed: 12/16/2022] Open
Abstract
Haplotype blocks methods plays a complementary role to the single-SNP approaches. CIT, FGT, SSLD, and single-SNP methods should be applied to discover the markers. Selection of the method used for the association has an impact on the biomarkers. SSLD method detected more significant SNPs than CIT, FGT, and single-SNP methods. The 383 SNPs discovered by all methods are significantly associated with RA.
The human genome, which includes thousands of genes, represents a big data challenge. Rheumatoid arthritis (RA) is a complex autoimmune disease with a genetic basis. Many single-nucleotide polymorphism (SNP) association methods partition a genome into haplotype blocks. The aim of this genome wide association study (GWAS) was to select the most appropriate haplotype block partitioning method for the North American Rheumatoid Arthritis Consortium (NARAC) dataset. The methods used for the NARAC dataset were the individual SNP approach and the following haplotype block methods: the four-gamete test (FGT), confidence interval test (CIT), and solid spine of linkage disequilibrium (SSLD). The measured parameters that reflect the strength of the association between the biomarker and RA were the P-value after Bonferroni correction and other parameters used to compare the output of each haplotype block method. This work presents a comparison among the individual SNP approach and the three haplotype block methods to select the method that can detect all the significant SNPs when applied alone. The GWAS results from the NARAC dataset obtained with the different methods are presented. The individual SNP, CIT, FGT, and SSLD methods detected 541, 1516, 1551, and 1831 RA-associated SNPs respectively, and the individual SNP, FGT, CIT, and SSLD methods detected 65, 156, 159, and 450 significant SNPs respectively, that were not detected by the other methods. Three hundred eighty-three SNPs were discovered by the haplotype block methods and the individual SNP approach, while 1021 SNPs were discovered by all three haplotype block methods. The 383 SNPs detected by all the methods are promising candidates for studying RA susceptibility. A hybrid technique involving all four methods should be applied to detect the significant SNPs associated with RA in the NARAC dataset, but the SSLD method may be preferred because of its advantages when only one method was used.
Collapse
Affiliation(s)
- Mohamed N Saad
- Biomedical Engineering Department, Faculty of Engineering, Minia University, Minia, Egypt
| | - Mai S Mabrouk
- Biomedical Engineering Department, Faculty of Engineering, Misr University for Science and Technology, 6th of October City, Egypt
| | - Ayman M Eldeib
- Systems and Biomedical Engineering Department, Faculty of Engineering, Cairo University, Giza, Egypt
| | - Olfat G Shaker
- Medical Biochemistry and Molecular Biology Department, Faculty of Medicine, Cairo University, Cairo, Egypt
| |
Collapse
|
7
|
Saad MN, Mabrouk MS, Eldeib AM, Shaker OG. Comparative study for haplotype block partitioning methods - Evidence from chromosome 6 of the North American Rheumatoid Arthritis Consortium (NARAC) dataset. PLoS One 2019; 13:e0209603. [PMID: 30596705 PMCID: PMC6312333 DOI: 10.1371/journal.pone.0209603] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2018] [Accepted: 12/07/2018] [Indexed: 11/19/2022] Open
Abstract
Haplotype-based methods compete with “one-SNP-at-a-time” approaches on being preferred for association studies. Chromosome 6 contains most of the known genetic biomarkers for rheumatoid arthritis (RA) disease. Therefore, chromosome 6 serves as a benchmark for the haplotype methods testing. The aim of this study is to test the North American Rheumatoid Arthritis Consortium (NARAC) dataset to find out if haplotype block methods or single-locus approaches alone can sufficiently provide the significant single nucleotide polymorphisms (SNPs) associated with RA. In addition, could we be satisfied with only one method of the haplotype block methods for partitioning chromosome 6 of the NARAC dataset? In the NARAC dataset, chromosome 6 comprises 35,574 SNPs for 2,062 individuals (868 cases, 1,194 controls). Individual SNP approach and three haplotype block methods were applied to the NARAC dataset to identify the RA biomarkers. We employed three haplotype partitioning methods which are confidence interval test (CIT), four gamete test (FGT), and solid spine of linkage disequilibrium (SSLD). P-values after stringent Bonferroni correction for multiple testing were measured to assess the strength of association between the genetic variants and RA susceptibility. Moreover, the block size (in base pairs (bp) and number of SNPs included), number of blocks, percentage of uncovered SNPs by the block method, percentage of significant blocks from the total number of blocks, number of significant haplotypes and SNPs were used to compare among the three haplotype block methods. Individual SNP, CIT, FGT, and SSLD methods detected 432, 1,086, 1,099, and 1,322 associated SNPs, respectively. Each method identified significant SNPs that were not detected by any other method (Individual SNP: 12, FGT: 37, CIT: 55, and SSLD: 189 SNPs). 916 SNPs were discovered by all the three haplotype block methods. 367 SNPs were discovered by the haplotype block methods and the individual SNP approach. The P-values of these 367 SNPs were lower than those of the SNPs uniquely detected by only one method. The 367 SNPs detected by all the methods represent promising candidates for RA susceptibility. They should be further investigated for the European population. A hybrid technique including the four methods should be applied to detect the significant SNPs associated with RA for chromosome 6 of the NARAC dataset. Moreover, SSLD method may be preferred for its favored benefits in case of selecting only one method.
Collapse
Affiliation(s)
- Mohamed N. Saad
- Biomedical Engineering Department, Faculty of Engineering, Minia University, Minia, Egypt
- * E-mail: ,
| | - Mai S. Mabrouk
- Biomedical Engineering Department, Faculty of Engineering, Misr University for Science and Technology (MUST), 6th of October City, Egypt
| | - Ayman M. Eldeib
- Systems and Biomedical Engineering Department, Faculty of Engineering, Cairo University, Giza, Egypt
| | - Olfat G. Shaker
- Medical Biochemistry and Molecular Biology Department, Faculty of Medicine, Cairo University, Cairo, Egypt
| |
Collapse
|
8
|
Bao M, Wang K. Genome-wide association studies using a penalized moving-window regression. Bioinformatics 2017; 33:3887-3894. [PMID: 28961706 PMCID: PMC5860090 DOI: 10.1093/bioinformatics/btx522] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2017] [Revised: 07/31/2017] [Accepted: 08/15/2017] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Genome-wide association studies (GWAS) have played an important role in identifying genetic variants underlying human complex traits. However, its success is hindered by weak effect at causal variants and presence of noise at non-causal variants. In an effort to overcome these difficulties, a previous study proposed a regularized regression method that penalizes on the difference of signal strength between two consecutive single-nucleotide polymorphisms (SNPs). RESULTS We provide a generalization to the afore-mentioned method so that more adjacent SNPs can be incorporated. The choice of optimal number of SNPs is studied. Simulation studies indicate that when consecutive SNPs have similar absolute coefficients our method performs better than using LASSO penalty. In other situations, our method is still comparable to using LASSO penalty. The practical utility of the proposed method is demonstrated by applying it to Genetic Analysis Workshop 16 rheumatoid arthritis GWAS data. AVAILABILITY AND IMPLEMENTATION An implementation of the proposed method is provided in R package MWLasso. CONTACT kai-wang@uiowa.edu.
Collapse
Affiliation(s)
- Minli Bao
- Interdisciplinary Graduate Program in Applied Mathematical and Computational Sciences, University of Iowa, Iowa City, IA, USA
| | - Kai Wang
- Department of Biostatistics, University of Iowa, Iowa City, IA, USA
| |
Collapse
|
9
|
Friedrichs S, Manitz J, Burger P, Amos CI, Risch A, Chang-Claude J, Wichmann HE, Kneib T, Bickeböller H, Hofner B. Pathway-Based Kernel Boosting for the Analysis of Genome-Wide Association Studies. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2017; 2017:6742763. [PMID: 28785300 PMCID: PMC5530424 DOI: 10.1155/2017/6742763] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/10/2017] [Revised: 04/15/2017] [Accepted: 05/10/2017] [Indexed: 01/24/2023]
Abstract
The analysis of genome-wide association studies (GWAS) benefits from the investigation of biologically meaningful gene sets, such as gene-interaction networks (pathways). We propose an extension to a successful kernel-based pathway analysis approach by integrating kernel functions into a powerful algorithmic framework for variable selection, to enable investigation of multiple pathways simultaneously. We employ genetic similarity kernels from the logistic kernel machine test (LKMT) as base-learners in a boosting algorithm. A model to explain case-control status is created iteratively by selecting pathways that improve its prediction ability. We evaluated our method in simulation studies adopting 50 pathways for different sample sizes and genetic effect strengths. Additionally, we included an exemplary application of kernel boosting to a rheumatoid arthritis and a lung cancer dataset. Simulations indicate that kernel boosting outperforms the LKMT in certain genetic scenarios. Applications to GWAS data on rheumatoid arthritis and lung cancer resulted in sparse models which were based on pathways interpretable in a clinical sense. Kernel boosting is highly flexible in terms of considered variables and overcomes the problem of multiple testing. Additionally, it enables the prediction of clinical outcomes. Thus, kernel boosting constitutes a new, powerful tool in the analysis of GWAS data and towards the understanding of biological processes involved in disease susceptibility.
Collapse
Affiliation(s)
- Stefanie Friedrichs
- Institute of Genetic Epidemiology, University Medical Centre, Georg-August University Göttingen, Göttingen, Germany
| | - Juliane Manitz
- Department of Statistics and Econometrics, Georg-August University Göttingen, Göttingen, Germany
- Department of Mathematics and Statistics, Boston University, Boston, MA, USA
| | - Patricia Burger
- Institute of Genetic Epidemiology, University Medical Centre, Georg-August University Göttingen, Göttingen, Germany
| | - Christopher I. Amos
- Department of Community and Family Medicine, Geisel School of Medicine, Dartmouth College, Lebanon, NH, USA
| | - Angela Risch
- Division of Molecular Biology, University of Salzburg, Salzburg, Austria
- Translational Lung Research Center Heidelberg (TLRC-H), Member of the German Center for Lung Research (DZL), Heidelberg, Germany
- Division of Epigenomics and Cancer Risk Factors, German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Jenny Chang-Claude
- Division of Cancer Epidemiology, German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Heinz-Erich Wichmann
- Institute of Medical Informatics, Biometry and Epidemiology, Chair of Epidemiology, Ludwig-Maximilians University, Munich, Germany
- Helmholtz Center Munich, Institute of Epidemiology II, Munich, Germany
- Institute of Medical Statistics and Epidemiology, Technical University Munich, Munich, Germany
| | - Thomas Kneib
- Department of Statistics and Econometrics, Georg-August University Göttingen, Göttingen, Germany
| | - Heike Bickeböller
- Institute of Genetic Epidemiology, University Medical Centre, Georg-August University Göttingen, Göttingen, Germany
| | - Benjamin Hofner
- Department of Medical Informatics, Biometry and Epidemiology, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
- Section Biostatistics, Paul-Ehrlich-Institut, Langen, Germany
| |
Collapse
|
10
|
Power Calculation of Multi-step Combined Principal Components with Applications to Genetic Association Studies. Sci Rep 2016; 6:26243. [PMID: 27189724 PMCID: PMC4870571 DOI: 10.1038/srep26243] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2016] [Accepted: 04/28/2016] [Indexed: 12/03/2022] Open
Abstract
Principal component analysis (PCA) is a useful tool to identify important linear combination of correlated variables in multivariate analysis and has been applied to detect association between genetic variants and human complex diseases of interest. How to choose adequate number of principal components (PCs) to represent the original system in an optimal way is a key issue for PCA. Note that the traditional PCA, only using a few top PCs while discarding the other PCs, might significantly lose power in genetic association studies if all the PCs contain non-ignorable signals. In order to make full use of information from all PCs, Aschard and his colleagues have proposed a multi-step combined PCs method (named mCPC) recently, which performs well especially when several traits are highly correlated. However, the power superiority of mCPC has just been illustrated by simulation, while the theoretical power performance of mCPC has not been studied yet. In this work, we attempt to investigate theoretical properties of mCPC and further propose a novel and efficient strategy to combine PCs. Extensive simulation results confirm that the proposed method is more robust than existing procedures. A real data application to detect the association between gene TRAF1-C5 and rheumatoid arthritis further shows good performance of the proposed procedure.
Collapse
|
11
|
Xu J, Yuan Z, Ji J, Zhang X, Li H, Wu X, Xue F, Liu Y. A powerful score-based test statistic for detecting gene-gene co-association. BMC Genet 2016; 17:31. [PMID: 26822525 PMCID: PMC4731962 DOI: 10.1186/s12863-016-0331-3] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2015] [Accepted: 01/13/2016] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The genetic variants identified by Genome-wide association study (GWAS) can only account for a small proportion of the total heritability for complex disease. The existence of gene-gene joint effects which contains the main effects and their co-association is one of the possible explanations for the "missing heritability" problems. Gene-gene co-association refers to the extent to which the joint effects of two genes differ from the main effects, not only due to the traditional interaction under nearly independent condition but the correlation between genes. Generally, genes tend to work collaboratively within specific pathway or network contributing to the disease and the specific disease-associated locus will often be highly correlated (e.g. single nucleotide polymorphisms (SNPs) in linkage disequilibrium). Therefore, we proposed a novel score-based statistic (SBS) as a gene-based method for detecting gene-gene co-association. RESULTS Various simulations illustrate that, under different sample sizes, marginal effects of causal SNPs and co-association levels, the proposed SBS has the better performance than other existed methods including single SNP-based and principle component analysis (PCA)-based logistic regression model, the statistics based on canonical correlations (CCU), kernel canonical correlation analysis (KCCU), partial least squares path modeling (PLSPM) and delta-square (δ (2)) statistic. The real data analysis of rheumatoid arthritis (RA) further confirmed its advantages in practice. CONCLUSIONS SBS is a powerful and efficient gene-based method for detecting gene-gene co-association.
Collapse
Affiliation(s)
- Jing Xu
- Department of Biostatistics, School of Public Health, Shandong University, 44 Wen Hua Xi Road, PO Box 100, Jinan, 250012, China.
| | - Zhongshang Yuan
- Department of Biostatistics, School of Public Health, Shandong University, 44 Wen Hua Xi Road, PO Box 100, Jinan, 250012, China.
| | - Jiadong Ji
- Department of Biostatistics, School of Public Health, Shandong University, 44 Wen Hua Xi Road, PO Box 100, Jinan, 250012, China.
| | - Xiaoshuai Zhang
- Department of Biostatistics, School of Public Health, Shandong University, 44 Wen Hua Xi Road, PO Box 100, Jinan, 250012, China.
| | - Hongkai Li
- Department of Biostatistics, School of Public Health, Shandong University, 44 Wen Hua Xi Road, PO Box 100, Jinan, 250012, China.
| | - Xuesen Wu
- Department of Epidemiology and Statistics, Bengbu Medical College at Bengbu, Anhui, 233030, China.
| | - Fuzhong Xue
- Department of Biostatistics, School of Public Health, Shandong University, 44 Wen Hua Xi Road, PO Box 100, Jinan, 250012, China.
| | - Yanxun Liu
- Department of Biostatistics, School of Public Health, Shandong University, 44 Wen Hua Xi Road, PO Box 100, Jinan, 250012, China.
| |
Collapse
|
12
|
Zhang W, Li H, Li Z, Li Q. A two-phase procedure for non-normal quantitative trait genetic association study. BMC Bioinformatics 2016; 17:52. [PMID: 26821800 PMCID: PMC4730615 DOI: 10.1186/s12859-016-0888-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2015] [Accepted: 01/06/2016] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The nonparametric trend test (NPT) is well suitable for identifying the genetic variants associated with quantitative traits when the trait values do not satisfy the normal distribution assumption. If the genetic model, defined according to the mode of inheritance, is known, the NPT derived under the given genetic model is optimal. However, in practice, the genetic model is often unknown beforehand. The NPT derived from an uncorrected model might result in loss of power. When the underlying genetic model is unknown, a robust test is preferred to maintain satisfactory power. RESULTS We propose a two-phase procedure to handle the uncertainty of the genetic model for non-normal quantitative trait genetic association study. First, a model selection procedure is employed to help choose the genetic model. Then the optimal test derived under the selected model is constructed to test for possible association. To control the type I error rate, we derive the joint distribution of the test statistics developed in the two phases and obtain the proper size. CONCLUSIONS The proposed method is more robust than existing methods through the simulation results and application to gene DNAH9 from the Genetic Analysis Workshop 16 for associated with Anti-cyclic citrullinated peptide antibody further demonstrate its performance.
Collapse
Affiliation(s)
- Wei Zhang
- Key Laboratory of Systems Control, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, 100190, China.
| | - Huiyun Li
- School of Management and Economics, Beijing Institute of Technology, Beijing, 100081, China.
| | - Zhaohai Li
- Department of Statistics, George Washington University, Washington, 20052, DC, USA.
| | - Qizhai Li
- Key Laboratory of Systems Control, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, 100190, China.
| |
Collapse
|
13
|
Zhang W, Li Q. Incorporating Hardy-Weinberg Equilibrium Law to Enhance the Association Strength for Ordinal Trait Genetic Study. Ann Hum Genet 2015; 80:102-12. [DOI: 10.1111/ahg.12142] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2015] [Accepted: 10/21/2015] [Indexed: 11/25/2022]
Affiliation(s)
- Wei Zhang
- Key Lab of Systems and Control; Academy of Mathematics and Systems Science; CAS; Beijing China
| | - Qizhai Li
- Key Lab of Systems and Control; Academy of Mathematics and Systems Science; CAS; Beijing China
| |
Collapse
|
14
|
Zhang W, Zhang Z, Li X, Li Q. Fitting Proportional Odds Model to Case-Control data with Incorporating Hardy-Weinberg Equilibrium. Sci Rep 2015; 5:17286. [PMID: 26607176 PMCID: PMC4660314 DOI: 10.1038/srep17286] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2015] [Accepted: 10/28/2015] [Indexed: 01/07/2023] Open
Abstract
Genetic association studies have been proved to be an efficient tool to reveal the aetiology of many human complex diseases and traits. When the phenotype is binary, the logistic regression model is commonly employed to evaluate the association strength of the genetic variants predispose to human diseases because the maximum likelihood estimator of the odds ratio based on case-control data is equivalent to that from the same model by taking the data as being arisen prospectively. This equivalence does not hold for the proportional odds model and using it to analyze the case-control data directly often results in a substantial bias. Through putting a parameter of the minor allele frequency in the modified likelihood function under the condition that the Hardy-Weinberg equilibrium law holds within controls, a consistent estimator is obtained. On the basis of it, we construct a score test statistic to test whether the genetic variant is associated with the diseases. Simulation studies show that the proposed estimator has smaller mean squared error than the existing methods when the genetic effect size is away from zero and the proposed test statistic has a good control of type I error rate and is more powerful than the existing procedures. Application to 45 single nucleotide polymorphisms located in the region of TRAF1-C5 genes for the association with four-level anticyclic citrullinated protein antibody from Genetic Analysis Workshop 16 further demonstrates its performance.
Collapse
Affiliation(s)
- Wei Zhang
- Key Laboratory of Systems Control, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China
| | - Zehui Zhang
- Central China Normal University, Wuhan, China
| | | | - Qizhai Li
- Key Laboratory of Systems Control, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China
| |
Collapse
|
15
|
Zhang W, Li Q. Nonparametric Risk and Nonparametric Odds in Quantitative Genetic Association Studies. Sci Rep 2015; 5:12105. [PMID: 26174851 PMCID: PMC5378889 DOI: 10.1038/srep12105] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2014] [Accepted: 06/17/2015] [Indexed: 12/30/2022] Open
Abstract
The coefficient in a linear regression model is commonly employed to evaluate the genetic effect of a single nucleotide polymorphism associated with a quantitative trait under the assumption that the trait value follows a normal distribution or is appropriately normally distributed after a certain transformation. When this assumption is violated, the distribution-free tests are preferred. In this work, we propose the nonparametric risk (NR) and nonparametric odds (NO), obtain the asymptotic normal distribution of estimated NR and then construct the confidence intervals. We also define the genetic models using NR, construct the test statistic under a given genetic model and a robust test, which are free of the genetic uncertainty. Simulation studies show that the proposed confidence intervals have satisfactory cover probabilities and the proposed test can control the type I error rates and is more powerful than the exiting ones under most of the considered scenarios. Application to gene of PTPN22 and genomic region of 6p21.33 from the Genetic Analysis Workshop 16 for association with the anticyclic citrullinated protein antibody further show their performances.
Collapse
Affiliation(s)
- Wei Zhang
- Key Laboratory of Systems Control, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China
| | - Qizhai Li
- Key Laboratory of Systems Control, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China
| |
Collapse
|
16
|
Li Z, Yuan A, Han G, Gao G, Li Q. Rank-based tests for identifying multiple genetic variants associated with quantitative traits. Ann Hum Genet 2015; 78:306-10. [PMID: 24942081 DOI: 10.1111/ahg.12067] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
Abstract
We consider the analysis of multiple genetic variants within a gene or a region that are expected to confer risks to human complex diseases with quantitative traits, where the trait values do not follow the normal distribution even after some transformations. We rank the phenotypic values, calculate a score to measure the trend effect of a particular allele for each marker, and then construct three statistics based on the quadratic frameworks of methods Hotelling T(2) , the summation of squared univariate statistic and the inverse of the square root weighted statistics to combine the scores for different marker loci. Simulation results show that the above three test statistics can control the type I error rate well and are more robust than standard tests constructed based on linear regression. Application to GAW16 data for rheumatoid arthritis successfully detects the association between the HLA-DRB1 gene and anticyclic citrullinated protein measure, while the standard methods based on normal assumption cannot detect this association.
Collapse
|
17
|
Freytag S, Manitz J, Schlather M, Kneib T, Amos CI, Risch A, Chang-Claude J, Heinrich J, Bickeböller H. A network-based kernel machine test for the identification of risk pathways in genome-wide association studies. Hum Hered 2014; 76:64-75. [PMID: 24434848 DOI: 10.1159/000357567] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2013] [Accepted: 11/26/2013] [Indexed: 02/06/2023] Open
Abstract
Biological pathways provide rich information and biological context on the genetic causes of complex diseases. The logistic kernel machine test integrates prior knowledge on pathways in order to analyze data from genome-wide association studies (GWAS). In this study, the kernel converts the genomic information of 2 individuals into a quantitative value reflecting their genetic similarity. With the selection of the kernel, one implicitly chooses a genetic effect model. Like many other pathway methods, none of the available kernels accounts for the topological structure of the pathway or gene-gene interaction types. However, evidence indicates that connectivity and neighborhood of genes are crucial in the context of GWAS, because genes associated with a disease often interact. Thus, we propose a novel kernel that incorporates the topology of pathways and information on interactions. Using simulation studies, we demonstrate that the proposed method maintains the type I error correctly and can be more effective in the identification of pathways associated with a disease than non-network-based methods. We apply our approach to genome-wide association case-control data on lung cancer and rheumatoid arthritis. We identify some promising new pathways associated with these diseases, which may improve our current understanding of the genetic mechanisms.
Collapse
Affiliation(s)
- Saskia Freytag
- Institute of Genetic Epidemiology, Medical School, Göttingen, Germany
| | | | | | | | | | | | | | | | | |
Collapse
|
18
|
Li Q, Hu J, Ding J, Zheng G. Fisher's method of combining dependent statistics using generalizations of the gamma distribution with applications to genetic pleiotropic associations. Biostatistics 2013; 15:284-95. [PMID: 24174580 DOI: 10.1093/biostatistics/kxt045] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
A classical approach to combine independent test statistics is Fisher's combination of $p$-values, which follows the $\chi ^2$ distribution. When the test statistics are dependent, the gamma distribution (GD) is commonly used for the Fisher's combination test (FCT). We propose to use two generalizations of the GD: the generalized and the exponentiated GDs. We study some properties of mis-using the GD for the FCT to combine dependent statistics when one of the two proposed distributions are true. Our results show that both generalizations have better control of type I error rates than the GD, which tends to have inflated type I error rates at more extreme tails. In practice, common model selection criteria (e.g. Akaike information criterion/Bayesian information criterion) can be used to help select a better distribution to use for the FCT. A simple strategy of the two generalizations of the GD in genome-wide association studies is discussed. Applications of the results to genetic pleiotrophic associations are described, where multiple traits are tested for association with a single marker.
Collapse
Affiliation(s)
- Qizhai Li
- Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China
| | | | | | | |
Collapse
|
19
|
Liu J, Huang J, Ma S, Wang K. Incorporating group correlations in genome-wide association studies using smoothed group Lasso. Biostatistics 2013; 14:205-19. [PMID: 22988281 PMCID: PMC3590928 DOI: 10.1093/biostatistics/kxs034] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2011] [Revised: 05/30/2012] [Accepted: 08/21/2012] [Indexed: 12/22/2022] Open
Abstract
In genome-wide association studies, penalization is an important approach for identifying genetic markers associated with disease. Motivated by the fact that there exists natural grouping structure in single nucleotide polymorphisms and, more importantly, such groups are correlated, we propose a new penalization method for group variable selection which can properly accommodate the correlation between adjacent groups. This method is based on a combination of the group Lasso penalty and a quadratic penalty on the difference of regression coefficients of adjacent groups. The new method is referred to as smoothed group Lasso (SGL). It encourages group sparsity and smoothes regression coefficients for adjacent groups. Canonical correlations are applied to the weights between groups in the quadratic difference penalty. We first derive a GCD algorithm for computing the solution path with linear regression model. The SGL method is further extended to logistic regression for binary response. With the assistance of the majorize-minimization algorithm, the SGL penalized logistic regression turns out to be an iteratively penalized least-square problem. We also suggest conducting principal component analysis to reduce the dimensionality within groups. Simulation studies are used to evaluate the finite sample performance. Comparison with group Lasso shows that SGL is more effective in selecting true positives. Two datasets are analyzed using the SGL method.
Collapse
Affiliation(s)
- Jin Liu
- School of Public Health, Yale University, New Haven, CT 06520, USA.
| | | | | | | |
Collapse
|
20
|
Li Q, Li Z, Zheng G, Gao G, Yu K. Rank-based robust tests for quantitative-trait genetic association studies. Genet Epidemiol 2013; 37:358-65. [PMID: 23526350 DOI: 10.1002/gepi.21723] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2012] [Revised: 02/18/2013] [Accepted: 02/20/2013] [Indexed: 11/06/2022]
Abstract
Standard linear regression is commonly used for genetic association studies of quantitative traits. This approach may not be appropriate if the trait, on its original or transformed scales, does not follow a normal distribution. A rank-based nonparametric approach that does not rely on any distributional assumptions can be an attractive alternative. Although several nonparametric tests exist in the literature, their performance in the genetic association setting is not well studied. We evaluate various nonparametric tests for the analysis of quantitative traits and propose a new class of nonparametric tests that have robust performance for traits with various distributions and under different genetic models. We demonstrate the advantage of our proposed methods through simulation study and real data applications.
Collapse
Affiliation(s)
- Qizhai Li
- Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China.
| | | | | | | | | |
Collapse
|
21
|
Wu CO, Zheng G, Kwak M. A Joint Regression Analysis for Genetic Association Studies with Outcome Stratified Samples. Biometrics 2013; 69:417-26. [DOI: 10.1111/biom.12012] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2011] [Revised: 10/01/2012] [Accepted: 11/01/2012] [Indexed: 11/30/2022]
Affiliation(s)
- Colin O. Wu
- Office of Biostatistics Research, National Heart, Lung and Blood Institute, BethesdaMaryland 20892U.S.A
| | - Gang Zheng
- Office of Biostatistics Research, National Heart, Lung and Blood Institute, BethesdaMaryland 20892U.S.A
| | - Minjung Kwak
- Office of Biostatistics Research, National Heart, Lung and Blood Institute, BethesdaMaryland 20892U.S.A
| |
Collapse
|
22
|
Freytag S, Bickeböller H, Amos CI, Kneib T, Schlather M. A novel kernel for correcting size bias in the logistic kernel machine test with an application to rheumatoid arthritis. Hum Hered 2013; 74:97-108. [PMID: 23466369 DOI: 10.1159/000347188] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2012] [Accepted: 01/17/2013] [Indexed: 01/07/2023] Open
Abstract
OBJECTIVES The logistic kernel machine test (LKMT) is a testing procedure tailored towards high-dimensional genetic data. Its use in pathway analyses of case-control genome-wide association studies results from its computational efficiency and flexibility in incorporating additional information via the kernel. The kernel can be any positive definite function; unfortunately, its form strongly influences the test's power and bias. Most authors have recommended the use of a simple linear kernel. We demonstrate via a simulation that the probability of rejecting the null hypothesis of no association just by chance increases with the number of SNPs or genes in the pathway when applying a simple linear kernel. METHODS We propose a novel kernel that includes an appropriate standardization in order to protect against any inflation of false positive results. Moreover, our novel kernel contains information on gene membership of SNPs in the pathway. RESULTS When applying the novel kernel to data from the North American Rheumatoid Arthritis Consortium, we find that even this basic genomic structure can improve the ability of the LKMT to identify meaningful associations. We also demonstrate that the standardization effectively eliminates problems of size bias. CONCLUSION We recommend the use of our standardized kernel and urge caution when using non-adjusted kernels in the LKMT to conduct pathway analyses.
Collapse
Affiliation(s)
- Saskia Freytag
- Department of Genetic Epidemiology, Medical School, Georg-August University Göttingen, Göttingen, Germany.
| | | | | | | | | |
Collapse
|
23
|
Liu J, Wang K, Ma S, Huang J. Accounting for linkage disequilibrium in genome-wide association studies: A penalized regression method. STATISTICS AND ITS INTERFACE 2013; 6:99-115. [PMID: 25258655 PMCID: PMC4172344 DOI: 10.4310/sii.2013.v6.n1.a10] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
Penalized regression methods are becoming increasingly popular in genome-wide association studies (GWAS) for identifying genetic markers associated with disease. However, standard penalized methods such as LASSO do not take into account the possible linkage disequilibrium between adjacent markers. We propose a novel penalized approach for GWAS using a dense set of single nucleotide polymorphisms (SNPs). The proposed method uses the minimax concave penalty (MCP) for marker selection and incorporates linkage disequilibrium (LD) information by penalizing the difference of the genetic effects at adjacent SNPs with high correlation. A coordinate descent algorithm is derived to implement the proposed method. This algorithm is efficient in dealing with a large number of SNPs. A multi-split method is used to calculate the p-values of the selected SNPs for assessing their significance. We refer to the proposed penalty function as the smoothed MCP and the proposed approach as the SMCP method. Performance of the proposed SMCP method and its comparison with LASSO and MCP approaches are evaluated through simulation studies, which demonstrate that the proposed method is more accurate in selecting associated SNPs. Its applicability to real data is illustrated using heterogeneous stock mice data and a rheumatoid arthritis.
Collapse
Affiliation(s)
- Jin Liu
- School of Public Health, Yale University, New Haven, CT 06520, USA
| | - Kai Wang
- Department of Biostatistics, University of Iowa, Iowa City, IA 52242, USA
| | - Shuangge Ma
- School of Public Health, Yale University, New Haven, CT 06520, USA
| | - Jian Huang
- Department of Statistics and Actuarial Science, Department of Biostatistics, University of Iowa, Iowa City, IA 52242, USA
| |
Collapse
|
24
|
Kruppa J, Ziegler A, König IR. Risk estimation and risk prediction using machine-learning methods. Hum Genet 2012; 131:1639-54. [PMID: 22752090 PMCID: PMC3432206 DOI: 10.1007/s00439-012-1194-y] [Citation(s) in RCA: 81] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2012] [Accepted: 06/14/2012] [Indexed: 01/02/2023]
Abstract
After an association between genetic variants and a phenotype has been established, further study goals comprise the classification of patients according to disease risk or the estimation of disease probability. To accomplish this, different statistical methods are required, and specifically machine-learning approaches may offer advantages over classical techniques. In this paper, we describe methods for the construction and evaluation of classification and probability estimation rules. We review the use of machine-learning approaches in this context and explain some of the machine-learning algorithms in detail. Finally, we illustrate the methodology through application to a genome-wide association analysis on rheumatoid arthritis.
Collapse
Affiliation(s)
- Jochen Kruppa
- Institut für Medizininsche Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Maria-Goeppert-Str. 1, 23562 Lübeck, Germany
| | | | | |
Collapse
|
25
|
Zheng G, Wu CO, Kwak M, Jiang W, Joo J, Lima JAC. Joint analysis of binary and quantitative traits with data sharing and outcome-dependent sampling. Genet Epidemiol 2012; 36:263-73. [PMID: 22460626 DOI: 10.1002/gepi.21619] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2011] [Revised: 12/23/2011] [Accepted: 01/02/2012] [Indexed: 11/07/2022]
Abstract
We study the analysis of a joint association between a genetic marker with both binary (case-control) and quantitative (continuous) traits, where the quantitative trait values are only available for the cases due to data sharing and outcome-dependent sampling. Data sharing becomes common in genetic association studies, and the outcome-dependent sampling is the consequence of data sharing, under which a phenotype of interest is not measured for some subgroup. The trend test (or Pearson's test) and F-test are often, respectively, used to analyze the binary and quantitative traits. Because of the outcome-dependent sampling, the usual F-test can be applied using the subgroup with the observed quantitative traits. We propose a modified F-test by also incorporating the genotype frequencies of the subgroup whose traits are not observed. Further, a combination of this modified F-test and Pearson's test is proposed by Fisher's combination of their P-values as a joint analysis. Because of the correlation of the two analyses, we propose to use a Gamma (scaled chi-squared) distribution to fit the asymptotic null distribution for the joint analysis. The proposed modified F-test and the joint analysis can also be applied to test single trait association (either binary or quantitative trait). Through simulations, we identify the situations under which the proposed tests are more powerful than the existing ones. Application to a real dataset of rheumatoid arthritis is presented.
Collapse
Affiliation(s)
- Gang Zheng
- National Heart, Lung and Blood Institute, 6701 Rockledge Drive, Bethesda, MD 20892, USA.
| | | | | | | | | | | |
Collapse
|
26
|
He Y, Li C, Amos CI, Xiong M, Ling H, Jin L. Accelerating haplotype-based genome-wide association study using perfect phylogeny and phase-known reference data. PLoS One 2011; 6:e22097. [PMID: 21789217 PMCID: PMC3137625 DOI: 10.1371/journal.pone.0022097] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2010] [Accepted: 06/17/2011] [Indexed: 11/18/2022] Open
Abstract
The genome-wide association study (GWAS) has become a routine approach for mapping disease risk loci with the advent of large-scale genotyping technologies. Multi-allelic haplotype markers can provide superior power compared with single-SNP markers in mapping disease loci. However, the application of haplotype-based analysis to GWAS is usually bottlenecked by prohibitive time cost for haplotype inference, also known as phasing. In this study, we developed an efficient approach to haplotype-based analysis in GWAS. By using a reference panel, our method accelerated the phasing process and reduced the potential bias generated by unrealistic assumptions in phasing process. The haplotype-based approach delivers great power and no type I error inflation for association studies. With only a medium-size reference panel, phasing error in our method is comparable to the genotyping error afforded by commercial genotyping solutions.
Collapse
Affiliation(s)
- Yungang He
- Department of Computational Genomics, CAS-MPG Partner Institute for Computational Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
- Key Laboratory of Computational Biology, CAS-MPG Partner Institute for Computational Biology, Chinese Academy of Sciences, Shanghai, China
- * E-mail: (YH); (LJ)
| | - Cong Li
- Department of Computational Genomics, CAS-MPG Partner Institute for Computational Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
- Key Laboratory of Computational Biology, CAS-MPG Partner Institute for Computational Biology, Chinese Academy of Sciences, Shanghai, China
| | - Christopher I. Amos
- Department of Epidemiology, University of Texas MD Anderson Cancer Center, Houston, Texas, United States of America
| | - Momiao Xiong
- Human Genetics Center, University of Texas School of Public Health, Houston, Texas, United States of America
- State Key Laboratory of Genetic Engineering and Ministry of Education Key Laboratory of Contemporary Anthropology, School of Life Sciences and Institutes of Biomedical Sciences, Fudan University, Shanghai, China
| | - Hua Ling
- Center for Inherited Disease Research, Johns Hopkins University, Baltimore, Maryland, United States of America
| | - Li Jin
- Department of Computational Genomics, CAS-MPG Partner Institute for Computational Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
- Key Laboratory of Computational Biology, CAS-MPG Partner Institute for Computational Biology, Chinese Academy of Sciences, Shanghai, China
- State Key Laboratory of Genetic Engineering and Ministry of Education Key Laboratory of Contemporary Anthropology, School of Life Sciences and Institutes of Biomedical Sciences, Fudan University, Shanghai, China
- * E-mail: (YH); (LJ)
| |
Collapse
|
27
|
Alekseyenko AV, Lytkin NI, Ai J, Ding B, Padyukov L, Aliferis CF, Statnikov A. Causal graph-based analysis of genome-wide association data in rheumatoid arthritis. Biol Direct 2011; 6:25. [PMID: 21592391 PMCID: PMC3118953 DOI: 10.1186/1745-6150-6-25] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2010] [Accepted: 05/18/2011] [Indexed: 01/27/2023] Open
Abstract
BACKGROUND GWAS owe their popularity to the expectation that they will make a major impact on diagnosis, prognosis and management of disease by uncovering genetics underlying clinical phenotypes. The dominant paradigm in GWAS data analysis so far consists of extensive reliance on methods that emphasize contribution of individual SNPs to statistical association with phenotypes. Multivariate methods, however, can extract more information by considering associations of multiple SNPs simultaneously. Recent advances in other genomics domains pinpoint multivariate causal graph-based inference as a promising principled analysis framework for high-throughput data. Designed to discover biomarkers in the local causal pathway of the phenotype, these methods lead to accurate and highly parsimonious multivariate predictive models. In this paper, we investigate the applicability of causal graph-based method TIE* to analysis of GWAS data. To test the utility of TIE*, we focus on anti-CCP positive rheumatoid arthritis (RA) GWAS datasets, where there is a general consensus in the community about the major genetic determinants of the disease. RESULTS Application of TIE* to the North American Rheumatoid Arthritis Cohort (NARAC) GWAS data results in six SNPs, mostly from the MHC locus. Using these SNPs we develop two predictive models that can classify cases and disease-free controls with an accuracy of 0.81 area under the ROC curve, as verified in independent testing data from the same cohort. The predictive performance of these models generalizes reasonably well to Swedish subjects from the closely related but not identical Epidemiological Investigation of Rheumatoid Arthritis (EIRA) cohort with 0.71-0.78 area under the ROC curve. Moreover, the SNPs identified by the TIE* method render many other previously known SNP associations conditionally independent of the phenotype. CONCLUSIONS Our experiments demonstrate that application of TIE* captures maximum amount of genetic information about RA in the data and recapitulates the major consensus findings about the genetic factors of this disease. In addition, TIE* yields reproducible markers and signatures of RA. This suggests that principled multivariate causal and predictive framework for GWAS analysis empowers the community with a new tool for high-quality and more efficient discovery. REVIEWERS This article was reviewed by Prof. Anthony Almudevar, Dr. Eugene V. Koonin, and Prof. Marianthi Markatou.
Collapse
Affiliation(s)
- Alexander V Alekseyenko
- Center for Health Informatics and Bioinformatics, New York University School of Medicine, New York, NY 10016, USA.
| | | | | | | | | | | | | |
Collapse
|
28
|
Yang C, Wan X, Yang Q, Xue H, Tang NLS, Yu W. A hidden two-locus disease association pattern in genome-wide association studies. BMC Bioinformatics 2011; 12:156. [PMID: 21569557 PMCID: PMC3116488 DOI: 10.1186/1471-2105-12-156] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2010] [Accepted: 05/14/2011] [Indexed: 12/16/2022] Open
Abstract
BACKGROUND Recent association analyses in genome-wide association studies (GWAS) mainly focus on single-locus association tests (marginal tests) and two-locus interaction detections. These analysis methods have provided strong evidence of associations between genetics variances and complex diseases. However, there exists a type of association pattern, which often occurs within local regions in the genome and is unlikely to be detected by either marginal tests or interaction tests. This association pattern involves a group of correlated single-nucleotide polymorphisms (SNPs). The correlation among SNPs can lead to weak marginal effects and the interaction does not play a role in this association pattern. This phenomenon is due to the existence of unfaithfulness: the marginal effects of correlated SNPs do not express their significant joint effects faithfully due to the correlation cancelation. RESULTS In this paper, we develop a computational method to detect this association pattern masked by unfaithfulness. We have applied our method to analyze seven data sets from the Wellcome Trust Case Control Consortium (WTCCC). The analysis for each data set takes about one week to finish the examination of all pairs of SNPs. Based on the empirical result of these real data, we show that this type of association masked by unfaithfulness widely exists in GWAS. CONCLUSIONS These newly identified associations enrich the discoveries of GWAS, which may provide new insights both in the analysis of tagSNPs and in the experiment design of GWAS. Since these associations may be easily missed by existing analysis tools, we can only connect some of them to publicly available findings from other association studies. As independent data set is limited at this moment, we also have difficulties to replicate these findings. More biological implications need further investigation. AVAILABILITY The software is freely available at http://bioinformatics.ust.hk/hidden_pattern_finder.zip.
Collapse
Affiliation(s)
- Can Yang
- Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology, Hong Kong.
| | | | | | | | | | | |
Collapse
|
29
|
Shi G, Boerwinkle E, Morrison AC, Gu CC, Chakravarti A, Rao DC. Mining gold dust under the genome wide significance level: a two-stage approach to analysis of GWAS. Genet Epidemiol 2010; 35:111-8. [PMID: 21254218 DOI: 10.1002/gepi.20556] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2010] [Revised: 10/27/2010] [Accepted: 11/17/2010] [Indexed: 12/14/2022]
Abstract
We propose a two-stage approach to analyze genome-wide association data in order to identify a set of promising single-nucleotide polymorphisms (SNPs). In stage one, we select a list of top signals from single SNP analyses by controlling false discovery rate. In stage two, we use the least absolute shrinkage and selection operator (LASSO) regression to reduce false positives. The proposed approach was evaluated using simulated quantitative traits based on genome-wide SNP data on 8,861 Caucasian individuals from the Atherosclerosis Risk in Communities (ARIC) Study. Our first stage, targeted at controlling false negatives, yields better power than using Bonferroni-corrected significance level. The LASSO regression reduces the number of significant SNPs in stage two: it reduces false-positive SNPs and it reduces true-positive SNPs also at simulated causal loci due to linkage disequilibrium. Interestingly, the LASSO regression preserves the power from stage one, i.e., the number of causal loci detected from the LASSO regression in stage two is almost the same as in stage one, while reducing false positives further. Real data on systolic blood pressure in the ARIC study was analyzed using our two-stage approach which identified two significant SNPs, one of which was reported to be genome-significant in a meta-analysis containing a much larger sample size. On the other hand, a single SNP association scan did not yield any significant results.
Collapse
Affiliation(s)
- Gang Shi
- Division of Biostatistics, Washington University School of Medicine, Saint Louis, Missouri 63110-1093, USA.
| | | | | | | | | | | |
Collapse
|
30
|
An P, Mukherjee O, Chanda P, Yao L, Engelman CD, Huang CH, Zheng T, Kovac IP, Dubé MP, Liang X, Li J, de Andrade M, Culverhouse R, Malzahn D, Manning AK, Clarke GM, Jung J, Province MA. The challenge of detecting epistasis (G x G interactions): Genetic Analysis Workshop 16. Genet Epidemiol 2010; 33 Suppl 1:S58-67. [PMID: 19924703 DOI: 10.1002/gepi.20474] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
Interest is increasing in epistasis as a possible source of the unexplained variance missed by genome-wide association studies. The Genetic Analysis Workshop 16 Group 9 participants evaluated a wide variety of classical and novel analytical methods for detecting epistasis, in both the statistical and machine learning paradigms, applied to both real and simulated data. Because the magnitude of epistasis is clearly relative to scale of penetrance, and therefore to some extent, to the choice of model framework, it is not surprising that strong interactions under one model might be minimized or even disappear entirely under a different modeling framework.
Collapse
Affiliation(s)
- Ping An
- Division of Statistical Genomics and Department of Genetics, Washington University School of Medicine, St. Louis, Missouri 63108, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
31
|
Abstract
The complex etiology of common diseases like cardiovascular disease, diabetes, hypertension, and rheumatoid arthritis has led investigators to focus on the genetics of correlated phenotypes and risk factors. Joint analysis of multiple disease-related phenotypes may reveal genes of pleiotropic effect and increase analytical power, but at the cost of increased analytical and computational complexity. All three data sets provided for analysis at the Genetic Analysis Workshop 16 offered multiple quantitative measures of phenotypes related to underlying disease processes as well as discrete measures of affection status. Participants in Group 6 addressed the challenges and possibilities of association analysis of these data sets on multiple levels, including phenotype definition and data reduction, multivariate approaches to gene discovery, analysis of causality and data structure, and development of predictive models. These approaches included combinations of continuous and discrete phenotypes, use of repeated measures in longitudinal data, and models that included multiple phenotypic measures and multiple single-nucleotide polymorphism variants. Most research teams regarded the use of multiple related phenotypes as a tool for increasing analytical power, as well as for clarifying the underlying biology of complex diseases.
Collapse
Affiliation(s)
- Jack W Kent
- Department of Genetics, Southwest Foundation for Biomedical Research, San Antonio, Texas 78245, USA.
| |
Collapse
|
32
|
Ziegler A. Genome-wide association studies: quality control and population-based measures. Genet Epidemiol 2010; 33 Suppl 1:S45-50. [PMID: 19924716 DOI: 10.1002/gepi.20472] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
Genome-wide association studies, using hundreds of thousands of single-nucleotide polymorphism (SNP) markers, have become a standard approach for identifying disease susceptibility genes. The change in the technology poses substantial computational and statistical challenges that have been addressed in the quality control, imputation, and population-based measure groups of the Genetic Analysis Workshop 16. The computational challenges pertain to efficient memory management and computational speed of the statistical procedures, and we discuss an approach for efficient SNP storage. Accuracy and computational speed is relevant for genotype calling, and the results from a comparison of three calling algorithms are discussed. The first statistical challenge is related to statistical quality control, and we discuss two novel quality control procedures. These low-level analyses have an effect on subsequent preparatory steps for high-level analyses, e.g., the quality of genotype imputation approaches. After the conduct of a genome-wide association study with successful replication and/or validation, measures of diagnostic accuracy, including the area under the curve, are investigated. The area under the curve can be constructed from summary data in some situations. Finally, we discuss how the population-attributable risk of a genetic variant that is only measured in a reference data set can be determined.
Collapse
Affiliation(s)
- Andreas Ziegler
- Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Germany.
| |
Collapse
|
33
|
Abstract
Genome-wide association studies of discrete traits generally use simple methods of analysis based on chi(2) tests for contingency tables or logistic regression, at least for an initial scan of the entire genome. Nevertheless, more power might be obtained by using various methods that analyze multiple markers in combination. Methods based on sliding windows, wavelets, Bayesian shrinkage, or penalized likelihood methods, among others, were explored by various participants of Genetic Analysis Workshop 16 Group 1 to combine information across multiple markers within a region, while others used Bayesian variable selection methods for genome-wide multivariate analyses of all markers simultaneously. Imputation can be used to fill in missing markers on individual subjects within a study or in a meta-analysis of studies using different panels. Although multiple imputation theoretically should give more robust tests of association, one participant contribution found little difference between results of single and multiple imputation. Careful control of population stratification is essential, and two contributions found that previously reported associations with two genes disappeared after more precise control. Other issues considered by this group included subgroup analysis, gene-gene interactions, and the use of biomarkers.
Collapse
Affiliation(s)
- Duncan C Thomas
- Department of Preventive Medicine, University of Southern California, Los Angeles, California 90089-9011, USA.
| |
Collapse
|
34
|
Sarasua SM, Collins JS, Williamson DM, Satten GA, Allen AS. Effect of population stratification on the identification of significant single-nucleotide polymorphisms in genome-wide association studies. BMC Proc 2009; 3 Suppl 7:S13. [PMID: 20017996 PMCID: PMC2795903 DOI: 10.1186/1753-6561-3-s7-s13] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
The North American Rheumatoid Arthritis Consortium case-control study collected case participants across the United States and control participants from New York. More than 500,000 single-nucleotide polymorphisms (SNPs) were genotyped in the sample of 2000 cases and controls. Careful adjustment for the confounding effect of population stratification must be conducted when analyzing these data; the variance inflation factor (VIF) without adjustment is 1.44. In the primary analyses of these data, a clustering algorithm in the program PLINK was used to reduce the VIF to 1.14, after which genomic control was used to control residual confounding. Here we use stratification scores to achieve a unified and coherent control for confounding. We used the first 10 principal components, calculated genome-wide using a set of 81,500 loci that had been selected to have low pair-wise linkage disequilibrium, as risk factors in a logistic model to calculate the stratification score. We then divided the data into five strata based on quantiles of the stratification score. The VIF of these stratified data is 1.04, indicating substantial control of stratification. However, after control for stratification, we find that there are no significant loci associated with rheumatoid arthritis outside of the HLA region. In particular, we find no evidence for association of TRAF1-C5 with rheumatoid arthritis.
Collapse
Affiliation(s)
- Sara M Sarasua
- Department of Genetics and Biochemistry, Clemson University, 100 Jordan Hall, Clemson, South Carolina 29634-0318, USA.
| | | | | | | | | |
Collapse
|
35
|
Buil A, Martinez-Perez A, Perera-Lluna A, Rib L, Caminal P, Soria JM. A new gene-based association test for genome-wide association studies. BMC Proc 2009; 3 Suppl 7:S130. [PMID: 20017997 PMCID: PMC2795904 DOI: 10.1186/1753-6561-3-s7-s130] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023] Open
Abstract
Genome-wide association studies are widely used today to discover genetic factors that modify the risk of complex diseases. Usually, these methods work in a SNP-by-SNP fashion. We present a gene-based test that can be applied in the context of genome-wide association studies. We compare both strategies, SNP-based and gene-based, in a sample of cases and controls for rheumatoid arthritis.We obtained different results using each strategy. The SNP-based test found the PTPN22 gene while the gene-based test found the PHF19-TRAF1-C5 region. That suggests that no single strategy performs better than another in all cases and that a certain underlying genetic architecture can be delineated more easily with one strategy rather than with another.
Collapse
Affiliation(s)
- Alfonso Buil
- Unitat de Genomica de Malalties Complexes, Institut de Recerca de l'Hospital de la Santa Creu i Sant Pau, Barcelona, 08025, Spain.
| | | | | | | | | | | |
Collapse
|
36
|
Cupples LA, Beyene J, Bickeböller H, Daw EW, Fallin MD, Gauderman WJ, Ghosh S, Goode EL, Hauser ER, Hinrichs A, Kent JW, Martin LJ, Martinez M, Neuman RJ, Province M, Szymczak S, Wilcox MA, Ziegler A, MacCluer JW, Almasy L. Genetic Analysis Workshop 16: Strategies for genome-wide association study analyses. BMC Proc 2009; 3 Suppl 7:S1. [PMID: 20017962 PMCID: PMC2795869 DOI: 10.1186/1753-6561-3-s7-s1] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Affiliation(s)
- L Adrienne Cupples
- Department of Biostatistics, Boston University School of Public Health, 801 Massachusetts Avenue, Boston, MA 02130 and Framingham Heart Study, Framingham, Massachusetts, USA
| | - Joseph Beyene
- Research Institute of the Hospital for Sick Children and University of Toronto, 555 University Avenue, Toronto, Ontario M5G 1X8, Canada
| | - Heike Bickeböller
- Department of Genetic Epidemiology, University Medical Center Göttingen, Humboldtallee 32, 37073 Göttingen, Germany
| | - E Warwick Daw
- Division of Statistical Genomics, Washington University School of Medicine, 4444 Forest Park Boulevard, Campus Box 8506, St. Louis, Missouri 63108, USA
| | - M Daniele Fallin
- Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health, 615 North Wolfe Street, Baltimore, Maryland 21205, USA
| | - W James Gauderman
- University of Southern California, Department of Preventive Medicine, Division of Biostatistics, 1540 Alcazar Street, CHP-220, Los Angeles, California 90033, USA
| | - Saurabh Ghosh
- Human Genetics Unit, Indian Statistical Institute, Kolkata 700018, India
| | - Ellen L Goode
- Department of Health Sciences Research, Mayo Clinic, 200 First Street Southwest, Rochester, Minnesota 55905, USA
| | | | - Anthony Hinrichs
- Division of Statistical Genomics, Washington University School of Medicine, 4444 Forest Park Boulevard, Campus Box 8506, St. Louis, Missouri 63108, USA
| | - Jack W Kent
- Department of Genetics, Southwest Foundation for Biomedical Research, P.O. Box 760549, San Antonio, Texas 78245, USA
| | - Lisa J Martin
- Division of Biostatistics and Epidemiology, Cincinnati Children's Hospital Medical Center, 3333 Burnet Avenue, Mail Code 5041, Cincinnati, Ohio 45229, USA
| | - Maria Martinez
- INSERM, U.563, University Paul-Sabatier, CPTP, Toulouse F-31300, France
| | - Rosalind J Neuman
- Division of Statistical Genomics, Washington University School of Medicine, 4444 Forest Park Boulevard, Campus Box 8506, St. Louis, Missouri 63108, USA
| | - Michael Province
- Division of Statistical Genomics, Washington University School of Medicine, 4444 Forest Park Boulevard, Campus Box 8506, St. Louis, Missouri 63108, USA
| | - Silke Szymczak
- Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Maria-Goeppert-Strasse 1, 23562 Lübeck, Germany
| | - Marsha A Wilcox
- Johnson & Johnson Pharmaceutical Research and Development, 1125 Trenton-Harbourton Road, Titusville, New Jersey 08560, USA
| | - Andreas Ziegler
- Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Maria-Goeppert-Strasse 1, 23562 Lübeck, Germany
| | - Jean W MacCluer
- Department of Genetics, Southwest Foundation for Biomedical Research, P.O. Box 760549, San Antonio, Texas 78245, USA
| | - Laura Almasy
- Department of Genetics, Southwest Foundation for Biomedical Research, P.O. Box 760549, San Antonio, Texas 78245, USA
| |
Collapse
|
37
|
Guo W, Liang CY, Lin S. Haplotype association analysis of North American Rheumatoid Arthritis Consortium data using a generalized linear model with regularization. BMC Proc 2009; 3 Suppl 7:S32. [PMID: 20018023 PMCID: PMC2795930 DOI: 10.1186/1753-6561-3-s7-s32] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open
Abstract
The Genetic Analysis Workshop 16 rheumatoid arthritis data include a set of 868 cases and 1194 controls genotyped at 545,080 single-nucleotide polymorphisms (SNPs) from the Illumina 550 k chip. We focus on investigating chromosomes 6 and 18, which have 35,574 and 16,450 SNPs, respectively. Association studies, including single SNP and haplotype-based analyses, were applied to the data on those two chromosomes. Specifically, we conducted a generalized linear model with regularization (rGLM) approach for detecting disease-haplotype association using unphased SNP data. A total of 444 and 43 four-SNP tests were found to be significant at the Bonferroni corrected 5% significance level on chromosome 6 and 18, respectively.
Collapse
Affiliation(s)
- Wei Guo
- Department of Statistics, The Ohio State University, Columbus, Ohio 43210 USA
| | - Chin-yuan Liang
- Department of Statistics, The Ohio State University, Columbus, Ohio 43210 USA
| | - Shili Lin
- Department of Statistics, The Ohio State University, Columbus, Ohio 43210 USA
| |
Collapse
|
38
|
Martin LJ, Gao G, Kang G, Fang Y, Woo JG. Improving the signal-to-noise ratio in genome-wide association studies. Genet Epidemiol 2009; 33 Suppl 1:S29-32. [PMID: 19924719 DOI: 10.1002/gepi.20469] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]
Abstract
Genome-wide association studies employ hundreds of thousands of statistical tests to determine which regions of the genome may likely harbor disease-causing alleles. Such large-scale testing simultaneously requires stringent control over type I error and maintenance of sufficient power to detect true associations. These contradictory goals have led some researchers beyond Bonferroni correction of P-values to an exploration of methods to improve the detection of a few true effects in the presence of many unassociated loci. This article reviews how Genetic Analysis Workshop 16 Group 5 investigators proposed to adjust for multiple tests while simultaneously using information about the structure of the genome to improve the detection of true positives.
Collapse
Affiliation(s)
- Lisa J Martin
- Division of Biostatistics and Epidemiology, Cincinnati Children's Hospital Medical Center, Cincinnati, Ohio 45229, USA.
| | | | | | | | | |
Collapse
|
39
|
MacCluer JW, Amos CI, Gregersen PK, Heard-Costa N, Lee M, Kraja AT, Borecki IB, Cupples LA, Almasy L. Genetic Analysis Workshop 16: introduction to workshop summaries. Genet Epidemiol 2009; 33 Suppl 1:S1-7. [PMID: 19924709 DOI: 10.1002/gepi.20464] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Genetic Analysis Workshop 16 (GAW16) was held on September 17-20, 2008 in St. Louis, Missouri. The focus of GAW16 was on methods and challenges in analysis of single-nucleotide polymorphism data from genome-wide scans. GAW16 attracted 221 participants from 12 countries. The 168 contributions were organized into 17 discussion groups of 6-17 papers each. Three data sets were available for analysis. Two of these were data from ongoing studies, generously provided by the investigators. The North American Rheumatoid Arthritis Consortium provided case-control data on rheumatoid arthritis, and the Framingham Heart Study (FHS) made available information on cardiovascular risk factors for participants in three generations of pedigree data. The third data set included simulated phenotypes for participants in the FHS, using actual pedigree structures and genotypes. This volume includes a paper for each of the 17 discussion groups, summarizing their main findings.
Collapse
Affiliation(s)
- Jean W MacCluer
- Department of Genetics, Southwest Foundation for Biomedical Research, San Antonio, Texas 78227-5301, USA.
| | | | | | | | | | | | | | | | | |
Collapse
|
40
|
Szymczak S, Biernacka JM, Cordell HJ, González-Recio O, König IR, Zhang H, Sun YV. Machine learning in genome-wide association studies. Genet Epidemiol 2009; 33 Suppl 1:S51-7. [PMID: 19924717 DOI: 10.1002/gepi.20473] [Citation(s) in RCA: 103] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Affiliation(s)
- Silke Szymczak
- Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Lübeck, Germany.
| | | | | | | | | | | | | |
Collapse
|