1
|
Yang Y, Wang Q, Wang C, Buxbaum J, Ionita-Laza I. KnockoffHybrid: A knockoff framework for hybrid analysis of trio and population designs in genome-wide association studies. Am J Hum Genet 2024; 111:1448-1461. [PMID: 38821058 PMCID: PMC11267528 DOI: 10.1016/j.ajhg.2024.05.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2023] [Revised: 05/02/2024] [Accepted: 05/06/2024] [Indexed: 06/02/2024] Open
Abstract
Both trio and population designs are popular study designs for identifying risk genetic variants in genome-wide association studies (GWASs). The trio design, as a family-based design, is robust to confounding due to population structure, whereas the population design is often more powerful due to larger sample sizes. Here, we propose KnockoffHybrid, a knockoff-based statistical method for hybrid analysis of both the trio and population designs. KnockoffHybrid provides a unified framework that brings together the advantages of both designs and produces powerful hybrid analysis while controlling the false discovery rate (FDR) in the presence of linkage disequilibrium and population structure. Furthermore, KnockoffHybrid has the flexibility to leverage different types of summary statistics for hybrid analyses, including expression quantitative trait loci (eQTL) and GWAS summary statistics. We demonstrate in simulations that KnockoffHybrid offers power gains over non-hybrid methods for the trio and population designs with the same number of cases while controlling the FDR with complex correlation among variants and population structure among subjects. In hybrid analyses of three trio cohorts for autism spectrum disorders (ASDs) from the Autism Speaks MSSNG, Autism Sequencing Consortium, and Autism Genome Project with GWAS summary statistics from the iPSYCH project and eQTL summary statistics from the MetaBrain project, KnockoffHybrid outperforms conventional methods by replicating several known risk genes for ASDs and identifying additional associations with variants in other genes, including the PRAME family genes involved in axon guidance and which may act as common targets for human speech/language evolution and related disorders.
Collapse
Affiliation(s)
- Yi Yang
- Department of Biostatistics, City University of Hong Kong, Hong Kong SAR, China; School of Data Science, City University of Hong Kong, Hong Kong SAR, China.
| | - Qi Wang
- School of Data Science, City University of Hong Kong, Hong Kong SAR, China
| | - Chen Wang
- Department of Biostatistics, Columbia University, New York, NY 10032, USA
| | - Joseph Buxbaum
- Departments of Psychiatry, Neuroscience, and Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Iuliana Ionita-Laza
- Department of Biostatistics, Columbia University, New York, NY 10032, USA; Department of Statistics, Lund University, Lund, Sweden
| |
Collapse
|
2
|
Klau JH, Maj C, Klinkhammer H, Krawitz PM, Mayr A, Hillmer AM, Schumacher J, Heider D. AI-based multi-PRS models outperform classical single-PRS models. Front Genet 2023; 14:1217860. [PMID: 37441549 PMCID: PMC10335560 DOI: 10.3389/fgene.2023.1217860] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2023] [Accepted: 06/13/2023] [Indexed: 07/15/2023] Open
Abstract
Polygenic risk scores (PRS) calculate the risk for a specific disease based on the weighted sum of associated alleles from different genetic loci in the germline estimated by regression models. Recent advances in genetics made it possible to create polygenic predictors of complex human traits, including risks for many important complex diseases, such as cancer, diabetes, or cardiovascular diseases, typically influenced by many genetic variants, each of which has a negligible effect on overall risk. In the current study, we analyzed whether adding additional PRS from other diseases to the prediction models and replacing the regressions with machine learning models can improve overall predictive performance. Results showed that multi-PRS models outperform single-PRS models significantly on different diseases. Moreover, replacing regression models with machine learning models, i.e., deep learning, can also improve overall accuracy.
Collapse
Affiliation(s)
- Jan Henric Klau
- Department of Mathematics and Computer Science, University of Marburg, Marburg, Germany
| | - Carlo Maj
- Center for Human Genetics, University of Marburg, Marburg, Germany
| | - Hannah Klinkhammer
- Institute for Genomic Statistics and Bioinformatics, Medical Faculty, University Bonn, Bonn, Germany
- Institute for Medical Biometry, Informatics and Epidemiology, Medical Faculty, University Bonn, Bonn, Germany
| | - Peter M. Krawitz
- Institute for Genomic Statistics and Bioinformatics, Medical Faculty, University Bonn, Bonn, Germany
| | - Andreas Mayr
- Institute for Medical Biometry, Informatics and Epidemiology, Medical Faculty, University Bonn, Bonn, Germany
| | - Axel M. Hillmer
- Institute of Pathology, Faculty of Medicine, University of Cologne, Cologne, Germany
| | | | - Dominik Heider
- Department of Mathematics and Computer Science, University of Marburg, Marburg, Germany
| |
Collapse
|
3
|
Yang Y, Wang C, Liu L, Buxbaum J, He Z, Ionita-Laza I. KnockoffTrio: A knockoff framework for the identification of putative causal variants in genome-wide association studies with trio design. Am J Hum Genet 2022; 109:1761-1776. [PMID: 36150388 PMCID: PMC9606389 DOI: 10.1016/j.ajhg.2022.08.013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2022] [Accepted: 08/24/2022] [Indexed: 01/25/2023] Open
Abstract
Family-based designs can eliminate confounding due to population substructure and can distinguish direct from indirect genetic effects, but these designs are underpowered due to limited sample sizes. Here, we propose KnockoffTrio, a statistical method to identify putative causal genetic variants for father-mother-child trio design built upon a recently developed knockoff framework in statistics. KnockoffTrio controls the false discovery rate (FDR) in the presence of arbitrary correlations among tests and is less conservative and thus more powerful than the conventional methods that control the family-wise error rate via Bonferroni correction. Furthermore, KnockoffTrio is not restricted to family-based association tests and can be used in conjunction with more powerful, potentially nonlinear models to improve the power of standard family-based tests. We show, using empirical simulations, that KnockoffTrio can prioritize causal variants over associations due to linkage disequilibrium and can provide protection against confounding due to population stratification. In applications to 14,200 trios from three study cohorts for autism spectrum disorders (ASDs), including AGP, SPARK, and SSC, we show that KnockoffTrio can identify multiple significant associations that are missed by conventional tests applied to the same data. In particular, we replicate known ASD association signals with variants in several genes such as MACROD2, NRXN1, PRKAR1B, CADM2, PCDH9, and DOCK4 and identify additional associations with variants in other genes including ARHGEF10, SLC28A1, ZNF589, and HINT1 at FDR 10%.
Collapse
Affiliation(s)
- Yi Yang
- Department of Biostatistics, Columbia University, New York, NY 10032, USA; Department of Biostatistics, City University of Hong Kong, Hong Kong SAR, China; School of Data Science, City University of Hong Kong, Hong Kong SAR, China
| | - Chen Wang
- Department of Biostatistics, Columbia University, New York, NY 10032, USA
| | - Linxi Liu
- Department of Statistics, University of Pittsburgh, Pittsburgh, PA 15260, USA
| | - Joseph Buxbaum
- Departments of Psychiatry, Neuroscience, and Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Zihuai He
- Quantitative Sciences Unit, Department of Medicine, Stanford University, Stanford, CA 94305, USA; Department of Neurology and Neurological Sciences, Stanford University, Stanford, CA 94305, USA
| | | |
Collapse
|
4
|
Wan JY, Goodman DL, Willems EL, Freedland AR, Norden-Krichmar TM, Santorico SA, Edwards KL. Genome-wide association analysis of metabolic syndrome quantitative traits in the GENNID multiethnic family study. Diabetol Metab Syndr 2021; 13:59. [PMID: 34074324 PMCID: PMC8170963 DOI: 10.1186/s13098-021-00670-3] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/19/2021] [Accepted: 04/28/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND To identify genetic associations of quantitative metabolic syndrome (MetS) traits and characterize heterogeneity across ethnic groups. METHODS Data was collected from GENetics of Noninsulin dependent Diabetes Mellitus (GENNID), a multiethnic resource of Type 2 diabetic families and included 1520 subjects in 259 African-American, European-American, Japanese-Americans, and Mexican-American families. We focused on eight MetS traits: weight, waist circumference, systolic and diastolic blood pressure, high-density lipoprotein, triglycerides, fasting glucose, and insulin. Using genotyped and imputed data from Illumina's Multiethnic array, we conducted genome-wide association analyses with linear mixed models for all ethnicities, except for the smaller Japanese-American group, where we used additive genetic models with gene-dropping. RESULTS Findings included ethnic-specific genetic associations and heterogeneity across ethnicities. Most significant associations were outside our candidate linkage regions and were coincident within a gene or intergenic region, with two exceptions in European-American families: (a) within previously identified linkage region on chromosome 2, two significant GLI2-TFCP2L1 associations with weight, and (b) one chromosome 11 variant near CADM1-LINC00900 with pleiotropic blood pressure effects. CONCLUSIONS This multiethnic family study found genetic heterogeneity and coincident associations (with one case of pleiotropy), highlighting the importance of including diverse populations in genetic research and illustrating the complex genetic architecture underlying MetS.
Collapse
Affiliation(s)
- Jia Y Wan
- Department of Epidemiology and Biostatistics, Program in Public Health, University of California, 635 E. Peltason Dr, Mail Code: 7550, Irvine, CA, 92697, USA
| | - Deborah L Goodman
- Department of Epidemiology and Biostatistics, Program in Public Health, University of California, 635 E. Peltason Dr, Mail Code: 7550, Irvine, CA, 92697, USA
| | - Emileigh L Willems
- Department of Mathematical and Statistical Sciences, University of Colorado, Denver, CO, USA
| | - Alexis R Freedland
- Department of Epidemiology and Biostatistics, Program in Public Health, University of California, 635 E. Peltason Dr, Mail Code: 7550, Irvine, CA, 92697, USA
| | - Trina M Norden-Krichmar
- Department of Epidemiology and Biostatistics, Program in Public Health, University of California, 635 E. Peltason Dr, Mail Code: 7550, Irvine, CA, 92697, USA
| | - Stephanie A Santorico
- Department of Mathematical and Statistical Sciences, University of Colorado, Denver, CO, USA
- Human Medical Genetics and Genomics Program, University of Colorado, Denver, CO, USA
- Department of Biostatistics & Informatics, University of Colorado, Denver, CO, USA
- Division of Biomedical Informatics & Personalized Medicine, University of Colorado School of Medicine, Aurora, CO, USA
| | - Karen L Edwards
- Department of Epidemiology and Biostatistics, Program in Public Health, University of California, 635 E. Peltason Dr, Mail Code: 7550, Irvine, CA, 92697, USA.
| |
Collapse
|
5
|
Nyangiri OA, Edwige SA, Koffi M, Mewamba E, Simo G, Namulondo J, Mulindwa J, Nassuuna J, Elliott A, Karume K, Mumba D, Corstjens P, Casacuberta-Partal M, van Dam G, Bucheton B, Noyes H, Matovu E. Candidate gene family-based and case-control studies of susceptibility to high Schistosoma mansoni worm burden in African children: a protocol. AAS Open Res 2021; 4:36. [PMID: 35252746 PMCID: PMC8861467 DOI: 10.12688/aasopenres.13203.1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 06/04/2021] [Indexed: 11/20/2022] Open
Abstract
Background: Approximately 25% of the risk of Schistosoma mansoni is associated with host genetic variation. We will test 24 candidate genes, mainly in the T h2 and T h17 pathways, for association with S. mansoni infection intensity in four African countries, using family based and case-control approaches. Methods: Children aged 5-15 years will be recruited in S. mansoni endemic areas of Ivory Coast, Cameroon, Uganda and the Democratic Republic of Congo (DRC). We will use family based (study 1) and case-control (study 2) designs. Study 1 will take place in Ivory Coast, Cameroon, Uganda and the DRC. We aim to recruit 100 high worm burden families from each country except Uganda, where a previous study recruited at least 40 families. For phenotyping, cases will be defined as the 20% of children in each community with heaviest worm burdens as measured by the circulating cathodic antigen (CCA) assay. Study 2 will take place in Uganda. We will recruit 500 children in a highly endemic community. For phenotyping, cases will be defined as the 20% of children with heaviest worm burdens as measured by the CAA assay, while controls will be the 20% of infected children with the lightest worm burdens. Deoxyribonucleic acid (DNA) will be genotyped on the Illumina H3Africa SNP (single nucleotide polymorphisms) chip and genotypes will be converted to sets of haplotypes that span the gene region for analysis. We have selected 24 genes for genotyping that are mainly in the Th2 and Th17 pathways and that have variants that have been demonstrated to be or could be associated with Schistosoma infection intensity. Analysis: In the family-based design, we will identify SNP haplotypes disproportionately transmitted to children with high worm burden. Case-control analysis will detect overrepresentation of haplotypes in extreme phenotypes with correction for relatedness by using whole genome principal components.
Collapse
Affiliation(s)
- Oscar A. Nyangiri
- College of Veterinary Medicine, Animal Resources and Biosecurity, Makerere University, Kampala, Uganda
| | - Sokouri A. Edwige
- Université Jean Lorougnon Guédé (UJLoG) de Daloa, Daloa, Cote d'Ivoire
| | - Mathurin Koffi
- Université Jean Lorougnon Guédé (UJLoG) de Daloa, Daloa, Cote d'Ivoire
| | - Estelle Mewamba
- Faculty of Science, University of Dschang, Dschang, Cameroon
| | - Gustave Simo
- Faculty of Science, University of Dschang, Dschang, Cameroon
| | - Joyce Namulondo
- College of Veterinary Medicine, Animal Resources and Biosecurity, Makerere University, Kampala, Uganda
| | - Julius Mulindwa
- College of Veterinary Medicine, Animal Resources and Biosecurity, Makerere University, Kampala, Uganda
| | - Jacent Nassuuna
- Medical Research Council/Uganda Virus Research Institute and London School of Hygiene & Tropical Medicine Uganda Research Unit, Entebbe, Uganda
| | - Alison Elliott
- Medical Research Council/Uganda Virus Research Institute and London School of Hygiene & Tropical Medicine Uganda Research Unit, Entebbe, Uganda
- London School of Hygiene and Tropical Medicine, London, WC1E, UK
| | - Kévin Karume
- Institut National de Recherche Biomedicale, Kinshasa, Democratic Republic of the Congo
| | - Dieudonne Mumba
- Institut National de Recherche Biomedicale, Kinshasa, Democratic Republic of the Congo
| | - P.L.A.M Corstjens
- Dept. of Cell and Chemical Biology, Leiden University Medical Center, Leiden, The Netherlands
| | | | - G.J. van Dam
- Dept. of Parasitology, Leiden University Medical Center, Leiden, The Netherlands
| | - Bruno Bucheton
- Institut de Recherche pour le Développement (IRD), IRD-CIRAD, Montpellier, France
| | - Harry Noyes
- Centre for Genomic Research, University of Liverpool, Liverpool, UK
| | - Enock Matovu
- College of Veterinary Medicine, Animal Resources and Biosecurity, Makerere University, Kampala, Uganda
| | | |
Collapse
|
6
|
Nyangiri OA, Edwige SA, Koffi M, Mewamba E, Simo G, Namulondo J, Mulindwa J, Nassuuna J, Elliott A, Karume K, Mumba D, Corstjens P, Casacuberta-Partal M, van Dam G, Bucheton B, Noyes H, Matovu E. Candidate gene family-based and case-control studies of susceptibility to high Schistosoma mansoni worm burden in African children: a protocol. AAS Open Res 2021; 4:36. [PMID: 35252746 PMCID: PMC8861467 DOI: 10.12688/aasopenres.13203.2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/15/2021] [Indexed: 11/20/2022] Open
Abstract
Background: Approximately 25% of the risk of Schistosoma mansoni is associated with host genetic variation. We will test 24 candidate genes, mainly in the T h2 and T h17 pathways, for association with S. mansoni infection intensity in four African countries, using family based and case-control approaches. Methods: Children aged 5-15 years will be recruited in S. mansoni endemic areas of Ivory Coast, Cameroon, Uganda and the Democratic Republic of Congo (DRC). We will use family based (study 1) and case-control (study 2) designs. Study 1 will take place in Ivory Coast, Cameroon, Uganda and the DRC. We aim to recruit 100 high worm burden families from each country except Uganda, where a previous study recruited at least 40 families. For phenotyping, cases will be defined as the 20% of children in each community with heaviest worm burdens as measured by the circulating cathodic antigen (CCA) assay. Study 2 will take place in Uganda. We will recruit 500 children in a highly endemic community. For phenotyping, cases will be defined as the 20% of children with heaviest worm burdens as measured by the CAA assay, while controls will be the 20% of infected children with the lightest worm burdens. Deoxyribonucleic acid (DNA) will be genotyped on the Illumina H3Africa SNP (single nucleotide polymorphisms) chip and genotypes will be converted to sets of haplotypes that span the gene region for analysis. We have selected 24 genes for genotyping that are mainly in the Th2 and Th17 pathways and that have variants that have been demonstrated to be or could be associated with Schistosoma infection intensity. Analysis: In the family-based design, we will identify SNP haplotypes disproportionately transmitted to children with high worm burden. Case-control analysis will detect overrepresentation of haplotypes in extreme phenotypes with correction for relatedness by using whole genome principal components.
Collapse
Affiliation(s)
- Oscar A. Nyangiri
- College of Veterinary Medicine, Animal Resources and Biosecurity, Makerere University, Kampala, Uganda
| | - Sokouri A. Edwige
- Université Jean Lorougnon Guédé (UJLoG) de Daloa, Daloa, Cote d'Ivoire
| | - Mathurin Koffi
- Université Jean Lorougnon Guédé (UJLoG) de Daloa, Daloa, Cote d'Ivoire
| | - Estelle Mewamba
- Faculty of Science, University of Dschang, Dschang, Cameroon
| | - Gustave Simo
- Faculty of Science, University of Dschang, Dschang, Cameroon
| | - Joyce Namulondo
- College of Veterinary Medicine, Animal Resources and Biosecurity, Makerere University, Kampala, Uganda
| | - Julius Mulindwa
- College of Veterinary Medicine, Animal Resources and Biosecurity, Makerere University, Kampala, Uganda
| | - Jacent Nassuuna
- Medical Research Council/Uganda Virus Research Institute and London School of Hygiene & Tropical Medicine Uganda Research Unit, Entebbe, Uganda
| | - Alison Elliott
- Medical Research Council/Uganda Virus Research Institute and London School of Hygiene & Tropical Medicine Uganda Research Unit, Entebbe, Uganda
- London School of Hygiene and Tropical Medicine, London, WC1E, UK
| | - Kévin Karume
- Institut National de Recherche Biomedicale, Kinshasa, Democratic Republic of the Congo
| | - Dieudonne Mumba
- Institut National de Recherche Biomedicale, Kinshasa, Democratic Republic of the Congo
| | - P.L.A.M Corstjens
- Dept. of Cell and Chemical Biology, Leiden University Medical Center, Leiden, The Netherlands
| | | | - G.J. van Dam
- Dept. of Parasitology, Leiden University Medical Center, Leiden, The Netherlands
| | - Bruno Bucheton
- Institut de Recherche pour le Développement (IRD), IRD-CIRAD, Montpellier, France
| | - Harry Noyes
- Centre for Genomic Research, University of Liverpool, Liverpool, UK
| | - Enock Matovu
- College of Veterinary Medicine, Animal Resources and Biosecurity, Makerere University, Kampala, Uganda
| | | |
Collapse
|
7
|
Kanzi AM, San JE, Chimukangara B, Wilkinson E, Fish M, Ramsuran V, de Oliveira T. Next Generation Sequencing and Bioinformatics Analysis of Family Genetic Inheritance. Front Genet 2020; 11:544162. [PMID: 33193618 PMCID: PMC7649788 DOI: 10.3389/fgene.2020.544162] [Citation(s) in RCA: 33] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2020] [Accepted: 09/21/2020] [Indexed: 12/29/2022] Open
Abstract
Mendelian and complex genetic trait diseases continue to burden and affect society both socially and economically. The lack of effective tests has hampered diagnosis thus, the affected lack proper prognosis. Mendelian diseases are caused by genetic mutations in a singular gene while complex trait diseases are caused by the accumulation of mutations in either linked or unlinked genomic regions. Significant advances have been made in identifying novel diseases associated mutations especially with the introduction of next generation and third generation sequencing. Regardless, some diseases are still without diagnosis as most tests rely on SNP genotyping panels developed from population based genetic analyses. Analysis of family genetic inheritance using whole genomes, whole exomes or a panel of genes has been shown to be effective in identifying disease-causing mutations. In this review, we discuss next generation and third generation sequencing platforms, bioinformatic tools and genetic resources commonly used to analyze family based genomic data with a focus on identifying inherited or novel disease-causing mutations. Additionally, we also highlight the analytical, ethical and regulatory challenges associated with analyzing personal genomes which constitute the data used for family genetic inheritance.
Collapse
Affiliation(s)
- Aquillah M. Kanzi
- Kwazulu-Natal Research and Innovation Sequencing Platform (KRISP), School of Laboratory Medicine and Medical Sciences, College of Health Sciences, University of KwaZulu-Natal, Durban, South Africa
| | | | | | | | | | | | | |
Collapse
|
8
|
Khan MI, CS P. Case-Parent Trio Studies in Cleft Lip and Palate. Glob Med Genet 2020; 7:75-79. [PMID: 33392609 PMCID: PMC7772012 DOI: 10.1055/s-0040-1722097] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
Cleft lip with or without cleft palate (CL/P) is one of the most common congenital malformations in humans involving various genetic and environmental risk factors. The prevalence of CL/P varies according to geographical location, ethnicity, race, gender, and socioeconomic status, affecting approximately 1 in 800 live births worldwide. Genetic studies aim to understand the mechanisms contributory to a phenotype by measuring the association between genetic variants and also between genetic variants and phenotype population. Genome-wide association studies are standard tools used to discover genetic loci related to a trait of interest. Genetic association studies are generally divided into two main design types: population-based studies and family-based studies. The epidemiological population-based studies comprise unrelated individuals that directly compare the frequency of genetic variants between (usually independent) cases and controls. The alternative to population-based studies (case-control designs) includes various family-based study designs that comprise related individuals. An example of such a study is a case-parent trio design study, which is commonly employed in genetics to identify the variants underlying complex human disease where transmission of alleles from parents to offspring is studied. This article describes the fundamentals of case-parent trio study, trio design and its significances, statistical methods, and limitations of the trio studies.
Collapse
Affiliation(s)
- Mahamad Irfanulla Khan
- Department of Orthodontics & Dentofacial Orthopedics, The Oxford Dental College, Bangalore, Karnataka, India
| | - Prashanth CS
- Department of Orthodontics & Dentofacial Orthopedics, DAPM RV Dental College, Bangalore, Karnataka, India
| |
Collapse
|
9
|
Yang JJ, Williams LK, Buu A. Identifying pleiotropic genes in genome-wide association studies from related subjects using the linear mixed model and Fisher combination function. BMC Bioinformatics 2017; 18:376. [PMID: 28836938 PMCID: PMC5571642 DOI: 10.1186/s12859-017-1791-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2016] [Accepted: 08/15/2017] [Indexed: 11/11/2022] Open
Abstract
Background A multivariate genome-wide association test is proposed for analyzing data on multivariate quantitative phenotypes collected from related subjects. The proposed method is a two-step approach. The first step models the association between the genotype and marginal phenotype using a linear mixed model. The second step uses the correlation between residuals of the linear mixed model to estimate the null distribution of the Fisher combination test statistic. Results The simulation results show that the proposed method controls the type I error rate and is more powerful than the marginal tests across different population structures (admixed or non-admixed) and relatedness (related or independent). The statistical analysis on the database of the Study of Addiction: Genetics and Environment (SAGE) demonstrates that applying the multivariate association test may facilitate identification of the pleiotropic genes contributing to the risk for alcohol dependence commonly expressed by four correlated phenotypes. Conclusions This study proposes a multivariate method for identifying pleiotropic genes while adjusting for cryptic relatedness and population structure between subjects. The two-step approach is not only powerful but also computationally efficient even when the number of subjects and the number of phenotypes are both very large.
Collapse
Affiliation(s)
- James J Yang
- School of Nursing, University of Michigan, Ann Arbor, 48104, Michigan, USA.
| | - L Keoki Williams
- Department of Internal Medicine, Henry Ford Health System, Detroit, 48202, Michigan, USA.,The Center for Health Policy and Health Services Research, Henry Ford Health System, Detroit, 48202, Michigan, USA
| | - Anne Buu
- Department of Health Behavior and Biological Sciences, University of Michigan, Ann Arbor, 48104, Michigan, USA
| |
Collapse
|
10
|
Identifying Pleiotropic Genes in Genome-Wide Association Studies for Multivariate Phenotypes with Mixed Measurement Scales. PLoS One 2017; 12:e0169893. [PMID: 28081206 PMCID: PMC5231271 DOI: 10.1371/journal.pone.0169893] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2016] [Accepted: 12/22/2016] [Indexed: 11/30/2022] Open
Abstract
We propose a multivariate genome-wide association test for mixed continuous, binary, and ordinal phenotypes. A latent response model is used to estimate the correlation between phenotypes with different measurement scales so that the empirical distribution of the Fisher’s combination statistic under the null hypothesis is estimated efficiently. The simulation study shows that our proposed correlation estimation methods have high levels of accuracy. More importantly, our approach conservatively estimates the variance of the test statistic so that the type I error rate is controlled. The simulation also shows that the proposed test maintains the power at the level very close to that of the ideal analysis based on known latent phenotypes while controlling the type I error. In contrast, conventional approaches–dichotomizing all observed phenotypes or treating them as continuous variables–could either reduce the power or employ a linear regression model unfit for the data. Furthermore, the statistical analysis on the database of the Study of Addiction: Genetics and Environment (SAGE) demonstrates that conducting a multivariate test on multiple phenotypes can increase the power of identifying markers that may not be, otherwise, chosen using marginal tests. The proposed method also offers a new approach to analyzing the Fagerström Test for Nicotine Dependence as multivariate phenotypes in genome-wide association studies.
Collapse
|
11
|
Abstract
This chapter describes the main issues that genetic epidemiologists usually consider in the design of linkage and association studies. For linkage, we briefly consider the situation of rare highly penetrant alleles showing a disease pattern consistent with Mendelian inheritance investigated through parametric methods in large pedigrees, or with autozygosity mapping in inbred families, and we then turn our focus to the most common design, the affected sibling pair design that is of more relevance for common, complex diseases. Power and sample size calculations are provided as a function of the strength of the genetic effect being investigated. We also discuss the impact of other determinants of statistical power such as disease heterogeneity, pedigree and genotyping errors and the effect of the type and density of genetic markers. For association studies, we consider the popular case-control design for dichotomous phenotypes and we provide power and sample size calculations for one-stage and multistage designs. For candidate genes, guidelines are given on the prioritization of genetic variants, and for genome-wide association studies (GWAS) the issue of choosing an appropriate SNP array is discussed. A warning is issued regarding the danger of designing an underpowered replication study following an initial GWAS. The risk of finding spurious association due to population stratification, cryptic relatedness, and differential bias is underlined.
Collapse
Affiliation(s)
- Jérémie Nsengimana
- Section of Epidemiology and Biostatistics, Leeds Institute of Cancer and Pathology, University of Leeds, Leeds, UK
| | - D Timothy Bishop
- Section of Epidemiology and Biostatistics, Leeds Institute of Cancer and Pathology, University of Leeds, Leeds, UK.
| |
Collapse
|
12
|
Family-based association study of HLA class II with type 1 diabetes in Moroccans. ACTA ACUST UNITED AC 2014; 63:80-4. [PMID: 25555495 DOI: 10.1016/j.patbio.2014.12.001] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2014] [Accepted: 12/01/2014] [Indexed: 12/28/2022]
Abstract
BACKGROUND The T1D is a multifactorial disease; with a strong genetic control. The human leukocyte antigen (HLA) system plays a crucial role in the autoimmune process leading to childhood diabetes. About 440,000 of the childhood population of the world (1.8 billion children under 14 years of age), have type 1 diabetes, and each year an additional 70,000 develop this disorder. The objective of this study was to investigate the distribution of HLA class II in Moroccan families of diabetic children to identify susceptibility alleles of the Moroccan population. SUBJECTS AND METHODS We included in this study, Moroccan families who have at least one child with T1D. The age of onset of diabetes was less than 15 years. HLA class II (DRB1* and DQB1*) was carried out by molecular biology techniques (PCR-SSP and PCR-SSO). The FBAT test (family-based association test) was used to highlight the association between T1D and the HLA-DRB1* and -DQB1* polymorphism. RESULTS The association of HLA class II (DRB1*, DQB1*) in type 1 diabetes was analyzed in fifty-one Moroccan families, including 90 diabetics. The results revealed that the most susceptible haplotypes are the DRB1*03:01-DQB1*02:01, DRB1*04:05-DQB1*03:02 (Z=3.674, P=0.000239; Z=2.828, P=0.004678, respectively). And the most protective haplotype is the DRB1*15-DQB1*06. CONCLUSION This is the first family-based association study searching for an association between HLA class II and T1D in a Moroccan population. Despite the different ethnic groups forming Morocco, Moroccan diabetics share the most susceptible and protective HLA haplotypes with other Caucasians populations, specifically the European and Mediterranean populations.
Collapse
|
13
|
Fardo DW, Zhang X, Ding L, He H, Kurowski B, Alexander ES, Mersha TB, Pilipenko V, Kottyan L, Nandakumar K, Martin L. On family-based genome-wide association studies with large pedigrees: observations and recommendations. BMC Proc 2014; 8:S26. [PMID: 25519377 PMCID: PMC4143718 DOI: 10.1186/1753-6561-8-s1-s26] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Abstract
Family based association studies are employed less often than case-control designs in the search for disease-predisposing genes. The optimal statistical genetic approach for complex pedigrees is unclear when evaluating both common and rare variants. We examined the empirical power and type I error rates of 2 common approaches, the measured genotype approach and family-based association testing, through simulations from a set of multigenerational pedigrees. Overall, these results suggest that much larger sample sizes will be required for family-based studies and that power was better using MGA compared to FBAT. Taking into account computational time and potential bias, a 2-step strategy is recommended with FBAT followed by MGA.
Collapse
Affiliation(s)
- David W Fardo
- Department of Biostatistics, University of Kentucky College of Public Health, 111 Washington Ave, Lexington, KY 40536, USA
| | - Xue Zhang
- Department of Pediatrics, Cincinnati Children's Hospital Medical Center, 3333 Burnet Avenue, Cincinnati, OH 45229, USA
| | - Lili Ding
- Department of Pediatrics, Cincinnati Children's Hospital Medical Center, 3333 Burnet Avenue, Cincinnati, OH 45229, USA ; Department of Pediatrics, University of Cincinnati College of Medicine, 2600 Clifton Ave, Cincinnati, OH 45229, USA
| | - Hua He
- Department of Pediatrics, Cincinnati Children's Hospital Medical Center, 3333 Burnet Avenue, Cincinnati, OH 45229, USA
| | - Brad Kurowski
- Department of Pediatrics, Cincinnati Children's Hospital Medical Center, 3333 Burnet Avenue, Cincinnati, OH 45229, USA ; Department of Pediatrics, University of Cincinnati College of Medicine, 2600 Clifton Ave, Cincinnati, OH 45229, USA
| | - Eileen S Alexander
- Department of Environmental Health, University of Cincinnati College of Medicine, 2600 Clifton Ave, Cincinnati, OH 45229, USA
| | - Tesfaye B Mersha
- Department of Pediatrics, Cincinnati Children's Hospital Medical Center, 3333 Burnet Avenue, Cincinnati, OH 45229, USA ; Department of Pediatrics, University of Cincinnati College of Medicine, 2600 Clifton Ave, Cincinnati, OH 45229, USA
| | - Valentina Pilipenko
- Department of Pediatrics, Cincinnati Children's Hospital Medical Center, 3333 Burnet Avenue, Cincinnati, OH 45229, USA
| | - Leah Kottyan
- Department of Pediatrics, Cincinnati Children's Hospital Medical Center, 3333 Burnet Avenue, Cincinnati, OH 45229, USA
| | - Kannabiran Nandakumar
- Department of Biostatistics, University of Kentucky College of Public Health, 111 Washington Ave, Lexington, KY 40536, USA
| | - Lisa Martin
- Department of Pediatrics, Cincinnati Children's Hospital Medical Center, 3333 Burnet Avenue, Cincinnati, OH 45229, USA ; Department of Pediatrics, University of Cincinnati College of Medicine, 2600 Clifton Ave, Cincinnati, OH 45229, USA
| |
Collapse
|
14
|
Yu Z, Gillen D, Li CF, Demetriou M. Incorporating parental information into family-based association tests. Biostatistics 2012; 14:556-72. [PMID: 23266418 DOI: 10.1093/biostatistics/kxs048] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open
Abstract
Assumptions regarding the true underlying genetic model, or mode of inheritance, are necessary when quantifying genetic associations with disease phenotypes. Here we propose new methods to ascertain the underlying genetic model from parental data in family-based association studies. Specifically, for parental mating-type data, we propose a novel statistic to test whether the underlying genetic model is additive, dominant, or recessive; for parental genotype-phenotype data, we propose three strategies to determine the true mode of inheritance. We illustrate how to incorporate the information gleaned from these strategies into family-based association tests. Because family-based association tests are conducted conditional on parental genotypes, the type I error rate of these procedures is not inflated by the information learned from parental data. This result holds even if such information is weak or when the assumption of Hardy-Weinberg equilibrium is violated. Our simulations demonstrate that incorporating parental data into family-based association tests can improve power under common inheritance models. The application of our proposed methods to a candidate-gene study of type 1 diabetes successfully detects a recessive effect in MGAT5 that would otherwise be missed by conventional family-based association tests.
Collapse
Affiliation(s)
- Zhaoxia Yu
- Department of Statistics, University of California at Irvine, Irvine, CA 92697, USA.
| | | | | | | |
Collapse
|
15
|
Statistical Challenges in Sequence-Based Association Studies with Population- and Family-Based Designs. STATISTICS IN BIOSCIENCES 2012. [DOI: 10.1007/s12561-012-9062-9] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
16
|
Abstract
This chapter describes the main issues that genetic epidemiologists usually consider in the design of linkage and association studies. For linkage, we briefly consider the situation of rare, highly penetrant alleles showing a disease pattern consistent with Mendelian inheritance investigated through parametric methods in large pedigrees or with autozygosity mapping in inbred families, and we then turn our focus to the most common design, affected sibling pairs, of more relevance for common, complex diseases. Theoretical and more practical power and sample size calculations are provided as a function of the strength of the genetic effect being investigated. We also discuss the impact of other determinants of statistical power such as disease heterogeneity, pedigree, and genotyping errors, as well as the effect of the type and density of genetic markers. Linkage studies should be as large as possible to have sufficient power in relation to the expected genetic effect size. Segregation analysis, a formal statistical technique to describe the underlying genetic susceptibility, may assist in the estimation of the relevant parameters to apply, for instance. However, segregation analyses estimate the total genetic component rather than a single-locus effect. Locus heterogeneity should be considered when power is estimated and at the analysis stage, i.e. assuming smaller locus effect than the total the genetic component from segregation studies. Disease heterogeneity should be minimised by considering subtypes if they are well defined or by otherwise collecting known sources of heterogeneity and adjusting for them as covariates; the power will depend upon the relationship between the disease subtype and the underlying genotypes. Ultimately, identifying susceptibility alleles of modest effects (e.g. RR≤1.5) requires a number of families that seem unfeasible in a single study. Meta-analysis and data pooling between different research groups can provide a sizeable study, but both approaches require even a higher level of vigilance about locus and disease heterogeneity when data come from different populations. All necessary steps should be taken to minimise pedigree and genotyping errors at the study design stage as they are, for the most part, due to human factors. A two-stage design is more cost-effective than one stage when using short tandem repeats (STRs). However, dense single-nucleotide polymorphism (SNP) arrays offer a more robust alternative, and due to their lower cost per unit, the total cost of studies using SNPs may in the future become comparable to that of studies using STRs in one or two stages. For association studies, we consider the popular case-control design for dichotomous phenotypes, and we provide power and sample size calculations for one-stage and multistage designs. For candidate genes, guidelines are given on the prioritisation of genetic variants, and for genome-wide association studies (GWAS), the issue of choosing an appropriate SNP array is discussed. A warning is issued regarding the danger of designing an underpowered replication study following an initial GWAS. The risk of finding spurious association due to population stratification, cryptic relatedness, and differential bias is underlined. GWAS have a high power to detect common variants of high or moderate effect. For weaker effects (e.g. relative risk<1.2), the power is greatly reduced, particularly for recessive loci. While sample sizes of 10,000 or 20,000 cases are not beyond reach for most common diseases, only meta-analyses and data pooling can allow attaining a study size of this magnitude for many other diseases. It is acknowledged that detecting the effects from rare alleles (i.e. frequency<5%) is not feasible in GWAS, and it is expected that novel methods and technology, such as next-generation resequencing, will fill this gap. At the current stage, the choice of which GWAS SNP array to use does not influence the power in populations of European ancestry. A multistage design reduces the study cost but has less power than the standard one-stage design. If one opts for a multistage design, the power can be improved by jointly analysing the data from different stages for the SNPs they share. The estimates of locus contribution to disease risk from genome-wide scans are often biased, and relying on them might result in an underpowered replication study. Population structure has so far caused less spurious associations than initially feared, thanks to systematic ethnicity matching and application of standard quality control measures. Differential bias could be a more serious threat and must be minimised by strictly controlling all the aspects of DNA acquisition, storage, and processing.
Collapse
Affiliation(s)
- Jérémie Nsengimana
- Section of Epidemiology and Biostatistics, Leeds Institute of Molecular Medicine, University of Leeds, Cancer Genetics Building, Leeds, UK.
| | | |
Collapse
|
17
|
Alkelai A, Lupoli S, Greenbaum L, Giegling I, Kohn Y, Sarner-Kanyas K, Ben-Asher E, Lancet D, Rujescu D, Macciardi F, Lerer B. Identification of new schizophrenia susceptibility loci in an ethnically homogeneous, family-based, Arab-Israeli sample. FASEB J 2011; 25:4011-23. [PMID: 21795503 DOI: 10.1096/fj.11-184937] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
While the use of population-based samples is a common strategy in genome-wide association studies (GWASs), family-based samples have considerable advantages, such as robustness against population stratification and false-positive associations, better quality control, and the possibility to check for both linkage and association. In a genome-wide linkage study of schizophrenia in Arab-Israeli families with multiple affected individuals, we previously reported significant evidence for a susceptibility locus at chromosome 6q23.2-q24.1 and suggestive evidence at chromosomes 10q22.3-26.3, 2q36.1-37.3 and 7p21.1-22.3. To identify schizophrenia susceptibility genes, we applied a family-based GWAS strategy in an enlarged, ethnically homogeneous, Arab-Israeli family sample. We performed genome-wide single nucleotide polymorphism (SNP) genotyping and single SNP transmission disequilibrium test association analysis and found genome-wide significant association (best value of P=1.22×10(-11)) for 8 SNPs within or near highly reasonable functional candidate genes for schizophrenia. Of particular interest are a group of SNPs within and flanking the transcriptional factor LRRFIP1 gene. To determine replicability of the significant associations beyond the Arab-Israeli population, we studied the association of the significant SNPs in a German case-control validation sample and found replication of associations near the UGT1 subfamily and EFHD1 genes. Applying an exploratory homozygosity mapping approach as a complementary strategy to identify schizophrenia susceptibility genes in our Arab Israeli sample, we identified 8 putative disease loci. Overall, this GWAS, which emphasizes the important contribution of family based studies, identifies promising candidate genes for schizophrenia.
Collapse
Affiliation(s)
- Anna Alkelai
- Biological Psychiatry Laboratory, Department of Psychiatry, Hadassah-Hebrew University Medical Center, Jerusalem, Israel
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
18
|
Van Steen K. Perspectives on genome-wide multi-stage family-based association studies. Stat Med 2011; 30:2201-21. [DOI: 10.1002/sim.4259] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2010] [Accepted: 03/07/2011] [Indexed: 01/03/2023]
|
19
|
Pan Y, Wang KS, Aragam N. NTM and NR3C2 polymorphisms influencing intelligence: family-based association studies. Prog Neuropsychopharmacol Biol Psychiatry 2011; 35:154-60. [PMID: 21036197 DOI: 10.1016/j.pnpbp.2010.10.016] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/04/2010] [Revised: 10/05/2010] [Accepted: 10/22/2010] [Indexed: 11/26/2022]
Abstract
Family, twin, and adoption studies have indicated that human intelligence quotient (IQ) has significant genetic components. We performed a low-density genome-wide association analysis with a family-based association test to identify genetic variants influencing IQ, as measured by Wechsler Adult Intelligence Scale full-score IQ (FSIQ). We examined 11,120 single-nucleotide polymorphisms (SNPs) from the Affymetrix GeneChips 10K mapping array genotyped in 292 nuclear families from Genetic Analysis Workshop 14, a subset from the Collaborative Study on the Genetics of Alcoholism (COGA). A replication analysis was performed using part of International Multi-Center ADHD Genetics Project (IMAGE) dataset. Twenty-two SNPs were identified as having suggestive associations with IQ (p<10(-3)) in the COGA sample and eleven of the SNPs were located within known genes. In particular, NTM at 11q25 (rs411280, p = 0.000764) and NR3C2 at 4q31.1 (rs3846329, p = 0.000675) were two novel genes which have not been associated with IQ in other studies. It has been reported that NTM might play a role in late-onset Alzheimer disease while NR3C2 may be associated with cognitive function and major depression. The associations of these two genes were well-replicated by single-marker and haplotype analyses in the IMAGE sample. In conclusion, our findings provide evidence that chromosome regions of 11q25 and 4q31.1 contain genes affecting IQ. This study will serve as a resource for replication in other populations.
Collapse
Affiliation(s)
- Yue Pan
- Department of Mathematics and Statistics, College of Arts and Sciences, East Tennessee State University, Johnson City, TN 37614, USA
| | | | | |
Collapse
|
20
|
Thomas DC, Casey G, Conti DV, Haile RW, Lewinger JP, Stram DO. Methodological Issues in Multistage Genome-wide Association Studies. Stat Sci 2009; 24:414-429. [PMID: 20607129 DOI: 10.1214/09-sts288] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
Because of the high cost of commercial genotyping chip technologies, many investigations have used a two-stage design for genome-wide association studies, using part of the sample for an initial discovery of "promising" SNPs at a less stringent significance level and the remainder in a joint analysis of just these SNPs using custom genotyping. Typical cost savings of about 50% are possible with this design to obtain comparable levels of overall type I error and power by using about half the sample for stage I and carrying about 0.1% of SNPs forward to the second stage, the optimal design depending primarily upon the ratio of costs per genotype for stages I and II. However, with the rapidly declining costs of the commercial panels, the generally low observed ORs of current studies, and many studies aiming to test multiple hypotheses and multiple endpoints, many investigators are abandoning the two-stage design in favor of simply genotyping all available subjects using a standard high-density panel. Concern is sometimes raised about the absence of a "replication" panel in this approach, as required by some high-profile journals, but it must be appreciated that the two-stage design is not a discovery/replication design but simply a more efficient design for discovery using a joint analysis of the data from both stages. Once a subset of highly-significant associations has been discovered, a truly independent "exact replication" study is needed in a similar population of the same promising SNPs using similar methods. This can then be followed by (1) "generalizability" studies to assess the full scope of replicated associations across different races, different endpoints, different interactions, etc.; (2) fine-mapping or re-sequencing to try to identify the causal variant; and (3) experimental studies of the biological function of these genes. Multistage sampling designs may be more useful at this stage, say for selecting subsets of subjects for deep re-sequencing of regions identified in the GWAS.
Collapse
Affiliation(s)
- Duncan C Thomas
- Department of Preventive Medicine, University of Southern California
| | | | | | | | | | | |
Collapse
|