1
|
Mezlini AM, Das S, Goldenberg A. Finding associations in a heterogeneous setting: statistical test for aberration enrichment. Genome Med 2021; 13:68. [PMID: 33892787 PMCID: PMC8066476 DOI: 10.1186/s13073-021-00864-4] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2020] [Accepted: 03/09/2021] [Indexed: 12/16/2022] Open
Abstract
Most two-group statistical tests find broad patterns such as overall shifts in mean, median, or variance. These tests may not have enough power to detect effects in a small subset of samples, e.g., a drug that works well only on a few patients. We developed a novel statistical test targeting such effects relevant for clinical trials, biomarker discovery, feature selection, etc. We focused on finding meaningful associations in complex genetic diseases in gene expression, miRNA expression, and DNA methylation. Our test outperforms traditional statistical tests in simulated and experimental data and detects potentially disease-relevant genes with heterogeneous effects.
Collapse
Affiliation(s)
- Aziz M. Mezlini
- Harvard Medical School, Boston, USA
- Department of Neurology, Massachusetts General Hospital, Boston, USA
- Department of Computer Science, University of Toronto, Toronto, Canada
- Genetics and genome biology, Hospital for sick children, Toronto, Canada
- The Vector Institute, Toronto, Canada
- Evidation Health, Inc., San Mateo, CA USA
| | - Sudeshna Das
- Harvard Medical School, Boston, USA
- Department of Neurology, Massachusetts General Hospital, Boston, USA
| | - Anna Goldenberg
- Department of Computer Science, University of Toronto, Toronto, Canada
- Genetics and genome biology, Hospital for sick children, Toronto, Canada
- The Vector Institute, Toronto, Canada
- CIFAR, Toronto, Canada
| |
Collapse
|
2
|
Alarcon F, Nuel G. Detecting latent exposure in genome-wide association studies using a breakpoint model for logistic regression. Stat Methods Med Res 2018; 28:1781-1792. [PMID: 29921158 DOI: 10.1177/0962280218776385] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Detecting gene-environment (G × E) interactions in the context of genome-wide association studies (GWAS) is a challenging problem since standard methods generally present a lack of power. An additional difficulty arises from the fact that the causal exposure is seldom observed and only a proxy of this exposure is observed. This leads to an additional drop in terms of power and it explains the failure of standard methods in detecting interactions, even very strong ones. In this article, we consider the latent exposure as a source of heterogeneity and we propose a new powerful method, named "Breakpoint Model for Logistic Regression" (BMLR), based on a breakpoint model, in order to detect G × E interactions when causal exposure is unobserved. First, the BMLR method is compared to the ordered-subset analysis for case-control method, which has been developed for the same purpose, through simulations. This highlights the ability of BMLR to detect the heterogeneity, and therefore, to detect interaction with latent exposure. Finally, the BMLR method is compared to standard methods, such as Plink, to perform a GWAS on a published realistic benchmark.
Collapse
Affiliation(s)
- Flora Alarcon
- 1 Laboratoire MAP5, Université Paris Descartes and CNRS, Sorbonne Paris Cité, Paris, France
| | - Gregory Nuel
- 2 Institute of Mathematics (INSMI), National Center for French Research (CNRS), Paris, France.,3 Stochastic and Biology Group, LPSM (CNRS 8001), Sorbonne Université, Paris, France
| |
Collapse
|
3
|
Colak R, Kim T, Kazan H, Oh Y, Cruz M, Valladares-Salgado A, Peralta J, Escobedo J, Parra EJ, Kim PM, Goldenberg A. JBASE: Joint Bayesian Analysis of Subphenotypes and Epistasis. Bioinformatics 2016; 32:203-10. [PMID: 26411870 PMCID: PMC4708100 DOI: 10.1093/bioinformatics/btv504] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2014] [Revised: 08/02/2015] [Accepted: 08/24/2015] [Indexed: 01/22/2023] Open
Abstract
MOTIVATION Rapid advances in genotyping and genome-wide association studies have enabled the discovery of many new genotype-phenotype associations at the resolution of individual markers. However, these associations explain only a small proportion of theoretically estimated heritability of most diseases. In this work, we propose an integrative mixture model called JBASE: joint Bayesian analysis of subphenotypes and epistasis. JBASE explores two major reasons of missing heritability: interactions between genetic variants, a phenomenon known as epistasis and phenotypic heterogeneity, addressed via subphenotyping. RESULTS Our extensive simulations in a wide range of scenarios repeatedly demonstrate that JBASE can identify true underlying subphenotypes, including their associated variants and their interactions, with high precision. In the presence of phenotypic heterogeneity, JBASE has higher Power and lower Type 1 Error than five state-of-the-art approaches. We applied our method to a sample of individuals from Mexico with Type 2 diabetes and discovered two novel epistatic modules, including two loci each, that define two subphenotypes characterized by differences in body mass index and waist-to-hip ratio. We successfully replicated these subphenotypes and epistatic modules in an independent dataset from Mexico genotyped with a different platform. AVAILABILITY AND IMPLEMENTATION JBASE is implemented in C++, supported on Linux and is available at http://www.cs.toronto.edu/∼goldenberg/JBASE/jbase.tar.gz. The genotype data underlying this study are available upon approval by the ethics review board of the Medical Centre Siglo XXI. Please contact Dr Miguel Cruz at mcruzl@yahoo.com for assistance with the application. CONTACT anna.goldenberg@utoronto.ca SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Recep Colak
- Department of Computer Science, University of Toronto, M5S 2E4, Toronto, ON, Canada, Donnelly Centre for Cellular & Biomolecular Research, University of Toronto, M5S 3E1, Toronto, ON, Canada
| | - TaeHyung Kim
- Department of Computer Science, University of Toronto, M5S 2E4, Toronto, ON, Canada, Department of Computer Engineering, Antalya International University, 07190, Antalya, Turkey
| | - Hilal Kazan
- Department of Computer Engineering, Antalya International University, 07190, Antalya, Turkey
| | - Yoomi Oh
- Donnelly Centre for Cellular & Biomolecular Research, University of Toronto, M5S 3E1, Toronto, ON, Canada, Department of Molecular Genetics, University of Toronto, M5S 1A8, Toronto, ON, Canada
| | - Miguel Cruz
- Unidad de Investigación Médica en Bioquímica, Hospital de Especialidades, IMSS, 06720, Mexico City, Mexico
| | - Adan Valladares-Salgado
- Unidad de Investigación Médica en Bioquímica, Hospital de Especialidades, IMSS, 06720, Mexico City, Mexico
| | - Jesus Peralta
- Unidad de Investigación Médica en Bioquímica, Hospital de Especialidades, IMSS, 06720, Mexico City, Mexico
| | - Jorge Escobedo
- Unidad de Investigación en Epidemiología Clínica, Instituto Mexicano del Seguro Social, Mexico City, Mexico
| | - Esteban J Parra
- Department of Anthropology, University of Toronto, L5L 1C6, Mississauga, ON, Canada
| | - Philip M Kim
- Donnelly Centre for Cellular & Biomolecular Research, University of Toronto, M5S 3E1, Toronto, ON, Canada, Department of Molecular Genetics, University of Toronto, M5S 1A8, Toronto, ON, Canada, Genetics and Genome Biology, Hospital for Sick Children, M5G 0A4, Toronto, ON, Canada and Banting and Best Department of Medical Research, University of Toronto, M5G 1L6, Toronto, ON, Canada
| | - Anna Goldenberg
- Department of Computer Science, University of Toronto, M5S 2E4, Toronto, ON, Canada, Genetics and Genome Biology, Hospital for Sick Children, M5G 0A4, Toronto, ON, Canada and
| |
Collapse
|
4
|
Szigeti K, Kellermayer B, Lentini JM, Trummer B, Lal D, Doody RS, Yan L, Liu S, Ma C. Ordered subset analysis of copy number variation association with age at onset of Alzheimer's disease. J Alzheimers Dis 2014; 41:1063-71. [PMID: 24787912 PMCID: PMC4866488 DOI: 10.3233/jad-132693] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Genetic heterogeneity is a common problem for genome-wide association studies of complex human diseases. Ordered-subset analysis (OSA) reduces genetic heterogeneity and optimizes the use of phenotypic information, thus improving power under some disease models. We hypothesized that in a genetically heterogeneous disorder such as Alzheimer's disease (AD), utilizing OSA by age at onset (AAO) of AD may increase the power to detect relevant loci. Using this approach, 8 loci were detected, including the chr15 : 30,44 region harboring CHRFAM7A. The association was replicated in the NIA-LOAD Familial Study dataset. CHRFAM7A is a dominant negative regulator of CHRNA7 function, the receptor that facilitates amyloid-β1-42 internalization through endocytosis and has been implicated in AD. OSA, using AAO as a quantitative trait, optimized power and detected replicable signals suggesting that AD is genetically heterogeneous between AAO subsets.
Collapse
Affiliation(s)
- Kinga Szigeti
- Department of Neurology, University at Buffalo, SUNY, Buffalo, NY, USA,Correspondence to: Kinga Szigeti, MD, PhD, University of Buffalo SUNY, 100 High Street, Buffalo, NY 14203, USA. Tel.: +1 716 859 3484; Fax: +1 716 859 7833;
| | | | - Jenna M. Lentini
- Department of Neurology, University at Buffalo, SUNY, Buffalo, NY, USA
| | - Brian Trummer
- Department of Neurology, University at Buffalo, SUNY, Buffalo, NY, USA
| | - Deepika Lal
- Department of Neurology, University at Buffalo, SUNY, Buffalo, NY, USA
| | - Rachelle S. Doody
- Alzheimer’s Disease and Memory Disorders Center, Department of Neurology, Baylor College of Medicine, Houston, TX, USA
| | - Li Yan
- Department of Bioinformatics, University at Buffalo, SUNY, Buffalo, NY, USA
| | - Song Liu
- Roswell Park Cancer Institute, Buffalo, NY, USA
| | - Changxing Ma
- Department of Bioinformatics, University at Buffalo, SUNY, Buffalo, NY, USA
| | | |
Collapse
|
5
|
The role of phenotype in gene discovery in the whole genome sequencing era. Hum Genet 2012; 131:1533-40. [PMID: 22722752 DOI: 10.1007/s00439-012-1191-1] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2012] [Accepted: 06/07/2012] [Indexed: 10/28/2022]
Abstract
As whole genome sequence becomes a routine component of gene discovery studies in humans, we will have an exhaustive catalog of genetic variation and the challenge becomes understanding the phenotypic consequences of these variants. Statistical genetic methods and analytical approaches that are concerned with optimizing phenotypes for gene discovery for complex traits offer two general categories of advantages. They may increase power to localize genes of interest and also aid in interpreting associations between genetic variants and disease outcomes by suggesting potential mechanisms and pathways through which genes may affect outcomes. Such phenotype optimization approaches include use of allied phenotypes such as symptoms or ages of onset to reduce genetic heterogeneity within a set of cases, study of quantitative risk factors or endophenotypes, joint analyses of related phenotypes, and derivation of new phenotypes designed to extract independent measures underlying the correlations among a set of related phenotypes through approaches such as principal components. New opportunities are also presented by technological advances that permit efficient collection of hundreds or thousands of phenotypes on an individual, including phenotypes more proximal to the level of gene action such as levels of gene expression, microRNAs, or metabolic and proteomic profiles.
Collapse
|
6
|
Londono D, Buyske S, Finch SJ, Sharma S, Wise CA, Gordon D. TDT-HET: a new transmission disequilibrium test that incorporates locus heterogeneity into the analysis of family-based association data. BMC Bioinformatics 2012; 13:13. [PMID: 22264315 PMCID: PMC3292499 DOI: 10.1186/1471-2105-13-13] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2011] [Accepted: 01/20/2012] [Indexed: 12/15/2022] Open
Abstract
BACKGROUND Locus heterogeneity is one of the most documented phenomena in genetics. To date, relatively little work had been done on the development of methods to address locus heterogeneity in genetic association analysis. Motivated by Zhou and Pan's work, we present a mixture model of linked and unlinked trios and develop a statistical method to estimate the probability that a heterozygous parent transmits the disease allele at a di-allelic locus, and the probability that any trio is in the linked group. The purpose here is the development of a test that extends the classic transmission disequilibrium test (TDT) to one that accounts for locus heterogeneity. RESULTS Our simulations suggest that, for sufficiently large sample size (1000 trios) our method has good power to detect association even the proportion of unlinked trios is high (75%). While the median difference (TDT-HET empirical power - TDT empirical power) is approximately 0 for all MOI, there are parameter settings for which the power difference can be substantial. Our multi-locus simulations suggest that our method has good power to detect association as long as the markers are reasonably well-correlated and the genotype relative risk are larger. Results of both single-locus and multi-locus simulations suggest our method maintains the correct type I error rate.Finally, the TDT-HET statistic shows highly significant p-values for most of the idiopathic scoliosis candidate loci, and for some loci, the estimated proportion of unlinked trios approaches or exceeds 50%, suggesting the presence of locus heterogeneity. CONCLUSIONS We have developed an extension of the TDT statistic (TDT-HET) that allows for locus heterogeneity among coded trios. Benefits of our method include: estimates of parameters in the presence of heterogeneity, and reasonable power even when the proportion of linked trios is small. Also, we have extended multi-locus methods to TDT-HET and have demonstrated that the empirical power may be high to detect linkage. Last, given that we obtain PPBs, we conjecture that the TDT-HET may be a useful method for correctly identifying linked trios. We anticipate that researchers will find this property increasingly useful as they apply next-generation sequencing data in family based studies.
Collapse
Affiliation(s)
- Douglas Londono
- Department of Genetics and Human Genetics Institute, Rutgers, The State University of New Jersey, 145 Bevier Road, Piscataway, NJ, 08854 USA
| | - Steven Buyske
- Department of Genetics and Human Genetics Institute, Rutgers, The State University of New Jersey, 145 Bevier Road, Piscataway, NJ, 08854 USA
- Department of Statistics & Biostatistics, Hill Center, Rutgers, The State University of New Jersey, 110 Frelinghuysen Road Piscataway, NJ 08854-8019 USA
| | - Stephen J Finch
- Department of Applied Mathematics and Statistics, Stony Brook University, Stony Brook, NY, 11794-3600 USA
| | - Swarkar Sharma
- Texas Scottish Rite Hospital for Children, 2222 Welborn Street, Dallas, TX 72519 USA
| | - Carol A Wise
- Texas Scottish Rite Hospital for Children, 2222 Welborn Street, Dallas, TX 72519 USA
- Department of Orthopedic Surgery and McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, 5323 Harry Hines Boulevard, Dallas, TX 75390 USA
| | - Derek Gordon
- Department of Genetics and Human Genetics Institute, Rutgers, The State University of New Jersey, 145 Bevier Road, Piscataway, NJ, 08854 USA
| |
Collapse
|
7
|
IGF-I gene variability is associated with an increased risk for AD. Neurobiol Aging 2010; 32:556.e3-11. [PMID: 21176999 DOI: 10.1016/j.neurobiolaging.2010.10.017] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2010] [Revised: 10/18/2010] [Accepted: 10/23/2010] [Indexed: 11/23/2022]
Abstract
Insulin-like growth factor I (IGF-I), a neuroprotective factor with a wide spectrum of actions in the adult brain, is involved in the pathogenesis of Alzheimer's disease (AD). Circulating levels of IGF-I change in AD patients and are implicated in the clearance of brain amyloid beta (Aβ) complexes. To investigate this hypothesis, we screened the IGF-I gene for various well known single nucleotide polymorphisms (SNPs) covering % of the gene variability in a population of 2352 individuals. Genetic analysis indicated different distribution of genotypes of 1 single nucleotide polymorphism, and 1 extended haplotype in the AD population compared with healthy control subjects. In particular, the frequency of rs972936 GG genotype was significantly greater in AD patients than in control subjects (63% vs. 55%). The rs972936 GG genotype was associated with an increased risk for disease, independently of apolipoprotein E genotype, and with enhanced circulating levels of IGF-I. These findings suggest that polymorphisms within the IGF-I gene could infer greater risk for AD through their effect on IGF-I levels, and confirm the physiological role IGF-I in the pathogenesis of AD.
Collapse
|