1
|
Mbatchou J, McPeek MS. JASPER: Fast, powerful, multitrait association testing in structured samples gives insight on pleiotropy in gene expression. Am J Hum Genet 2024:S0002-9297(24)00216-7. [PMID: 39025064 DOI: 10.1016/j.ajhg.2024.06.010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2023] [Revised: 06/19/2024] [Accepted: 06/20/2024] [Indexed: 07/20/2024] Open
Abstract
Joint association analysis of multiple traits with multiple genetic variants can provide insight into genetic architecture and pleiotropy, improve trait prediction, and increase power for detecting association. Furthermore, some traits are naturally high-dimensional, e.g., images, networks, or longitudinally measured traits. Assessing significance for multitrait genetic association can be challenging, especially when the sample has population sub-structure and/or related individuals. Failure to adequately adjust for sample structure can lead to power loss and inflated type 1 error, and commonly used methods for assessing significance can work poorly with a large number of traits or be computationally slow. We developed JASPER, a fast, powerful, robust method for assessing significance of multitrait association with a set of genetic variants, in samples that have population sub-structure, admixture, and/or relatedness. In simulations, JASPER has higher power, better type 1 error control, and faster computation than existing methods, with the power and speed advantage of JASPER increasing with the number of traits. JASPER is potentially applicable to a wide range of association testing applications, including for multiple disease traits, expression traits, image-derived traits, and microbiome abundances. It allows for covariates, ascertainment, and rare variants and is robust to phenotype model misspecification. We apply JASPER to analyze gene expression in the Framingham Heart Study, where, compared to alternative approaches, JASPER finds more significant associations, including several that indicate pleiotropic effects, most of which replicate previous results, while others have not previously been reported. Our results demonstrate the promise of JASPER for powerful multitrait analysis in structured samples.
Collapse
Affiliation(s)
- Joelle Mbatchou
- Regeneron Genetics Center, Tarrytown, NY 10591, USA; Department of Statistics, The University of Chicago, Chicago, IL 60637, USA
| | - Mary Sara McPeek
- Department of Statistics, The University of Chicago, Chicago, IL 60637, USA; Department of Human Genetics, The University of Chicago, Chicago, IL 60637, USA.
| |
Collapse
|
2
|
Ali B, Huguenin-Bizot B, Laurent M, Chaumont F, Maistriaux LC, Nicolas S, Duborjal H, Welcker C, Tardieu F, Mary-Huard T, Moreau L, Charcosset A, Runcie D, Rincent R. High-dimensional multi-omics measured in controlled conditions are useful for maize platform and field trait predictions. TAG. THEORETICAL AND APPLIED GENETICS. THEORETISCHE UND ANGEWANDTE GENETIK 2024; 137:175. [PMID: 38958724 DOI: 10.1007/s00122-024-04679-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/20/2024] [Accepted: 06/15/2024] [Indexed: 07/04/2024]
Abstract
KEY MESSAGE Transcriptomics and proteomics information collected on a platform can predict additive and non-additive effects for platform traits and additive effects for field traits. The effects of climate change in the form of drought, heat stress, and irregular seasonal changes threaten global crop production. The ability of multi-omics data, such as transcripts and proteins, to reflect a plant's response to such climatic factors can be capitalized in prediction models to maximize crop improvement. Implementing multi-omics characterization in field evaluations is challenging due to high costs. It is, however, possible to do it on reference genotypes in controlled conditions. Using omics measured on a platform, we tested different multi-omics-based prediction approaches, using a high dimensional linear mixed model (MegaLMM) to predict genotypes for platform traits and agronomic field traits in a panel of 244 maize hybrids. We considered two prediction scenarios: in the first one, new hybrids are predicted (CV-NH), and in the second one, partially observed hybrids are predicted (CV-POH). For both scenarios, all hybrids were characterized for omics on the platform. We observed that omics can predict both additive and non-additive genetic effects for the platform traits, resulting in much higher predictive abilities than GBLUP. It highlights their efficiency in capturing regulatory processes in relation to growth conditions. For the field traits, we observed that the additive components of omics only slightly improved predictive abilities for predicting new hybrids (CV-NH, model MegaGAO) and for predicting partially observed hybrids (CV-POH, model GAOxW-BLUP) in comparison to GBLUP. We conclude that measuring the omics in the fields would be of considerable interest in predicting productivity if the costs of omics drop significantly.
Collapse
Affiliation(s)
- Baber Ali
- INRAE, CNRS, AgroParisTech, GQE-Le Moulon, Université Paris-Saclay, 91190, Gif-Sur-Yvette, France
| | - Bertrand Huguenin-Bizot
- Laboratoire Reproduction Et Développement Des Plantes, CNRS, ENS de Lyon-46, Allée d'Italie, 69364, Lyon, France
| | - Maxime Laurent
- Louvain Institute of Biomolecular Science and Technology, UCLouvain, Louvain-La-Neuve, Belgium
| | - François Chaumont
- Louvain Institute of Biomolecular Science and Technology, UCLouvain, Louvain-La-Neuve, Belgium
| | - Laurie C Maistriaux
- Louvain Institute of Biomolecular Science and Technology, UCLouvain, Louvain-La-Neuve, Belgium
| | - Stéphane Nicolas
- INRAE, CNRS, AgroParisTech, GQE-Le Moulon, Université Paris-Saclay, 91190, Gif-Sur-Yvette, France
| | - Hervé Duborjal
- Limagrain, Limagrain Fields Seeds, Research Centre, 63720, Chappes, France
| | | | | | - Tristan Mary-Huard
- INRAE, CNRS, AgroParisTech, GQE-Le Moulon, Université Paris-Saclay, 91190, Gif-Sur-Yvette, France
| | - Laurence Moreau
- INRAE, CNRS, AgroParisTech, GQE-Le Moulon, Université Paris-Saclay, 91190, Gif-Sur-Yvette, France
| | - Alain Charcosset
- INRAE, CNRS, AgroParisTech, GQE-Le Moulon, Université Paris-Saclay, 91190, Gif-Sur-Yvette, France
| | - Daniel Runcie
- Department of Plant Sciences, University of California Davis, Davis, CA, USA
| | - Renaud Rincent
- INRAE, CNRS, AgroParisTech, GQE-Le Moulon, Université Paris-Saclay, 91190, Gif-Sur-Yvette, France.
| |
Collapse
|
3
|
McCaw ZR, Gao J, Lin X, Gronsbell J. Synthetic surrogates improve power for genome-wide association studies of partially missing phenotypes in population biobanks. Nat Genet 2024; 56:1527-1536. [PMID: 38872030 DOI: 10.1038/s41588-024-01793-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2023] [Accepted: 05/08/2024] [Indexed: 06/15/2024]
Abstract
Within population biobanks, incomplete measurement of certain traits limits the power for genetic discovery. Machine learning is increasingly used to impute the missing values from the available data. However, performing genome-wide association studies (GWAS) on imputed traits can introduce spurious associations, identifying genetic variants that are not associated with the original trait. Here we introduce a new method, synthetic surrogate (SynSurr) analysis, which makes GWAS on imputed phenotypes robust to imputation errors. Rather than replacing missing values, SynSurr jointly analyzes the original and imputed traits. We show that SynSurr estimates the same genetic effect as standard GWAS and improves power in proportion to the quality of the imputations. SynSurr requires a commonly made missing-at-random assumption but relaxes the requirements of existing imputation methods by not requiring correct model specification. We present extensive simulations and ablation analyses to validate SynSurr and apply it to empower the GWAS of dual-energy X-ray absorptiometry traits within the UK Biobank.
Collapse
Affiliation(s)
- Zachary R McCaw
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA.
| | - Jianhui Gao
- Department of Statistical Sciences, University of Toronto, Toronto, Ontario, Canada
| | - Xihong Lin
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
- Department of Statistics, Harvard University, Cambridge, MA, USA
| | - Jessica Gronsbell
- Department of Statistical Sciences, University of Toronto, Toronto, Ontario, Canada.
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada.
- Department of Family & Community Medicine, University of Toronto, Toronto, Ontario, Canada.
| |
Collapse
|
4
|
Chen M, Dahl A. A robust model for cell type-specific interindividual variation in single-cell RNA sequencing data. Nat Commun 2024; 15:5229. [PMID: 38898015 PMCID: PMC11186839 DOI: 10.1038/s41467-024-49242-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2023] [Accepted: 05/28/2024] [Indexed: 06/21/2024] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) has been widely used to characterize cell types based on their average gene expression profiles. However, most studies do not consider cell type-specific variation across donors. Modelling this cell type-specific inter-individual variation could help elucidate cell type-specific biology and inform genes and cell types underlying complex traits. We therefore develop a new model to detect and quantify cell type-specific variation across individuals called CTMM (Cell Type-specific linear Mixed Model). We use extensive simulations to show that CTMM is powerful and unbiased in realistic settings. We also derive calibrated tests for cell type-specific interindividual variation, which is challenging given the modest sample sizes in scRNA-seq. We apply CTMM to scRNA-seq data from human induced pluripotent stem cells to characterize the transcriptomic variation across donors as cells differentiate into endoderm. We find that almost 100% of transcriptome-wide variability between donors is differentiation stage-specific. CTMM also identifies individual genes with statistically significant stage-specific variability across samples, including 85 genes that do not have significant stage-specific mean expression. Finally, we extend CTMM to partition interindividual covariance between stages, which recapitulates the overall differentiation trajectory. Overall, CTMM is a powerful tool to illuminate cell type-specific biology in scRNA-seq.
Collapse
Affiliation(s)
- Minhui Chen
- Section of Genetic Medicine, University of Chicago, Chicago, IL, 60637, USA.
| | - Andy Dahl
- Section of Genetic Medicine, University of Chicago, Chicago, IL, 60637, USA.
| |
Collapse
|
5
|
Herrera-Luis E, Benke K, Volk H, Ladd-Acosta C, Wojcik GL. Gene-environment interactions in human health. Nat Rev Genet 2024:10.1038/s41576-024-00731-z. [PMID: 38806721 DOI: 10.1038/s41576-024-00731-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 04/03/2024] [Indexed: 05/30/2024]
Abstract
Gene-environment interactions (G × E), the interplay of genetic variation with environmental factors, have a pivotal impact on human complex traits and diseases. Statistically, G × E can be assessed by determining the deviation from expectation of predictive models based solely on the phenotypic effects of genetics or environmental exposures. Despite the unprecedented, widespread and diverse use of G × E analytical frameworks, heterogeneity in their application and reporting hinders their applicability in public health. In this Review, we discuss study design considerations as well as G × E analytical frameworks to assess polygenic liability dependent on the environment, to identify specific genetic variants exhibiting G × E, and to characterize environmental context for these dynamics. We conclude with recommendations to address the most common challenges and pitfalls in the conceptualization, methodology and reporting of G × E studies, as well as future directions.
Collapse
Affiliation(s)
- Esther Herrera-Luis
- Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
| | - Kelly Benke
- Department of Mental Health, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
| | - Heather Volk
- Department of Mental Health, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
| | - Christine Ladd-Acosta
- Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
| | - Genevieve L Wojcik
- Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA.
| |
Collapse
|
6
|
Ren J, Pan W. Statistical inference with large-scale trait imputation. Stat Med 2024; 43:625-641. [PMID: 38038193 PMCID: PMC10848238 DOI: 10.1002/sim.9975] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2023] [Revised: 09/26/2023] [Accepted: 11/17/2023] [Indexed: 12/02/2023]
Abstract
Recently a nonparametric method called LS-imputation has been proposed for large-scale trait imputation based on a GWAS summary dataset and a large set of genotyped individuals. The imputed trait values, along with the genotypes, can be treated as an individual-level dataset for downstream genetic analyses, including those that cannot be done with GWAS summary data. However, since the covariance matrix of the imputed trait values is often too large to calculate, the current method imposes a working assumption that the imputed trait values are identically and independently distributed, which is incorrect in truth. Here we propose a "divide and conquer/combine" strategy to estimate and account for the covariance matrix of the imputed trait values via batches, thus relaxing the incorrect working assumption. Applications of the methods to the UK Biobank data for marginal association analysis showed some improvement by the new method in some cases, but overall the original method performed well, which was explained by nearly constant variances of and mostly weak correlations among imputed trait values.
Collapse
Affiliation(s)
- Jingchen Ren
- School of Statistics, University of Minnesota, Minneapolis, MN, 55455
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN, 55455
| | - Wei Pan
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN, 55455
| |
Collapse
|
7
|
Mbatchou J, McPeek MS. JASPER: fast, powerful, multitrait association testing in structured samples gives insight on pleiotropy in gene expression. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.12.18.571948. [PMID: 38187553 PMCID: PMC10769254 DOI: 10.1101/2023.12.18.571948] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/09/2024]
Abstract
Joint association analysis of multiple traits with multiple genetic variants can provide insight into genetic architecture and pleiotropy, improve trait prediction and increase power for detecting association. Furthermore, some traits are naturally high-dimensional, e.g., images, networks or longitudinally measured traits. Assessing significance for multitrait genetic association can be challenging, especially when the sample has population sub-structure and/or related individuals. Failure to adequately adjust for sample structure can lead to power loss and inflated type 1 error, and commonly used methods for assessing significance can work poorly with a large number of traits or be computationally slow. We developed JASPER, a fast, powerful, robust method for assessing significance of multitrait association with a set of genetic variants, in samples that have population sub-structure, admixture and/or relatedness. In simulations, JASPER has higher power, better type 1 error control, and faster computation than existing methods, with the power and speed advantage of JASPER increasing with the number of traits. JASPER is potentially applicable to a wide range of association testing applications, including for multiple disease traits, expression traits, image-derived traits and microbiome abundances. It allows for covariates, ascertainment and rare variants and is robust to phenotype model misspecification. We apply JASPER to analyze gene expression in the Framingham Heart Study, where, compared to alternative approaches, JASPER finds more significant associations, including several that indicate pleiotropic effects, some of which replicate previous results, while others have not previously been reported. Our results demonstrate the promise of JASPER for powerful multitrait analysis in structured samples.
Collapse
Affiliation(s)
- Joelle Mbatchou
- Regeneron Genetics Center, Tarrytown, NY 10591, USA
- Department of Statistics, The University of Chicago, Chicago, IL 60637, USA
| | - Mary Sara McPeek
- Department of Statistics, The University of Chicago, Chicago, IL 60637, USA
- Department of Human Genetics, The University of Chicago, Chicago, IL 60637, USA
| |
Collapse
|
8
|
Dahl A, Thompson M, An U, Krebs M, Appadurai V, Border R, Bacanu SA, Werge T, Flint J, Schork AJ, Sankararaman S, Kendler KS, Cai N. Phenotype integration improves power and preserves specificity in biobank-based genetic studies of major depressive disorder. Nat Genet 2023; 55:2082-2093. [PMID: 37985818 PMCID: PMC10703686 DOI: 10.1038/s41588-023-01559-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2022] [Accepted: 09/18/2023] [Indexed: 11/22/2023]
Abstract
Biobanks often contain several phenotypes relevant to diseases such as major depressive disorder (MDD), with partly distinct genetic architectures. Researchers face complex tradeoffs between shallow (large sample size, low specificity/sensitivity) and deep (small sample size, high specificity/sensitivity) phenotypes, and the optimal choices are often unclear. Here we propose to integrate these phenotypes to combine the benefits of each. We use phenotype imputation to integrate information across hundreds of MDD-relevant phenotypes, which significantly increases genome-wide association study (GWAS) power and polygenic risk score (PRS) prediction accuracy of the deepest available MDD phenotype in UK Biobank, LifetimeMDD. We demonstrate that imputation preserves specificity in its genetic architecture using a novel PRS-based pleiotropy metric. We further find that integration via summary statistics also enhances GWAS power and PRS predictions, but can introduce nonspecific genetic effects depending on input. Our work provides a simple and scalable approach to improve genetic studies in large biobanks by integrating shallow and deep phenotypes.
Collapse
Affiliation(s)
- Andrew Dahl
- Section of Genetic Medicine, University of Chicago, Chicago, IL, USA.
| | - Michael Thompson
- Department of Computer Science, University of California, Los Angeles, Los Angeles, CA, USA
| | - Ulzee An
- Department of Computer Science, University of California, Los Angeles, Los Angeles, CA, USA
| | - Morten Krebs
- Institute of Biological Psychiatry, Mental Health Center-Sct Hans, Copenhagen University Hospital-Mental Health Services CPH, Copenhagen, Denmark
| | - Vivek Appadurai
- Institute of Biological Psychiatry, Mental Health Center-Sct Hans, Copenhagen University Hospital-Mental Health Services CPH, Copenhagen, Denmark
| | - Richard Border
- Department of Computer Science, University of California, Los Angeles, Los Angeles, CA, USA
- Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, USA
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Silviu-Alin Bacanu
- Virginia Institute for Psychiatric and Behavioral Genetics and Department of Psychiatry, Virginia Commonwealth University, Richmond, VA, USA
| | - Thomas Werge
- Institute of Biological Psychiatry, Mental Health Center-Sct Hans, Copenhagen University Hospital-Mental Health Services CPH, Copenhagen, Denmark
- Lundbeck Foundation GeoGenetics Centre, Natural History Museum of Denmark, University of Copenhagen, Copenhagen, Denmark
- Department of Clinical Medicine, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Jonathan Flint
- Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, USA
| | - Andrew J Schork
- Institute of Biological Psychiatry, Mental Health Center-Sct Hans, Copenhagen University Hospital-Mental Health Services CPH, Copenhagen, Denmark
- Neurogenomics Division, The Translational Genomics Research Institute (TGEN), Phoenix, AZ, USA
- Section for Geogenetics, GLOBE Institute, Faculty of Health and Medical Sciences, Copenhagen University, Copenhagen, Denmark
| | - Sriram Sankararaman
- Department of Computer Science, University of California, Los Angeles, Los Angeles, CA, USA
- Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, USA
- Department of Computational Medicine, University of California, Los Angeles, Los Angeles, CA, USA
| | - Kenneth S Kendler
- Virginia Institute for Psychiatric and Behavioral Genetics and Department of Psychiatry, Virginia Commonwealth University, Richmond, VA, USA
| | - Na Cai
- Helmholtz Pioneer Campus, Helmholtz Zentrum München, Neuherberg, Germany.
- Computational Health Centre, Helmholtz Zentrum München, Neuherberg, Germany.
- School of Medicine, Technical University of Munich, Munich, Germany.
| |
Collapse
|
9
|
An U, Pazokitoroudi A, Alvarez M, Huang L, Bacanu S, Schork AJ, Kendler K, Pajukanta P, Flint J, Zaitlen N, Cai N, Dahl A, Sankararaman S. Deep learning-based phenotype imputation on population-scale biobank data increases genetic discoveries. Nat Genet 2023; 55:2269-2276. [PMID: 37985819 PMCID: PMC10703681 DOI: 10.1038/s41588-023-01558-w] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2022] [Accepted: 10/04/2023] [Indexed: 11/22/2023]
Abstract
Biobanks that collect deep phenotypic and genomic data across many individuals have emerged as a key resource in human genetics. However, phenotypes in biobanks are often missing across many individuals, limiting their utility. We propose AutoComplete, a deep learning-based imputation method to impute or 'fill-in' missing phenotypes in population-scale biobank datasets. When applied to collections of phenotypes measured across ~300,000 individuals from the UK Biobank, AutoComplete substantially improved imputation accuracy over existing methods. On three traits with notable amounts of missingness, we show that AutoComplete yields imputed phenotypes that are genetically similar to the originally observed phenotypes while increasing the effective sample size by about twofold on average. Further, genome-wide association analyses on the resulting imputed phenotypes led to a substantial increase in the number of associated loci. Our results demonstrate the utility of deep learning-based phenotype imputation to increase power for genetic discoveries in existing biobank datasets.
Collapse
Affiliation(s)
- Ulzee An
- Computer Science Department, UCLA, Los Angeles, CA, USA.
| | | | - Marcus Alvarez
- Department of Human Genetics, David Geffen School of Medicine at UCLA, Los Angeles, CA, USA
| | - Lianyun Huang
- Helmholtz Pioneer Campus, Helmholtz Zentrum München, Neuherberg, Germany
- Computational Health Centre, Helmholtz Zentrum München, Neuherberg, Germany
- School of Medicine, Technical University of Munich, Munich, Germany
| | - Silviu Bacanu
- Virginia Institute for Psychiatric and Behavioral Genetics and Department of Psychiatry, Virginia Commonwealth University, Richmond, VA, USA
| | - Andrew J Schork
- Institute of Biological Psychiatry, Mental Health Center - Sct Hans, Copenhagen University Hospital, Copenhagen, Denmark
- Neurogenomics Division, The Translational Genomics Research Institute (TGEN), Phoenix, AZ, USA
- Section for Geogenetics, GLOBE Institute, Faculty of Health and Medical Sciences, Copenhagen University, Copenhagen, Denmark
| | - Kenneth Kendler
- Virginia Institute for Psychiatric and Behavioral Genetics and Department of Psychiatry, Virginia Commonwealth University, Richmond, VA, USA
| | - Päivi Pajukanta
- Department of Human Genetics, David Geffen School of Medicine at UCLA, Los Angeles, CA, USA
- Institute for Precision Health, David Geffen School of Medicine at UCLA, Los Angeles, CA, USA
| | - Jonathan Flint
- Department of Human Genetics, David Geffen School of Medicine at UCLA, Los Angeles, CA, USA
| | - Noah Zaitlen
- Neurology Department, UCLA, Los Angeles, CA, USA
| | - Na Cai
- Helmholtz Pioneer Campus, Helmholtz Zentrum München, Neuherberg, Germany
- Computational Health Centre, Helmholtz Zentrum München, Neuherberg, Germany
- School of Medicine, Technical University of Munich, Munich, Germany
| | - Andy Dahl
- Section of Genetic Medicine, University of Chicago, Chicago, IL, USA
| | - Sriram Sankararaman
- Computer Science Department, UCLA, Los Angeles, CA, USA.
- Department of Human Genetics, David Geffen School of Medicine at UCLA, Los Angeles, CA, USA.
- Department of Computational Medicine, UCLA, Los Angeles, CA, USA.
| |
Collapse
|
10
|
Zhang Z, Jung J, Kim A, Suboc N, Gazal S, Mancuso N. A scalable approach to characterize pleiotropy across thousands of human diseases and complex traits using GWAS summary statistics. Am J Hum Genet 2023; 110:1863-1874. [PMID: 37879338 PMCID: PMC10645558 DOI: 10.1016/j.ajhg.2023.09.015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2023] [Revised: 09/26/2023] [Accepted: 09/27/2023] [Indexed: 10/27/2023] Open
Abstract
Genome-wide association studies (GWASs) across thousands of traits have revealed the pervasive pleiotropy of trait-associated genetic variants. While methods have been proposed to characterize pleiotropic components across groups of phenotypes, scaling these approaches to ultra-large-scale biobanks has been challenging. Here, we propose FactorGo, a scalable variational factor analysis model to identify and characterize pleiotropic components using biobank GWAS summary data. In extensive simulations, we observe that FactorGo outperforms the state-of-the-art (model-free) approach tSVD in capturing latent pleiotropic factors across phenotypes while maintaining a similar computational cost. We apply FactorGo to estimate 100 latent pleiotropic factors from GWAS summary data of 2,483 phenotypes measured in European-ancestry Pan-UK BioBank individuals (N = 420,531). Next, we find that factors from FactorGo are more enriched with relevant tissue-specific annotations than those identified by tSVD (p = 2.58E-10) and validate our approach by recapitulating brain-specific enrichment for BMI and the height-related connection between reproductive system and muscular-skeletal growth. Finally, our analyses suggest shared etiologies between rheumatoid arthritis and periodontal condition in addition to alkaline phosphatase as a candidate prognostic biomarker for prostate cancer. Overall, FactorGo improves our biological understanding of shared etiologies across thousands of GWASs.
Collapse
Affiliation(s)
- Zixuan Zhang
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA.
| | - Junghyun Jung
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
| | - Artem Kim
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
| | - Noah Suboc
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
| | - Steven Gazal
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA; Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA; Norris Comprehensive Cancer Center, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
| | - Nicholas Mancuso
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA; Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA; Norris Comprehensive Cancer Center, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA.
| |
Collapse
|
11
|
Ren J, Lin Z, Pan W. Integrating GWAS summary statistics, individual-level genotypic and omic data to enhance the performance for large-scale trait imputation. Hum Mol Genet 2023; 32:2693-2703. [PMID: 37369060 PMCID: PMC10460491 DOI: 10.1093/hmg/ddad097] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2023] [Revised: 05/23/2023] [Accepted: 06/13/2023] [Indexed: 06/29/2023] Open
Abstract
Recently, a non-parametric method has been proposed to impute the genetic component of a trait for a large set of genotyped individuals based on a separate genome-wide association study (GWAS) summary dataset of the same trait (from the same population). The imputed trait may contain linear, non-linear and epistatic effects of genetic variants, thus can be used for downstream linear or non-linear association analyses and machine learning tasks. Here, we propose an extension of the method to impute both genetic and environmental components of a trait using both single nucleotide polymorphism (SNP)-trait and omics-trait association summary data. We illustrate an application to a UK Biobank subset of individuals (n ≈ 80K) with both body mass index (BMI) GWAS data and metabolomic data. We divided the whole dataset into two equally sized and non-overlapping training and test datasets; we used the training data to build SNP- and metabolite-BMI association summary data and impute BMI on the test data. We compared the performance of the original and new imputation methods. As by the original method, the imputed BMI values by the new method largely retained SNP-BMI association information; however, the latter retained more information about BMI-environment associations and were more highly correlated with the original observed BMI values.
Collapse
Affiliation(s)
- Jingchen Ren
- School of Statistics, University of Minnesota, Minneapolis, MN 55455, USA
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455, USA
| | - Zhaotong Lin
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455, USA
| | - Wei Pan
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455, USA
| |
Collapse
|
12
|
Mignogna G, Carey CE, Wedow R, Baya N, Cordioli M, Pirastu N, Bellocco R, Malerbi KF, Nivard MG, Neale BM, Walters RK, Ganna A. Patterns of item nonresponse behaviour to survey questionnaires are systematic and associated with genetic loci. Nat Hum Behav 2023; 7:1371-1387. [PMID: 37386106 PMCID: PMC10444625 DOI: 10.1038/s41562-023-01632-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2022] [Accepted: 05/17/2023] [Indexed: 07/01/2023]
Abstract
Response to survey questionnaires is vital for social and behavioural research, and most analyses assume full and accurate response by participants. However, nonresponse is common and impedes proper interpretation and generalizability of results. We examined item nonresponse behaviour across 109 questionnaire items in the UK Biobank (N = 360,628). Phenotypic factor scores for two participant-selected nonresponse answers, 'Prefer not to answer' (PNA) and 'I don't know' (IDK), each predicted participant nonresponse in follow-up surveys (incremental pseudo-R2 = 0.056), even when controlling for education and self-reported health (incremental pseudo-R2 = 0.046). After performing genome-wide association studies of our factors, PNA and IDK were highly genetically correlated with one another (rg = 0.73 (s.e. = 0.03)) and with education (rg,PNA = -0.51 (s.e. = 0.03); rg,IDK = -0.38 (s.e. = 0.02)), health (rg,PNA = 0.51 (s.e. = 0.03); rg,IDK = 0.49 (s.e. = 0.02)) and income (rg,PNA = -0.57 (s.e. = 0.04); rg,IDK = -0.46 (s.e. = 0.02)), with additional unique genetic associations observed for both PNA and IDK (P < 5 × 10-8). We discuss how these associations may bias studies of traits correlated with item nonresponse and demonstrate how this bias may substantially affect genome-wide association studies. While the UK Biobank data are deidentified, we further protected participant privacy by avoiding exploring non-response behaviour to single questions, assuring that no information can be used to associate results with any particular respondents.
Collapse
Affiliation(s)
- Gianmarco Mignogna
- Analytic and Translational Genetics Unit, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA
- Institute for Molecular Medicine Finland, University of Helsinki, Helsinki, Finland
- Department of Statistics and Quantitative Methods, University of Milano-Bicocca, Milan, Italy
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Caitlin E Carey
- Analytic and Translational Genetics Unit, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA
| | - Robbee Wedow
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
- Department of Sociology, Purdue University, West Lafayette, IN, USA.
- Department of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, IN, USA.
- AnalytiXIN (Analytics Indiana), Indianapolis, IN, USA.
- Department of Statistics, Purdue University, West Lafayette, IN, USA.
| | - Nikolas Baya
- Analytic and Translational Genetics Unit, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Mattia Cordioli
- Institute for Molecular Medicine Finland, University of Helsinki, Helsinki, Finland
| | - Nicola Pirastu
- Centre for Global Health Research, Usher Institute, University of Edinburgh, Edinburgh, Scotland
- Fondazione Human Technopole, Viale Rita Levi-Montalcini, Milan, Italy
| | - Rino Bellocco
- Department of Statistics and Quantitative Methods, University of Milano-Bicocca, Milan, Italy
- Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden
| | | | - Michel G Nivard
- Department of Biological Psychiatry, Faculty of Behavioural and Movement Sciences, Vrije Universiteit, Amsterdam, the Netherlands
- Methodology Program, Amsterdam Public Health, Amsterdam, the Netherlands
- Amsterdam Neuroscience - Mood, Anxiety, Psychosis, Stress and Sleep, Amsterdam, the Netherlands
| | - Benjamin M Neale
- Analytic and Translational Genetics Unit, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Novo Nordisk Foundation for Genomic Mechanisms of Disease, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Raymond K Walters
- Analytic and Translational Genetics Unit, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Andrea Ganna
- Analytic and Translational Genetics Unit, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA.
- Institute for Molecular Medicine Finland, University of Helsinki, Helsinki, Finland.
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
| |
Collapse
|
13
|
Gentry AE, Kirkpatrick RM, Peterson RE, Webb BT. Missingness adapted group informed clustered (MAGIC)-LASSO: a novel paradigm for phenotype prediction to improve power for genetic loci discovery. Front Genet 2023; 14:1162690. [PMID: 37547462 PMCID: PMC10399453 DOI: 10.3389/fgene.2023.1162690] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2023] [Accepted: 07/03/2023] [Indexed: 08/08/2023] Open
Abstract
Introduction: The availability of large-scale biobanks linking genetic data, rich phenotypes, and biological measures is a powerful opportunity for scientific discovery. However, real-world collections frequently have extensive missingness. While missing data prediction is possible, performance is significantly impaired by block-wise missingness inherent to many biobanks. Methods: To address this, we developed Missingness Adapted Group-wise Informed Clustered (MAGIC)-LASSO which performs hierarchical clustering of variables based on missingness followed by sequential Group LASSO within clusters. Variables are pre-filtered for missingness and balance between training and target sets with final models built using stepwise inclusion of features ranked by completeness. This research has been conducted using the UK Biobank (n > 500 k) to predict unmeasured Alcohol Use Disorders Identification Test (AUDIT) scores. Results: The phenotypic correlation between measured and predicted total score was 0.67 while genetic correlations between independent subjects was high >0.86. Discussion: Phenotypic and genetic correlations in real data application, as well as simulations, demonstrate the method has significant accuracy and utility for increasing power for genetic loci discovery.
Collapse
Affiliation(s)
- Amanda Elswick Gentry
- Department of Psychiatry, Virginia Institute for Psychiatric and Behavioral Genetics, Virginia Commonwealth University, Richmond, VA, United States
| | - Robert M. Kirkpatrick
- Department of Psychiatry, Virginia Institute for Psychiatric and Behavioral Genetics, Virginia Commonwealth University, Richmond, VA, United States
| | - Roseann E. Peterson
- Department of Psychiatry, Virginia Institute for Psychiatric and Behavioral Genetics, Virginia Commonwealth University, Richmond, VA, United States
- Department of Psychiatry and Behavioral Sciences, Institute for Genomics in Health, SUNY Downstate Health Sciences University, Brooklyn, NY, United States
| | - Bradley T. Webb
- Department of Psychiatry, Virginia Institute for Psychiatric and Behavioral Genetics, Virginia Commonwealth University, Richmond, VA, United States
- GenOmics and Translational Research Center, RTI International, Research Triangle Park, NC, United States
| |
Collapse
|
14
|
Ren J, Lin Z, He R, Shen X, Pan W. Using GWAS summary data to impute traits for genotyped individuals. HGG ADVANCES 2023; 4:100197. [PMID: 37181332 PMCID: PMC10173780 DOI: 10.1016/j.xhgg.2023.100197] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2022] [Accepted: 04/07/2023] [Indexed: 05/16/2023] Open
Abstract
Genome-wide association study (GWAS) summary data have become extremely useful in daily routine data analysis, largely facilitating new methods development and new applications. However, a severe limitation with the current use of GWAS summary data is its exclusive restriction to only linear single nucleotide polymorphism (SNP)-trait association analyses. To further expand the use of GWAS summary data, along with a large sample of individual-level genotypes, we propose a nonparametric method for large-scale imputation of the genetic component of the trait for the given genotypes. The imputed individual-level trait values, along with the individual-level genotypes, make it possible to conduct any analysis as with individual-level GWAS data, including nonlinear SNP-trait associations and predictions. We use the UK Biobank data to highlight the usefulness and effectiveness of the proposed method in three applications that currently cannot be done with only GWAS summary data (for SNP-trait associations): marginal SNP-trait association analysis under non-additive genetic models, detection of SNP-SNP interactions, and genetic prediction of a trait using a nonlinear model of SNPs.
Collapse
Affiliation(s)
- Jingchen Ren
- School of Statistics, University of Minnesota, Minneapolis, MN 55455, USA
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455, USA
| | - Zhaotong Lin
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455, USA
| | - Ruoyu He
- School of Statistics, University of Minnesota, Minneapolis, MN 55455, USA
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455, USA
| | - Xiaotong Shen
- School of Statistics, University of Minnesota, Minneapolis, MN 55455, USA
| | - Wei Pan
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455, USA
- Corresponding author
| |
Collapse
|
15
|
Flint J. The genetic basis of major depressive disorder. Mol Psychiatry 2023; 28:2254-2265. [PMID: 36702864 PMCID: PMC10611584 DOI: 10.1038/s41380-023-01957-9] [Citation(s) in RCA: 28] [Impact Index Per Article: 28.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/18/2022] [Revised: 12/30/2022] [Accepted: 01/11/2023] [Indexed: 01/27/2023]
Abstract
The genetic dissection of major depressive disorder (MDD) ranks as one of the success stories of psychiatric genetics, with genome-wide association studies (GWAS) identifying 178 genetic risk loci and proposing more than 200 candidate genes. However, the GWAS results derive from the analysis of cohorts in which most cases are diagnosed by minimal phenotyping, a method that has low specificity. I review data indicating that there is a large genetic component unique to MDD that remains inaccessible to minimal phenotyping strategies and that the majority of genetic risk loci identified with minimal phenotyping approaches are unlikely to be MDD risk loci. I show that inventive uses of biobank data, novel imputation methods, combined with more interviewer diagnosed cases, can identify loci that contribute to the episodic severe shifts of mood, and neurovegetative and cognitive changes that are central to MDD. Furthermore, new theories about the nature and causes of MDD, drawing upon advances in neuroscience and psychology, can provide handles on how best to interpret and exploit genetic mapping results.
Collapse
Affiliation(s)
- Jonathan Flint
- Department of Psychiatry and Biobehavioral Sciences, Billy and Audrey Wilder Endowed Chair in Psychiatry and Neuroscience, Center for Neurobehavioral Genetics, 695 Charles E. Young Drive South, 3357B Gonda, Box 951761, Los Angeles, CA, 90095-1761, USA.
| |
Collapse
|
16
|
Jarquin D, Roy A, Clarke B, Ghosal S. Combining phenotypic and genomic data to improve prediction of binary traits. J Appl Stat 2023; 51:1497-1523. [PMID: 38863802 PMCID: PMC11164039 DOI: 10.1080/02664763.2023.2208773] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2022] [Accepted: 04/22/2023] [Indexed: 06/13/2024]
Abstract
Plant breeders want to develop cultivars that outperform existing genotypes. Some characteristics (here 'main traits') of these cultivars are categorical and difficult to measure directly. It is important to predict the main trait of newly developed genotypes accurately. In addition to marker data, breeding programs often have information on secondary traits (or 'phenotypes') that are easy to measure. Our goal is to improve prediction of main traits with interpretable relations by combining the two data types using variable selection techniques. However, the genomic characteristics can overwhelm the set of secondary traits, so a standard technique may fail to select any phenotypic variables. We develop a new statistical technique that ensures appropriate representation from both the secondary traits and the genotypic variables for optimal prediction. When two data types (markers and secondary traits) are available, we achieve improved prediction of a binary trait by two steps that are designed to ensure that a significant intrinsic effect of a phenotype is incorporated in the relation before accounting for extra effects of genotypes. First, we sparsely regress the secondary traits on the markers and replace the secondary traits by their residuals to obtain the effects of phenotypic variables as adjusted by the genotypic variables. Then, we develop a sparse logistic classifier using the markers and residuals so that the adjusted phenotypes may be selected first to avoid being overwhelmed by the genotypic variables due to their numerical advantage. This classifier uses forward selection aided by a penalty term and can be computed effectively by a technique called the one-pass method. It compares favorably with other classifiers on simulated and real data.
Collapse
Affiliation(s)
- D. Jarquin
- Agronomy, University of Florida, Gainesville, FL, USA
| | - A. Roy
- Biostatistics Department, University of Florida, Gainesville, FL, USA
| | - B. Clarke
- Statistics, University of Nebraska-Lincoln, Lincoln, NE, USA
| | - S. Ghosal
- Statistics, North Carolina State University, Raleigh, NC, USA
| |
Collapse
|
17
|
Boatwright JL, Sapkota S, Kresovich S. Functional genomic effects of indels using Bayesian genome-phenome wide association studies in sorghum. Front Genet 2023; 14:1143395. [PMID: 37065477 PMCID: PMC10102435 DOI: 10.3389/fgene.2023.1143395] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2023] [Accepted: 03/20/2023] [Indexed: 04/03/2023] Open
Abstract
High-throughput genomic and phenomic data have enhanced the ability to detect genotype-to-phenotype associations that can resolve broad pleiotropic effects of mutations on plant phenotypes. As the scale of genotyping and phenotyping has advanced, rigorous methodologies have been developed to accommodate larger datasets and maintain statistical precision. However, determining the functional effects of associated genes/loci is expensive and limited due to the complexity associated with cloning and subsequent characterization. Here, we utilized phenomic imputation of a multi-year, multi-environment dataset using PHENIX which imputes missing data using kinship and correlated traits, and we screened insertions and deletions (InDels) from the recently whole-genome sequenced Sorghum Association Panel for putative loss-of-function effects. Candidate loci from genome-wide association results were screened for potential loss of function using a Bayesian Genome-Phenome Wide Association Study (BGPWAS) model across both functionally characterized and uncharacterized loci. Our approach is designed to facilitate in silico validation of associations beyond traditional candidate gene and literature-search approaches and to facilitate the identification of putative variants for functional analysis and reduce the incidence of false-positive candidates in current functional validation methods. Using this Bayesian GPWAS model, we identified associations for previously characterized genes with known loss-of-function alleles, specific genes falling within known quantitative trait loci, and genes without any previous genome-wide associations while additionally detecting putative pleiotropic effects. In particular, we were able to identify the major tannin haplotypes at the Tan1 locus and effects of InDels on the protein folding. Depending on the haplotype present, heterodimer formation with Tan2 was significantly affected. We also identified major effect InDels in Dw2 and Ma1, where proteins were truncated due to frameshift mutations that resulted in early stop codons. These truncated proteins also lost most of their functional domains, suggesting that these indels likely result in loss of function. Here, we show that the Bayesian GPWAS model is able to identify loss-of-function alleles that can have significant effects upon protein structure and folding as well as multimer formation. Our approach to characterize loss-of-function mutations and their functional repercussions will facilitate precision genomics and breeding by identifying key targets for gene editing and trait integration.
Collapse
Affiliation(s)
- J. Lucas Boatwright
- Department of Plant and Environmental Sciences, Clemson University, Clemson, SC, United States
- Advanced Plant Technology, Clemson University, Clemson, SC, United States
- *Correspondence: J. Lucas Boatwright,
| | - Sirjan Sapkota
- Advanced Plant Technology, Clemson University, Clemson, SC, United States
| | - Stephen Kresovich
- Department of Plant and Environmental Sciences, Clemson University, Clemson, SC, United States
- Advanced Plant Technology, Clemson University, Clemson, SC, United States
- Feed the Future Innovation Lab for Crop Improvement, Cornell University, Ithaca, NY, United States
| |
Collapse
|
18
|
Zhang Z, Jung J, Kim A, Suboc N, Gazal S, Mancuso N. A scalable variational approach to characterize pleiotropic components across thousands of human diseases and complex traits using GWAS summary statistics. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2023:2023.03.27.23287801. [PMID: 37034739 PMCID: PMC10081403 DOI: 10.1101/2023.03.27.23287801] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 04/29/2023]
Abstract
Genome-wide association studies (GWAS) across thousands of traits have revealed the pervasive pleiotropy of trait-associated genetic variants. While methods have been proposed to characterize pleiotropic components across groups of phenotypes, scaling these approaches to ultra large-scale biobanks has been challenging. Here, we propose FactorGo, a scalable variational factor analysis model to identify and characterize pleiotropic components using biobank GWAS summary data. In extensive simulations, we observe that FactorGo outperforms the state-of-the-art (model-free) approach tSVD in capturing latent pleiotropic factors across phenotypes, while maintaining a similar computational cost. We apply FactorGo to estimate 100 latent pleiotropic factors from GWAS summary data of 2,483 phenotypes measured in European-ancestry Pan-UK BioBank individuals (N=420,531). Next, we find that factors from FactorGo are more enriched with relevant tissue-specific annotations than those identified by tSVD (P=2.58E-10), and validate our approach by recapitulating brain-specific enrichment for BMI and the height-related connection between reproductive system and muscular-skeletal growth. Finally, our analyses suggest novel shared etiologies between rheumatoid arthritis and periodontal condition, in addition to alkaline phosphatase as a candidate prognostic biomarker for prostate cancer. Overall, FactorGo improves our biological understanding of shared etiologies across thousands of GWAS.
Collapse
Affiliation(s)
- Zixuan Zhang
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California
| | - Junghyun Jung
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California
| | - Artem Kim
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California
| | - Noah Suboc
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California
| | - Steven Gazal
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California
- Department of Quantitative and Computational Biology, University of Southern California
- Norris Comprehensive Cancer Center, Keck School of Medicine, University of Southern California
| | - Nicholas Mancuso
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California
- Department of Quantitative and Computational Biology, University of Southern California
- Norris Comprehensive Cancer Center, Keck School of Medicine, University of Southern California
| |
Collapse
|
19
|
Chen M, Dahl A. A robust model for cell type-specific interindividual variation in single-cell RNA sequencing data. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.02.24.529987. [PMID: 36909553 PMCID: PMC10002707 DOI: 10.1101/2023.02.24.529987] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/03/2023]
Abstract
The development of single-cell RNA sequencing (scRNA-seq) offers opportunities to characterize cellular heterogeneity at unprecedented resolution. Although scRNA-seq has been widely used to identify and characterize gene expression variation across cell types and cell states based on their average gene expression profiles, most studies ignore variation across individual donors. Modelling this inter-individual variation could improve statistical power to detect cell type-specific biology and inform the genes and cell types that underlying complex traits. We therefore develop a new model to detect and quantify cell type-specific variation across individuals called CTMM (Cell Type-specific linear Mixed Model). CTMM operates on cell type-specific pseudobulk expression and is fit with efficient methods that scale to hundreds of samples. We use extensive simulations to show that CTMM is powerful and unbiased in realistic settings. We also derive calibrated tests for cell type-specific interindividual variation, which is challenging given the modest sample sizes in scRNA-seq data. We apply CTMM to scRNA-seq data from human induced pluripotent stem cells to characterize the transcriptomic variation across donors as cells differentiate into endoderm. We find that almost 100% of transcriptome-wide variability between donors is differentiation stage-specific. CTMM also identifies individual genes with statistically significant stage-specific variability across samples, including 61 genes that do not have significant stage-specific mean expression. Finally, we extend CTMM to partition interindividual covariance between stages, which recapitulates the overall differentiation trajectory. Overall, CTMM is a powerful tool to characterize a novel dimension of cell type-specific biology in scRNA-seq.
Collapse
Affiliation(s)
- Minhui Chen
- Section of Genetic Medicine, University of Chicago, Chicago, IL 60637
| | - Andy Dahl
- Section of Genetic Medicine, University of Chicago, Chicago, IL 60637
| |
Collapse
|
20
|
Wang S, Ge S, Sobkowiak B, Wang L, Grandjean L, Colijn C, Elliott LT. Genome-Wide Association with Uncertainty in the Genetic Similarity Matrix. J Comput Biol 2023; 30:189-203. [PMID: 36374242 DOI: 10.1089/cmb.2022.0067] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
Genome-wide association studies (GWASs) are often confounded by population stratification and structure. Linear mixed models (LMMs) are a powerful class of methods for uncovering genetic effects, while controlling for such confounding. LMMs include random effects for a genetic similarity matrix, and they assume that a true genetic similarity matrix is known. However, uncertainty about the phylogenetic structure of a study population may degrade the quality of LMM results. This may happen in bacterial studies in which the number of samples or loci is small, or in studies with low-quality genotyping. In this study, we develop methods for linear mixed models in which the genetic similarity matrix is unknown and is derived from Markov chain Monte Carlo estimates of the phylogeny. We apply our model to a GWAS of multidrug resistance in tuberculosis, and illustrate our methods on simulated data.
Collapse
Affiliation(s)
- Shijia Wang
- School of Statistics and Data Science, LPMC and KLMDASR, Nankai University, Tianjin, China
| | - Shufei Ge
- Institute of Mathematical Sciences, ShanghaiTech University, Shanghai, China
| | | | - Liangliang Wang
- Department of Statistics and Actuarial Science, Simon Fraser University, Burnaby, Canada
| | - Louis Grandjean
- Department of Infectious Diseases, University College London, London, United Kingdom
| | - Caroline Colijn
- Department of Mathematics and Simon Fraser University, Burnaby, Canada
| | - Lloyd T Elliott
- Department of Statistics and Actuarial Science, Simon Fraser University, Burnaby, Canada
| |
Collapse
|
21
|
Zlobin AS, Volkova NA, Zinovieva NA, Iolchiev BS, Bagirov VA, Borodin PM, Axenovich TI, Tsepilov YA. Loci Associated with Negative Heterosis for Viability and Meat Productivity in Interspecific Sheep Hybrids. Animals (Basel) 2023; 13:ani13010184. [PMID: 36611792 PMCID: PMC9817718 DOI: 10.3390/ani13010184] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2022] [Revised: 12/15/2022] [Accepted: 12/19/2022] [Indexed: 01/05/2023] Open
Abstract
Negative heterosis can occur on different economically important traits, but the exact biological mechanisms of this phenomenon are still unknown. The present study focuses on determining the genetic factors associated with negative heterosis in interspecific hybrids between domestic sheep (Ovis aries) and argali (Ovis ammon). One locus (rs417431015) associated with viability and two loci (rs413302370, rs402808951) associated with meat productivity were identified. One gene (ARAP2) was prioritized for viability and three for meat productivity (PDE2A, ARAP1, and PCDH15). The loci associated with meat productivity were demonstrated to fit the overdominant inheritance model and could potentially be involved int negative heterosis mechanisms.
Collapse
Affiliation(s)
- Alexander S. Zlobin
- Kurchatov Genomic Center, Institute of Cytology and Genetics, Siberian Branch of the Russian Academy of Sciences SB RAS, 630090 Novosibirsk, Russia
| | - Natalia A. Volkova
- L.K. Ernst Federal Science Center for Animal Husbandry, 101000 Moscow, Russia
| | | | - Baylar S. Iolchiev
- L.K. Ernst Federal Science Center for Animal Husbandry, 101000 Moscow, Russia
| | - Vugar A. Bagirov
- L.K. Ernst Federal Science Center for Animal Husbandry, 101000 Moscow, Russia
| | - Pavel M. Borodin
- Institute of Cytology and Genetics, SB RAS, 630090 Novosibirsk, Russia
| | | | - Yakov A. Tsepilov
- Kurchatov Genomic Center, Institute of Cytology and Genetics, Siberian Branch of the Russian Academy of Sciences SB RAS, 630090 Novosibirsk, Russia
- Correspondence:
| |
Collapse
|
22
|
Clinical and genotypic analysis in determining dystonia non-motor phenotypic heterogeneity: a UK Biobank study. J Neurol 2022; 269:6436-6451. [PMID: 35925398 PMCID: PMC9618530 DOI: 10.1007/s00415-022-11307-4] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2022] [Revised: 07/20/2022] [Accepted: 07/21/2022] [Indexed: 11/10/2022]
Abstract
The spectrum of non-motor symptoms in dystonia remains unclear. Using UK Biobank data, we analysed clinical phenotypic and genetic information in the largest dystonia cohort reported to date. Case-control comparison of dystonia and matched control cohort was undertaken to identify domains (psychiatric, pain, sleep and cognition) of increased symptom burden in dystonia. Whole exome data were used to determine the rate and likely pathogenicity of variants in Mendelian inherited dystonia causing genes and linked to clinical data. Within the dystonia cohort, phenotypic and genetic single-nucleotide polymorphism (SNP) data were combined in a mixed model analysis to derive genetically informed phenotypic axes. A total of 1572 individuals with dystonia were identified, including cervical dystonia (n = 775), blepharospasm (n = 131), tremor (n = 488) and dystonia, unspecified (n = 154) groups. Phenotypic patterns highlighted a predominance of psychiatric symptoms (anxiety and depression), excess pain and sleep disturbance. Cognitive impairment was limited to prospective memory and fluid intelligence. Whole exome sequencing identified 798 loss of function variants in dystonia-linked genes, 67 missense variants (MPC > 3) and 305 other forms of non-synonymous variants (including inframe deletion, inframe insertion, stop loss and start loss variants). A single loss of function variant (ANO3) was identified in the dystonia cohort. Combined SNP and clinical data identified multiple genetically informed phenotypic axes with predominance of psychiatric, pain and sleep non-motor domains. An excess of psychiatric, pain and sleep symptoms were evident across all forms of dystonia. Combination with genetic data highlights phenotypic subgroups consistent with the heterogeneity observed in clinical practice.
Collapse
|
23
|
Sandor C, Millin S, Dahl A, Schalkamp AK, Lawton M, Hubbard L, Rahman N, Williams N, Ben-Shlomo Y, Grosset DG, Hu MT, Marchini J, Webber C. Universal clinical Parkinson's disease axes identify a major influence of neuroinflammation. Genome Med 2022; 14:129. [PMID: 36384636 PMCID: PMC9670420 DOI: 10.1186/s13073-022-01132-9] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2022] [Accepted: 10/21/2022] [Indexed: 11/17/2022] Open
Abstract
BACKGROUND There is large individual variation in both clinical presentation and progression between Parkinson's disease patients. Generation of deeply and longitudinally phenotyped patient cohorts has enormous potential to identify disease subtypes for prognosis and therapeutic targeting. METHODS Replicating across three large Parkinson's cohorts (Oxford Discovery cohort (n = 842)/Tracking UK Parkinson's study (n = 1807) and Parkinson's Progression Markers Initiative (n = 472)) with clinical observational measures collected longitudinally over 5-10 years, we developed a Bayesian multiple phenotypes mixed model incorporating genetic relationships between individuals able to explain many diverse clinical measurements as a smaller number of continuous underlying factors ("phenotypic axes"). RESULTS When applied to disease severity at diagnosis, the most influential of three phenotypic axes "Axis 1" was characterised by severe non-tremor motor phenotype, anxiety and depression at diagnosis, accompanied by faster progression in cognitive function measures. Axis 1 was associated with increased genetic risk of Alzheimer's disease and reduced CSF Aβ1-42 levels. As observed previously for Alzheimer's disease genetic risk, and in contrast to Parkinson's disease genetic risk, the loci influencing Axis 1 were associated with microglia-expressed genes implicating neuroinflammation. When applied to measures of disease progression for each individual, integration of Alzheimer's disease genetic loci haplotypes improved the accuracy of progression modelling, while integrating Parkinson's disease genetics did not. CONCLUSIONS We identify universal axes of Parkinson's disease phenotypic variation which reveal that Parkinson's patients with high concomitant genetic risk for Alzheimer's disease are more likely to present with severe motor and non-motor features at baseline and progress more rapidly to early dementia.
Collapse
Affiliation(s)
- Cynthia Sandor
- UK Dementia Research Institute, Cardiff University, Cardiff, CF24 4HQ, UK.
| | - Stephanie Millin
- Department of Physiology, Anatomy and Genetics, University of Oxford, Oxford, OX1 3PT, UK
| | - Andrew Dahl
- Wellcome Centre for Human Genetics, University of Oxford, Oxford, OX3 7BN, UK
| | | | - Michael Lawton
- School of Social and Community Medicine, University of Bristol, Bristol, BS8 1TH, UK
| | - Leon Hubbard
- MRC Centre for Neuropsychiatric Genetics and Genomics, Institute of Psychological Medicine and Clinical Neurosciences, School of Medicine, Cardiff University, Cardiff, CF24 4HQ, UK
| | - Nabila Rahman
- UK Dementia Research Institute, Cardiff University, Cardiff, CF24 4HQ, UK
| | - Nigel Williams
- MRC Centre for Neuropsychiatric Genetics and Genomics, Institute of Psychological Medicine and Clinical Neurosciences, School of Medicine, Cardiff University, Cardiff, CF24 4HQ, UK
| | - Yoav Ben-Shlomo
- School of Social and Community Medicine, University of Bristol, Bristol, BS8 1TH, UK
| | - Donald G Grosset
- Department of Neurology, Institute of Neurological Sciences, Queen Elizabeth University Hospital, G51 4LB, Glasgow, UK
| | - Michele T Hu
- Department of Physiology, Anatomy and Genetics, Le Gros Clark Building, Oxford Parkinson's Disease Centre, University of Oxford, Oxford, OX1 3PT, UK
- Nuffield Department of Clinical Neurosciences, Division of Clinical Neurology, University of Oxford, Oxford, OX3 7LF, UK
| | - Jonathan Marchini
- Wellcome Centre for Human Genetics, University of Oxford, Oxford, OX3 7BN, UK
- Department of Statistics, University of Oxford, Oxford, OX1, UK
- Regeneron Genetics Center, Tarrytown, NY, USA
| | - Caleb Webber
- UK Dementia Research Institute, Cardiff University, Cardiff, CF24 4HQ, UK.
- Department of Physiology, Anatomy and Genetics, University of Oxford, Oxford, OX1 3PT, UK.
| |
Collapse
|
24
|
Huang YH, Ku HM, Wang CA, Chen LY, He SS, Chen S, Liao PC, Juan PY, Kao CF. A multiple phenotype imputation method for genetic diversity and core collection in Taiwanese vegetable soybean. FRONTIERS IN PLANT SCIENCE 2022; 13:948349. [PMID: 36119593 PMCID: PMC9480828 DOI: 10.3389/fpls.2022.948349] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 05/19/2022] [Accepted: 07/25/2022] [Indexed: 06/15/2023]
Abstract
Establishment of vegetable soybean (edamame) [Glycine max (L.) Merr.] germplasms has been highly valued in Asia and the United States owing to the increasing market demand for edamame. The idea of core collection (CC) is to shorten the breeding program so as to improve the availability of germplasm resources. However, multidimensional phenotypes typically are highly correlated and have different levels of missing rate, often failing to capture the underlying pattern of germplasms and select CC precisely. These are commonly observed on correlated samples. To overcome such scenario, we introduced the "multiple imputation" (MI) method to iteratively impute missing phenotypes for 46 morphological traits and jointly analyzed high-dimensional imputed missing phenotypes (EC impu ) to explore population structure and relatedness among 200 Taiwanese vegetable soybean accessions. An advanced maximization strategy with a heuristic algorithm and PowerCore was used to evaluate the morphological diversity among the EC impu . In total, 36 accessions (denoted as CC impu ) were efficiently selected representing high diversity and the entire coverage of the EC impu . Only 4 (8.7%) traits showed slightly significant differences between the CC impu and EC impu . Compared to the EC impu , 96% traits retained all characteristics or had a slight diversity loss in the CC impu . The CC impu exhibited a small percentage of significant mean difference (4.51%), and large coincidence rate (98.1%), variable rate (138.76%), and coverage (close to 100%), indicating the representativeness of the EC impu . We noted that the CC impu outperformed the CC raw in evaluation properties, suggesting that the multiple phenotype imputation method has the potential to deal with missing phenotypes in correlated samples efficiently and reliably without re-phenotyping accessions. Our results illustrated a significant role of imputed missing phenotypes in support of the MI-based framework for plant-breeding programs.
Collapse
Affiliation(s)
- Yen-Hsiang Huang
- Department of Agronomy, College of Agriculture and Natural Resources, National Chung Hsing University, Taichung, Taiwan
| | - Hsin-Mei Ku
- Department of Agronomy, College of Agriculture and Natural Resources, National Chung Hsing University, Taichung, Taiwan
| | - Chong-An Wang
- Department of Agronomy, College of Agriculture and Natural Resources, National Chung Hsing University, Taichung, Taiwan
| | - Ling-Yu Chen
- Department of Agronomy, College of Agriculture and Natural Resources, National Chung Hsing University, Taichung, Taiwan
| | - Shan-Syue He
- Department of Agronomy, College of Bioresources and Agriculture, National Taiwan University, Taipei, Taiwan
| | - Shu Chen
- Plant Germplasm Division, Taiwan Agricultural Research Institute, Taichung, Taiwan
| | - Po-Chun Liao
- Department of Agronomy, College of Agriculture and Natural Resources, National Chung Hsing University, Taichung, Taiwan
| | - Pin-Yuan Juan
- Department of Agronomy, College of Agriculture and Natural Resources, National Chung Hsing University, Taichung, Taiwan
| | - Chung-Feng Kao
- Department of Agronomy, College of Agriculture and Natural Resources, National Chung Hsing University, Taichung, Taiwan
- Advanced Plant Biotechnology Center, National Chung Hsing University, Taichung, Taiwan
| |
Collapse
|
25
|
Pedersen EM, Agerbo E, Plana-Ripoll O, Grove J, Dreier JW, Musliner KL, Bækvad-Hansen M, Athanasiadis G, Schork A, Bybjerg-Grauholm J, Hougaard DM, Werge T, Nordentoft M, Mors O, Dalsgaard S, Christensen J, Børglum AD, Mortensen PB, McGrath JJ, Privé F, Vilhjálmsson BJ. Accounting for age of onset and family history improves power in genome-wide association studies. Am J Hum Genet 2022; 109:417-432. [PMID: 35139346 PMCID: PMC8948165 DOI: 10.1016/j.ajhg.2022.01.009] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2021] [Accepted: 01/07/2022] [Indexed: 11/01/2022] Open
Abstract
Genome-wide association studies (GWASs) have revolutionized human genetics, allowing researchers to identify thousands of disease-related genes and possible drug targets. However, case-control status does not account for the fact that not all controls may have lived through their period of risk for the disorder of interest. This can be quantified by examining the age-of-onset distribution and the age of the controls or the age of onset for cases. The age-of-onset distribution may also depend on information such as sex and birth year. In addition, family history is not routinely included in the assessment of control status. Here, we present LT-FH++, an extension of the liability threshold model conditioned on family history (LT-FH), which jointly accounts for age of onset and sex as well as family history. Using simulations, we show that, when family history and the age-of-onset distribution are available, the proposed approach yields statistically significant power gains over LT-FH and large power gains over genome-wide association study by proxy (GWAX). We applied our method to four psychiatric disorders available in the iPSYCH data and to mortality in the UK Biobank and found 20 genome-wide significant associations with LT-FH++, compared to ten for LT-FH and eight for a standard case-control GWAS. As more genetic data with linked electronic health records become available to researchers, we expect methods that account for additional health information, such as LT-FH++, to become even more beneficial.
Collapse
|
26
|
1H-NMR metabolomics-based surrogates to impute common clinical risk factors and endpoints. EBioMedicine 2021; 75:103764. [PMID: 34942446 PMCID: PMC8703237 DOI: 10.1016/j.ebiom.2021.103764] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2021] [Revised: 12/02/2021] [Accepted: 12/03/2021] [Indexed: 12/31/2022] Open
Abstract
Background Missing or incomplete phenotypic information can severely deteriorate the statistical power in epidemiological studies. High-throughput quantification of small-molecules in bio-samples, i.e. ‘metabolomics’, is steadily gaining popularity, as it is highly informative for various phenotypical characteristics. Here we aim to leverage metabolomics to impute missing data in clinical variables routinely assessed in large epidemiological and clinical studies. Methods To this end, we have employed ∼26,000 1H-NMR metabolomics samples from 28 Dutch cohorts collected within the BBMRI-NL consortium, to create 19 metabolomics-based predictors for clinical variables, including diabetes status (AUC5-Fold CV = 0·94) and lipid medication usage (AUC5-Fold CV = 0·90). Findings Subsequent application in independent cohorts confirmed that our metabolomics-based predictors can indeed be used to impute a wide array of missing clinical variables from a single metabolomics data resource. In addition, application highlighted the potential use of our predictors to explore the effects of totally unobserved confounders in omics association studies. Finally, we show that our predictors can be used to explore risk factor profiles contributing to mortality in older participants. Interpretation To conclude, we provide 1H-NMR metabolomics-based models to impute clinical variables routinely assessed in epidemiological studies and illustrate their merit in scenarios when phenotypic variables are partially incomplete or totally unobserved. Funding BBMRI-NL, X-omics, VOILA, Medical Delta and the Dutch Research Council (NWO-VENI).
Collapse
|
27
|
Wadon ME, Bailey GA, Yilmaz Z, Hubbard E, AlSaeed M, Robinson A, McLauchlan D, Barbano RL, Marsh L, Factor SA, Fox SH, Adler CH, Rodriguez RL, Comella CL, Reich SG, Severt WL, Goetz CG, Perlmutter JS, Jinnah HA, Harding KE, Sandor C, Peall KJ. Non-motor phenotypic subgroups in adult-onset idiopathic, isolated, focal cervical dystonia. Brain Behav 2021; 11:e2292. [PMID: 34291595 PMCID: PMC8413761 DOI: 10.1002/brb3.2292] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/21/2021] [Revised: 06/15/2021] [Accepted: 07/04/2021] [Indexed: 11/24/2022] Open
Abstract
BACKGROUND Non-motor symptoms are well established phenotypic components of adult-onset idiopathic, isolated, focal cervical dystonia (AOIFCD). However, improved understanding of their clinical heterogeneity is needed to better target therapeutic intervention. Here, we examine non-motor phenotypic features to identify possible AOIFCD subgroups. METHODS Participants diagnosed with AOIFCD were recruited via specialist neurology clinics (dystonia wales: n = 114, dystonia coalition: n = 183). Non-motor assessment included psychiatric symptoms, pain, sleep disturbance, and quality of life, assessed using self-completed questionnaires or face-to-face assessment. Both cohorts were analyzed independently using Cluster, and Bayesian multiple mixed model phenotype analyses to investigate the relationship between non-motor symptoms and determine evidence of phenotypic subgroups. RESULTS Independent cluster analysis of the two cohorts suggests two predominant phenotypic subgroups, one consisting of approximately a third of participants in both cohorts, experiencing increased levels of depression, anxiety, sleep impairment, and pain catastrophizing, as well as, decreased quality of life. The Bayesian approach reinforced this with the primary axis, which explained the majority of the variance, in each cohort being associated with psychiatric symptomology, and also sleep impairment and pain catastrophizing in the Dystonia Wales cohort. CONCLUSIONS Non-motor symptoms accompanying AOIFCD parse into two predominant phenotypic sub-groups, with differences in psychiatric symptoms, pain catastrophizing, sleep quality, and quality of life. Improved understanding of these symptom groups will enable better targeted pathophysiological investigation and future therapeutic intervention.
Collapse
Affiliation(s)
- Megan E Wadon
- Neuroscience and Mental Health Research Institute, Division of Psychological Medicine and Clinical Neurosciences, Cardiff University, Maindy Road, Cardiff, CF24 4HQ, UK
| | - Grace A Bailey
- Neuroscience and Mental Health Research Institute, Division of Psychological Medicine and Clinical Neurosciences, Cardiff University, Maindy Road, Cardiff, CF24 4HQ, UK
| | - Zehra Yilmaz
- Neuroscience and Mental Health Research Institute, Division of Psychological Medicine and Clinical Neurosciences, Cardiff University, Maindy Road, Cardiff, CF24 4HQ, UK.,Institute of Neurology, University College London, Queen Square, London, WC1N 3BG, UK
| | - Emily Hubbard
- School of Medicine, Cardiff University, Heath Park Campus, Cardiff, CF14 4YS, UK
| | - Meshari AlSaeed
- School of Medicine, Cardiff University, Heath Park Campus, Cardiff, CF14 4YS, UK.,Division of Neurology, University of British Columbia, Wesbrook Mall, Vancouver, British Columbia, V6T 2B5, Canada
| | - Amy Robinson
- School of Medicine, Cardiff University, Heath Park Campus, Cardiff, CF14 4YS, UK
| | - Duncan McLauchlan
- Neuroscience and Mental Health Research Institute, Division of Psychological Medicine and Clinical Neurosciences, Cardiff University, Maindy Road, Cardiff, CF24 4HQ, UK
| | - Richard L Barbano
- Department of Neurology, University of Rochester, Elmwood Avenue, Rochester, New York, NY 14642, USA
| | - Laura Marsh
- Menninger Department of Psychiatry, Baylor College of Medicine, Butler Boulevard, Houston, Texas, 77030, USA
| | - Stewart A Factor
- Departments of Neurology & Human Genetics, Emory University, Woodruff Circle, Atlanta, Georgia, 30322, USA
| | - Susan H Fox
- Edmond J Safra Program in Parkinson Disease, Movement Disorder Clinic, Toronto Western Hospital, Bathurst Street, Toronto, Ontario, M5T 2S8, Canada.,Department of Medicine, University of Toronto, Queen's Park Crescent West, Toronto, Ontario, M5S 3H2, Canada
| | - Charles H Adler
- The Parkinson's Disease and Movement Disorders Center, Mayo Clinic, Department of Neurology, East Shea Boulevard, Scottsdale, Arizona, 85259, USA
| | - Ramon L Rodriguez
- Department of Neurology, University of Florida, Newell Drive, Gainesville, Florida, 32611, USA
| | - Cynthia L Comella
- Department of Neurological Sciences, Rush University Medical Center, West Harrison Street, Chicago, Illinois, 60612, USA
| | - Stephen G Reich
- Department of Neurology, University of Maryland School of Medicine, south Paca Street, Baltimore, Maryland, 21201, USA
| | - William L Severt
- Beth Israel Medical Center, First Avenue, New York, New York, 10003, USA
| | - Christopher G Goetz
- Department of Neurological Sciences, Rush University Medical Center, West Harrison Street, Chicago, Illinois, 60612, USA
| | - Joel S Perlmutter
- Neurology, Radiology, Neuroscience, Physical Therapy and Occupational Therapy, Washington University School of Medicine, South Euclid Avenue, St. Louis, Missouri, 63110, USA
| | - Hyder A Jinnah
- Departments of Neurology & Human Genetics, Emory University, Woodruff Circle, Atlanta, Georgia, 30322, USA
| | - Katharine E Harding
- Department of Neurology, Aneurin Bevan University Health Board, Corporation Road, Newport, NP19 0BH, UK
| | - Cynthia Sandor
- UK Dementia Research Institute, Cardiff University, Maindy Road, Cardiff, CF24 4HQ, UK
| | - Kathryn J Peall
- Neuroscience and Mental Health Research Institute, Division of Psychological Medicine and Clinical Neurosciences, Cardiff University, Maindy Road, Cardiff, CF24 4HQ, UK
| |
Collapse
|
28
|
Runcie DE, Qu J, Cheng H, Crawford L. MegaLMM: Mega-scale linear mixed models for genomic predictions with thousands of traits. Genome Biol 2021; 22:213. [PMID: 34301310 PMCID: PMC8299638 DOI: 10.1186/s13059-021-02416-w] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2020] [Accepted: 06/23/2021] [Indexed: 12/21/2022] Open
Abstract
Large-scale phenotype data can enhance the power of genomic prediction in plant and animal breeding, as well as human genetics. However, the statistical foundation of multi-trait genomic prediction is based on the multivariate linear mixed effect model, a tool notorious for its fragility when applied to more than a handful of traits. We present MegaLMM, a statistical framework and associated software package for mixed model analyses of a virtually unlimited number of traits. Using three examples with real plant data, we show that MegaLMM can leverage thousands of traits at once to significantly improve genetic value prediction accuracy.
Collapse
Affiliation(s)
- Daniel E. Runcie
- Department of Plant Sciences, University of California Davis, Davis, CA USA
| | - Jiayi Qu
- Department of Plant Sciences, University of California Davis, Davis, CA USA
| | - Hao Cheng
- Department of Plant Sciences, University of California Davis, Davis, CA USA
| | | |
Collapse
|
29
|
Mbatchou J, Barnard L, Backman J, Marcketta A, Kosmicki JA, Ziyatdinov A, Benner C, O'Dushlaine C, Barber M, Boutkov B, Habegger L, Ferreira M, Baras A, Reid J, Abecasis G, Maxwell E, Marchini J. Computationally efficient whole-genome regression for quantitative and binary traits. Nat Genet 2021; 53:1097-1103. [PMID: 34017140 DOI: 10.1038/s41588-021-00870-7] [Citation(s) in RCA: 379] [Impact Index Per Article: 126.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2020] [Accepted: 04/13/2021] [Indexed: 11/08/2022]
Abstract
Genome-wide association analysis of cohorts with thousands of phenotypes is computationally expensive, particularly when accounting for sample relatedness or population structure. Here we present a novel machine-learning method called REGENIE for fitting a whole-genome regression model for quantitative and binary phenotypes that is substantially faster than alternatives in multi-trait analyses while maintaining statistical efficiency. The method naturally accommodates parallel analysis of multiple phenotypes and requires only local segments of the genotype matrix to be loaded in memory, in contrast to existing alternatives, which must load genome-wide matrices into memory. This results in substantial savings in compute time and memory usage. We introduce a fast, approximate Firth logistic regression test for unbalanced case-control phenotypes. The method is ideally suited to take advantage of distributed computing frameworks. We demonstrate the accuracy and computational benefits of this approach using the UK Biobank dataset with up to 407,746 individuals.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | | | | | | | | | - Aris Baras
- Regeneron Genetics Center, Tarrytown, NY, USA
| | | | | | | | | |
Collapse
|
30
|
Abstract
Disease classification, or nosology, was historically driven by careful examination of clinical features of patients. As technologies to measure and understand human phenotypes advanced, so too did classifications of disease, and the advent of genetic data has led to a surge in genetic subtyping in the past decades. Although the fundamental process of refining disease definitions and subtypes is shared across diverse fields, each field is driven by its own goals and technological expertise, leading to inconsistent and conflicting definitions of disease subtypes. Here, we review several classical and recent subtypes and subtyping approaches and provide concrete definitions to delineate subtypes. In particular, we focus on subtypes with distinct causal disease biology, which are of primary interest to scientists, and subtypes with pragmatic medical benefits, which are of primary interest to physicians. We propose genetic heterogeneity as a gold standard for establishing biologically distinct subtypes of complex polygenic disease. We focus especially on methods to find and validate genetic subtypes, emphasizing common pitfalls and how to avoid them.
Collapse
Affiliation(s)
- Andy Dahl
- Section of Genetic Medicine, Department of Medicine, University of Chicago, Chicago, Illinois 60637, USA; .,Department of Neurology, University of California, Los Angeles, California 90024, USA; .,Department of Computational Medicine, University of California, Los Angeles, California 90095, USA
| | - Noah Zaitlen
- Department of Neurology, University of California, Los Angeles, California 90024, USA; .,Department of Computational Medicine, University of California, Los Angeles, California 90095, USA
| |
Collapse
|
31
|
Fu L, Wang Y, Li T, Hu YQ. A Novel Approach Integrating Hierarchical Clustering and Weighted Combination for Association Study of Multiple Phenotypes and a Genetic Variant. Front Genet 2021; 12:654804. [PMID: 34220938 PMCID: PMC8249926 DOI: 10.3389/fgene.2021.654804] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2021] [Accepted: 04/20/2021] [Indexed: 11/26/2022] Open
Abstract
As a pivotal research tool, genome-wide association study has successfully identified numerous genetic variants underlying distinct diseases. However, these identified genetic variants only explain a small proportion of the phenotypic variation for certain diseases, suggesting that there are still more genetic signals to be detected. One of the reasons may be that one-phenotype one-variant association study is not so efficient in detecting variants of weak effects. Nowadays, it is increasingly worth noting that joint analysis of multiple phenotypes may boost the statistical power to detect pathogenic variants with weak genetic effects on complex diseases, providing more clues for their underlying biology mechanisms. So a Weighted Combination of multiple phenotypes following Hierarchical Clustering method (WCHC) is proposed for simultaneously analyzing multiple phenotypes in association studies. A series of simulations are conducted, and the results show that WCHC is either the most powerful method or comparable with the most powerful competitor in most of the simulation scenarios. Additionally, we evaluated the performance of WCHC in its application to the obesity-related phenotypes from Atherosclerosis Risk in Communities, and several associated variants are reported.
Collapse
Affiliation(s)
- Liwan Fu
- State Key Laboratory of Genetic Engineering, Human Phenome Institute, Institute of Biostatistics, School of Life Sciences, Fudan University, Shanghai, China.,Center for Non-communicable Disease Management, Beijing Children's Hospital, Capital Medical University, National Center for Children's Health, Beijing, China
| | - Yuquan Wang
- State Key Laboratory of Genetic Engineering, Human Phenome Institute, Institute of Biostatistics, School of Life Sciences, Fudan University, Shanghai, China
| | - Tingting Li
- State Key Laboratory of Genetic Engineering, Human Phenome Institute, Institute of Biostatistics, School of Life Sciences, Fudan University, Shanghai, China
| | - Yue-Qing Hu
- State Key Laboratory of Genetic Engineering, Human Phenome Institute, Institute of Biostatistics, School of Life Sciences, Fudan University, Shanghai, China.,Shanghai Center for Mathematical Sciences, Fudan University, Shanghai, China
| |
Collapse
|
32
|
Xu H, Schwander K, Brown MR, Wang W, Waken RJ, Boerwinkle E, Cupples LA, de Las Fuentes L, van Heemst D, Osazuwa-Peters O, de Vries PS, van Dijk KW, Sung YJ, Zhang X, Morrison AC, Rao DC, Noordam R, Liu CT. Lifestyle Risk Score: handling missingness of individual lifestyle components in meta-analysis of gene-by-lifestyle interactions. Eur J Hum Genet 2021; 29:839-850. [PMID: 33500576 PMCID: PMC8110957 DOI: 10.1038/s41431-021-00808-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2020] [Revised: 11/30/2020] [Accepted: 01/05/2021] [Indexed: 01/29/2023] Open
Abstract
Recent studies consider lifestyle risk score (LRS), an aggregation of multiple lifestyle exposures, in identifying association of gene-lifestyle interaction with disease traits. However, not all cohorts have data on all lifestyle factors, leading to increased heterogeneity in the environmental exposure in collaborative meta-analyses. We compared and evaluated four approaches (Naïve, Safe, Complete and Moderator Approaches) to handle the missingness in LRS-stratified meta-analyses under various scenarios. Compared to "benchmark" results with all lifestyle factors available for all cohorts, the Complete Approach, which included only cohorts with all lifestyle components, was underpowered due to lower sample size, and the Naïve Approach, which utilized all available data and ignored the missingness, was slightly inflated. The Safe Approach, which used all data in LRS-exposed group and only included cohorts with all lifestyle factors available in the LRS-unexposed group, and the Moderator Approach, which handled missingness via moderator meta-regression, were both slightly conservative and yielded almost identical p values. We also evaluated the performance of the Safe Approach under different scenarios. We observed that the larger the proportion of cohorts without missingness included, the more accurate the results compared to "benchmark" results. In conclusion, we generally recommend the Safe Approach, a straightforward and non-inflated approach, to handle heterogeneity among cohorts in the LRS based genome-wide interaction meta-analyses.
Collapse
Affiliation(s)
- Hanfei Xu
- Department of Biostatistics, Boston University School of Public Health, Boston, MA, USA.
| | - Karen Schwander
- Division of Statistical Genomics, Department of Genetics, Washington University School of Medicine, St. Louis, MO, USA
| | - Michael R Brown
- Human Genetics Center, Department of Epidemiology, Human Genetics, and Environmental Sciences, School of Public Health, the University of Texas School of Public health, Houston, TX, USA
| | - Wenyi Wang
- Department of Human Genetics, Leiden University Medical Center, Leiden, the Netherlands
| | - R J Waken
- Field and Environmental Data Science, Benson Hill Inc, St. Louis, MO, USA
| | - Eric Boerwinkle
- Human Genetics Center, Department of Epidemiology, Human Genetics, and Environmental Sciences, School of Public Health, the University of Texas School of Public health, Houston, TX, USA
- The Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | - L Adrienne Cupples
- Department of Biostatistics, Boston University School of Public Health, Boston, MA, USA
- NHLBI and Boston University Framingham Heart Study, Framingham, MA, USA
| | - Lisa de Las Fuentes
- Department of Medicine, Cardiovascular Division, Washington University School of Medicine, St. Louis, MO, USA
| | - Diana van Heemst
- Section of Gerontology and Geriatrics, Department of Internal Medicine, Leiden University Medical Center, Leiden, the Netherlands
| | | | - Paul S de Vries
- Human Genetics Center, Department of Epidemiology, Human Genetics, and Environmental Sciences, School of Public Health, the University of Texas School of Public health, Houston, TX, USA
| | - Ko Willems van Dijk
- Department of Human Genetics, Leiden University Medical Center, Leiden, the Netherlands
- Division of Endocrinology, Department of Internal Medicine, Leiden University Medical Center, Leiden, the Netherlands
- Leiden Laboratory for Experimental Vascular Medicine, Leiden University Medical Center, Leiden, the Netherlands
| | - Yun Ju Sung
- Department of Psychiatry, Washington University School of Medicine, St. Louis, MO, USA
| | - Xiaoyu Zhang
- Department of Biostatistics, Boston University School of Public Health, Boston, MA, USA
| | - Alanna C Morrison
- Human Genetics Center, Department of Epidemiology, Human Genetics, and Environmental Sciences, School of Public Health, the University of Texas School of Public health, Houston, TX, USA
| | - D C Rao
- Division of Biostatistics, Washington University School of Medicine, St. Louis, MO, USA
| | - Raymond Noordam
- Section of Gerontology and Geriatrics, Department of Internal Medicine, Leiden University Medical Center, Leiden, the Netherlands
| | - Ching-Ti Liu
- Department of Biostatistics, Boston University School of Public Health, Boston, MA, USA.
| |
Collapse
|
33
|
Wang S, Ge S, Colijn C, Biller P, Wang L, Elliott LT. Estimating Genetic Similarity Matrices Using Phylogenies. J Comput Biol 2021; 28:587-600. [PMID: 33926225 PMCID: PMC8219189 DOI: 10.1089/cmb.2020.0375] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Genetic similarity is a measure of the genetic relatedness among individuals. The standard method for computing these matrices involves the inner product of observed genetic variants. Such an approach is inaccurate or impossible if genotypes are not available, or not densely sampled, or of poor quality (e.g., genetic analysis of extinct species). We provide a new method for computing genetic similarities among individuals using phylogenetic trees. Our method can supplement (or stand in for) computations based on genotypes. We provide simulations suggesting that the genetic similarity matrices computed from trees are consistent with those computed from genotypes. With our methods, quantitative analysis on genetic traits and analysis of heritability and coheritability can be conducted directly using genetic similarity matrices and so in the absence of genotype data, or under uncertainty in the phylogenetic tree. We use simulation studies to demonstrate the advantages of our method, and we provide applications to data.
Collapse
Affiliation(s)
- Shijia Wang
- School of Statistics and Data Science, LPMC and KLMDASR, Nankai University, Tianjin, China
| | - Shufei Ge
- Institute of Mathematical Sciences, ShanghaiTech University, Shanghai, China
| | - Caroline Colijn
- Department of Mathematics, Simon Fraser University, Burnaby, Canada
| | - Priscila Biller
- Department of Mathematics, Simon Fraser University, Burnaby, Canada
| | - Liangliang Wang
- Department of Statistics and Actuarial Science, Simon Fraser University, Burnaby, Canada
| | - Lloyd T Elliott
- Department of Statistics and Actuarial Science, Simon Fraser University, Burnaby, Canada
| |
Collapse
|
34
|
Liu F, Zhou Z, Cai M, Wen Y, Zhang J. AGNEP: An Agglomerative Nesting Clustering Algorithm for Phenotypic Dimension Reduction in Joint Analysis of Multiple Phenotypes. Front Genet 2021; 12:648831. [PMID: 33981331 PMCID: PMC8107386 DOI: 10.3389/fgene.2021.648831] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2021] [Accepted: 04/01/2021] [Indexed: 11/17/2022] Open
Abstract
Genome-wide association study (GWAS) has identified thousands of genetic variants associated with complex traits and diseases. Compared with analyzing a single phenotype at a time, the joint analysis of multiple phenotypes can improve statistical power by taking into account the information from phenotypes. However, most established joint algorithms ignore the different level of correlations between multiple phenotypes; instead of that, they simultaneously analyze all phenotypes in a genetic model. Thus, they may fail to capture the genetic structure of phenotypes and consequently reduce the statistical power. In this study, we develop a novel method agglomerative nesting clustering algorithm for phenotypic dimension reduction analysis (AGNEP) to jointly analyze multiple phenotypes for GWAS. First, AGNEP uses an agglomerative nesting clustering algorithm to group correlated phenotypes and then applies principal component analysis (PCA) to generate representative phenotypes for each group. Finally, multivariate analysis is employed to test associations between genetic variants and the representative phenotypes rather than all phenotypes. We perform three simulation experiments with various genetic structures and a real dataset analysis for 19 Arabidopsis phenotypes. Compared to established methods, AGNEP is more powerful in terms of statistical power, computing time, and the number of quantitative trait nucleotides (QTNs). The analysis of the Arabidopsis real dataset further illustrates the efficiency of AGNEP for detecting QTNs, which are confirmed by The Arabidopsis Information Resource gene bank.
Collapse
Affiliation(s)
- Fengrong Liu
- College of Science, Nanjing Agricultural University, Nanjing, China.,School of Data Science, University of Science and Technology of China, Hefei, China
| | - Ziyang Zhou
- College of Science, Nanjing Agricultural University, Nanjing, China
| | - Mingzhi Cai
- College of Science, Nanjing Agricultural University, Nanjing, China
| | - Yangjun Wen
- College of Science, Nanjing Agricultural University, Nanjing, China
| | - Jin Zhang
- College of Science, Nanjing Agricultural University, Nanjing, China.,Postdoctoral Research Station of Crop Science, Nanjing Agricultural University, Nanjing, China
| |
Collapse
|
35
|
Zhang J, Chen M, Wen Y, Zhang Y, Lu Y, Wang S, Chen J. A Fast Multi-Locus Ridge Regression Algorithm for High-Dimensional Genome-Wide Association Studies. Front Genet 2021; 12:649196. [PMID: 33854527 PMCID: PMC8041068 DOI: 10.3389/fgene.2021.649196] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2021] [Accepted: 03/01/2021] [Indexed: 11/13/2022] Open
Abstract
The mixed linear model (MLM) has been widely used in genome-wide association study (GWAS) to dissect quantitative traits in human, animal, and plant genetics. Most methodologies consider all single nucleotide polymorphism (SNP) effects as random effects under the MLM framework, which fail to detect the joint minor effect of multiple genetic markers on a trait. Therefore, polygenes with minor effects remain largely unexplored in today’s big data era. In this study, we developed a new algorithm under the MLM framework, which is called the fast multi-locus ridge regression (FastRR) algorithm. The FastRR algorithm first whitens the covariance matrix of the polygenic matrix K and environmental noise, then selects potentially related SNPs among large scale markers, which have a high correlation with the target trait, and finally analyzes the subset variables using a multi-locus deshrinking ridge regression for true quantitative trait nucleotide (QTN) detection. Results from the analyses of both simulated and real data show that the FastRR algorithm is more powerful for both large and small QTN detection, more accurate in QTN effect estimation, and has more stable results under various polygenic backgrounds. Moreover, compared with existing methods, the FastRR algorithm has the advantage of high computing speed. In conclusion, the FastRR algorithm provides an alternative algorithm for multi-locus GWAS in high dimensional genomic datasets.
Collapse
Affiliation(s)
- Jin Zhang
- College of Science, Nanjing Agricultural University, Nanjing, China.,Postdoctoral Research Station of Crop Science, Nanjing Agricultural University, Nanjing, China
| | - Min Chen
- College of Science, Nanjing Agricultural University, Nanjing, China
| | - Yangjun Wen
- College of Science, Nanjing Agricultural University, Nanjing, China
| | - Yin Zhang
- College of Science, Nanjing Agricultural University, Nanjing, China
| | - Yunan Lu
- College of Science, Nanjing Agricultural University, Nanjing, China
| | - Shengmeng Wang
- College of Science, Nanjing Agricultural University, Nanjing, China
| | - Juncong Chen
- College of Finance, Nanjing Agricultural University, Nanjing, China
| |
Collapse
|
36
|
Coupled mixed model for joint genetic analysis of complex disorders with two independently collected data sets. BMC Bioinformatics 2021; 22:50. [PMID: 33546598 PMCID: PMC7866684 DOI: 10.1186/s12859-021-03959-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2020] [Accepted: 01/06/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In the last decade, Genome-wide Association studies (GWASs) have contributed to decoding the human genome by uncovering many genetic variations associated with various diseases. Many follow-up investigations involve joint analysis of multiple independently generated GWAS data sets. While most of the computational approaches developed for joint analysis are based on summary statistics, the joint analysis based on individual-level data with consideration of confounding factors remains to be a challenge. RESULTS In this study, we propose a method, called Coupled Mixed Model (CMM), that enables a joint GWAS analysis on two independently collected sets of GWAS data with different phenotypes. The CMM method does not require the data sets to have the same phenotypes as it aims to infer the unknown phenotypes using a set of multivariate sparse mixed models. Moreover, CMM addresses the confounding variables due to population stratification, family structures, and cryptic relatedness, as well as those arising during data collection such as batch effects that frequently appear in joint genetic studies. We evaluate the performance of CMM using simulation experiments. In real data analysis, we illustrate the utility of CMM by an application to evaluating common genetic associations for Alzheimer's disease and substance use disorder using datasets independently collected for the two complex human disorders. Comparison of the results with those from previous experiments and analyses supports the utility of our method and provides new insights into the diseases. The software is available at https://github.com/HaohanWang/CMM .
Collapse
|
37
|
Gunjača J, Carović-Stanko K, Lazarević B, Vidak M, Petek M, Liber Z, Šatović Z. Genome-Wide Association Studies of Mineral Content in Common Bean. FRONTIERS IN PLANT SCIENCE 2021; 12:636484. [PMID: 33763096 PMCID: PMC7982862 DOI: 10.3389/fpls.2021.636484] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/01/2020] [Accepted: 02/09/2021] [Indexed: 05/15/2023]
Abstract
Micronutrient malnutrition is one of the main public health problems in many parts of the world. This problem raises the attention of all valuable sources of micronutrients for the human diet, such as common bean (Phaseolus vulgaris L.). In this research, a panel of 174 accessions representing Croatian common bean landraces was phenotyped for seed content of eight nutrients (N, P, K, Ca, Mg, Fe, Zn, and Mn), and genotyped using 6,311 high-quality DArTseq-derived SNP markers. A genome-wide association study (GWAS) was then performed to identify new genetic sources for improving seed mineral content. Twenty-two quantitative trait nucleotides (QTN) associated with seed nitrogen content were discovered on chromosomes Pv01, Pv02, Pv03, Pv05, Pv07, Pv08, and Pv10. Five QTNs were associated with seed phosphorus content, four on chromosome Pv07, and one on Pv08. A single significant QTN was found for seed calcium content on chromosome Pv09 and for seed magnesium content on Pv08. Finally, two QTNs associated with seed zinc content were identified on Pv06 while no QTNs were found to be associated with seed potassium, iron, or manganese content. Our results demonstrate the utility of GWAS for understanding the genetic architecture of seed nutritional traits in common bean and have utility for future enrichment of seed with macro- and micronutrients through genomics-assisted breeding.
Collapse
Affiliation(s)
- Jerko Gunjača
- Department of Plant Breeding, Genetics and Biometrics, Faculty of Agriculture, University of Zagreb, Zagreb, Croatia
- Centre of Excellence for Biodiversity and Molecular Plant Breeding (CoE CroP-BioDiv), Zagreb, Croatia
| | - Klaudija Carović-Stanko
- Centre of Excellence for Biodiversity and Molecular Plant Breeding (CoE CroP-BioDiv), Zagreb, Croatia
- Department of Seed Science and Technology, Faculty of Agriculture, University of Zagreb, Zagreb, Croatia
- *Correspondence: Klaudija Carović-Stanko,
| | - Boris Lazarević
- Centre of Excellence for Biodiversity and Molecular Plant Breeding (CoE CroP-BioDiv), Zagreb, Croatia
- Department of Plant Nutrition, Faculty of Agriculture, University of Zagreb, Zagreb, Croatia
| | - Monika Vidak
- Centre of Excellence for Biodiversity and Molecular Plant Breeding (CoE CroP-BioDiv), Zagreb, Croatia
| | - Marko Petek
- Department of Plant Nutrition, Faculty of Agriculture, University of Zagreb, Zagreb, Croatia
| | - Zlatko Liber
- Centre of Excellence for Biodiversity and Molecular Plant Breeding (CoE CroP-BioDiv), Zagreb, Croatia
- Department of Biology, Faculty of Science, University of Zagreb, Zagreb, Croatia
| | - Zlatko Šatović
- Centre of Excellence for Biodiversity and Molecular Plant Breeding (CoE CroP-BioDiv), Zagreb, Croatia
- Department of Seed Science and Technology, Faculty of Agriculture, University of Zagreb, Zagreb, Croatia
| |
Collapse
|
38
|
Scott MF, Ladejobi O, Amer S, Bentley AR, Biernaskie J, Boden SA, Clark M, Dell'Acqua M, Dixon LE, Filippi CV, Fradgley N, Gardner KA, Mackay IJ, O'Sullivan D, Percival-Alwyn L, Roorkiwal M, Singh RK, Thudi M, Varshney RK, Venturini L, Whan A, Cockram J, Mott R. Multi-parent populations in crops: a toolbox integrating genomics and genetic mapping with breeding. Heredity (Edinb) 2020; 125:396-416. [PMID: 32616877 PMCID: PMC7784848 DOI: 10.1038/s41437-020-0336-6] [Citation(s) in RCA: 76] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2020] [Revised: 06/16/2020] [Accepted: 06/16/2020] [Indexed: 11/21/2022] Open
Abstract
Crop populations derived from experimental crosses enable the genetic dissection of complex traits and support modern plant breeding. Among these, multi-parent populations now play a central role. By mixing and recombining the genomes of multiple founders, multi-parent populations combine many commonly sought beneficial properties of genetic mapping populations. For example, they have high power and resolution for mapping quantitative trait loci, high genetic diversity and minimal population structure. Many multi-parent populations have been constructed in crop species, and their inbred germplasm and associated phenotypic and genotypic data serve as enduring resources. Their utility has grown from being a tool for mapping quantitative trait loci to a means of providing germplasm for breeding programmes. Genomics approaches, including de novo genome assemblies and gene annotations for the population founders, have allowed the imputation of rich sequence information into the descendent population, expanding the breadth of research and breeding applications of multi-parent populations. Here, we report recent successes from crop multi-parent populations in crops. We also propose an ideal genotypic, phenotypic and germplasm 'package' that multi-parent populations should feature to optimise their use as powerful community resources for crop research, development and breeding.
Collapse
Affiliation(s)
| | | | - Samer Amer
- University of Reading, Reading, RG6 6AH, UK
- Faculty of Agriculture, Alexandria University, Alexandria, 23714, Egypt
| | - Alison R Bentley
- The John Bingham Laboratory, NIAB, 93 Lawrence Weaver Road, Cambridge, CB3 0LE, UK
| | - Jay Biernaskie
- Department of Plant Sciences, University of Oxford, South Parks Road, Oxford, OX1 3RB, UK
| | - Scott A Boden
- School of Agriculture, Food and Wine, University of Adelaide, Glen Osmond, SA, 5064, Australia
| | | | | | - Laura E Dixon
- Faculty of Biological Sciences, University of Leeds, Leeds, LS2 9JT, UK
| | - Carla V Filippi
- Instituto de Agrobiotecnología y Biología Molecular (IABIMO), INTA-CONICET, Nicolas Repetto y Los Reseros s/n, 1686, Hurlingham, Buenos Aires, Argentina
| | - Nick Fradgley
- The John Bingham Laboratory, NIAB, 93 Lawrence Weaver Road, Cambridge, CB3 0LE, UK
| | - Keith A Gardner
- The John Bingham Laboratory, NIAB, 93 Lawrence Weaver Road, Cambridge, CB3 0LE, UK
| | - Ian J Mackay
- SRUC, West Mains Road, Kings Buildings, Edinburgh, EH9 3JG, UK
| | | | | | - Manish Roorkiwal
- Center of Excellence in Genomics and Systems Biology, International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Hyderabad, India
| | - Rakesh Kumar Singh
- International Center for Biosaline Agriculture, Academic City, Dubai, United Arab Emirates
| | - Mahendar Thudi
- Center of Excellence in Genomics and Systems Biology, International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Hyderabad, India
| | - Rajeev Kumar Varshney
- Center of Excellence in Genomics and Systems Biology, International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Hyderabad, India
| | | | - Alex Whan
- CSIRO, GPO Box 1700, Canberra, ACT, 2601, Australia
| | - James Cockram
- The John Bingham Laboratory, NIAB, 93 Lawrence Weaver Road, Cambridge, CB3 0LE, UK
| | - Richard Mott
- UCL Genetics Institute, Gower Street, London, WC1E 6BT, UK
| |
Collapse
|
39
|
Rodriguez M, Scintu A, Posadinu CM, Xu Y, Nguyen CV, Sun H, Bitocchi E, Bellucci E, Papa R, Fei Z, Giovannoni JJ, Rau D, Attene G. GWAS Based on RNA-Seq SNPs and High-Throughput Phenotyping Combined with Climatic Data Highlights the Reservoir of Valuable Genetic Diversity in Regional Tomato Landraces. Genes (Basel) 2020; 11:E1387. [PMID: 33238469 PMCID: PMC7709041 DOI: 10.3390/genes11111387] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2020] [Revised: 11/19/2020] [Accepted: 11/20/2020] [Indexed: 11/23/2022] Open
Abstract
Tomato (Solanum lycopersicum L.) is a widely used model plant species for dissecting out the genomic bases of complex traits to thus provide an optimal platform for modern "-omics" studies and genome-guided breeding. Genome-wide association studies (GWAS) have become a preferred approach for screening large diverse populations and many traits. Here, we present GWAS analysis of a collection of 115 landraces and 11 vintage and modern cultivars. A total of 26 conventional descriptors, 40 traits obtained by digital phenotyping, the fruit content of six carotenoids recorded at the early ripening (breaker) and red-ripe stages and 21 climate-related variables were analyzed in the context of genetic diversity monitored in the 126 accessions. The data obtained from thorough phenotyping and the SNP diversity revealed by sequencing of ripe fruit transcripts of 120 of the tomato accessions were jointly analyzed to determine which genomic regions are implicated in the expressed phenotypic variation. This study reveals that the use of fruit RNA-Seq SNP diversity is effective not only for identification of genomic regions that underlie variation in fruit traits, but also of variation related to additional plant traits and adaptive responses to climate variation. These results allowed validation of our approach because different marker-trait associations mapped on chromosomal regions where other candidate genes for the same traits were previously reported. In addition, previously uncharacterized chromosomal regions were targeted as potentially involved in the expression of variable phenotypes, thus demonstrating that our tomato collection is a precious reservoir of diversity and an excellent tool for gene discovery.
Collapse
Affiliation(s)
- Monica Rodriguez
- Dipartimento di Agraria, Università degli Studi di Sassari, 07100 Sassari, Italy; (A.S.); (C.M.P.); (D.R.); (G.A.)
- Centro per la Conservazione e Valorizzazione della Biodiversità Vegetale—CBV, Università degli Studi di Sassari, 07041 Alghero, Italy
| | - Alessandro Scintu
- Dipartimento di Agraria, Università degli Studi di Sassari, 07100 Sassari, Italy; (A.S.); (C.M.P.); (D.R.); (G.A.)
| | - Chiara M. Posadinu
- Dipartimento di Agraria, Università degli Studi di Sassari, 07100 Sassari, Italy; (A.S.); (C.M.P.); (D.R.); (G.A.)
| | - Yimin Xu
- Boyce Thompson Institute for Plant Research and U.S. Department of Agriculture—Agriculture Research Service, Ithaca, New York, NY 14853, USA; (Y.X.); (H.S.); (Z.F.); (J.J.G.)
| | - Cuong V. Nguyen
- Global Institute for Food Security, University of Saskatchewan, Saskatoon, SK S7N 0W9, Canada;
| | - Honghe Sun
- Boyce Thompson Institute for Plant Research and U.S. Department of Agriculture—Agriculture Research Service, Ithaca, New York, NY 14853, USA; (Y.X.); (H.S.); (Z.F.); (J.J.G.)
| | - Elena Bitocchi
- Dipartimento di Scienze Agrarie, Alimentari e Ambientali—D3A, Università Politecnica delle Marche, 60131 Ancona, Italy; (E.B.); (E.B.); (R.P.)
| | - Elisa Bellucci
- Dipartimento di Scienze Agrarie, Alimentari e Ambientali—D3A, Università Politecnica delle Marche, 60131 Ancona, Italy; (E.B.); (E.B.); (R.P.)
| | - Roberto Papa
- Dipartimento di Scienze Agrarie, Alimentari e Ambientali—D3A, Università Politecnica delle Marche, 60131 Ancona, Italy; (E.B.); (E.B.); (R.P.)
| | - Zhangjun Fei
- Boyce Thompson Institute for Plant Research and U.S. Department of Agriculture—Agriculture Research Service, Ithaca, New York, NY 14853, USA; (Y.X.); (H.S.); (Z.F.); (J.J.G.)
| | - James J. Giovannoni
- Boyce Thompson Institute for Plant Research and U.S. Department of Agriculture—Agriculture Research Service, Ithaca, New York, NY 14853, USA; (Y.X.); (H.S.); (Z.F.); (J.J.G.)
| | - Domenico Rau
- Dipartimento di Agraria, Università degli Studi di Sassari, 07100 Sassari, Italy; (A.S.); (C.M.P.); (D.R.); (G.A.)
| | - Giovanna Attene
- Dipartimento di Agraria, Università degli Studi di Sassari, 07100 Sassari, Italy; (A.S.); (C.M.P.); (D.R.); (G.A.)
- Centro per la Conservazione e Valorizzazione della Biodiversità Vegetale—CBV, Università degli Studi di Sassari, 07041 Alghero, Italy
| |
Collapse
|
40
|
Liang Z, Qiu Y, Schnable JC. Genome-Phenome Wide Association in Maize and Arabidopsis Identifies a Common Molecular and Evolutionary Signature. MOLECULAR PLANT 2020; 13:907-922. [PMID: 32171733 DOI: 10.1016/j.molp.2020.03.003] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/02/2019] [Revised: 01/20/2020] [Accepted: 03/08/2020] [Indexed: 06/10/2023]
Abstract
Linking natural genetic variation to trait variation can help determine the functional roles ofdifferent genes. Variations of one or several traits are often assessed separately. High-throughput phenotyping and data mining can capture dozens or hundreds of traits from the same individuals. Here, we test the association between markers within a gene and many traits simultaneously. This genome-phenome wide association study (GPWAS) is both a multi-marker and multi-trait test. Genes identified using GPWAS with 260 phenotypic traits in maize were enriched for genes independently linked to phenotypic variation. Traits associated with classical mutants were consistent with reported phenotypes for mutant alleles. Genes linked to phenomic variation in maize using GPWAS shared molecular, population genetic, and evolutionary features with classical mutants in maize. Genes linked to phenomic variation in Arabidopsis using GPWAS are significantly enriched in genes with known loss-of-function phenotypes. GPWAS may be an effective strategy to identify genes in which loss-of-function alleles produce mutant phenotypes. The shared signatures present in classical mutants and genes identified using GPWAS may be markers for genes with a role in specifying plant phenotypes generally or pleiotropy specifically.
Collapse
Affiliation(s)
- Zhikai Liang
- Department of Agronomy and Horticulture, University of Nebraska-Lincoln, Lincoln, NE, USA; Plant Science Innovation Center, University of Nebraska-Lincoln, Lincoln, NE, USA
| | - Yumou Qiu
- Department of Statistics, Iowa State University, Ames, IA, USA
| | - James C Schnable
- Department of Agronomy and Horticulture, University of Nebraska-Lincoln, Lincoln, NE, USA; Plant Science Innovation Center, University of Nebraska-Lincoln, Lincoln, NE, USA.
| |
Collapse
|
41
|
Dutta D, Brummett CM, Moser SE, Fritsche LG, Tsodikov A, Lee S, Clauw DJ, Scott LJ. Heritability of the Fibromyalgia Phenotype Varies by Age. Arthritis Rheumatol 2020; 72:815-823. [PMID: 31736264 DOI: 10.1002/art.41171] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2018] [Accepted: 11/14/2019] [Indexed: 12/14/2022]
Abstract
OBJECTIVE Many studies suggest a strong familial component to fibromyalgia (FM). However, those studies have nearly all been confined to individuals with primary FM, i.e., FM without any other accompanying disorder. The current 2011 and 2016 criteria for diagnosing FM construct a score using a combination of the number of painful body sites and the severity of somatic symptoms (FM score). This study was undertaken to estimate the genetic heritability of the FM score across sex and age groups to identify subgroups of individuals with greater heritability, which may help in the design of future genetic studies. METHODS We collected data on 26,749 individuals of European ancestry undergoing elective surgery at the University of Michigan (Michigan Genomics Initiative study). We estimated the single-nucleotide polymorphism-based heritability of FM score by age and sex categories using genome-wide association study data and a linear mixed-effects model. RESULTS Overall, the FM score had an estimated heritability of 13.9% (SE 2.9%) (P = 1.6 × 10-7 ). Estimated FM score heritability was highest in individuals ≤50 years of age (23.5%; SE 7.9%) (P = 3.0 ×10-4 ) and lowest in individuals >60 years of age (7.5%; SE 8.1%) (P = 0.41). These patterns remained the same when we analyzed FM as a case-control phenotype. Even though women had an ~30% higher average FM score than men across age categories, FM score heritability did not differ significantly by sex. CONCLUSION Younger individuals appear to have a much stronger genetic component to the FM score than older individuals. Older individuals may be more likely to have what was previously called "secondary FM." Regardless of the cause, these results have implications for future genetic studies of FM and associated conditions.
Collapse
Affiliation(s)
- Diptavo Dutta
- Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland
| | | | | | | | | | - Seunggeun Lee
- Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland
| | | | - Laura J Scott
- University of Michigan School of Public Health, Ann Arbor
| |
Collapse
|
42
|
Elliott LT. Kinship Solutions for Partially Observed Multiphenotype Data. J Comput Biol 2020; 27:1461-1470. [PMID: 32159382 PMCID: PMC7482112 DOI: 10.1089/cmb.2019.0440] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
Current work for multivariate analysis of phenotypes in genome-wide association studies often requires that genetic similarity matrices be inverted or decomposed. This can be a computational bottleneck when many phenotypes are presented, each with a different missingness pattern. A usual method in this case is to perform decompositions on subsets of the kinship matrix for each phenotype, with each subset corresponding to the set of observed samples for that phenotype. We provide a new method for decomposing these kinship matrices that can reduce the computational complexity by an order of magnitude by propagating low-rank modifications along a tree spanning the phenotypes. We demonstrate that our method provides speed improvements of around 40% under reasonable conditions.
Collapse
Affiliation(s)
- Lloyd T Elliott
- Department of Statistics and Actuarial Science, Simon Fraser University, Burnaby, Canada
| |
Collapse
|
43
|
Dahl A, Nguyen K, Cai N, Gandal MJ, Flint J, Zaitlen N. A Robust Method Uncovers Significant Context-Specific Heritability in Diverse Complex Traits. Am J Hum Genet 2020; 106:71-91. [PMID: 31901249 PMCID: PMC7042488 DOI: 10.1016/j.ajhg.2019.11.015] [Citation(s) in RCA: 40] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2019] [Accepted: 11/26/2019] [Indexed: 02/08/2023] Open
Abstract
Gene-environment interactions (GxE) can be fundamental in applications ranging from functional genomics to precision medicine and is a conjectured source of substantial heritability. However, unbiased methods to profile GxE genome-wide are nascent and, as we show, cannot accommodate general environment variables, modest sample sizes, heterogeneous noise, and binary traits. To address this gap, we propose a simple, unifying mixed model for gene-environment interaction (GxEMM). In simulations and theory, we show that GxEMM can dramatically improve estimates and eliminate false positives when the assumptions of existing methods fail. We apply GxEMM to a range of human and model organism datasets and find broad evidence of context-specific genetic effects, including GxSex, GxAdversity, and GxDisease interactions across thousands of clinical and molecular phenotypes. Overall, GxEMM is broadly applicable for testing and quantifying polygenic interactions, which can be useful for explaining heritability and invaluable for determining biologically relevant environments.
Collapse
Affiliation(s)
- Andy Dahl
- Department of Neurology, University of California Los Angeles, Los Angeles, CA 90095, USA; Department of Medicine, University of California San Francisco, San Francisco, CA 94158, USA.
| | - Khiem Nguyen
- Department of Medicine, University of California San Francisco, San Francisco, CA 94158, USA
| | - Na Cai
- Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK; European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Michael J Gandal
- Department of Psychiatry, Semel Institute, University of California, Los Angeles, Los Angeles, CA 90095, USA
| | - Jonathan Flint
- Center for Neurobehavioral Genetics, Semel Institute for Neuroscience and Human Behavior, University of California Los Angeles, Los Angeles, CA 90095, USA
| | - Noah Zaitlen
- Department of Neurology, University of California Los Angeles, Los Angeles, CA 90095, USA; Department of Medicine, University of California San Francisco, San Francisco, CA 94158, USA.
| |
Collapse
|
44
|
Runcie D, Cheng H. Pitfalls and Remedies for Cross Validation with Multi-trait Genomic Prediction Methods. G3 (BETHESDA, MD.) 2019; 9:3727-3741. [PMID: 31511297 PMCID: PMC6829121 DOI: 10.1534/g3.119.400598] [Citation(s) in RCA: 28] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/02/2019] [Accepted: 09/10/2019] [Indexed: 01/08/2023]
Abstract
Incorporating measurements on correlated traits into genomic prediction models can increase prediction accuracy and selection gain. However, multi-trait genomic prediction models are complex and prone to overfitting which may result in a loss of prediction accuracy relative to single-trait genomic prediction. Cross-validation is considered the gold standard method for selecting and tuning models for genomic prediction in both plant and animal breeding. When used appropriately, cross-validation gives an accurate estimate of the prediction accuracy of a genomic prediction model, and can effectively choose among disparate models based on their expected performance in real data. However, we show that a naive cross-validation strategy applied to the multi-trait prediction problem can be severely biased and lead to sub-optimal choices between single and multi-trait models when secondary traits are used to aid in the prediction of focal traits and these secondary traits are measured on the individuals to be tested. We use simulations to demonstrate the extent of the problem and propose three partial solutions: 1) a parametric solution from selection index theory, 2) a semi-parametric method for correcting the cross-validation estimates of prediction accuracy, and 3) a fully non-parametric method which we call CV2*: validating model predictions against focal trait measurements from genetically related individuals. The current excitement over high-throughput phenotyping suggests that more comprehensive phenotype measurements will be useful for accelerating breeding programs. Using an appropriate cross-validation strategy should more reliably determine if and when combining information across multiple traits is useful.
Collapse
Affiliation(s)
| | - Hao Cheng
- Department of Animal Science, University of California Davis, Davis, CA 95616
| |
Collapse
|
45
|
Turchin MC, Stephens M. Bayesian multivariate reanalysis of large genetic studies identifies many new associations. PLoS Genet 2019; 15:e1008431. [PMID: 31596850 PMCID: PMC6802844 DOI: 10.1371/journal.pgen.1008431] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2019] [Revised: 10/21/2019] [Accepted: 09/17/2019] [Indexed: 01/08/2023] Open
Abstract
Genome-wide association studies (GWAS) have now been conducted for hundreds of phenotypes of relevance to human health. Many such GWAS involve multiple closely-related phenotypes collected on the same samples. However, the vast majority of these GWAS have been analyzed using simple univariate analyses, which consider one phenotype at a time. This is despite the fact that, at least in simulation experiments, multivariate analyses have been shown to be more powerful at detecting associations. Here, we conduct multivariate association analyses on 13 different publicly-available GWAS datasets that involve multiple closely-related phenotypes. These data include large studies of anthropometric traits (GIANT), plasma lipid traits (GlobalLipids), and red blood cell traits (HaemgenRBC). Our analyses identify many new associations (433 in total across the 13 studies), many of which replicate when follow-up samples are available. Overall, our results demonstrate that multivariate analyses can help make more effective use of data from both existing and future GWAS. Genome-wide association studies (GWAS) have become a common and powerful tool for identifying significant correlations between markers of genetic variation and physical traits of interest. Often these studies are conducted by comparing genetic variation against single traits one at a time (‘univariate’); however, it has previously been shown that it is possible to increase your power to detect significant associations by comparing genetic variation against multiple traits simultaneously (‘multivariate’). Despite this apparent increase in power though, researchers still rarely conduct multivariate GWAS, even when studies have multiple traits readily available. Here, we reanalyze 13 previously published GWAS using a multivariate method and find >400 additional associations. Our method makes use of univariate GWAS summary statistics and is available as a software package, thus making it accessible to other researchers interested in conducting the same analyses. We also show, using studies that have multiple releases, that our new associations have high rates of replication. Overall, we argue multivariate approaches in GWAS should no longer be overlooked and how, often, there is low-hanging fruit in the form of new associations by running these methods on data already collected.
Collapse
Affiliation(s)
- Michael C. Turchin
- Department of Human Genetics, The University of Chicago, Chicago, Illinois, United States of America
| | - Matthew Stephens
- Department of Human Genetics, The University of Chicago, Chicago, Illinois, United States of America
- Department of Statistics, The University of Chicago, Chicago, Illinois, United States of America
- * E-mail:
| |
Collapse
|
46
|
Alliey-Rodriguez N, Grey TA, Shafee R, Asif H, Lutz O, Bolo NR, Padmanabhan J, Tandon N, Klinger M, Reis K, Spring J, Coppes L, Zeng V, Hegde RR, Hoang DT, Bannai D, Nawaz U, Henson P, Liu S, Gage D, McCarroll S, Bishop JR, Hill S, Reilly JL, Lencer R, Clementz BA, Buckley P, Glahn DC, Meda SA, Narayanan B, Pearlson G, Keshavan MS, Ivleva EI, Tamminga C, Sweeney JA, Curtis D, Badner JA, Keedy S, Rapoport J, Liu C, Gershon ES. NRXN1 is associated with enlargement of the temporal horns of the lateral ventricles in psychosis. Transl Psychiatry 2019; 9:230. [PMID: 31530798 PMCID: PMC6748921 DOI: 10.1038/s41398-019-0564-9] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/21/2019] [Revised: 07/11/2019] [Accepted: 07/30/2019] [Indexed: 12/19/2022] Open
Abstract
Schizophrenia, Schizoaffective, and Bipolar disorders share behavioral and phenomenological traits, intermediate phenotypes, and some associated genetic loci with pleiotropic effects. Volumetric abnormalities in brain structures are among the intermediate phenotypes consistently reported associated with these disorders. In order to examine the genetic underpinnings of these structural brain modifications, we performed genome-wide association analyses (GWAS) on 60 quantitative structural brain MRI phenotypes in a sample of 777 subjects (483 cases and 294 controls pooled together). Genotyping was performed with the Illumina PsychChip microarray, followed by imputation to the 1000 genomes multiethnic reference panel. Enlargement of the Temporal Horns of Lateral Ventricles (THLV) is associated with an intronic SNP of the gene NRXN1 (rs12467877, P = 6.76E-10), which accounts for 4.5% of the variance in size. Enlarged THLV is associated with psychosis in this sample, and with reduction of the hippocampus and enlargement of the choroid plexus and caudate. Eight other suggestively significant associations (P < 5.5E-8) were identified with THLV and 5 other brain structures. Although rare deletions of NRXN1 have been previously associated with psychosis, this is the first report of a common SNP variant of NRXN1 associated with enlargement of the THLV in psychosis.
Collapse
Affiliation(s)
- Ney Alliey-Rodriguez
- University of Chicago, Department of Psychiatry and Behavioral Neurosciences, Chicago, USA.
| | - Tamar A. Grey
- 0000 0001 2341 2786grid.116068.8Massachusetts Institute of Technology, Cambridge, USA
| | - Rebecca Shafee
- 000000041936754Xgrid.38142.3cHarvard Medical School, Department of Genetics, Boston, USA ,grid.66859.34Stanley Center, Broad Institute of MIT and Harvard, Cambridge, USA
| | - Huma Asif
- University of Chicago, Department of Psychiatry and Behavioral Neurosciences, Chicago, USA
| | - Olivia Lutz
- 000000041936754Xgrid.38142.3cHarvard Medical School, Department of Psychiatry, Boston, USA
| | - Nicolas R. Bolo
- 000000041936754Xgrid.38142.3cHarvard Medical School, Department of Psychiatry, Boston, USA
| | - Jaya Padmanabhan
- 000000041936754Xgrid.38142.3cHarvard Medical School, Department of Psychiatry, Boston, USA
| | - Neeraj Tandon
- 000000041936754Xgrid.38142.3cHarvard Medical School, Department of Psychiatry, Boston, USA
| | - Madeline Klinger
- University of Chicago, Department of Psychiatry and Behavioral Neurosciences, Chicago, USA
| | - Katherine Reis
- University of Chicago, Department of Psychiatry and Behavioral Neurosciences, Chicago, USA
| | - Jonathan Spring
- University of Chicago Laboratory for Advanced Computing, Chicago, USA
| | - Lucas Coppes
- University of Chicago, Department of Psychiatry and Behavioral Neurosciences, Chicago, USA
| | - Victor Zeng
- 000000041936754Xgrid.38142.3cHarvard University, Cambridge, USA
| | - Rachal R. Hegde
- 0000 0004 1936 7558grid.189504.1Boston University, Boston, USA
| | - Dung T. Hoang
- 000000041936754Xgrid.38142.3cHarvard University, Cambridge, USA
| | - Deepthi Bannai
- 0000 0004 1936 7558grid.189504.1Boston University, Boston, USA
| | - Uzma Nawaz
- 0000 0004 1936 7558grid.189504.1Boston University, Boston, USA
| | - Philip Henson
- 000000041936754Xgrid.38142.3cHarvard University, Cambridge, USA
| | - Siyuan Liu
- 0000 0001 2297 5165grid.94365.3dChild Psychiatry Branch, National Institutes of Mental Health, National Institutes of Health, Bethesda, MD USA
| | - Diane Gage
- grid.66859.34Broad Institute of MIT and Harvard, Cambridge, USA
| | | | - Jeffrey R. Bishop
- 0000000419368657grid.17635.36University of Minnesota, Department of Experimental and Clinical Pharmacology and Department of Psychiatry, Minneapolis, USA
| | - Scot Hill
- 0000 0004 0388 7807grid.262641.5Rosalind Franklin University, North Chicago, USA
| | - James L. Reilly
- 0000 0001 2299 3507grid.16753.36Northwestern University, Evanston, USA
| | - Rebekka Lencer
- 0000 0001 2172 9288grid.5949.1University of Muenster, Munster, Germany
| | - Brett A. Clementz
- 0000 0000 9564 9822grid.264978.6Department of Psychology, University of Georgia, Athens, Georgia
| | - Peter Buckley
- 0000 0004 0458 8737grid.224260.0Virginia Commonwealth University, Richmond, USA
| | - David C. Glahn
- 0000000419368710grid.47100.32Yale University Departments of Psychiatry & Neuroscience, New Haven, USA
| | - Shashwath A. Meda
- 0000000419368710grid.47100.32Yale University Departments of Psychiatry & Neuroscience, New Haven, USA
| | - Balaji Narayanan
- 0000000419368710grid.47100.32Yale University Departments of Psychiatry & Neuroscience, New Haven, USA
| | - Godfrey Pearlson
- 0000000419368710grid.47100.32Yale University Departments of Psychiatry & Neuroscience, New Haven, USA
| | - Matcheri S. Keshavan
- 000000041936754Xgrid.38142.3cHarvard Medical School, Department of Psychiatry, Boston, USA
| | - Elena I. Ivleva
- 0000 0000 9482 7121grid.267313.2University of Texas Southwestern Medical Center, Department of Psychiatry, Dallas, USA
| | - Carol Tamminga
- 0000 0000 9482 7121grid.267313.2University of Texas Southwestern Medical Center, Department of Psychiatry, Dallas, USA
| | - John A. Sweeney
- 0000 0000 9482 7121grid.267313.2University of Texas Southwestern Medical Center, Department of Psychiatry, Dallas, USA
| | - David Curtis
- 0000 0001 2171 1133grid.4868.2University College London and Centre for Psychiatry, Barts and the London School of Medicine and Dentistry, London, UK
| | - Judith A. Badner
- 0000 0001 0705 3621grid.240684.cRush University Medical Center, Chicago, USA
| | - Sarah Keedy
- University of Chicago, Department of Psychiatry and Behavioral Neurosciences, Chicago, USA
| | - Judith Rapoport
- 0000 0001 2297 5165grid.94365.3dChild Psychiatry Branch, National Institutes of Mental Health, National Institutes of Health, Bethesda, MD USA
| | - Chunyu Liu
- 0000 0000 9159 4457grid.411023.5SUNY Upstate Medical University, Binghamton, USA
| | - Elliot S. Gershon
- University of Chicago, Department of Psychiatry and Behavioral Neurosciences, Chicago, USA ,University of Chicago, Department of Human Genetics, Chicago, USA
| |
Collapse
|
47
|
Dutta D, Gagliano Taliun SA, Weinstock JS, Zawistowski M, Sidore C, Fritsche LG, Cucca F, Schlessinger D, Abecasis GR, Brummett CM, Lee S. Meta-MultiSKAT: Multiple phenotype meta-analysis for region-based association test. Genet Epidemiol 2019; 43:800-814. [PMID: 31433078 DOI: 10.1002/gepi.22248] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2019] [Accepted: 06/13/2019] [Indexed: 12/17/2022]
Abstract
The power of genetic association analyses can be increased by jointly meta-analyzing multiple correlated phenotypes. Here, we develop a meta-analysis framework, Meta-MultiSKAT, that uses summary statistics to test for association between multiple continuous phenotypes and variants in a region of interest. Our approach models the heterogeneity of effects between studies through a kernel matrix and performs a variance component test for association. Using a genotype kernel, our approach can test for rare-variants and the combined effects of both common and rare-variants. To achieve robust power, within Meta-MultiSKAT, we developed fast and accurate omnibus tests combining different models of genetic effects, functional genomic annotations, multiple correlated phenotypes, and heterogeneity across studies. In addition, Meta-MultiSKAT accommodates situations where studies do not share exactly the same set of phenotypes or have differing correlation patterns among the phenotypes. Simulation studies confirm that Meta-MultiSKAT can maintain the type-I error rate at the exome-wide level of 2.5 × 10-6 . Further simulations under different models of association show that Meta-MultiSKAT can improve the power of detection from 23% to 38% on average over single phenotype-based meta-analysis approaches. We demonstrate the utility and improved power of Meta-MultiSKAT in the meta-analyses of four white blood cell subtype traits from the Michigan Genomics Initiative (MGI) and SardiNIA studies.
Collapse
Affiliation(s)
- Diptavo Dutta
- Department of Biostatistics, School of Public Health, University of Michigan, Ann Arbor, Michigan.,Center for Statistical Genetics, School of Public Health, University of Michigan, Ann Arbor, Michigan
| | - Sarah A Gagliano Taliun
- Department of Biostatistics, School of Public Health, University of Michigan, Ann Arbor, Michigan.,Center for Statistical Genetics, School of Public Health, University of Michigan, Ann Arbor, Michigan
| | - Joshua S Weinstock
- Department of Biostatistics, School of Public Health, University of Michigan, Ann Arbor, Michigan.,Center for Statistical Genetics, School of Public Health, University of Michigan, Ann Arbor, Michigan
| | - Matthew Zawistowski
- Department of Biostatistics, School of Public Health, University of Michigan, Ann Arbor, Michigan.,Center for Statistical Genetics, School of Public Health, University of Michigan, Ann Arbor, Michigan
| | - Carlo Sidore
- Istituto di Ricerca Genetica e Biomedica, Consiglio Nazionale delle Ricerche (CNR), Monserrato, Cagliari, Italy
| | - Lars G Fritsche
- Department of Biostatistics, School of Public Health, University of Michigan, Ann Arbor, Michigan.,Center for Statistical Genetics, School of Public Health, University of Michigan, Ann Arbor, Michigan
| | - Francesco Cucca
- Istituto di Ricerca Genetica e Biomedica, Consiglio Nazionale delle Ricerche (CNR), Monserrato, Cagliari, Italy.,Dipartimento di Scienze Biomediche, Università degli Studi di Sassari, Sassari, Italy
| | - David Schlessinger
- Laboratory of Genetics, National Institute on Aging, US National Institutes of Health, Baltimore, Maryland
| | - Gonçalo R Abecasis
- Department of Biostatistics, School of Public Health, University of Michigan, Ann Arbor, Michigan.,Center for Statistical Genetics, School of Public Health, University of Michigan, Ann Arbor, Michigan
| | - Chad M Brummett
- Division of Pain Medicine, Department of Anesthesiology, University of Michigan Medical School, Ann Arbor, Michigan.,Institute for Healthcare Policy and Innovation, University of Michigan, Ann Arbor, Michigan
| | - Seunggeun Lee
- Department of Biostatistics, School of Public Health, University of Michigan, Ann Arbor, Michigan.,Center for Statistical Genetics, School of Public Health, University of Michigan, Ann Arbor, Michigan
| |
Collapse
|
48
|
Large-scale neuroanatomical study uncovers 198 gene associations in mouse brain morphogenesis. Nat Commun 2019; 10:3465. [PMID: 31371714 PMCID: PMC6671969 DOI: 10.1038/s41467-019-11431-2] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2019] [Accepted: 07/13/2019] [Indexed: 01/03/2023] Open
Abstract
Brain morphogenesis is an important process contributing to higher-order cognition, however our knowledge about its biological basis is largely incomplete. Here we analyze 118 neuroanatomical parameters in 1,566 mutant mouse lines and identify 198 genes whose disruptions yield NeuroAnatomical Phenotypes (NAPs), mostly affecting structures implicated in brain connectivity. Groups of functionally similar NAP genes participate in pathways involving the cytoskeleton, the cell cycle and the synapse, display distinct fetal and postnatal brain expression dynamics and importantly, their disruption can yield convergent phenotypic patterns. 17% of human unique orthologues of mouse NAP genes are known loci for cognitive dysfunction. The remaining 83% constitute a vast pool of genes newly implicated in brain architecture, providing the largest study of mouse NAP genes and pathways. This offers a complementary resource to human genetic studies and predict that many more genes could be involved in mammalian brain morphogenesis. Brain morphogenesis is an important process contributing to higher-order cognition, however our knowledge about its biological basis is largely incomplete. Here, authors analyzed 118 neuroanatomical parameters in 1,566 mutant mouse lines to identify 198 genes whose disruptions yield neuroanatomical phenotypes
Collapse
|
49
|
Dahl A, Cai N, Ko A, Laakso M, Pajukanta P, Flint J, Zaitlen N. Reverse GWAS: Using genetics to identify and model phenotypic subtypes. PLoS Genet 2019; 15:e1008009. [PMID: 30951530 PMCID: PMC6469799 DOI: 10.1371/journal.pgen.1008009] [Citation(s) in RCA: 28] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2018] [Revised: 04/17/2019] [Accepted: 02/07/2019] [Indexed: 12/16/2022] Open
Abstract
Recent and classical work has revealed biologically and medically significant subtypes in complex diseases and traits. However, relevant subtypes are often unknown, unmeasured, or actively debated, making automated statistical approaches to subtype definition valuable. We propose reverse GWAS (RGWAS) to identify and validate subtypes using genetics and multiple traits: while GWAS seeks the genetic basis of a given trait, RGWAS seeks to define trait subtypes with distinct genetic bases. Unlike existing approaches relying on off-the-shelf clustering methods, RGWAS uses a novel decomposition, MFMR, to model covariates, binary traits, and population structure. We use extensive simulations to show that modelling these features can be crucial for power and calibration. We validate RGWAS in practice by recovering a recently discovered stress subtype in major depression. We then show the utility of RGWAS by identifying three novel subtypes of metabolic traits. We biologically validate these metabolic subtypes with SNP-level tests and a novel polygenic test: the former recover known metabolic GxE SNPs; the latter suggests subtypes may explain substantial missing heritability. Crucially, statins, which are widely prescribed and theorized to increase diabetes risk, have opposing effects on blood glucose across metabolic subtypes, suggesting the subtypes have potential translational value.
Collapse
Affiliation(s)
- Andy Dahl
- Department of Medicine, UCSF, San Francisco, California, United States of America
| | - Na Cai
- Wellcome Sanger Institute, Cambridge, United Kingdom
- European Bioinformatics Institute (EMBL-EBI), Cambridge, United Kingdom
| | - Arthur Ko
- Department of Human Genetics, David Geffen School of Medicine, UCLA, Los Angeles, California, United States of America
| | - Markku Laakso
- Institute of Clinical Medicine, Internal Medicine, University of Eastern Finland, Kuopio, Finland
- Kuopio University Hospital, Kuopio, Finland
| | - Päivi Pajukanta
- Department of Human Genetics, David Geffen School of Medicine, UCLA, Los Angeles, California, United States of America
| | - Jonathan Flint
- Center for Neurobehavioral Genetics, Semel Institute for Neuroscience and Human Behavior, UCLA, Los Angeles, California, United States of America
| | - Noah Zaitlen
- Department of Medicine, UCSF, San Francisco, California, United States of America
| |
Collapse
|
50
|
Onogi A. Comparison of F-tests for Univariate and Multivariate Mixed-Effect Models in Genome-Wide Association Mapping. Front Genet 2019; 10:30. [PMID: 30778369 PMCID: PMC6369166 DOI: 10.3389/fgene.2019.00030] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2018] [Accepted: 01/17/2019] [Indexed: 01/24/2023] Open
Abstract
Genome-wide association mapping (GWA) has been widely applied to a variety of species to identify genomic regions responsible for quantitative traits. The use of multivariate information could enhance the detection power of GWA. Although mixed-effect models are frequently used for GWA, the utility of F-tests for multivariate mixed-effect models is not well-recognized. Thus, we compared the F-tests for univariate and multivariate mixed-effect models with simulations. The superiority of the multivariate F-test over the univariate test varied depending on three parameters: phenotypic correlation between variates (r), relative size of quantitative trait locus effects between variates (ad), and missing proportion of phenotypic records (mprop). Simulation results showed that, when mprop was low, the multivariate F-test outperformed the univariate test as r and ad differ, and as mprop increased, the multivariate F-test outperformed as ad increased. These observations were consistent with results of the analytical evaluation of the F-value. When mprop was at the maximum, i.e., when no individual had phenotypic values for multiple variates, as in the case of meta-analysis, the multivariate F-test gained more detection power as ad increased. Although using multivariate information in mixed-effect model contexts did not always ensure more detection power than with univariate tests, the multivariate F-test will be a method applied when multivariate data are available because it does not show inflation of signals and could lead to new findings.
Collapse
Affiliation(s)
- Akio Onogi
- Institute of Crop Science, National Agriculture and Food Research Organization, Tsukuba, Japan.,Japan Science and Technology Agency PRESTO, Kawaguchi, Japan
| |
Collapse
|