1
|
Zhang S, Zhou Y, Geng P, Lu Q. Functional Neural Networks for High-Dimensional Genetic Data Analysis. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:383-393. [PMID: 38507390 PMCID: PMC11301578 DOI: 10.1109/tcbb.2024.3364614] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/22/2024]
Abstract
Artificial intelligence (AI) is a thriving research field with many successful applications in areas such as computer vision and speech recognition. Machine learning methods, such as artificial neural networks (ANN), play a central role in modern AI technology. While ANN also holds great promise for human genetic research, the high-dimensional genetic data and complex genetic structure bring tremendous challenges. The vast majority of genetic variants on the genome have small or no effects on diseases, and fitting ANN on a large number of variants without considering the underlying genetic structure (e.g., linkage disequilibrium) could bring a serious overfitting issue. Furthermore, while a single disease phenotype is often studied in a classic genetic study, in emerging research fields (e.g., imaging genetics), researchers need to deal with different types of disease phenotypes. To address these challenges, we propose a functional neural networks (FNN) method. FNN uses a series of basis functions to model high-dimensional genetic data and a variety of phenotype data and further builds a multi-layer functional neural network to capture the complex relationships between genetic variants and disease phenotypes. Through simulations, we demonstrate the advantages of FNN for high-dimensional genetic data analysis in terms of robustness and accuracy. The real data applications also showed that FNN attained higher accuracy than the existing methods.
Collapse
|
2
|
Denault WRP, Romanowska J, Haaland ØA, Lyle R, Taylor J, Xu Z, Lie RT, Gjessing HK, Jugessur A. Wavelet Screening identifies regions highly enriched for differentially methylated loci for orofacial clefts. NAR Genom Bioinform 2021; 3:lqab035. [PMID: 33987535 PMCID: PMC8092375 DOI: 10.1093/nargab/lqab035] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2020] [Revised: 04/05/2021] [Accepted: 04/16/2021] [Indexed: 12/04/2022] Open
Abstract
DNA methylation is the most widely studied epigenetic mark in humans and plays an essential role in normal biological processes as well as in disease development. More focus has recently been placed on understanding functional aspects of methylation, prompting the development of methods to investigate the relationship between heterogeneity in methylation patterns and disease risk. However, most of these methods are limited in that they use simplified models that may rely on arbitrarily chosen parameters, they can only detect differentially methylated regions (DMRs) one at a time, or they are computationally intensive. To address these shortcomings, we present a wavelet-based method called 'Wavelet Screening' (WS) that can perform an epigenome-wide association study (EWAS) of thousands of individuals on a single CPU in only a matter of hours. By detecting multiple DMRs located near each other, WS identifies more complex patterns that can differentiate between different methylation profiles. We performed an extensive set of simulations to demonstrate the robustness and high power of WS, before applying it to a previously published EWAS dataset of orofacial clefts (OFCs). WS identified 82 associated regions containing several known genes and loci for OFCs, while other findings are novel and warrant replication in other OFCs cohorts.
Collapse
Affiliation(s)
- William R P Denault
- Department of Genetics and Bioinformatics, Norwegian Institute of Public Health, 0473, Oslo, Norway
- Department of Global Public Health and Primary Care, University of Bergen, 5006, Bergen, Norway
- Centre for Fertility and Health (CeFH), Norwegian Institute of Public Health, 0473, Oslo, Norway
| | - Julia Romanowska
- Department of Global Public Health and Primary Care, University of Bergen, 5006, Bergen, Norway
- Centre for Fertility and Health (CeFH), Norwegian Institute of Public Health, 0473, Oslo, Norway
| | - Øystein A Haaland
- Department of Global Public Health and Primary Care, University of Bergen, 5006, Bergen, Norway
| | - Robert Lyle
- Centre for Fertility and Health (CeFH), Norwegian Institute of Public Health, 0473, Oslo, Norway
- Department of Medical Genetics, Oslo University Hospital, 0450, Oslo, Norway
| | - Jack A Taylor
- Epidemiology Branch and Epigenetics and Stem Cell Biology Laboratory, National Institute of Environmental Health Sciences (NIH/NIEHS), 27709, Durham, North Carolina, USA
| | - Zongli Xu
- Epidemiology Branch, National Institute of Environmental Health Sciences (NIH/NIEHS), 27709, Durham, North Carolina, USA
| | - Rolv T Lie
- Department of Global Public Health and Primary Care, University of Bergen, 5006, Bergen, Norway
- Centre for Fertility and Health (CeFH), Norwegian Institute of Public Health, 0473, Oslo, Norway
| | - Håkon K Gjessing
- Department of Global Public Health and Primary Care, University of Bergen, 5006, Bergen, Norway
- Centre for Fertility and Health (CeFH), Norwegian Institute of Public Health, 0473, Oslo, Norway
| | - Astanand Jugessur
- Department of Genetics and Bioinformatics, Norwegian Institute of Public Health, 0473, Oslo, Norway
- Department of Global Public Health and Primary Care, University of Bergen, 5006, Bergen, Norway
- Centre for Fertility and Health (CeFH), Norwegian Institute of Public Health, 0473, Oslo, Norway
| |
Collapse
|
3
|
Denault WRP, Romanowska J, Helgeland Ø, Jacobsson B, Gjessing HK, Jugessur A. A fast wavelet-based functional association analysis replicates several susceptibility loci for birth weight in a Norwegian population. BMC Genomics 2021; 22:321. [PMID: 33932983 PMCID: PMC8088671 DOI: 10.1186/s12864-021-07582-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2020] [Accepted: 03/26/2021] [Indexed: 11/28/2022] Open
Abstract
Background Birth weight (BW) is one of the most widely studied anthropometric traits in humans because of its role in various adult-onset diseases. The number of loci associated with BW has increased dramatically since the advent of whole-genome screening approaches such as genome-wide association studies (GWASes) and meta-analyses of GWASes (GWAMAs). To further contribute to elucidating the genetic architecture of BW, we analyzed a genotyped Norwegian dataset with information on child’s BW (N=9,063) using a slightly modified version of a wavelet-based method by Shim and Stephens (2015) called WaveQTL. Results WaveQTL uses wavelet regression for regional testing and offers a more flexible functional modeling framework compared to conventional GWAS methods. To further improve WaveQTL, we added a novel feature termed “zooming strategy” to enhance the detection of associations in typically small regions. The modified WaveQTL replicated five out of the 133 loci previously identified by the largest GWAMA of BW to date by Warrington et al. (2019), even though our sample size was 26 times smaller than that study and 18 times smaller than the second largest GWAMA of BW by Horikoshi et al. (2016). In addition, the modified WaveQTL performed better in regions of high LD between SNPs. Conclusions This study is the first adaptation of the original WaveQTL method to the analysis of genome-wide genotypic data. Our results highlight the utility of the modified WaveQTL as a complementary tool for identifying loci that might escape detection by conventional genome-wide screening methods due to power issues. An attractive application of the modified WaveQTL would be to select traits from various public GWAS repositories to investigate whether they might benefit from a second analysis. Supplementary Information The online version contains supplementary material available at (10.1186/s12864-021-07582-6).
Collapse
Affiliation(s)
- William R P Denault
- Department of Genetics and Bioinformatics, Norwegian Institute of Public Health, Oslo, Norway. .,Department of Global Public Health and Primary Care, University of Bergen, Bergen, Norway. .,Centre for Fertility and Health (CeFH), Norwegian Institute of Public Health, Oslo, Norway.
| | - Julia Romanowska
- Department of Global Public Health and Primary Care, University of Bergen, Bergen, Norway.,Centre for Fertility and Health (CeFH), Norwegian Institute of Public Health, Oslo, Norway
| | - Øyvind Helgeland
- Department of Genetics and Bioinformatics, Norwegian Institute of Public Health, Oslo, Norway.,KG Jebsen Center for Diabetes Research, Department of Clinical Science, University of Bergen, Bergen, Norway
| | - Bo Jacobsson
- Department of Genetics and Bioinformatics, Norwegian Institute of Public Health, Oslo, Norway.,Department of Obstetrics and Gynecology, Institute of Clinical Sciences, Sahlgrenska Academy, University of Gothenburg, Gothenburg, Sweden
| | - Håkon K Gjessing
- Department of Global Public Health and Primary Care, University of Bergen, Bergen, Norway.,Centre for Fertility and Health (CeFH), Norwegian Institute of Public Health, Oslo, Norway
| | - Astanand Jugessur
- Department of Genetics and Bioinformatics, Norwegian Institute of Public Health, Oslo, Norway.,Department of Global Public Health and Primary Care, University of Bergen, Bergen, Norway.,Centre for Fertility and Health (CeFH), Norwegian Institute of Public Health, Oslo, Norway
| |
Collapse
|
4
|
Zhang B, Chiu CY, Yuan F, Sang T, Cook RJ, Wilson AF, Bailey-Wilson JE, Chew EY, Xiong M, Fan R. Gene-based analysis of bi-variate survival traits via functional regressions with applications to eye diseases. Genet Epidemiol 2021; 45:455-470. [PMID: 33645812 DOI: 10.1002/gepi.22381] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2020] [Revised: 01/15/2021] [Accepted: 02/08/2021] [Indexed: 11/12/2022]
Abstract
Genetic studies of two related survival outcomes of a pleiotropic gene are commonly encountered but statistical models to analyze them are rarely developed. To analyze sequencing data, we propose mixed effect Cox proportional hazard models by functional regressions to perform gene-based joint association analysis of two survival traits motivated by our ongoing real studies. These models extend fixed effect Cox models of univariate survival traits by incorporating variations and correlation of multivariate survival traits into the models. The associations between genetic variants and two survival traits are tested by likelihood ratio test statistics. Extensive simulation studies suggest that type I error rates are well controlled and power performances are stable. The proposed models are applied to analyze bivariate survival traits of left and right eyes in the age-related macular degeneration progression.
Collapse
Affiliation(s)
- Bingsong Zhang
- Department of Biostatistics, Bioinformatics, and Biomathematics, Georgetown University Medical Center, Washington, District of Columbia, USA
| | - Chi-Yang Chiu
- Division of Biostatistics, Department of Preventive Medicine, University of Tennessee Health Science Center, Memphis, Tennessee, USA.,Computational and Statistical Genomics Branch, National Human Genome, Research Institute, National Institutes of Health (NIH), Baltimore, Maryland, USA
| | - Fang Yuan
- Department of Biochemistry and Molecular Biology, School of Basic Medicine, Kunming Medical University, Kunming, People's Republic of China
| | - Tian Sang
- Department of Biostatistics, Bioinformatics, and Biomathematics, Georgetown University Medical Center, Washington, District of Columbia, USA.,School of Mathematics, Physics and Statistics, Shanghai University of Engineering Science, Shanghai, China
| | - Richard J Cook
- Department of Statistics and Actuarial Science, Waterloo, Ontario, Canada
| | - Alexander F Wilson
- Computational and Statistical Genomics Branch, National Human Genome, Research Institute, National Institutes of Health (NIH), Baltimore, Maryland, USA
| | - Joan E Bailey-Wilson
- Computational and Statistical Genomics Branch, National Human Genome, Research Institute, National Institutes of Health (NIH), Baltimore, Maryland, USA
| | - Emily Y Chew
- Division of Epidemiology and Clinical Applications, National Eye Institute, NIH, Bethesda, Maryland, USA
| | - Momiao Xiong
- Human Genetics Center, University of Texas-Houston, Houston, Texas, USA
| | - Ruzong Fan
- Department of Biostatistics, Bioinformatics, and Biomathematics, Georgetown University Medical Center, Washington, District of Columbia, USA.,Computational and Statistical Genomics Branch, National Human Genome, Research Institute, National Institutes of Health (NIH), Baltimore, Maryland, USA
| |
Collapse
|
5
|
Denault WRP, Jugessur A. Detecting differentially methylated regions using a fast wavelet-based approach to functional association analysis. BMC Bioinformatics 2021; 22:61. [PMID: 33568045 PMCID: PMC7876806 DOI: 10.1186/s12859-021-03979-y] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2020] [Accepted: 01/27/2021] [Indexed: 11/10/2022] Open
Abstract
Background We present here a computational shortcut to improve a powerful wavelet-based method by Shim and Stephens (Ann Appl Stat 9(2):665–686, 2015. 10.1214/14-AOAS776) called WaveQTL that was originally designed to identify DNase I hypersensitivity quantitative trait loci (dsQTL). Results WaveQTL relies on permutations to evaluate the significance of an association. We applied a recent method by Zhou and Guan (J Am Stat Assoc 113(523):1362–1371, 2017. 10.1080/01621459.2017.1328361) to boost computational speed, which involves calculating the distribution of Bayes factors and estimating the significance of an association by simulations rather than permutations. We called this simulation-based approach “fast functional wavelet” (FFW), and tested it on a publicly available DNA methylation (DNAm) dataset on colorectal cancer. The simulations confirmed a substantial gain in computational speed compared to the permutation-based approach in WaveQTL. Furthermore, we show that FFW controls the type I error satisfactorily and has good power for detecting differentially methylated regions. Conclusions Our approach has broad utility and can be applied to detect associations between different types of functions and phenotypes. As more and more DNAm datasets are being made available through public repositories, an attractive application of FFW would be to re-analyze these data and identify associations that might have been missed by previous efforts. The full R package for FFW is freely available at GitHub https://github.com/william-denault/ffw.
Collapse
Affiliation(s)
- William R P Denault
- Department of Genetics and Bioinformatics, Norwegian Institute of Public Health, Oslo, Norway. .,Centre for Fertility and Health, Norwegian Institute of Public Health, Oslo, Norway. .,Department of Global Public Health and Primary Care, University of Bergen, Bergen, Norway.
| | - Astanand Jugessur
- Department of Genetics and Bioinformatics, Norwegian Institute of Public Health, Oslo, Norway.,Centre for Fertility and Health, Norwegian Institute of Public Health, Oslo, Norway.,Department of Global Public Health and Primary Care, University of Bergen, Bergen, Norway
| |
Collapse
|
6
|
Yanchenko AK, Hoff PD. Hierarchical multidimensional scaling for the comparison of musical performance styles. Ann Appl Stat 2020. [DOI: 10.1214/20-aoas1391] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
7
|
Li Y, Wang F, Wu M, Ma S. Integrative functional linear model for genome-wide association studies with multiple traits. Biostatistics 2020; 23:574-590. [PMID: 33040145 DOI: 10.1093/biostatistics/kxaa043] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2019] [Revised: 06/30/2020] [Accepted: 09/12/2020] [Indexed: 11/14/2022] Open
Abstract
In recent biomedical research, genome-wide association studies (GWAS) have demonstrated great success in investigating the genetic architecture of human diseases. For many complex diseases, multiple correlated traits have been collected. However, most of the existing GWAS are still limited because they analyze each trait separately without considering their correlations and suffer from a lack of sufficient information. Moreover, the high dimensionality of single nucleotide polymorphism (SNP) data still poses tremendous challenges to statistical methods, in both theoretical and practical aspects. In this article, we innovatively propose an integrative functional linear model for GWAS with multiple traits. This study is the first to approximate SNPs as functional objects in a joint model of multiple traits with penalization techniques. It effectively accommodates the high dimensionality of SNPs and correlations among multiple traits to facilitate information borrowing. Our extensive simulation studies demonstrate the satisfactory performance of the proposed method in the identification and estimation of disease-associated genetic variants, compared to four alternatives. The analysis of type 2 diabetes data leads to biologically meaningful findings with good prediction accuracy and selection stability.
Collapse
Affiliation(s)
- Yang Li
- Center For Applied Statistics, School Of Statistics, And Statistical Consulting Center, Renmin University Of China, Beijing 100872, China
| | - Fan Wang
- Center For Applied Statistics, School Of Statistics, And Statistical Consulting Center, Renmin University Of China, Beijing 100872, China
| | - Mengyun Wu
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai 200433, China
| | - Shuangge Ma
- Department of Biostatistics, Yale School of Public Health, New Haven 06520, USA
| |
Collapse
|
8
|
Jiang Y, Chiu CY, Yan Q, Chen W, Gorin MB, Conley YP, Lakhal-Chaieb ML, Cook RJ, Amos CI, Wilson AF, Bailey-Wilson JE, McMahon FJ, Vazquez AI, Yuan A, Zhong X, Xiong M, Weeks DE, Fan R. Gene-Based Association Testing of Dichotomous Traits With Generalized Functional Linear Mixed Models Using Extended Pedigrees: Applications to Age-Related Macular Degeneration. J Am Stat Assoc 2020; 116:531-545. [PMID: 34321704 PMCID: PMC8315575 DOI: 10.1080/01621459.2020.1799809] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2017] [Revised: 07/09/2020] [Accepted: 07/17/2020] [Indexed: 10/23/2022]
Abstract
Genetics plays a role in age-related macular degeneration (AMD), a common cause of blindness in the elderly. There is a need for powerful methods for carrying out region-based association tests between a dichotomous trait like AMD and genetic variants on family data. Here, we apply our new generalized functional linear mixed models (GFLMM) developed to test for gene-based association in a set of AMD families. Using common and rare variants, we observe significant association with two known AMD genes: CFH and ARMS2. Using rare variants, we find suggestive signals in four genes: ASAH1, CLEC6A, TMEM63C, and SGSM1. Intriguingly, ASAH1 is down-regulated in AMD aqueous humor, and ASAH1 deficiency leads to retinal inflammation and increased vulnerability to oxidative stress. These findings were made possible by our GFLMM which model the effect of a major gene as a fixed mean, the polygenic contributions as a random variation, and the correlation of pedigree members by kinship coefficients. Simulations indicate that the GFLMM likelihood ratio tests (LRTs) accurately control the Type I error rates. The LRTs have similar or higher power than existing retrospective kernel and burden statistics. Our GFLMM-based statistics provide a new tool for conducting family-based genetic studies of complex diseases. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Collapse
Affiliation(s)
- Yingda Jiang
- Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, PA
| | - Chi-Yang Chiu
- Division of Biostatistics, Department of Preventive Medicine, University of Tennessee Health Science Center, Memphis, TN
- Computational and Statistical Genomics Branch, National Human Genome Research Institute, NIH, Baltimore, MD
| | - Qi Yan
- Division of Pulmonary Medicine, Allergy and Immunology, Children’s Hospital of Pittsburgh at The University of Pittsburgh, Pittsburgh, PA
| | - Wei Chen
- Division of Pulmonary Medicine, Allergy and Immunology, Children’s Hospital of Pittsburgh at The University of Pittsburgh, Pittsburgh, PA
| | - Michael B. Gorin
- Department of Ophthalmology, David Geffen School of Medicine, UCLA Stein Eye Institute, Los Angeles, CA
| | - Yvette P. Conley
- Department of Health Promotion and Development, University of Pittsburgh, Pittsburgh, PA
- Department of Human Genetics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, PA
| | | | - Richard J. Cook
- Department of Statistics and Actuarial Science, Waterloo, ON, Canada
| | | | - Alexander F. Wilson
- Computational and Statistical Genomics Branch, National Human Genome Research Institute, NIH, Baltimore, MD
| | - Joan E. Bailey-Wilson
- Computational and Statistical Genomics Branch, National Human Genome Research Institute, NIH, Baltimore, MD
| | - Francis J. McMahon
- Human Genetics Branch and Genetic Basis of Mood and Anxiety Disorders Section, National Institute of Mental Health, NIH, Bethesda, MD
| | - Ana I. Vazquez
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, MI
| | - Ao Yuan
- Department of Biostatistics, Bioinformatics, and Biomathematics, Georgetown University Medical Center, Washington, DC
| | - Xiaogang Zhong
- Department of Biostatistics, Bioinformatics, and Biomathematics, Georgetown University Medical Center, Washington, DC
| | - Momiao Xiong
- Human Genetics Center, University of Texas, Houston, TX
| | - Daniel E. Weeks
- Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, PA
- Department of Human Genetics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, PA
| | - Ruzong Fan
- Computational and Statistical Genomics Branch, National Human Genome Research Institute, NIH, Baltimore, MD
- Department of Biostatistics, Bioinformatics, and Biomathematics, Georgetown University Medical Center, Washington, DC
| |
Collapse
|
9
|
Chiu CY, Zhang B, Wang S, Shao J, Lakhal-Chaieb ML, Cook RJ, Wilson AF, Bailey-Wilson JE, Xiong M, Fan R. Gene-based association analysis of survival traits via functional regression-based mixed effect cox models for related samples. Genet Epidemiol 2019; 43:952-965. [PMID: 31502722 DOI: 10.1002/gepi.22254] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2019] [Revised: 06/26/2019] [Accepted: 07/16/2019] [Indexed: 01/09/2023]
Abstract
The importance to integrate survival analysis into genetics and genomics is widely recognized, but only a small number of statisticians have produced relevant work toward this study direction. For unrelated population data, functional regression (FR) models have been developed to test for association between a quantitative/dichotomous/survival trait and genetic variants in a gene region. In major gene association analysis, these models have higher power than sequence kernel association tests. In this paper, we extend this approach to analyze censored traits for family data or related samples using FR based mixed effect Cox models (FamCoxME). The FamCoxME model effect of major gene as fixed mean via functional data analysis techniques, the local gene or polygene variations or both as random, and the correlation of pedigree members by kinship coefficients or genetic relationship matrix or both. The association between the censored trait and the major gene is tested by likelihood ratio tests (FamCoxME FR LRT). Simulation results indicate that the LRT control the type I error rates accurately/conservatively and have good power levels when both local gene or polygene variations are modeled. The proposed methods were applied to analyze a breast cancer data set from the Consortium of Investigators of Modifiers of BRCA1 and BRCA2 (CIMBA). The FamCoxME provides a new tool for gene-based analysis of family-based studies or related samples.
Collapse
Affiliation(s)
- Chi-Yang Chiu
- Division of Biostatistics, Department of Preventive Medicine, University of Tennessee Health Science Center, Memphis, Tennessee
| | - Bingsong Zhang
- Department of Biostatistics, Bioinformatics, and Biomathematics, Georgetown University Medical Center, Washington, District of Columbia
| | - Shuqi Wang
- Department of Biostatistics, Bioinformatics, and Biomathematics, Georgetown University Medical Center, Washington, District of Columbia
| | - Jingyi Shao
- Department of Biostatistics, Bioinformatics, and Biomathematics, Georgetown University Medical Center, Washington, District of Columbia
| | | | - Richard J Cook
- Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, Ontario, Canada
| | - Alexander F Wilson
- Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland
| | - Joan E Bailey-Wilson
- Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland
| | - Momiao Xiong
- Department of Biostatistics, Human Genetics Center, University of Texas-Houston, Houston, Texas
| | - Ruzong Fan
- Department of Biostatistics, Bioinformatics, and Biomathematics, Georgetown University Medical Center, Washington, District of Columbia
| |
Collapse
|
10
|
Geng P, Tong X, Lu Q. An integrative U method for joint analysis of multi-level omic data. BMC Genet 2019; 20:40. [PMID: 30967125 PMCID: PMC6457037 DOI: 10.1186/s12863-019-0742-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2018] [Accepted: 03/20/2019] [Indexed: 11/30/2022] Open
Abstract
Background The advance of high-throughput technologies has made it cost-effective to collect diverse types of omic data in large-scale clinical and biological studies. While the collection of the vast amounts of multi-level omic data from these studies provides a great opportunity for genetic research, the high dimensionality of omic data and complex relationships among multi-level omic data bring tremendous analytic challenges. Results To address these challenges, we develop an integrative U (IU) method for the design and analysis of multi-level omic data. While non-parametric methods make less model assumptions and are flexible for analyzing different types of phenotypes and omic data, they have been less developed for association analysis of omic data. The IU method is a nonparametric method that can accommodate various types of omic and phenotype data, and consider interactive relationship among different levels of omic data. Through simulations and a real data application, we compare the IU test with commonly used variance component tests. Conclusions Results show that the proposed test attains more robust type I error performance and higher empirical power than variance component tests under various types of phenotypes and different underlying interaction effects. Electronic supplementary material The online version of this article (10.1186/s12863-019-0742-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Pei Geng
- Department of Mathematics, Illinois State University, Normal, IL, 61761, USA
| | - Xiaoran Tong
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, MI, 48824, USA
| | - Qing Lu
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, MI, 48824, USA.
| |
Collapse
|
11
|
Chiu CY, Yuan F, Zhang BS, Yuan A, Li X, Fang HB, Lange K, Weeks DE, Wilson AF, Bailey-Wilson JE, Musolf AM, Stambolian D, Lakhal-Chaieb ML, Cook RJ, McMahon FJ, Amos CI, Xiong M, Fan R. Linear mixed models for association analysis of quantitative traits with next-generation sequencing data. Genet Epidemiol 2019; 43:189-206. [PMID: 30537345 PMCID: PMC6375753 DOI: 10.1002/gepi.22177] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2018] [Revised: 08/27/2018] [Accepted: 09/26/2018] [Indexed: 01/01/2023]
Abstract
We develop linear mixed models (LMMs) and functional linear mixed models (FLMMs) for gene-based tests of association between a quantitative trait and genetic variants on pedigrees. The effects of a major gene are modeled as a fixed effect, the contributions of polygenes are modeled as a random effect, and the correlations of pedigree members are modeled via inbreeding/kinship coefficients. F -statistics and χ 2 likelihood ratio test (LRT) statistics based on the LMMs and FLMMs are constructed to test for association. We show empirically that the F -distributed statistics provide a good control of the type I error rate. The F -test statistics of the LMMs have similar or higher power than the FLMMs, kernel-based famSKAT (family-based sequence kernel association test), and burden test famBT (family-based burden test). The F -statistics of the FLMMs perform well when analyzing a combination of rare and common variants. For small samples, the LRT statistics of the FLMMs control the type I error rate well at the nominal levels α = 0.01 and 0.05 . For moderate/large samples, the LRT statistics of the FLMMs control the type I error rates well. The LRT statistics of the LMMs can lead to inflated type I error rates. The proposed models are useful in whole genome and whole exome association studies of complex traits.
Collapse
Affiliation(s)
- Chi-Yang Chiu
- Division of Biostatistics, Department of Preventive Medicine, University of Tennessee Health Science Center, Memphis, Tennessee
- Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health (NIH), Bethesda, Maryland
| | - Fang Yuan
- Department of Biochemistry and Molecular Biology, School of Basic Medicine, Kunming Medical University, Kunming, Yunnan, China
| | - Bing-Song Zhang
- Department of Biostatistics, Bioinformatics, and Biomathematics, Georgetown University Medical Center, Washington, District of Columbia
| | - Ao Yuan
- Department of Biostatistics, Bioinformatics, and Biomathematics, Georgetown University Medical Center, Washington, District of Columbia
| | - Xin Li
- Department of Biostatistics, Bioinformatics, and Biomathematics, Georgetown University Medical Center, Washington, District of Columbia
| | - Hong-Bin Fang
- Department of Biostatistics, Bioinformatics, and Biomathematics, Georgetown University Medical Center, Washington, District of Columbia
| | - Kenneth Lange
- Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, California
| | - Daniel E Weeks
- Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, Pennsylvania
- Department of Human Genetics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, Pennsylvania
| | - Alexander F Wilson
- Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health (NIH), Bethesda, Maryland
| | - Joan E Bailey-Wilson
- Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health (NIH), Bethesda, Maryland
| | - Anthony M Musolf
- Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health (NIH), Bethesda, Maryland
| | - Dwight Stambolian
- Department of Genetics, University of Pennsylvania, Philadelphia, Pennsylvania
| | | | - Richard J Cook
- Department of Statistics and Actuarial Science, Waterloo, Ontario, Quebec, Canada
| | - Francis J McMahon
- Human Genetics Branch and Genetic Basis of Mood and Anxiety Disorders Section, University of Waterloo, National Institute of Mental Health, NIH, Bethesda, Maryland
| | | | - Momiao Xiong
- Human Genetics Center, University of Texas-Houston, Houston, Texas
| | - Ruzong Fan
- Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health (NIH), Bethesda, Maryland
- Department of Biochemistry and Molecular Biology, School of Basic Medicine, Kunming Medical University, Kunming, Yunnan, China
| |
Collapse
|
12
|
Ma L, Soriano J. Efficient Functional ANOVA Through Wavelet-Domain Markov Groves. J Am Stat Assoc 2018. [DOI: 10.1080/01621459.2017.1286241] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Affiliation(s)
- Li Ma
- Department of Statistical Science, Duke University, Durham, NC
| | | |
Collapse
|
13
|
Abstract
While genome-wide association studies have been very successful in identifying associations of common genetic variants with many different traits, the rarer frequency spectrum of the genome has not yet been comprehensively explored. Technological developments increasingly lift restrictions to access rare genetic variation. Dense reference panels enable improved genotype imputation for rarer variants in studies using DNA microarrays. Moreover, the decreasing cost of next generation sequencing makes whole exome and genome sequencing increasingly affordable for large samples. Large-scale efforts based on sequencing, such as ExAC, 100,000 Genomes, and TopMed, are likely to significantly advance this field.The main challenge in evaluating complex trait associations of rare variants is statistical power. The choice of population should be considered carefully because allele frequencies and linkage disequilibrium structure differ between populations. Genetically isolated populations can have favorable genomic characteristics for the study of rare variants.One strategy to increase power is to assess the combined effect of multiple rare variants within a region, known as aggregate testing. A range of methods have been developed for this. Model performance depends on the genetic architecture of the region of interest.
Collapse
Affiliation(s)
- Karoline Kuchenbaecker
- Wellcome Trust Sanger Institute, Cambridge, UK. .,University College London, London, UK.
| | - Emil Vincent Rosenbaum Appel
- Novo Nordisk Foundation Center for Basic Metabolic Research, Section for Metabolic Genetics, Faculty of Health Sciences, University of Copenhagen, Copenhagen, Denmark
| |
Collapse
|
14
|
Jadhav S, Tong X, Lu Q. A functional U-statistic method for association analysis of sequencing data. Genet Epidemiol 2017; 41:636-643. [PMID: 28850771 DOI: 10.1002/gepi.22063] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2017] [Revised: 06/06/2017] [Accepted: 07/10/2017] [Indexed: 11/08/2022]
Abstract
Although sequencing studies hold great promise for uncovering novel variants predisposing to human diseases, the high dimensionality of the sequencing data brings tremendous challenges to data analysis. Moreover, for many complex diseases (e.g., psychiatric disorders) multiple related phenotypes are collected. These phenotypes can be different measurements of an underlying disease, or measurements characterizing multiple related diseases for studying common genetic mechanism. Although jointly analyzing these phenotypes could potentially increase the power of identifying disease-associated genes, the different types of phenotypes pose challenges for association analysis. To address these challenges, we propose a nonparametric method, functional U-statistic method (FU), for multivariate analysis of sequencing data. It first constructs smooth functions from individuals' sequencing data, and then tests the association of these functions with multiple phenotypes by using a U-statistic. The method provides a general framework for analyzing various types of phenotypes (e.g., binary and continuous phenotypes) with unknown distributions. Fitting the genetic variants within a gene using a smoothing function also allows us to capture complexities of gene structure (e.g., linkage disequilibrium, LD), which could potentially increase the power of association analysis. Through simulations, we compared our method to the multivariate outcome score test (MOST), and found that our test attained better performance than MOST. In a real data application, we apply our method to the sequencing data from Minnesota Twin Study (MTS) and found potential associations of several nicotine receptor subunit (CHRN) genes, including CHRNB3, associated with nicotine dependence and/or alcohol dependence.
Collapse
Affiliation(s)
- Sneha Jadhav
- Department of Statistics and Probability, Michigan State University, East Lansing, Michigan, United States of America
| | - Xiaoran Tong
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, Michigan, United States of America
| | - Qing Lu
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, Michigan, United States of America
| |
Collapse
|
15
|
Meta-analysis of quantitative pleiotropic traits for next-generation sequencing with multivariate functional linear models. Eur J Hum Genet 2016; 25:350-359. [PMID: 28000696 DOI: 10.1038/ejhg.2016.170] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2016] [Revised: 07/26/2016] [Accepted: 09/27/2016] [Indexed: 11/09/2022] Open
Abstract
To analyze next-generation sequencing data, multivariate functional linear models are developed for a meta-analysis of multiple studies to connect genetic variant data to multiple quantitative traits adjusting for covariates. The goal is to take the advantage of both meta-analysis and pleiotropic analysis in order to improve power and to carry out a unified association analysis of multiple studies and multiple traits of complex disorders. Three types of approximate F -distributions based on Pillai-Bartlett trace, Hotelling-Lawley trace, and Wilks's Lambda are introduced to test for association between multiple quantitative traits and multiple genetic variants. Simulation analysis is performed to evaluate false-positive rates and power of the proposed tests. The proposed methods are applied to analyze lipid traits in eight European cohorts. It is shown that it is more advantageous to perform multivariate analysis than univariate analysis in general, and it is more advantageous to perform meta-analysis of multiple studies instead of analyzing the individual studies separately. The proposed models require individual observations. The value of the current paper can be seen at least for two reasons: (a) the proposed methods can be applied to studies that have individual genotype data; (b) the proposed methods can be used as a criterion for future work that uses summary statistics to build test statistics to meta-analyze the data.
Collapse
|
16
|
Chiu CY, Jung J, Wang Y, Weeks DE, Wilson AF, Bailey-Wilson JE, Amos CI, Mills JL, Boehnke M, Xiong M, Fan R. A comparison study of multivariate fixed models and Gene Association with Multiple Traits (GAMuT) for next-generation sequencing. Genet Epidemiol 2016; 41:18-34. [PMID: 27917525 DOI: 10.1002/gepi.22014] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2016] [Revised: 09/01/2016] [Accepted: 09/19/2016] [Indexed: 01/23/2023]
Abstract
In this paper, extensive simulations are performed to compare two statistical methods to analyze multiple correlated quantitative phenotypes: (1) approximate F-distributed tests of multivariate functional linear models (MFLM) and additive models of multivariate analysis of variance (MANOVA), and (2) Gene Association with Multiple Traits (GAMuT) for association testing of high-dimensional genotype data. It is shown that approximate F-distributed tests of MFLM and MANOVA have higher power and are more appropriate for major gene association analysis (i.e., scenarios in which some genetic variants have relatively large effects on the phenotypes); GAMuT has higher power and is more appropriate for analyzing polygenic effects (i.e., effects from a large number of genetic variants each of which contributes a small amount to the phenotypes). MFLM and MANOVA are very flexible and can be used to perform association analysis for (i) rare variants, (ii) common variants, and (iii) a combination of rare and common variants. Although GAMuT was designed to analyze rare variants, it can be applied to analyze a combination of rare and common variants and it performs well when (1) the number of genetic variants is large and (2) each variant contributes a small amount to the phenotypes (i.e., polygenes). MFLM and MANOVA are fixed effect models that perform well for major gene association analysis. GAMuT can be viewed as an extension of sequence kernel association tests (SKAT). Both GAMuT and SKAT are more appropriate for analyzing polygenic effects and they perform well not only in the rare variant case, but also in the case of a combination of rare and common variants. Data analyses of European cohorts and the Trinity Students Study are presented to compare the performance of the two methods.
Collapse
Affiliation(s)
- Chi-Yang Chiu
- Biostatistics and Bioinformatics Branch, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health (NIH), Bethesda, MD, USA
| | - Jeesun Jung
- Laboratory of Epidemiology and Biometry, National Institute on Alcohol, Abuse and Alcoholism, NIH, Bethesda, MD, USA
| | - Yifan Wang
- Center for Drug Evaluation and Research, Food and Drug Administration, Silver Spring, MD, USA
| | - Daniel E Weeks
- Department of Human Genetics, University of Pittsburgh, Pittsburgh, PA, USA
| | - Alexander F Wilson
- Computational and Statistical Genomics Branch, National Human Genome Research Institute, NIH, Bethesda, MD, USA
| | - Joan E Bailey-Wilson
- Computational and Statistical Genomics Branch, National Human Genome Research Institute, NIH, Bethesda, MD, USA
| | - Christopher I Amos
- Department of Biomedical Data Science, Geisel School of Medicine at Dartmouth, Lebanon, NH, USA
| | - James L Mills
- Epidemiology Branch, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health (NIH), Bethesda, MD, USA
| | - Michael Boehnke
- Department of Biostatistics, School of Public Health, The University of Michigan, Ann Arbor, MI, USA
| | - Momiao Xiong
- Human Genetics Center, University of Texas-Houston, Houston, TX, USA
| | - Ruzong Fan
- Biostatistics and Bioinformatics Branch, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health (NIH), Bethesda, MD, USA
| |
Collapse
|
17
|
Svishcheva GR, Belonogova NM, Axenovich TI. Functional linear models for region-based association analysis. RUSS J GENET+ 2016. [DOI: 10.1134/s1022795416100124] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
|
18
|
Jadhav S, Vsevolozhskaya OA, Tong X, Lu Q. The impact of genetic structure on sequencing analysis. BMC Proc 2016; 10:171-174. [PMID: 27980631 PMCID: PMC5133514 DOI: 10.1186/s12919-016-0025-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Background Genome-wide association studies have made substantial progress in identifying common variants associated with human diseases. Despite such success, a large portion of heritability remains unexplained. Evolutionary theory and empirical studies suggest that rare mutations could play an important role in human diseases, which motivates comprehensive investigation of rare variants in sequencing studies. To explore the association of rare variants with human diseases, many statistical approaches have been developed with different ways of modeling genetic structure (ie, linkage disequilibrium). Nevertheless, the appropriate strategy to model genetic structure of sequencing data and its effect on association analysis have not been well studied. Methods We investigate 3 statistical approaches that use 3 different strategies to model the genetic structure of sequencing data. We proceed by comparing a burden test that assumes independence among sequencing variants, a burden test that considers pairwise linkage disequilibrium (LD), and a functional analysis of variance (FANOVA) test that models genetic data through fitting continuous curves on individuals’ genotypes. Results Through simulations, we find that FANOVA attains better or comparable performance to the 2 burden tests. Overall, the burden test that considers pairwise LD has comparable performance to the burden test that assumes independence between sequencing variants. However, for 1 gene, where the disease-associated variant is located in an LD block, we find that considering pairwise LD could improve the test’s performance. Conclusions The structure of sequencing variants is complex in nature and its patterns vary across the whole genome. In certain cases (eg, a disease-susceptibility variant is in an LD block), ignoring the genetic structure in the association analysis could result in suboptimal performance. Through this study, we show that a functional-based method is promising for modeling the underlying genetic structure of sequencing data, which could lead to better performance.
Collapse
|
19
|
Fan R, Chiu CY, Jung J, Weeks DE, Wilson AF, Bailey-Wilson JE, Amos CI, Chen Z, Mills JL, Xiong M. A Comparison Study of Fixed and Mixed Effect Models for Gene Level Association Studies of Complex Traits. Genet Epidemiol 2016; 40:702-721. [PMID: 27374056 DOI: 10.1002/gepi.21984] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2015] [Revised: 03/08/2016] [Accepted: 04/26/2016] [Indexed: 12/22/2022]
Abstract
In association studies of complex traits, fixed-effect regression models are usually used to test for association between traits and major gene loci. In recent years, variance-component tests based on mixed models were developed for region-based genetic variant association tests. In the mixed models, the association is tested by a null hypothesis of zero variance via a sequence kernel association test (SKAT), its optimal unified test (SKAT-O), and a combined sum test of rare and common variant effect (SKAT-C). Although there are some comparison studies to evaluate the performance of mixed and fixed models, there is no systematic analysis to determine when the mixed models perform better and when the fixed models perform better. Here we evaluated, based on extensive simulations, the performance of the fixed and mixed model statistics, using genetic variants located in 3, 6, 9, 12, and 15 kb simulated regions. We compared the performance of three models: (i) mixed models that lead to SKAT, SKAT-O, and SKAT-C, (ii) traditional fixed-effect additive models, and (iii) fixed-effect functional regression models. To evaluate the type I error rates of the tests of fixed models, we generated genotype data by two methods: (i) using all variants, (ii) using only rare variants. We found that the fixed-effect tests accurately control or have low false positive rates. We performed simulation analyses to compare power for two scenarios: (i) all causal variants are rare, (ii) some causal variants are rare and some are common. Either one or both of the fixed-effect models performed better than or similar to the mixed models except when (1) the region sizes are 12 and 15 kb and (2) effect sizes are small. Therefore, the assumption of mixed models could be satisfied and SKAT/SKAT-O/SKAT-C could perform better if the number of causal variants is large and each causal variant contributes a small amount to the traits (i.e., polygenes). In major gene association studies, we argue that the fixed-effect models perform better or similarly to mixed models in most cases because some variants should affect the traits relatively large. In practice, it makes sense to perform analysis by both the fixed and mixed effect models and to make a comparison, and this can be readily done using our R codes and the SKAT packages.
Collapse
Affiliation(s)
- Ruzong Fan
- Biostatistics and Bioinformatics Branch, Division of Intramural Population Health Research, Eunice Kennedy Shriver, National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Chi-Yang Chiu
- Biostatistics and Bioinformatics Branch, Division of Intramural Population Health Research, Eunice Kennedy Shriver, National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Jeesun Jung
- Laboratory of Epidemiology and Biometry, National Institute on Alcohol Abuse and Alcoholism, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Daniel E Weeks
- Departments of Human Genetics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America.,Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
| | - Alexander F Wilson
- Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Joan E Bailey-Wilson
- Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Christopher I Amos
- Department of Biomedical Data Science, Geisel School of Medicine at Dartmouth, Lebanon, New Hampshire, United States of America
| | - Zhen Chen
- Biostatistics and Bioinformatics Branch, Division of Intramural Population Health Research, Eunice Kennedy Shriver, National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, Maryland, United States of America
| | - James L Mills
- Epidemiology Branch, Division of Intramural Population Health Research, Eunice Kennedy Shriver, National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Momiao Xiong
- Human Genetics Center, University of Texas-Houston, Houston, Texas, United States of America
| |
Collapse
|
20
|
Campos-Sánchez R, Cremona MA, Pini A, Chiaromonte F, Makova KD. Integration and Fixation Preferences of Human and Mouse Endogenous Retroviruses Uncovered with Functional Data Analysis. PLoS Comput Biol 2016; 12:e1004956. [PMID: 27309962 PMCID: PMC4911145 DOI: 10.1371/journal.pcbi.1004956] [Citation(s) in RCA: 32] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2016] [Accepted: 04/29/2016] [Indexed: 01/24/2023] Open
Abstract
Endogenous retroviruses (ERVs), the remnants of retroviral infections in the germ line, occupy ~8% and ~10% of the human and mouse genomes, respectively, and affect their structure, evolution, and function. Yet we still have a limited understanding of how the genomic landscape influences integration and fixation of ERVs. Here we conducted a genome-wide study of the most recently active ERVs in the human and mouse genome. We investigated 826 fixed and 1,065 in vitro HERV-Ks in human, and 1,624 fixed and 242 polymorphic ETns, as well as 3,964 fixed and 1,986 polymorphic IAPs, in mouse. We quantitated >40 human and mouse genomic features (e.g., non-B DNA structure, recombination rates, and histone modifications) in ±32 kb of these ERVs' integration sites and in control regions, and analyzed them using Functional Data Analysis (FDA) methodology. In one of the first applications of FDA in genomics, we identified genomic scales and locations at which these features display their influence, and how they work in concert, to provide signals essential for integration and fixation of ERVs. The investigation of ERVs of different evolutionary ages (young in vitro and polymorphic ERVs, older fixed ERVs) allowed us to disentangle integration vs. fixation preferences. As a result of these analyses, we built a comprehensive model explaining the uneven distribution of ERVs along the genome. We found that ERVs integrate in late-replicating AT-rich regions with abundant microsatellites, mirror repeats, and repressive histone marks. Regions favoring fixation are depleted of genes and evolutionarily conserved elements, and have low recombination rates, reflecting the effects of purifying selection and ectopic recombination removing ERVs from the genome. In addition to providing these biological insights, our study demonstrates the power of exploiting multiple scales and localization with FDA. These powerful techniques are expected to be applicable to many other genomic investigations.
Collapse
Affiliation(s)
- Rebeca Campos-Sánchez
- Genetics Graduate Program, The Huck Institutes of the Life Sciences, Penn State University, University Park, Pennsylvania, United States of America
| | - Marzia A. Cremona
- MOX—Modeling and Scientific Computing, Department of Mathematics, Politecnico di Milano, Milano, Italy
- Department of Statistics, Penn State University, University Park, Pennsylvania, United States of America
| | - Alessia Pini
- MOX—Modeling and Scientific Computing, Department of Mathematics, Politecnico di Milano, Milano, Italy
| | - Francesca Chiaromonte
- Department of Statistics, Penn State University, University Park, Pennsylvania, United States of America
- Center for Medical Genomics, The Huck Institutes of the Life Sciences, Penn State University, University Park, Pennsylvania, United States of America
| | - Kateryna D. Makova
- Center for Medical Genomics, The Huck Institutes of the Life Sciences, Penn State University, University Park, Pennsylvania, United States of America
- Department of Biology, Penn State University, University Park, Pennsylvania, United States of America
| |
Collapse
|
21
|
Vsevolozhskaya OA, Zaykin DV, Barondess DA, Tong X, Jadhav S, Lu Q. Uncovering Local Trends in Genetic Effects of Multiple Phenotypes via Functional Linear Models. Genet Epidemiol 2016; 40:210-221. [PMID: 27027515 PMCID: PMC4817279 DOI: 10.1002/gepi.21955] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2015] [Revised: 12/04/2015] [Accepted: 12/14/2015] [Indexed: 12/27/2022]
Abstract
Recent technological advances equipped researchers with capabilities that go beyond traditional genotyping of loci known to be polymorphic in a general population. Genetic sequences of study participants can now be assessed directly. This capability removed technology-driven bias toward scoring predominantly common polymorphisms and let researchers reveal a wealth of rare and sample-specific variants. Although the relative contributions of rare and common polymorphisms to trait variation are being debated, researchers are faced with the need for new statistical tools for simultaneous evaluation of all variants within a region. Several research groups demonstrated flexibility and good statistical power of the functional linear model approach. In this work we extend previous developments to allow inclusion of multiple traits and adjustment for additional covariates. Our functional approach is unique in that it provides a nuanced depiction of effects and interactions for the variables in the model by representing them as curves varying over a genetic region. We demonstrate flexibility and competitive power of our approach by contrasting its performance with commonly used statistical tools and illustrate its potential for discovery and characterization of genetic architecture of complex traits using sequencing data from the Dallas Heart Study.
Collapse
Affiliation(s)
| | - Dmitri V. Zaykin
- National Institute of Environmental Health Sciences, National Institutes of Health, USA
| | - David A. Barondess
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, USA
| | - Xiaoren Tong
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, USA
| | - Sneha Jadhav
- Department of Statistics, Michigan State University, East Lansing, USA
| | - Qing Lu
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, USA
| |
Collapse
|
22
|
Fan R, Wang Y, Yan Q, Ding Y, Weeks DE, Lu Z, Ren H, Cook RJ, Xiong M, Swaroop A, Chew EY, Chen W. Gene-Based Association Analysis for Censored Traits Via Fixed Effect Functional Regressions. Genet Epidemiol 2016; 40:133-43. [PMID: 26782979 DOI: 10.1002/gepi.21947] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2015] [Revised: 10/13/2015] [Accepted: 11/05/2015] [Indexed: 11/07/2022]
Abstract
Genetic studies of survival outcomes have been proposed and conducted recently, but statistical methods for identifying genetic variants that affect disease progression are rarely developed. Motivated by our ongoing real studies, here we develop Cox proportional hazard models using functional regression (FR) to perform gene-based association analysis of survival traits while adjusting for covariates. The proposed Cox models are fixed effect models where the genetic effects of multiple genetic variants are assumed to be fixed. We introduce likelihood ratio test (LRT) statistics to test for associations between the survival traits and multiple genetic variants in a genetic region. Extensive simulation studies demonstrate that the proposed Cox RF LRT statistics have well-controlled type I error rates. To evaluate power, we compare the Cox FR LRT with the previously developed burden test (BT) in a Cox model and sequence kernel association test (SKAT), which is based on mixed effect Cox models. The Cox FR LRT statistics have higher power than or similar power as Cox SKAT LRT except when 50%/50% causal variants had negative/positive effects and all causal variants are rare. In addition, the Cox FR LRT statistics have higher power than Cox BT LRT. The models and related test statistics can be useful in the whole genome and whole exome association studies. An age-related macular degeneration dataset was analyzed as an example.
Collapse
Affiliation(s)
- Ruzong Fan
- Division of Intramural Population Health Research, Biostatistics and Bioinformatics Branch, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| | - Yifan Wang
- Division of Intramural Population Health Research, Biostatistics and Bioinformatics Branch, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| | - Qi Yan
- Division of Pulmonary Medicine, Allergy and Immunology, Children's Hospital of Pittsburgh at The University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
| | - Ying Ding
- Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
| | - Daniel E Weeks
- Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
- Department of Human Genetics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
| | - Zhaohui Lu
- Division of Intramural Population Health Research, Biostatistics and Bioinformatics Branch, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| | - Haobo Ren
- Regeneron Pharmaceuticals, Inc, Basking Ridge, New Jersey, United States of America
| | - Richard J Cook
- Department of Statistics and Actuarial Science, Waterloo, ON, Canada
| | - Momiao Xiong
- Human Genetics Center, University of Texas, Houston, Texas, United States of America
| | - Anand Swaroop
- Neurobiology-Neurodegeneration and Repair Laboratory, National Eye Institute, NIH, Bethesda, Maryland, United States of America
| | - Emily Y Chew
- Division of Epidemiology and Clinical Applications, National Eye Institute, NIH, Bethesda, Maryland, United States of America
| | - Wei Chen
- Division of Pulmonary Medicine, Allergy and Immunology, Children's Hospital of Pittsburgh at The University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
- Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
- Department of Human Genetics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
| |
Collapse
|
23
|
Meta-analysis of Complex Diseases at Gene Level with Generalized Functional Linear Models. Genetics 2015; 202:457-70. [PMID: 26715663 DOI: 10.1534/genetics.115.180869] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2015] [Accepted: 12/09/2015] [Indexed: 11/18/2022] Open
Abstract
We developed generalized functional linear models (GFLMs) to perform a meta-analysis of multiple case-control studies to evaluate the relationship of genetic data to dichotomous traits adjusting for covariates. Unlike the previously developed meta-analysis for sequence kernel association tests (MetaSKATs), which are based on mixed-effect models to make the contributions of major gene loci random, GFLMs are fixed models; i.e., genetic effects of multiple genetic variants are fixed. Based on GFLMs, we developed chi-squared-distributed Rao's efficient score test and likelihood-ratio test (LRT) statistics to test for an association between a complex dichotomous trait and multiple genetic variants. We then performed extensive simulations to evaluate the empirical type I error rates and power performance of the proposed tests. The Rao's efficient score test statistics of GFLMs are very conservative and have higher power than MetaSKATs when some causal variants are rare and some are common. When the causal variants are all rare [i.e., minor allele frequencies (MAF) < 0.03], the Rao's efficient score test statistics have similar or slightly lower power than MetaSKATs. The LRT statistics generate accurate type I error rates for homogeneous genetic-effect models and may inflate type I error rates for heterogeneous genetic-effect models owing to the large numbers of degrees of freedom and have similar or slightly higher power than the Rao's efficient score test statistics. GFLMs were applied to analyze genetic data of 22 gene regions of type 2 diabetes data from a meta-analysis of eight European studies and detected significant association for 18 genes (P < 3.10 × 10(-6)), tentative association for 2 genes (HHEX and HMGA2; P ≈ 10(-5)), and no association for 2 genes, while MetaSKATs detected none. In addition, the traditional additive-effect model detects association at gene HHEX. GFLMs and related tests can analyze rare or common variants or a combination of the two and can be useful in whole-genome and whole-exome association studies.
Collapse
|
24
|
Svishcheva GR, Belonogova NM, Axenovich TI. Region-Based Association Test for Familial Data under Functional Linear Models. PLoS One 2015; 10:e0128999. [PMID: 26111046 PMCID: PMC4481467 DOI: 10.1371/journal.pone.0128999] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2014] [Accepted: 05/04/2015] [Indexed: 12/22/2022] Open
Abstract
Region-based association analysis is a more powerful tool for gene mapping than testing of individual genetic variants, particularly for rare genetic variants. The most powerful methods for regional mapping are based on the functional data analysis approach, which assumes that the regional genome of an individual may be considered as a continuous stochastic function that contains information about both linkage and linkage disequilibrium. Here, we extend this powerful approach, earlier applied only to independent samples, to the samples of related individuals. To this end, we additionally include a random polygene effects in functional linear model used for testing association between quantitative traits and multiple genetic variants in the region. We compare the statistical power of different methods using Genetic Analysis Workshop 17 mini-exome family data and a wide range of simulation scenarios. Our method increases the power of regional association analysis of quantitative traits compared with burden-based and kernel-based methods for the majority of the scenarios. In addition, we estimate the statistical power of our method using regions with small number of genetic variants, and show that our method retains its advantage over burden-based and kernel-based methods in this case as well. The new method is implemented as the R-function 'famFLM' using two types of basis functions: the B-spline and Fourier bases. We compare the properties of the new method using models that differ from each other in the type of their function basis. The models based on the Fourier basis functions have an advantage in terms of speed and power over the models that use the B-spline basis functions and those that combine B-spline and Fourier basis functions. The 'famFLM' function is distributed under GPLv3 license and is freely available at http://mga.bionet.nsc.ru/soft/famFLM/.
Collapse
Affiliation(s)
- Gulnara R. Svishcheva
- Institute of Cytology and Genetics, Siberian Branch of the Russian Academy of Sciences, Novosibirsk, Russia
| | - Nadezhda M. Belonogova
- Institute of Cytology and Genetics, Siberian Branch of the Russian Academy of Sciences, Novosibirsk, Russia
| | - Tatiana I. Axenovich
- Institute of Cytology and Genetics, Siberian Branch of the Russian Academy of Sciences, Novosibirsk, Russia
- Department of Natural Sciences, Novosibirsk State University, Novosibirsk, Russia
| |
Collapse
|