1
|
Hou T, Shen X, Zhang S, Liang M, Chen L, Lu Q. AIGen: an artificial intelligence software for complex genetic data analysis. Brief Bioinform 2024; 25:bbae566. [PMID: 39550221 PMCID: PMC11568876 DOI: 10.1093/bib/bbae566] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2024] [Revised: 09/12/2024] [Accepted: 11/11/2024] [Indexed: 11/18/2024] Open
Abstract
The recent development of artificial intelligence (AI) technology, especially the advance of deep neural network (DNN) technology, has revolutionized many fields. While DNN plays a central role in modern AI technology, it has rarely been used in genetic data analysis due to analytical and computational challenges brought by high-dimensional genetic data and an increasing number of samples. To facilitate the use of AI in genetic data analysis, we developed a C++ package, AIGen, based on two newly developed neural networks (i.e. kernel neural networks and functional neural networks) that are capable of modeling complex genotype-phenotype relationships (e.g. interactions) while providing robust performance against high-dimensional genetic data. Moreover, computationally efficient algorithms (e.g. a minimum norm quadratic unbiased estimation approach and batch training) are implemented in the package to accelerate the computation, making them computationally efficient for analyzing large-scale datasets with thousands or even millions of samples. By applying AIGen to the UK Biobank dataset, we demonstrate that it can efficiently analyze large-scale genetic data, attain improved accuracy, and maintain robust performance. Availability: AIGen is developed in C++ and its source code, along with reference libraries, is publicly accessible on GitHub at https://github.com/TingtHou/AIGen.
Collapse
Affiliation(s)
- Tingting Hou
- Department of Experimental Statistics, Louisiana State University, 45 Martin D. Woodin Hall, Baton Rouge, LA 70802, United States
| | - Xiaoxi Shen
- Department of Mathematics, Texas State University, 601 University Drive San Marcos, TX 78666, United States
| | - Shan Zhang
- Department of Biostatistics, University of Florida, 2004 Mowry Road, Gainesville, FL 32611, United States
| | - Muxuan Liang
- Department of Biostatistics, University of Florida, 2004 Mowry Road, Gainesville, FL 32611, United States
| | - Li Chen
- Department of Biostatistics, University of Florida, 2004 Mowry Road, Gainesville, FL 32611, United States
| | - Qing Lu
- Department of Biostatistics, University of Florida, 2004 Mowry Road, Gainesville, FL 32611, United States
| |
Collapse
|
2
|
Zhang S, Zhou Y, Geng P, Lu Q. Functional Neural Networks for High-Dimensional Genetic Data Analysis. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:383-393. [PMID: 38507390 PMCID: PMC11301578 DOI: 10.1109/tcbb.2024.3364614] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/22/2024]
Abstract
Artificial intelligence (AI) is a thriving research field with many successful applications in areas such as computer vision and speech recognition. Machine learning methods, such as artificial neural networks (ANN), play a central role in modern AI technology. While ANN also holds great promise for human genetic research, the high-dimensional genetic data and complex genetic structure bring tremendous challenges. The vast majority of genetic variants on the genome have small or no effects on diseases, and fitting ANN on a large number of variants without considering the underlying genetic structure (e.g., linkage disequilibrium) could bring a serious overfitting issue. Furthermore, while a single disease phenotype is often studied in a classic genetic study, in emerging research fields (e.g., imaging genetics), researchers need to deal with different types of disease phenotypes. To address these challenges, we propose a functional neural networks (FNN) method. FNN uses a series of basis functions to model high-dimensional genetic data and a variety of phenotype data and further builds a multi-layer functional neural network to capture the complex relationships between genetic variants and disease phenotypes. Through simulations, we demonstrate the advantages of FNN for high-dimensional genetic data analysis in terms of robustness and accuracy. The real data applications also showed that FNN attained higher accuracy than the existing methods.
Collapse
|
3
|
Craig SJ, Kenney AM, Lin J, Paul IM, Birch LL, Savage JS, Marini ME, Chiaromonte F, Reimherr ML, Makova KD. Constructing a polygenic risk score for childhood obesity using functional data analysis. ECONOMETRICS AND STATISTICS 2023; 25:66-86. [PMID: 36620476 PMCID: PMC9813976 DOI: 10.1016/j.ecosta.2021.10.014] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/17/2023]
Abstract
Obesity is a highly heritable condition that affects increasing numbers of adults and, concerningly, of children. However, only a small fraction of its heritability has been attributed to specific genetic variants. These variants are traditionally ascertained from genome-wide association studies (GWAS), which utilize samples with tens or hundreds of thousands of individuals for whom a single summary measurement (e.g., BMI) is collected. An alternative approach is to focus on a smaller, more deeply characterized sample in conjunction with advanced statistical models that leverage longitudinal phenotypes. Novel functional data analysis (FDA) techniques are used to capitalize on longitudinal growth information from a cohort of children between birth and three years of age. In an ultra-high dimensional setting, hundreds of thousands of single nucleotide polymorphisms (SNPs) are screened, and selected SNPs are used to construct two polygenic risk scores (PRS) for childhood obesity using a weighting approach that incorporates the dynamic and joint nature of SNP effects. These scores are significantly higher in children with (vs. without) rapid infant weight gain-a predictor of obesity later in life. Using two independent cohorts, it is shown that the genetic variants identified in very young children are also informative in older children and in adults, consistent with early childhood obesity being predictive of obesity later in life. In contrast, PRSs based on SNPs identified by adult obesity GWAS are not predictive of weight gain in the cohort of young children. This provides an example of a successful application of FDA to GWAS. This application is complemented with simulations establishing that a deeply characterized sample can be just as, if not more, effective than a comparable study with a cross-sectional response. Overall, it is demonstrated that a deep, statistically sophisticated characterization of a longitudinal phenotype can provide increased statistical power to studies with relatively small sample sizes; and shows how FDA approaches can be used as an alternative to the traditional GWAS.
Collapse
Affiliation(s)
- Sarah J.C. Craig
- Department of Biology, Penn State University, University Park
- Center for Medical Genomics, Penn State University, University Park, PA
| | - Ana M. Kenney
- Department of Statistics, Penn State University, University Park, PA
| | - Junli Lin
- Department of Statistics, Penn State University, University Park, PA
| | - Ian M. Paul
- Center for Medical Genomics, Penn State University, University Park, PA
- Department of Pediatrics, Penn State College of Medicine, Hershey, PA
| | - Leann L. Birch
- Department of Foods and Nutrition, University of Georgia, Athens, GA
| | - Jennifer S. Savage
- Department of Nutritional Sciences, Penn State University, University Park, PA
- Center for Childhood Obesity Research, Penn State University, University Park, PA
| | - Michele E. Marini
- Center for Childhood Obesity Research, Penn State University, University Park, PA
| | - Francesca Chiaromonte
- Center for Medical Genomics, Penn State University, University Park, PA
- Department of Statistics, Penn State University, University Park, PA
- EMbeDS, Sant’Anna School of Advanced Studies, Piazza Martiri della Libertà, Pisa, Italy
| | - Matthew L. Reimherr
- Center for Medical Genomics, Penn State University, University Park, PA
- Department of Statistics, Penn State University, University Park, PA
| | - Kateryna D. Makova
- Department of Biology, Penn State University, University Park
- Center for Medical Genomics, Penn State University, University Park, PA
| |
Collapse
|
4
|
Zhang B, Chiu CY, Yuan F, Sang T, Cook RJ, Wilson AF, Bailey-Wilson JE, Chew EY, Xiong M, Fan R. Gene-based analysis of bi-variate survival traits via functional regressions with applications to eye diseases. Genet Epidemiol 2021; 45:455-470. [PMID: 33645812 DOI: 10.1002/gepi.22381] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2020] [Revised: 01/15/2021] [Accepted: 02/08/2021] [Indexed: 11/12/2022]
Abstract
Genetic studies of two related survival outcomes of a pleiotropic gene are commonly encountered but statistical models to analyze them are rarely developed. To analyze sequencing data, we propose mixed effect Cox proportional hazard models by functional regressions to perform gene-based joint association analysis of two survival traits motivated by our ongoing real studies. These models extend fixed effect Cox models of univariate survival traits by incorporating variations and correlation of multivariate survival traits into the models. The associations between genetic variants and two survival traits are tested by likelihood ratio test statistics. Extensive simulation studies suggest that type I error rates are well controlled and power performances are stable. The proposed models are applied to analyze bivariate survival traits of left and right eyes in the age-related macular degeneration progression.
Collapse
Affiliation(s)
- Bingsong Zhang
- Department of Biostatistics, Bioinformatics, and Biomathematics, Georgetown University Medical Center, Washington, District of Columbia, USA
| | - Chi-Yang Chiu
- Division of Biostatistics, Department of Preventive Medicine, University of Tennessee Health Science Center, Memphis, Tennessee, USA.,Computational and Statistical Genomics Branch, National Human Genome, Research Institute, National Institutes of Health (NIH), Baltimore, Maryland, USA
| | - Fang Yuan
- Department of Biochemistry and Molecular Biology, School of Basic Medicine, Kunming Medical University, Kunming, People's Republic of China
| | - Tian Sang
- Department of Biostatistics, Bioinformatics, and Biomathematics, Georgetown University Medical Center, Washington, District of Columbia, USA.,School of Mathematics, Physics and Statistics, Shanghai University of Engineering Science, Shanghai, China
| | - Richard J Cook
- Department of Statistics and Actuarial Science, Waterloo, Ontario, Canada
| | - Alexander F Wilson
- Computational and Statistical Genomics Branch, National Human Genome, Research Institute, National Institutes of Health (NIH), Baltimore, Maryland, USA
| | - Joan E Bailey-Wilson
- Computational and Statistical Genomics Branch, National Human Genome, Research Institute, National Institutes of Health (NIH), Baltimore, Maryland, USA
| | - Emily Y Chew
- Division of Epidemiology and Clinical Applications, National Eye Institute, NIH, Bethesda, Maryland, USA
| | - Momiao Xiong
- Human Genetics Center, University of Texas-Houston, Houston, Texas, USA
| | - Ruzong Fan
- Department of Biostatistics, Bioinformatics, and Biomathematics, Georgetown University Medical Center, Washington, District of Columbia, USA.,Computational and Statistical Genomics Branch, National Human Genome, Research Institute, National Institutes of Health (NIH), Baltimore, Maryland, USA
| |
Collapse
|
5
|
Jiang Y, Chiu CY, Yan Q, Chen W, Gorin MB, Conley YP, Lakhal-Chaieb ML, Cook RJ, Amos CI, Wilson AF, Bailey-Wilson JE, McMahon FJ, Vazquez AI, Yuan A, Zhong X, Xiong M, Weeks DE, Fan R. Gene-Based Association Testing of Dichotomous Traits With Generalized Functional Linear Mixed Models Using Extended Pedigrees: Applications to Age-Related Macular Degeneration. J Am Stat Assoc 2020; 116:531-545. [PMID: 34321704 PMCID: PMC8315575 DOI: 10.1080/01621459.2020.1799809] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2017] [Revised: 07/09/2020] [Accepted: 07/17/2020] [Indexed: 10/23/2022]
Abstract
Genetics plays a role in age-related macular degeneration (AMD), a common cause of blindness in the elderly. There is a need for powerful methods for carrying out region-based association tests between a dichotomous trait like AMD and genetic variants on family data. Here, we apply our new generalized functional linear mixed models (GFLMM) developed to test for gene-based association in a set of AMD families. Using common and rare variants, we observe significant association with two known AMD genes: CFH and ARMS2. Using rare variants, we find suggestive signals in four genes: ASAH1, CLEC6A, TMEM63C, and SGSM1. Intriguingly, ASAH1 is down-regulated in AMD aqueous humor, and ASAH1 deficiency leads to retinal inflammation and increased vulnerability to oxidative stress. These findings were made possible by our GFLMM which model the effect of a major gene as a fixed mean, the polygenic contributions as a random variation, and the correlation of pedigree members by kinship coefficients. Simulations indicate that the GFLMM likelihood ratio tests (LRTs) accurately control the Type I error rates. The LRTs have similar or higher power than existing retrospective kernel and burden statistics. Our GFLMM-based statistics provide a new tool for conducting family-based genetic studies of complex diseases. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Collapse
Affiliation(s)
- Yingda Jiang
- Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, PA
| | - Chi-Yang Chiu
- Division of Biostatistics, Department of Preventive Medicine, University of Tennessee Health Science Center, Memphis, TN
- Computational and Statistical Genomics Branch, National Human Genome Research Institute, NIH, Baltimore, MD
| | - Qi Yan
- Division of Pulmonary Medicine, Allergy and Immunology, Children’s Hospital of Pittsburgh at The University of Pittsburgh, Pittsburgh, PA
| | - Wei Chen
- Division of Pulmonary Medicine, Allergy and Immunology, Children’s Hospital of Pittsburgh at The University of Pittsburgh, Pittsburgh, PA
| | - Michael B. Gorin
- Department of Ophthalmology, David Geffen School of Medicine, UCLA Stein Eye Institute, Los Angeles, CA
| | - Yvette P. Conley
- Department of Health Promotion and Development, University of Pittsburgh, Pittsburgh, PA
- Department of Human Genetics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, PA
| | | | - Richard J. Cook
- Department of Statistics and Actuarial Science, Waterloo, ON, Canada
| | | | - Alexander F. Wilson
- Computational and Statistical Genomics Branch, National Human Genome Research Institute, NIH, Baltimore, MD
| | - Joan E. Bailey-Wilson
- Computational and Statistical Genomics Branch, National Human Genome Research Institute, NIH, Baltimore, MD
| | - Francis J. McMahon
- Human Genetics Branch and Genetic Basis of Mood and Anxiety Disorders Section, National Institute of Mental Health, NIH, Bethesda, MD
| | - Ana I. Vazquez
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, MI
| | - Ao Yuan
- Department of Biostatistics, Bioinformatics, and Biomathematics, Georgetown University Medical Center, Washington, DC
| | - Xiaogang Zhong
- Department of Biostatistics, Bioinformatics, and Biomathematics, Georgetown University Medical Center, Washington, DC
| | - Momiao Xiong
- Human Genetics Center, University of Texas, Houston, TX
| | - Daniel E. Weeks
- Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, PA
- Department of Human Genetics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, PA
| | - Ruzong Fan
- Computational and Statistical Genomics Branch, National Human Genome Research Institute, NIH, Baltimore, MD
- Department of Biostatistics, Bioinformatics, and Biomathematics, Georgetown University Medical Center, Washington, DC
| |
Collapse
|
6
|
Chiu CY, Zhang B, Wang S, Shao J, Lakhal-Chaieb ML, Cook RJ, Wilson AF, Bailey-Wilson JE, Xiong M, Fan R. Gene-based association analysis of survival traits via functional regression-based mixed effect cox models for related samples. Genet Epidemiol 2019; 43:952-965. [PMID: 31502722 DOI: 10.1002/gepi.22254] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2019] [Revised: 06/26/2019] [Accepted: 07/16/2019] [Indexed: 01/09/2023]
Abstract
The importance to integrate survival analysis into genetics and genomics is widely recognized, but only a small number of statisticians have produced relevant work toward this study direction. For unrelated population data, functional regression (FR) models have been developed to test for association between a quantitative/dichotomous/survival trait and genetic variants in a gene region. In major gene association analysis, these models have higher power than sequence kernel association tests. In this paper, we extend this approach to analyze censored traits for family data or related samples using FR based mixed effect Cox models (FamCoxME). The FamCoxME model effect of major gene as fixed mean via functional data analysis techniques, the local gene or polygene variations or both as random, and the correlation of pedigree members by kinship coefficients or genetic relationship matrix or both. The association between the censored trait and the major gene is tested by likelihood ratio tests (FamCoxME FR LRT). Simulation results indicate that the LRT control the type I error rates accurately/conservatively and have good power levels when both local gene or polygene variations are modeled. The proposed methods were applied to analyze a breast cancer data set from the Consortium of Investigators of Modifiers of BRCA1 and BRCA2 (CIMBA). The FamCoxME provides a new tool for gene-based analysis of family-based studies or related samples.
Collapse
Affiliation(s)
- Chi-Yang Chiu
- Division of Biostatistics, Department of Preventive Medicine, University of Tennessee Health Science Center, Memphis, Tennessee
| | - Bingsong Zhang
- Department of Biostatistics, Bioinformatics, and Biomathematics, Georgetown University Medical Center, Washington, District of Columbia
| | - Shuqi Wang
- Department of Biostatistics, Bioinformatics, and Biomathematics, Georgetown University Medical Center, Washington, District of Columbia
| | - Jingyi Shao
- Department of Biostatistics, Bioinformatics, and Biomathematics, Georgetown University Medical Center, Washington, District of Columbia
| | | | - Richard J Cook
- Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, Ontario, Canada
| | - Alexander F Wilson
- Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland
| | - Joan E Bailey-Wilson
- Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland
| | - Momiao Xiong
- Department of Biostatistics, Human Genetics Center, University of Texas-Houston, Houston, Texas
| | - Ruzong Fan
- Department of Biostatistics, Bioinformatics, and Biomathematics, Georgetown University Medical Center, Washington, District of Columbia
| |
Collapse
|
7
|
Chiu CY, Yuan F, Zhang BS, Yuan A, Li X, Fang HB, Lange K, Weeks DE, Wilson AF, Bailey-Wilson JE, Musolf AM, Stambolian D, Lakhal-Chaieb ML, Cook RJ, McMahon FJ, Amos CI, Xiong M, Fan R. Linear mixed models for association analysis of quantitative traits with next-generation sequencing data. Genet Epidemiol 2019; 43:189-206. [PMID: 30537345 PMCID: PMC6375753 DOI: 10.1002/gepi.22177] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2018] [Revised: 08/27/2018] [Accepted: 09/26/2018] [Indexed: 01/01/2023]
Abstract
We develop linear mixed models (LMMs) and functional linear mixed models (FLMMs) for gene-based tests of association between a quantitative trait and genetic variants on pedigrees. The effects of a major gene are modeled as a fixed effect, the contributions of polygenes are modeled as a random effect, and the correlations of pedigree members are modeled via inbreeding/kinship coefficients. F -statistics and χ 2 likelihood ratio test (LRT) statistics based on the LMMs and FLMMs are constructed to test for association. We show empirically that the F -distributed statistics provide a good control of the type I error rate. The F -test statistics of the LMMs have similar or higher power than the FLMMs, kernel-based famSKAT (family-based sequence kernel association test), and burden test famBT (family-based burden test). The F -statistics of the FLMMs perform well when analyzing a combination of rare and common variants. For small samples, the LRT statistics of the FLMMs control the type I error rate well at the nominal levels α = 0.01 and 0.05 . For moderate/large samples, the LRT statistics of the FLMMs control the type I error rates well. The LRT statistics of the LMMs can lead to inflated type I error rates. The proposed models are useful in whole genome and whole exome association studies of complex traits.
Collapse
Affiliation(s)
- Chi-Yang Chiu
- Division of Biostatistics, Department of Preventive Medicine, University of Tennessee Health Science Center, Memphis, Tennessee
- Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health (NIH), Bethesda, Maryland
| | - Fang Yuan
- Department of Biochemistry and Molecular Biology, School of Basic Medicine, Kunming Medical University, Kunming, Yunnan, China
| | - Bing-Song Zhang
- Department of Biostatistics, Bioinformatics, and Biomathematics, Georgetown University Medical Center, Washington, District of Columbia
| | - Ao Yuan
- Department of Biostatistics, Bioinformatics, and Biomathematics, Georgetown University Medical Center, Washington, District of Columbia
| | - Xin Li
- Department of Biostatistics, Bioinformatics, and Biomathematics, Georgetown University Medical Center, Washington, District of Columbia
| | - Hong-Bin Fang
- Department of Biostatistics, Bioinformatics, and Biomathematics, Georgetown University Medical Center, Washington, District of Columbia
| | - Kenneth Lange
- Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, California
| | - Daniel E Weeks
- Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, Pennsylvania
- Department of Human Genetics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, Pennsylvania
| | - Alexander F Wilson
- Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health (NIH), Bethesda, Maryland
| | - Joan E Bailey-Wilson
- Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health (NIH), Bethesda, Maryland
| | - Anthony M Musolf
- Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health (NIH), Bethesda, Maryland
| | - Dwight Stambolian
- Department of Genetics, University of Pennsylvania, Philadelphia, Pennsylvania
| | | | - Richard J Cook
- Department of Statistics and Actuarial Science, Waterloo, Ontario, Quebec, Canada
| | - Francis J McMahon
- Human Genetics Branch and Genetic Basis of Mood and Anxiety Disorders Section, University of Waterloo, National Institute of Mental Health, NIH, Bethesda, Maryland
| | | | - Momiao Xiong
- Human Genetics Center, University of Texas-Houston, Houston, Texas
| | - Ruzong Fan
- Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health (NIH), Bethesda, Maryland
- Department of Biochemistry and Molecular Biology, School of Basic Medicine, Kunming Medical University, Kunming, Yunnan, China
| |
Collapse
|
8
|
Gordon D, Londono D, Patel P, Kim W, Finch SJ, Heiman GA. An Analytic Solution to the Computation of Power and Sample Size for Genetic Association Studies under a Pleiotropic Mode of Inheritance. Hum Hered 2017; 81:194-209. [PMID: 28315880 DOI: 10.1159/000457135] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2016] [Accepted: 01/20/2017] [Indexed: 01/14/2023] Open
Abstract
Our motivation here is to calculate the power of 3 statistical tests used when there are genetic traits that operate under a pleiotropic mode of inheritance and when qualitative phenotypes are defined by use of thresholds for the multiple quantitative phenotypes. Specifically, we formulate a multivariate function that provides the probability that an individual has a vector of specific quantitative trait values conditional on having a risk locus genotype, and we apply thresholds to define qualitative phenotypes (affected, unaffected) and compute penetrances and conditional genotype frequencies based on the multivariate function. We extend the analytic power and minimum-sample-size-necessary (MSSN) formulas for 2 categorical data-based tests (genotype, linear trend test [LTT]) of genetic association to the pleiotropic model. We further compare the MSSN of the genotype test and the LTT with that of a multivariate ANOVA (Pillai). We approximate the MSSN for statistics by linear models using a factorial design and ANOVA. With ANOVA decomposition, we determine which factors most significantly change the power/MSSN for all statistics. Finally, we determine which test statistics have the smallest MSSN. In this work, MSSN calculations are for 2 traits (bivariate distributions) only (for illustrative purposes). We note that the calculations may be extended to address any number of traits. Our key findings are that the genotype test usually has lower MSSN requirements than the LTT. More inclusive thresholds (top/bottom 25% vs. top/bottom 10%) have higher sample size requirements. The Pillai test has a much larger MSSN than both the genotype test and the LTT, as a result of sample selection. With these formulas, researchers can specify how many subjects they must collect to localize genes for pleiotropic phenotypes.
Collapse
Affiliation(s)
- Derek Gordon
- Department of Genetics, The State University of New Jersey, Piscataway, NJ, USA
| | | | | | | | | | | |
Collapse
|
9
|
Meta-analysis of quantitative pleiotropic traits for next-generation sequencing with multivariate functional linear models. Eur J Hum Genet 2016; 25:350-359. [PMID: 28000696 DOI: 10.1038/ejhg.2016.170] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2016] [Revised: 07/26/2016] [Accepted: 09/27/2016] [Indexed: 11/09/2022] Open
Abstract
To analyze next-generation sequencing data, multivariate functional linear models are developed for a meta-analysis of multiple studies to connect genetic variant data to multiple quantitative traits adjusting for covariates. The goal is to take the advantage of both meta-analysis and pleiotropic analysis in order to improve power and to carry out a unified association analysis of multiple studies and multiple traits of complex disorders. Three types of approximate F -distributions based on Pillai-Bartlett trace, Hotelling-Lawley trace, and Wilks's Lambda are introduced to test for association between multiple quantitative traits and multiple genetic variants. Simulation analysis is performed to evaluate false-positive rates and power of the proposed tests. The proposed methods are applied to analyze lipid traits in eight European cohorts. It is shown that it is more advantageous to perform multivariate analysis than univariate analysis in general, and it is more advantageous to perform meta-analysis of multiple studies instead of analyzing the individual studies separately. The proposed models require individual observations. The value of the current paper can be seen at least for two reasons: (a) the proposed methods can be applied to studies that have individual genotype data; (b) the proposed methods can be used as a criterion for future work that uses summary statistics to build test statistics to meta-analyze the data.
Collapse
|
10
|
Chiu CY, Jung J, Wang Y, Weeks DE, Wilson AF, Bailey-Wilson JE, Amos CI, Mills JL, Boehnke M, Xiong M, Fan R. A comparison study of multivariate fixed models and Gene Association with Multiple Traits (GAMuT) for next-generation sequencing. Genet Epidemiol 2016; 41:18-34. [PMID: 27917525 DOI: 10.1002/gepi.22014] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2016] [Revised: 09/01/2016] [Accepted: 09/19/2016] [Indexed: 01/23/2023]
Abstract
In this paper, extensive simulations are performed to compare two statistical methods to analyze multiple correlated quantitative phenotypes: (1) approximate F-distributed tests of multivariate functional linear models (MFLM) and additive models of multivariate analysis of variance (MANOVA), and (2) Gene Association with Multiple Traits (GAMuT) for association testing of high-dimensional genotype data. It is shown that approximate F-distributed tests of MFLM and MANOVA have higher power and are more appropriate for major gene association analysis (i.e., scenarios in which some genetic variants have relatively large effects on the phenotypes); GAMuT has higher power and is more appropriate for analyzing polygenic effects (i.e., effects from a large number of genetic variants each of which contributes a small amount to the phenotypes). MFLM and MANOVA are very flexible and can be used to perform association analysis for (i) rare variants, (ii) common variants, and (iii) a combination of rare and common variants. Although GAMuT was designed to analyze rare variants, it can be applied to analyze a combination of rare and common variants and it performs well when (1) the number of genetic variants is large and (2) each variant contributes a small amount to the phenotypes (i.e., polygenes). MFLM and MANOVA are fixed effect models that perform well for major gene association analysis. GAMuT can be viewed as an extension of sequence kernel association tests (SKAT). Both GAMuT and SKAT are more appropriate for analyzing polygenic effects and they perform well not only in the rare variant case, but also in the case of a combination of rare and common variants. Data analyses of European cohorts and the Trinity Students Study are presented to compare the performance of the two methods.
Collapse
Affiliation(s)
- Chi-Yang Chiu
- Biostatistics and Bioinformatics Branch, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health (NIH), Bethesda, MD, USA
| | - Jeesun Jung
- Laboratory of Epidemiology and Biometry, National Institute on Alcohol, Abuse and Alcoholism, NIH, Bethesda, MD, USA
| | - Yifan Wang
- Center for Drug Evaluation and Research, Food and Drug Administration, Silver Spring, MD, USA
| | - Daniel E Weeks
- Department of Human Genetics, University of Pittsburgh, Pittsburgh, PA, USA
| | - Alexander F Wilson
- Computational and Statistical Genomics Branch, National Human Genome Research Institute, NIH, Bethesda, MD, USA
| | - Joan E Bailey-Wilson
- Computational and Statistical Genomics Branch, National Human Genome Research Institute, NIH, Bethesda, MD, USA
| | - Christopher I Amos
- Department of Biomedical Data Science, Geisel School of Medicine at Dartmouth, Lebanon, NH, USA
| | - James L Mills
- Epidemiology Branch, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health (NIH), Bethesda, MD, USA
| | - Michael Boehnke
- Department of Biostatistics, School of Public Health, The University of Michigan, Ann Arbor, MI, USA
| | - Momiao Xiong
- Human Genetics Center, University of Texas-Houston, Houston, TX, USA
| | - Ruzong Fan
- Biostatistics and Bioinformatics Branch, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health (NIH), Bethesda, MD, USA
| |
Collapse
|
11
|
Fan R, Chiu CY, Jung J, Weeks DE, Wilson AF, Bailey-Wilson JE, Amos CI, Chen Z, Mills JL, Xiong M. A Comparison Study of Fixed and Mixed Effect Models for Gene Level Association Studies of Complex Traits. Genet Epidemiol 2016; 40:702-721. [PMID: 27374056 DOI: 10.1002/gepi.21984] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2015] [Revised: 03/08/2016] [Accepted: 04/26/2016] [Indexed: 12/22/2022]
Abstract
In association studies of complex traits, fixed-effect regression models are usually used to test for association between traits and major gene loci. In recent years, variance-component tests based on mixed models were developed for region-based genetic variant association tests. In the mixed models, the association is tested by a null hypothesis of zero variance via a sequence kernel association test (SKAT), its optimal unified test (SKAT-O), and a combined sum test of rare and common variant effect (SKAT-C). Although there are some comparison studies to evaluate the performance of mixed and fixed models, there is no systematic analysis to determine when the mixed models perform better and when the fixed models perform better. Here we evaluated, based on extensive simulations, the performance of the fixed and mixed model statistics, using genetic variants located in 3, 6, 9, 12, and 15 kb simulated regions. We compared the performance of three models: (i) mixed models that lead to SKAT, SKAT-O, and SKAT-C, (ii) traditional fixed-effect additive models, and (iii) fixed-effect functional regression models. To evaluate the type I error rates of the tests of fixed models, we generated genotype data by two methods: (i) using all variants, (ii) using only rare variants. We found that the fixed-effect tests accurately control or have low false positive rates. We performed simulation analyses to compare power for two scenarios: (i) all causal variants are rare, (ii) some causal variants are rare and some are common. Either one or both of the fixed-effect models performed better than or similar to the mixed models except when (1) the region sizes are 12 and 15 kb and (2) effect sizes are small. Therefore, the assumption of mixed models could be satisfied and SKAT/SKAT-O/SKAT-C could perform better if the number of causal variants is large and each causal variant contributes a small amount to the traits (i.e., polygenes). In major gene association studies, we argue that the fixed-effect models perform better or similarly to mixed models in most cases because some variants should affect the traits relatively large. In practice, it makes sense to perform analysis by both the fixed and mixed effect models and to make a comparison, and this can be readily done using our R codes and the SKAT packages.
Collapse
Affiliation(s)
- Ruzong Fan
- Biostatistics and Bioinformatics Branch, Division of Intramural Population Health Research, Eunice Kennedy Shriver, National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Chi-Yang Chiu
- Biostatistics and Bioinformatics Branch, Division of Intramural Population Health Research, Eunice Kennedy Shriver, National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Jeesun Jung
- Laboratory of Epidemiology and Biometry, National Institute on Alcohol Abuse and Alcoholism, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Daniel E Weeks
- Departments of Human Genetics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America.,Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
| | - Alexander F Wilson
- Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Joan E Bailey-Wilson
- Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Christopher I Amos
- Department of Biomedical Data Science, Geisel School of Medicine at Dartmouth, Lebanon, New Hampshire, United States of America
| | - Zhen Chen
- Biostatistics and Bioinformatics Branch, Division of Intramural Population Health Research, Eunice Kennedy Shriver, National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, Maryland, United States of America
| | - James L Mills
- Epidemiology Branch, Division of Intramural Population Health Research, Eunice Kennedy Shriver, National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Momiao Xiong
- Human Genetics Center, University of Texas-Houston, Houston, Texas, United States of America
| |
Collapse
|