1
|
St-Pierre J, Oualkacha K. A copula-based set-variant association test for bivariate continuous, binary or mixed phenotypes. Int J Biostat 2023; 19:369-387. [PMID: 36279152 PMCID: PMC10644254 DOI: 10.1515/ijb-2022-0010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2022] [Revised: 05/26/2022] [Accepted: 08/23/2022] [Indexed: 11/15/2022]
Abstract
In genome wide association studies (GWAS), researchers are often dealing with dichotomous and non-normally distributed traits, or a mixture of discrete-continuous traits. However, most of the current region-based methods rely on multivariate linear mixed models (mvLMMs) and assume a multivariate normal distribution for the phenotypes of interest. Hence, these methods are not applicable to disease or non-normally distributed traits. Therefore, there is a need to develop unified and flexible methods to study association between a set of (possibly rare) genetic variants and non-normal multivariate phenotypes. Copulas are multivariate distribution functions with uniform margins on the [0, 1] interval and they provide suitable models to deal with non-normality of errors in multivariate association studies. We propose a novel unified and flexible copula-based multivariate association test (CBMAT) for discovering association between a genetic region and a bivariate continuous, binary or mixed phenotype. We also derive a data-driven analytic p-value procedure of the proposed region-based score-type test. Through simulation studies, we demonstrate that CBMAT has well controlled type I error rates and higher power to detect associations compared with other existing methods, for discrete and non-normally distributed traits. At last, we apply CBMAT to detect the association between two genes located on chromosome 11 and several lipid levels measured on 1477 subjects from the ASLPAC study.
Collapse
Affiliation(s)
- Julien St-Pierre
- Department of Epidemiology, Biostatistics and Occupational Health, McGill University, Montreal, QC, Canada
| | - Karim Oualkacha
- Département de Mathématiques, Université du Québec à Montréal, Montreal, QC, Canada
| |
Collapse
|
2
|
Deep polygenic neural network for predicting and identifying yield-associated genes in Indonesian rice accessions. Sci Rep 2022; 12:13823. [PMID: 35970979 PMCID: PMC9378700 DOI: 10.1038/s41598-022-16075-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2021] [Accepted: 07/04/2022] [Indexed: 11/12/2022] Open
Abstract
As the fourth most populous country in the world, Indonesia must increase the annual rice production rate to achieve national food security by 2050. One possible solution comes from the nanoscopic level: a genetic variant called Single Nucleotide Polymorphism (SNP), which can express significant yield-associated genes. The prior benchmark of this study utilized a statistical genetics model where no SNP position information and attention mechanism were involved. Hence, we developed a novel deep polygenic neural network, named the NucleoNet model, to address these obstacles. The NucleoNets were constructed with the combination of prominent components that include positional SNP encoding, the context vector, wide models, Elastic Net, and Shannon’s entropy loss. This polygenic modeling obtained up to 2.779 of Mean Squared Error (MSE) with 47.156% of Symmetric Mean Absolute Percentage Error (SMAPE), while revealing 15 new important SNPs. Furthermore, the NucleoNets reduced the MSE score up to 32.28% compared to the Ordinary Least Squares (OLS) model. Through the ablation study, we learned that the combination of Xavier distribution for weights initialization and Normal distribution for biases initialization sparked more various important SNPs throughout 12 chromosomes. Our findings confirmed that the NucleoNet model was successfully outperformed the OLS model and identified important SNPs to Indonesian rice yields.
Collapse
|
3
|
Xue Y, Ding J, Wang J, Zhang S, Pan D. Two-phase SSU and SKAT in genetic association studies. J Genet 2020. [DOI: 10.1007/s12041-019-1166-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
4
|
Sha Q, Wang Z, Zhang X, Zhang S. A clustering linear combination approach to jointly analyze multiple phenotypes for GWAS. Bioinformatics 2020; 35:1373-1379. [PMID: 30239574 DOI: 10.1093/bioinformatics/bty810] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2017] [Revised: 08/29/2018] [Accepted: 09/18/2018] [Indexed: 12/16/2022] Open
Abstract
SUMMARY There is an increasing interest in joint analysis of multiple phenotypes for genome-wide association studies (GWASs) based on the following reasons. First, cohorts usually collect multiple phenotypes and complex diseases are usually measured by multiple correlated intermediate phenotypes. Second, jointly analyzing multiple phenotypes may increase statistical power for detecting genetic variants associated with complex diseases. Third, there is increasing evidence showing that pleiotropy is a widespread phenomenon in complex diseases. In this paper, we develop a clustering linear combination (CLC) method to jointly analyze multiple phenotypes for GWASs. In the CLC method, we first cluster individual statistics into positively correlated clusters and then, combine the individual statistics linearly within each cluster and combine the between-cluster terms in a quadratic form. CLC is not only robust to different signs of the means of individual statistics, but also reduce the degrees of freedom of the test statistic. We also theoretically prove that if we can cluster the individual statistics correctly, CLC is the most powerful test among all tests with certain quadratic forms. Our simulation results show that CLC is either the most powerful test or has similar power to the most powerful test among the tests we compared, and CLC is much more powerful than other tests when effect sizes align with inferred clusters. We also evaluate the performance of CLC through a real case study. AVAILABILITY AND IMPLEMENTATION R code for implementing our method is available at http://www.math.mtu.edu/∼shuzhang/software.html. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Qiuying Sha
- Department of Mathematical Sciences, Michigan Technological University, Houghton, MI, USA
| | - Zhenchuan Wang
- Department of Mathematical Sciences, Michigan Technological University, Houghton, MI, USA
| | - Xiao Zhang
- Department of Mathematical Sciences, Michigan Technological University, Houghton, MI, USA
| | - Shuanglin Zhang
- Department of Mathematical Sciences, Michigan Technological University, Houghton, MI, USA
| |
Collapse
|
5
|
Xue Y, Ding J, Wang J, Zhang S, Pan D. Two-phase SSU and SKAT in genetic association studies. J Genet 2020; 99:9. [PMID: 32089528] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
The sum of squared score (SSU) and sequence kernel association test (SKAT) are the two good alternative tests for genetic association studies in case-control data. Both SSU and SKAT are derived through assuming a dose-response model between the risk of disease and genotypes. However, in practice, the real genetic mode of inheritance is impossible to know. Thus, these two tests might losepower substantially as shown in simulation results when the genetic model is misspecified. Here, to make both the tests suitable in broad situations, we propose two-phase SSU (tpSSU) and two-phase SKAT (tpSKAT), where the Hardy-Weinberg equilibrium test is adopted to choose the genetic model in the first phase and the SSU and SKAT are constructed corresponding to the selected genetic model in the second phase. We found that both tpSSU and tpSKAT outperformed the original SSU and SKAT in most of our simulation scenarios. Byapplying tpSSU and tpSKAT to the study of type 2 diabetes data, we successfully identified some genes that have direct effects on obesity. Besides, we also detected the significant chromosomal region 10q21.22 in GAW16 rheumatoid arthritis dataset, with P<10-6. These findings suggest that tpSSU and tpSKAT can be effective in identifying genetic variants for complex diseases in case-control association studies.
Collapse
Affiliation(s)
- Yuan Xue
- School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, People's Republic of China.
| | | | | | | | | |
Collapse
|
6
|
Heller R, Meir A, Chatterjee N. Post-selection estimation and testing following aggregate association tests. J R Stat Soc Series B Stat Methodol 2019. [DOI: 10.1111/rssb.12318] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
| | - Amit Meir
- University of Washington; Seattle USA
| | | |
Collapse
|
7
|
Guinot F, Szafranski M, Ambroise C, Samson F. Learning the optimal scale for GWAS through hierarchical SNP aggregation. BMC Bioinformatics 2018; 19:459. [PMID: 30497371 PMCID: PMC6267789 DOI: 10.1186/s12859-018-2475-9] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2017] [Accepted: 11/09/2018] [Indexed: 11/16/2022] Open
Abstract
Background Genome-Wide Association Studies (GWAS) seek to identify causal genomic variants associated with rare human diseases. The classical statistical approach for detecting these variants is based on univariate hypothesis testing, with healthy individuals being tested against affected individuals at each locus. Given that an individual’s genotype is characterized by up to one million SNPs, this approach lacks precision, since it may yield a large number of false positives that can lead to erroneous conclusions about genetic associations with the disease. One way to improve the detection of true genetic associations is to reduce the number of hypotheses to be tested by grouping SNPs. Results We propose a dimension-reduction approach which can be applied in the context of GWAS by making use of the haplotype structure of the human genome. We compare our method with standard univariate and group-based approaches on both synthetic and real GWAS data. Conclusion We show that reducing the dimension of the predictor matrix by aggregating SNPs gives a greater precision in the detection of associations between the phenotype and genomic regions.
Collapse
Affiliation(s)
- Florent Guinot
- UMR 8071 LaMME - UEVE, CNRS, ENSIIE, USC INRA, 23 bd de France, Evry, 91000, France. .,BIOptimize, Reims, 51000, France.
| | - Marie Szafranski
- UMR 8071 LaMME - UEVE, CNRS, ENSIIE, USC INRA, 23 bd de France, Evry, 91000, France
| | - Christophe Ambroise
- UMR 8071 LaMME - UEVE, CNRS, ENSIIE, USC INRA, 23 bd de France, Evry, 91000, France.,UMR MIA-Paris - AgroParisTech, INRA, Université Paris-Saclay, Paris, 75005, France
| | - Franck Samson
- UMR 8071 LaMME - UEVE, CNRS, ENSIIE, USC INRA, 23 bd de France, Evry, 91000, France
| |
Collapse
|
8
|
Romanescu RG, Espin-Garcia O, Ma J, Bull SB. Integrating epigenetic, genetic, and phenotypic data to uncover gene-region associations with triglycerides in the GOLDN study. BMC Proc 2018; 12:57. [PMID: 30263054 PMCID: PMC6157034 DOI: 10.1186/s12919-018-0142-9] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
Background There has been significant interest in investigating genome-wide and epigenome-wide associations with lipids. Testing at the gene or region level may improve power in such studies. Methods We analyze chromosome 11 cytosine-phosphate-guanine (CpG) methylation levels and single-nucleotide polymorphism (SNP) genotypes from the original Genetics of Lipid Lowering Drugs and Diet Network (GOLDN) study, aiming to explore the association between triglyceride levels and genetic/epigenetic factors. We apply region-based tests of association to methylation and genotype data, in turn, which seek to increase power by reducing the dimension of the gene-region variables. We also investigate whether integrating 2 omics data sets (methylation and genotype) into the triglyceride association analysis helps or hinders detection of candidate gene regions. Results Gene-region testing identified 1 CpG region that had been previously reported in the GOLDN study data and another 2 gene regions that are also associated with triglyceride levels. Testing on the combined genetic and epigenetic data detected the same genes as using epigenetic or genetic data alone. Conclusions Region-based testing can uncover additional association signals beyond those detected using single-variant testing.
Collapse
Affiliation(s)
- Razvan G Romanescu
- 1Lunenfeld-Tanenbaum Research Institute, Sinai Health System, 600 University Ave, Toronto, ON M5G 1X5 Canada
| | - Osvaldo Espin-Garcia
- 1Lunenfeld-Tanenbaum Research Institute, Sinai Health System, 600 University Ave, Toronto, ON M5G 1X5 Canada.,2Dalla Lana School of Public Health, University of Toronto, 155 College St, Toronto, ON M5T 3M7 Canada
| | - Jin Ma
- 1Lunenfeld-Tanenbaum Research Institute, Sinai Health System, 600 University Ave, Toronto, ON M5G 1X5 Canada
| | - Shelley B Bull
- 1Lunenfeld-Tanenbaum Research Institute, Sinai Health System, 600 University Ave, Toronto, ON M5G 1X5 Canada.,2Dalla Lana School of Public Health, University of Toronto, 155 College St, Toronto, ON M5T 3M7 Canada
| |
Collapse
|
9
|
Abstract
We identified a child with KLF1-E325K congenital dyserythropoietic anemia type IV who experienced a severe clinical course, fetal anemia, hydrops fetalis, and postnatal transfusion dependence only partially responsive to splenectomy. The child also had complete sex reversal, the cause which remains undetermined. To gain insights into our patient's severe hematologic phenotype, detailed analyses were performed. Erythrocytes from the patient and parents demonstrated functional abnormalities of the erythrocyte membrane, attributed to variants in the α-spectrin gene. Hypomorphic alleles in SEC23B and YARS2 were also identified. We hypothesize that coinheritance of variants in relevant erythrocyte genes contribute to the clinical course in our patient and other E325K-linked congenital dyserythropoietic anemia IV patients with severe clinical phenotypes.
Collapse
|
10
|
Kim SA, Cho CS, Kim SR, Bull SB, Yoo YJ. A new haplotype block detection method for dense genome sequencing data based on interval graph modeling of clusters of highly correlated SNPs. Bioinformatics 2018; 34:388-397. [PMID: 29028986 PMCID: PMC5860363 DOI: 10.1093/bioinformatics/btx609] [Citation(s) in RCA: 30] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2016] [Revised: 09/11/2017] [Accepted: 09/28/2017] [Indexed: 11/13/2022] Open
Abstract
Motivation Linkage disequilibrium (LD) block construction is required for research in population genetics and genetic epidemiology, including specification of sets of single nucleotide polymorphisms (SNPs) for analysis of multi-SNP based association and identification of haplotype blocks in high density sequencing data. Existing methods based on a narrow sense definition do not allow intermediate regions of low LD between strongly associated SNP pairs and tend to split high density SNP data into small blocks having high between-block correlation. Results We present Big-LD, a block partition method based on interval graph modeling of LD bins which are clusters of strong pairwise LD SNPs, not necessarily physically consecutive. Big-LD uses an agglomerative approach that starts by identifying small communities of SNPs, i.e. the SNPs in each LD bin region, and proceeds by merging these communities. We determine the number of blocks using a method to find maximum-weight independent set. Big-LD produces larger LD blocks compared to existing methods such as MATILDE, Haploview, MIG ++, or S-MIG ++ and the LD blocks better agree with recombination hotspot locations determined by sperm-typing experiments. The observed average runtime of Big-LD for 13 288 240 non-monomorphic SNPs from 1000 Genomes Project autosome data (286 East Asians) is about 5.83 h, which is a significant improvement over the existing methods. Availability and implementation Source code and documentation are available for download at http://github.com/sunnyeesl/BigLD. Contact yyoo@snu.ac.kr. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Sun Ah Kim
- Department of Statistics, Seoul National University, Seoul, South Korea
| | - Chang-Sung Cho
- Department of Mathematics Education, Seoul National University, Seoul, South Korea
| | - Suh-Ryung Kim
- Department of Mathematics Education, Seoul National University, Seoul, South Korea
| | - Shelley B Bull
- Prosserman Centre for Health Research, The Lunenfeld-Tanenbaum Research Institute, Sinai Health System, Toronto, Canada
- Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, Canada
| | - Yun Joo Yoo
- Department of Mathematics Education, Seoul National University, Seoul, South Korea
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, South Korea
| |
Collapse
|
11
|
Bull SB, Andrulis IL, Paterson AD. Statistical challenges in high-dimensional molecular and genetic epidemiology. CAN J STAT 2017. [DOI: 10.1002/cjs.11342] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Affiliation(s)
- Shelley B. Bull
- Lunenfeld-Tanenbaum Research Institute; Sinai Health System; Toronto Ontario, Canada M5T 3L9
- Dalla Lana School of Public Health; University of Toronto; Toronto, Ontario Canada M5T 3M7
| | - Irene L. Andrulis
- Lunenfeld-Tanenbaum Research Institute; Sinai Health System; Toronto Ontario, Canada M5T 3L9
- Department of Molecular Genetics; University of Toronto; Toronto, Ontario Canada M5S 1A8
| | - Andrew D. Paterson
- Dalla Lana School of Public Health; University of Toronto; Toronto, Ontario Canada M5T 3M7
- Genetics and Genome Biology Program; The Hospital for Sick Children; Toronto, Ontario Canada M5G 0A4
| |
Collapse
|