1
|
Zhang M, Liu Y, Zhou H, Watkins J, Zhou J. A novel nonlinear dimension reduction approach to infer population structure for low-coverage sequencing data. BMC Bioinformatics 2021; 22:348. [PMID: 34174829 PMCID: PMC8236193 DOI: 10.1186/s12859-021-04265-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2020] [Accepted: 06/11/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Low-depth sequencing allows researchers to increase sample size at the expense of lower accuracy. To incorporate uncertainties while maintaining statistical power, we introduce MCPCA_PopGen to analyze population structure of low-depth sequencing data. RESULTS The method optimizes the choice of nonlinear transformations of dosages to maximize the Ky Fan norm of the covariance matrix. The transformation incorporates the uncertainty in calling between heterozygotes and the common homozygotes for loci having a rare allele and is more linear when both variants are common. CONCLUSIONS We apply MCPCA_PopGen to samples from two indigenous Siberian populations and reveal hidden population structure accurately using only a single chromosome. The MCPCA_PopGen package is available on https://github.com/yiwenstat/MCPCA_PopGen .
Collapse
Affiliation(s)
- Miao Zhang
- Interdisciplinary Program in Statistics and Data Science, University of Arizona, 617 N. Santa Rita Ave., 85721 Tucson, USA
| | - Yiwen Liu
- Department of Mathematics, University of Arizona, 617 N. Santa Rita Ave., 85721 Tucson, USA
| | - Hua Zhou
- Department of Biostatistics, University of California, Los Angeles, 650 Charles E. Young Dr. South, 90095 Los Angeles, USA
| | - Joseph Watkins
- Department of Mathematics, University of Arizona, 617 N. Santa Rita Ave., 85721 Tucson, USA
- Interdisciplinary Program in Statistics and Data Science, University of Arizona, 617 N. Santa Rita Ave., 85721 Tucson, USA
| | - Jin Zhou
- Department of Epidemiology and Biostatistics, University of Arizona, 1295 N. Martin Ave., 85724 Tucson, USA
- Interdisciplinary Program in Statistics and Data Science, University of Arizona, 617 N. Santa Rita Ave., 85721 Tucson, USA
- Department of Medicine, UCLA David Geffen School of Medicine, Los Angeles, CA USA
| |
Collapse
|
2
|
Russo A, Di Gaetano C, Cugliari G, Matullo G. Advances in the Genetics of Hypertension: The Effect of Rare Variants. Int J Mol Sci 2018; 19:E688. [PMID: 29495593 PMCID: PMC5877549 DOI: 10.3390/ijms19030688] [Citation(s) in RCA: 32] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2018] [Revised: 02/19/2018] [Accepted: 02/26/2018] [Indexed: 12/22/2022] Open
Abstract
Worldwide, hypertension still represents a serious health burden with nine million people dying as a consequence of hypertension-related complications. Essential hypertension is a complex trait supported by multifactorial genetic inheritance together with environmental factors. The heritability of blood pressure (BP) is estimated to be 30-50%. A great effort was made to find genetic variants affecting BP levels through Genome-Wide Association Studies (GWAS). This approach relies on the "common disease-common variant" hypothesis and led to the identification of multiple genetic variants which explain, in aggregate, only 2-3% of the genetic variance of hypertension. Part of the missing genetic information could be caused by variants too rare to be detected by GWAS. The use of exome chips and Next-Generation Sequencing facilitated the discovery of causative variants. Here, we report the advances in the detection of novel rare variants, genes, and/or pathways through the most promising approaches, and the recent statistical tests that have emerged to handle rare variants. We also discuss the need to further support rare novel variants with replication studies within larger consortia and with deeper functional studies to better understand how new genes might improve patient care and the stratification of the response to antihypertensive treatments.
Collapse
Affiliation(s)
- Alessia Russo
- Department of Medical Sciences, University of Turin, 10126 Turin, Italy.
- Italian Institute for Genomic Medicine (IIGM, Formerly HuGeF), 10126 Turin, Italy.
| | - Cornelia Di Gaetano
- Department of Medical Sciences, University of Turin, 10126 Turin, Italy.
- Italian Institute for Genomic Medicine (IIGM, Formerly HuGeF), 10126 Turin, Italy.
| | - Giovanni Cugliari
- Department of Medical Sciences, University of Turin, 10126 Turin, Italy.
- Italian Institute for Genomic Medicine (IIGM, Formerly HuGeF), 10126 Turin, Italy.
| | - Giuseppe Matullo
- Department of Medical Sciences, University of Turin, 10126 Turin, Italy.
- Italian Institute for Genomic Medicine (IIGM, Formerly HuGeF), 10126 Turin, Italy.
| |
Collapse
|
3
|
Abstract
Despite thousands of genetic loci identified to date, a large proportion of genetic variation predisposing to complex disease and traits remains unaccounted for. Advances in sequencing technology enable focused explorations on the contribution of low-frequency and rare variants to human traits. Here we review experimental approaches and current knowledge on the contribution of these genetic variants in complex disease and discuss challenges and opportunities for personalised medicine.
Collapse
Affiliation(s)
- Lorenzo Bomba
- Human Genetics, Wellcome Trust Sanger Institute, Genome Campus, Hinxton, CB10 1HH, UK
| | - Klaudia Walter
- Human Genetics, Wellcome Trust Sanger Institute, Genome Campus, Hinxton, CB10 1HH, UK
| | - Nicole Soranzo
- Human Genetics, Wellcome Trust Sanger Institute, Genome Campus, Hinxton, CB10 1HH, UK. .,Department of Haematology, University of Cambridge, Hills Rd, Cambridge, CB2 0AH, UK. .,The National Institute for Health Research Blood and Transplant Unit (NIHR BTRU) in Donor Health and Genomics at the University of Cambridge, University of Cambridge, Strangeways Research Laboratory, Wort's Causeway, Cambridge, CB1 8RN, UK.
| |
Collapse
|
4
|
Minică CC, Genovese G, Hultman CM, Pool R, Vink JM, Neale MC, Dolan CV, Neale BM. The Weighting is the Hardest Part: On the Behavior of the Likelihood Ratio Test and the Score Test Under a Data-Driven Weighting Scheme in Sequenced Samples. Twin Res Hum Genet 2017; 20:108-118. [PMID: 28238293 PMCID: PMC5357183 DOI: 10.1017/thg.2017.7] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Sequence-based association studies are at a critical inflexion point with the increasing availability of exome-sequencing data. A popular test of association is the sequence kernel association test (SKAT). Weights are embedded within SKAT to reflect the hypothesized contribution of the variants to the trait variance. Because the true weights are generally unknown, and so are subject to misspecification, we examined the efficiency of a data-driven weighting scheme. We propose the use of a set of theoretically defensible weighting schemes, of which, we assume, the one that gives the largest test statistic is likely to capture best the allele frequency-functional effect relationship. We show that the use of alternative weights obviates the need to impose arbitrary frequency thresholds. As both the score test and the likelihood ratio test (LRT) may be used in this context, and may differ in power, we characterize the behavior of both tests. The two tests have equal power, if the weights in the set included weights resembling the correct ones. However, if the weights are badly specified, the LRT shows superior power (due to its robustness to misspecification). With this data-driven weighting procedure the LRT detected significant signal in genes located in regions already confirmed as associated with schizophrenia - the PRRC2A (p = 1.020e-06) and the VARS2 (p = 2.383e-06) - in the Swedish schizophrenia case-control cohort of 11,040 individuals with exome-sequencing data. The score test is currently preferred for its computational efficiency and power. Indeed, assuming correct specification, in some circumstances, the score test is the most powerful test. However, LRT has the advantageous properties of being generally more robust and more powerful under weight misspecification. This is an important result given that, arguably, misspecified models are likely to be the rule rather than the exception in weighting-based approaches.
Collapse
Affiliation(s)
- Camelia C. Minică
- Department of Biological Psychology, Vrije Universiteit, Amsterdam
1081 BT, The Netherlands
- The EMGO Institute for Health and Care Research,
Amsterdam 1081 BT, The Netherlands
| | - Giulio Genovese
- The Stanley Center for Psychiatric Research, Broad Institute of the
Massachusetts Institute of Technology and Harvard, Cambridge, MA 02142, USA
- The Program in Medical and Population Genetics, Broad Institute of
the Massachusetts Institute of Technology and Harvard, Cambridge, MA 02142,
USA
- Department of Genetics, Harvard Medical School, Cambridge, MA 02115,
USA
| | - Christina M. Hultman
- Department of Medical Epidemiology and Biostatistics, Karolinska
Institute, Stockholm SE-171 77, Sweden
| | - René Pool
- Department of Biological Psychology, Vrije Universiteit, Amsterdam
1081 BT, The Netherlands
- The EMGO Institute for Health and Care Research,
Amsterdam 1081 BT, The Netherlands
| | - Jacqueline M. Vink
- Behavioural Science Institute, Radboud University, Nijmegen, The
Netherlands
| | - Michael C. Neale
- Department of Biological Psychology, Vrije Universiteit, Amsterdam
1081 BT, The Netherlands
- Virginia Institute for Psychiatric and Behavioral Genetics, Virginia
Commonwealth University, Richmond, USA
| | - Conor V. Dolan
- Department of Biological Psychology, Vrije Universiteit, Amsterdam
1081 BT, The Netherlands
- The EMGO Institute for Health and Care Research,
Amsterdam 1081 BT, The Netherlands
| | - Benjamin M. Neale
- The Stanley Center for Psychiatric Research, Broad Institute of the
Massachusetts Institute of Technology and Harvard, Cambridge, MA 02142, USA
- The Program in Medical and Population Genetics, Broad Institute of
the Massachusetts Institute of Technology and Harvard, Cambridge, MA 02142,
USA
- The Analytical and Translational Genetics Unit, Department of
Medicine, Massachusetts General Hospital and Harvard Medical School, Boston, MA
02114, USA
| |
Collapse
|
5
|
Jeng XJ, Daye ZJ, Lu W, Tzeng JY. Rare Variants Association Analysis in Large-Scale Sequencing Studies at the Single Locus Level. PLoS Comput Biol 2016; 12:e1004993. [PMID: 27355347 PMCID: PMC4927097 DOI: 10.1371/journal.pcbi.1004993] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2015] [Accepted: 05/21/2016] [Indexed: 11/24/2022] Open
Abstract
Genetic association analyses of rare variants in next-generation sequencing (NGS) studies are fundamentally challenging due to the presence of a very large number of candidate variants at extremely low minor allele frequencies. Recent developments often focus on pooling multiple variants to provide association analysis at the gene instead of the locus level. Nonetheless, pinpointing individual variants is a critical goal for genomic researches as such information can facilitate the precise delineation of molecular mechanisms and functions of genetic factors on diseases. Due to the extreme rarity of mutations and high-dimensionality, significances of causal variants cannot easily stand out from those of noncausal ones. Consequently, standard false-positive control procedures, such as the Bonferroni and false discovery rate (FDR), are often impractical to apply, as a majority of the causal variants can only be identified along with a few but unknown number of noncausal variants. To provide informative analysis of individual variants in large-scale sequencing studies, we propose the Adaptive False-Negative Control (AFNC) procedure that can include a large proportion of causal variants with high confidence by introducing a novel statistical inquiry to determine those variants that can be confidently dispatched as noncausal. The AFNC provides a general framework that can accommodate for a variety of models and significance tests. The procedure is computationally efficient and can adapt to the underlying proportion of causal variants and quality of significance rankings. Extensive simulation studies across a plethora of scenarios demonstrate that the AFNC is advantageous for identifying individual rare variants, whereas the Bonferroni and FDR are exceedingly over-conservative for rare variants association studies. In the analyses of the CoLaus dataset, AFNC has identified individual variants most responsible for gene-level significances. Moreover, single-variant results using the AFNC have been successfully applied to infer related genes with annotation information. Next-generation sequencing technologies have allowed genetic association studies of complex traits at the single base-pair resolution, where most genetic variants have extremely low mutation frequencies. These rare variants have been the focus of modern statistical-computational genomics due to their potential to explain missing disease heritability. The identification of individual rare variants associated with diseases can provide new biological insights and enable the precise delineation of disease mechanisms. However, due to the extreme rarity of mutations and large numbers of variants, significances of causative variants tend to be mixed inseparably with a few noncausative ones, and standard multiple testing procedures controlling for false positives fail to provide a meaningful way to include a large proportion of the causative variants. To address the challenge of detecting weak biological signals, we propose a novel statistical procedure, based on false-negative control, to provide a practical approach for variant inclusion in large-scale sequencing studies. By determining those variants that can be confidently dispatched as noncausative, the proposed procedure offers an objective selection of a modest number of potentially causative variants at the single-locus level. Results can be further prioritized or used to infer disease-associated genes with annotation information.
Collapse
Affiliation(s)
- Xinge Jessie Jeng
- Department of Statistics, North Carolina State University, Raleigh, North Carolina, United States of America
| | - Zhongyin John Daye
- Epidemiology and Biostatistics, University of Arizona, Tucson, Arizona, United States of America
| | - Wenbin Lu
- Department of Statistics, North Carolina State University, Raleigh, North Carolina, United States of America
| | - Jung-Ying Tzeng
- Department of Statistics, North Carolina State University, Raleigh, North Carolina, United States of America
- Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina, United States of America
- Department of Statistics, National Cheng-Kung University, Tainan, Taiwan
- * E-mail:
| |
Collapse
|
6
|
Khurana E, Fu Y, Chakravarty D, Demichelis F, Rubin MA, Gerstein M. Role of non-coding sequence variants in cancer. Nat Rev Genet 2016; 17:93-108. [PMID: 26781813 DOI: 10.1038/nrg.2015.17] [Citation(s) in RCA: 319] [Impact Index Per Article: 39.9] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
Patients with cancer carry somatic sequence variants in their tumour in addition to the germline variants in their inherited genome. Although variants in protein-coding regions have received the most attention, numerous studies have noted the importance of non-coding variants in cancer. Moreover, the overwhelming majority of variants, both somatic and germline, occur in non-coding portions of the genome. We review the current understanding of non-coding variants in cancer, including the great diversity of the mutation types--from single nucleotide variants to large genomic rearrangements--and the wide range of mechanisms by which they affect gene expression to promote tumorigenesis, such as disrupting transcription factor-binding sites or functions of non-coding RNAs. We highlight specific case studies of somatic and germline variants, and discuss how non-coding variants can be interpreted on a large-scale through computational and experimental methods.
Collapse
Affiliation(s)
- Ekta Khurana
- Meyer Cancer Center, Weill Cornell Medical College, New York, New York 10065, USA.,Institute for Precision Medicine, Weill Cornell Medical College, New York, New York 10065, USA.,Institute for Computational Biomedicine, Weill Cornell Medical College, New York, New York 10021, USA.,Department of Physiology and Biophysics, Weill Cornell Medical College, New York, New York 10065, USA
| | - Yao Fu
- Bina Technologies, Roche Sequencing, Redwood City, California 94065, USA
| | - Dimple Chakravarty
- Institute for Precision Medicine, Weill Cornell Medical College, New York, New York 10065, USA.,Department of Pathology and Laboratory Medicine, Weill Cornell Medical College, New York, New York 10065, USA
| | - Francesca Demichelis
- Institute for Precision Medicine, Weill Cornell Medical College, New York, New York 10065, USA.,Institute for Computational Biomedicine, Weill Cornell Medical College, New York, New York 10021, USA.,Centre for Integrative Biology, University of Trento, 38123 Trento, Italy
| | - Mark A Rubin
- Meyer Cancer Center, Weill Cornell Medical College, New York, New York 10065, USA.,Institute for Precision Medicine, Weill Cornell Medical College, New York, New York 10065, USA.,Department of Pathology and Laboratory Medicine, Weill Cornell Medical College, New York, New York 10065, USA
| | - Mark Gerstein
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut 06520, USA.,Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA.,Department of Computer Science, Yale University, New Haven, Connecticut 06520, USA
| |
Collapse
|
7
|
Zeng P, Zhao Y, Liu J, Liu L, Zhang L, Wang T, Huang S, Chen F. Likelihood ratio tests in rare variant detection for continuous phenotypes. Ann Hum Genet 2015; 78:320-32. [PMID: 25117149 DOI: 10.1111/ahg.12071] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2013] [Accepted: 04/22/2014] [Indexed: 12/30/2022]
Abstract
It is believed that rare variants play an important role in human phenotypes; however, the detection of rare variants is extremely challenging due to their very low minor allele frequency. In this paper, the likelihood ratio test (LRT) and restricted likelihood ratio test (ReLRT) are proposed to test the association of rare variants based on the linear mixed effects model, where a group of rare variants are treated as random effects. Like the sequence kernel association test (SKAT), a state-of-the-art method for rare variant detection, LRT and ReLRT can effectively overcome the problem of directionality of effect inherent in the burden test in practice. By taking full advantage of the spectral decomposition, exact finite sample null distributions for LRT and ReLRT are obtained by simulation. We perform extensive numerical studies to evaluate the performance of LRT and ReLRT, and compare to the burden test, SKAT and SKAT-O. The simulations have shown that LRT and ReLRT can correctly control the type I error, and the controls are robust to the weights chosen and the number of rare variants under study. LRT and ReLRT behave similarly to the burden test when all the causal rare variants share the same direction of effect, and outperform SKAT across various situations. When both positive and negative effects exist, LRT and ReLRT suffer from few power reductions compared to the other two competing methods; under this case, an additional finding from our simulations is that SKAT-O is no longer the optimal test, and its power is even lower than that of SKAT. The exome sequencing SNP data from Genetic Analysis Workshop 17 were employed to illustrate the proposed methods, and interesting results are described.
Collapse
Affiliation(s)
- Ping Zeng
- Department of Epidemiology and Biostatistics, School of Public Health, Nanjing Medical University, Nanjing, Jiangsu, 211166, P. R. China; Department of Epidemiology and Biostatistics, School of Public Health, Xuzhou Medical College, Xuzhou, Jiangsu, 221004, P. R. China
| | | | | | | | | | | | | | | |
Collapse
|
8
|
Porth I, El-Kassaby YA. Using Populus as a lignocellulosic feedstock for bioethanol. Biotechnol J 2015; 10:510-24. [PMID: 25676392 DOI: 10.1002/biot.201400194] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2014] [Revised: 11/11/2014] [Accepted: 12/30/2014] [Indexed: 11/10/2022]
Abstract
Populus species along with species from the sister genus Salix will provide valuable feedstock resources for advanced second-generation biofuels. Their inherent fast growth characteristics can particularly be exploited for short rotation management, a time and energy saving cultivation alternative for lignocellulosic feedstock supply. Salicaceae possess inherent cell wall characteristics with favorable cellulose to lignin ratios for utilization as bioethanol crop. We review economically important traits relevant for intensively managed biofuel crop plantations, genomic and phenotypic resources available for Populus, breeding strategies for forest trees dedicated to bioenergy provision, and bioprocesses and downstream applications related to opportunities using Salicaceae as a renewable resource. Challenges need to be resolved for every single step of the conversion process chain, i.e., starting from tree domestication for improved performance as a bioenergy crop, bioconversion process, policy development for land use changes associated with advanced biofuels, and harvest and supply logistics associated with industrial-scale biorefinery plants using Populus as feedstock. Significant hurdles towards cost and energy efficiency, environmental friendliness, and yield maximization with regards to biomass pretreatment, saccharification, and fermentation of celluloses and the sustainability of biorefineries as a whole still need to be overcome.
Collapse
Affiliation(s)
- Ilga Porth
- Forest and Conservation Sciences, University of British Columbia, Vancouver, Canada.
| | | |
Collapse
|
9
|
Lee W, Lee D, Pawitan Y. Likelihood ratio and score burden tests for detecting disease-associated rare variants. Stat Appl Genet Mol Biol 2015; 14:481-95. [DOI: 10.1515/sagmb-2014-0089] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
AbstractThis paper presents two simple rare variant (RV) burden tests based on the likelihood ratio test (LRT) and score statistics. LRT is one of the commonly used tests in practical data analysis, and we show here that there is no reason to ignore it in testing RV associations. With the Bartlett correction, we have numerically shown that the LRT-based test can have a reliable distribution. Our simulation study indicates that if the non-null variants are as common as the null variants, then the LRT and score statistics have comparable performance to the C-alpha test, and if the former is rarer than the null variants, then they outperform the C-alpha test.
Collapse
|
10
|
He L, Pitkäniemi J, Sarin AP, Salomaa V, Sillanpää MJ, Ripatti S. Hierarchical Bayesian model for rare variant association analysis integrating genotype uncertainty in human sequence data. Genet Epidemiol 2014; 39:89-100. [PMID: 25395270 DOI: 10.1002/gepi.21871] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2014] [Revised: 09/18/2014] [Accepted: 10/03/2014] [Indexed: 11/08/2022]
Abstract
Next-generation sequencing (NGS) has led to the study of rare genetic variants, which possibly explain the missing heritability for complex diseases. Most existing methods for rare variant (RV) association detection do not account for the common presence of sequencing errors in NGS data. The errors can largely affect the power and perturb the accuracy of association tests due to rare observations of minor alleles. We developed a hierarchical Bayesian approach to estimate the association between RVs and complex diseases. Our integrated framework combines the misclassification probability with shrinkage-based Bayesian variable selection. It allows for flexibility in handling neutral and protective RVs with measurement error, and is robust enough for detecting causal RVs with a wide spectrum of minor allele frequency (MAF). Imputation uncertainty and MAF are incorporated into the integrated framework to achieve the optimal statistical power. We demonstrate that sequencing error does significantly affect the findings, and our proposed model can take advantage of it to improve statistical power in both simulated and real data. We further show that our model outperforms existing methods, such as sequence kernel association test (SKAT). Finally, we illustrate the behavior of the proposed method using a Finnish low-density lipoprotein cholesterol study, and show that it identifies an RV known as FH North Karelia in LDLR gene with three carriers in 1,155 individuals, which is missed by both SKAT and Granvil.
Collapse
Affiliation(s)
- Liang He
- Department of Public Health, Hjelt Institute, University of Helsinki, Helsinki, Finland
| | | | | | | | | | | |
Collapse
|
11
|
Mallaney C, Sung YJ. Rare variant analysis of blood pressure phenotypes in the Genetic Analysis Workshop 18 whole genome sequencing data using sequence kernel association test. BMC Proc 2014; 8:S10. [PMID: 25519353 PMCID: PMC4143707 DOI: 10.1186/1753-6561-8-s1-s10] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Sequence kernel association test (SKAT) has become one of the most commonly used nonburden tests for analyzing rare variants. Performance of burden tests depends on the weighting of rare and common variants when collapsing them in a genomic region. Using the systolic and diastolic blood pressure phenotypes of 142 unrelated individuals in the Genetic Analysis Workshop 18 data, we investigated whether performance of SKAT also depends on the weighting scheme. We analyzed the entire sequencing data for all 200 replications using 3 weighting schemes: equal weighting, Madsen-Browning weighting, and SKAT default linear weighting. We considered two options: all single-nucleotide polymorphisms (SNPs) and only low-frequency SNPs. A SKAT default weighting scheme (which heavily downweights common variants) performed better for the genes in which causal SNPs are mostly rare. This SKAT default weighting scheme behaved similarly to other weighting schemes after eliminating all common SNPs. In contrast, the equal weighting scheme performed the best for MAP4 and FLT3, both of which included a common variant with a large effect. However, SKAT with all 3 weighting schemes performed poorly. Overall power across all causal genes was about 0.05, which was almost identical to the type I error rate. This poor performance is partly due to a small sample size because of the need to analyze only unrelated individuals. Because a half of causal SNPs were not found in the annotation file based on the 1000 Genomes Project, we suspect that performance was also affected by our use of incomplete annotation information.
Collapse
Affiliation(s)
- Cates Mallaney
- Division of Biostatistics, Washington University in St. Louis, School of Medicine, St. Louis, MO 63110, USA
| | - Yun Ju Sung
- Division of Biostatistics, Washington University in St. Louis, School of Medicine, St. Louis, MO 63110, USA
| |
Collapse
|
12
|
Abstract
Advances in next-generation sequencing technology have made it possible to comprehensively interrogate the entire spectrum of genomic variations including rare variants. They may help capture the remaining genetic heritability which has not been fully explained by previous genome-wide association studies. Here we performed a gene-based genome-wide scan to identify hypertension susceptibility loci in analysis of a whole genome sequencing cohort of 103 unrelated individuals. We found that collapsing singletons may boost signals for associating rare variants and identified SETX statistically significant by a genome-wide gene-based threshold (p value <5.0 × 10(-6)). The function of SETX in hypertension may be worthy of further investigation.
Collapse
Affiliation(s)
- Wei Wang
- Department of Computer Science, New Jersey Institute of Technology, University Heights Newark, New Jersey 07102, USA
| | - Zhi Wei
- Department of Computer Science, New Jersey Institute of Technology, University Heights Newark, New Jersey 07102, USA
| |
Collapse
|
13
|
Derkach A, Lawless JF, Sun L. Pooled Association Tests for Rare Genetic Variants: A Review and Some New Results. Stat Sci 2014. [DOI: 10.1214/13-sts456] [Citation(s) in RCA: 51] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
14
|
Derkach A, Chiang T, Gong J, Addis L, Dobbins S, Tomlinson I, Houlston R, Pal DK, Strug LJ. Association analysis using next-generation sequence data from publicly available control groups: the robust variance score statistic. ACTA ACUST UNITED AC 2014; 30:2179-88. [PMID: 24733292 DOI: 10.1093/bioinformatics/btu196] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Abstract
MOTIVATION Sufficiently powered case-control studies with next-generation sequence (NGS) data remain prohibitively expensive for many investigators. If feasible, a more efficient strategy would be to include publicly available sequenced controls. However, these studies can be confounded by differences in sequencing platform; alignment, single nucleotide polymorphism and variant calling algorithms; read depth; and selection thresholds. Assuming one can match cases and controls on the basis of ethnicity and other potential confounding factors, and one has access to the aligned reads in both groups, we investigate the effect of systematic differences in read depth and selection threshold when comparing allele frequencies between cases and controls. We propose a novel likelihood-based method, the robust variance score (RVS), that substitutes genotype calls by their expected values given observed sequence data. RESULTS We show theoretically that the RVS eliminates read depth bias in the estimation of minor allele frequency. We also demonstrate that, using simulated and real NGS data, the RVS method controls Type I error and has comparable power to the 'gold standard' analysis with the true underlying genotypes for both common and rare variants. AVAILABILITY AND IMPLEMENTATION An RVS R script and instructions can be found at strug.research.sickkids.ca, and at https://github.com/strug-lab/RVS. CONTACT lisa.strug@utoronto.ca SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Andriy Derkach
- Department of Statistical Science, University of Toronto, Toronto, ON, Canada, Program in Child Health Evaluative Sciences, the Hospital for Sick Children Research Institute, Toronto, ON, Canada, Department of Clinical Neuroscience, Institute of Psychiatry, King's College London, London, Division of Genetics and Epidemiology, Institute of Cancer Research, Sutton, Surrey, Molecular and Population Genetics and NIHR Comprehensive Biomedical Research Centre, Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK, Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada
| | - Theodore Chiang
- Department of Statistical Science, University of Toronto, Toronto, ON, Canada, Program in Child Health Evaluative Sciences, the Hospital for Sick Children Research Institute, Toronto, ON, Canada, Department of Clinical Neuroscience, Institute of Psychiatry, King's College London, London, Division of Genetics and Epidemiology, Institute of Cancer Research, Sutton, Surrey, Molecular and Population Genetics and NIHR Comprehensive Biomedical Research Centre, Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK, Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada
| | - Jiafen Gong
- Department of Statistical Science, University of Toronto, Toronto, ON, Canada, Program in Child Health Evaluative Sciences, the Hospital for Sick Children Research Institute, Toronto, ON, Canada, Department of Clinical Neuroscience, Institute of Psychiatry, King's College London, London, Division of Genetics and Epidemiology, Institute of Cancer Research, Sutton, Surrey, Molecular and Population Genetics and NIHR Comprehensive Biomedical Research Centre, Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK, Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada
| | - Laura Addis
- Department of Statistical Science, University of Toronto, Toronto, ON, Canada, Program in Child Health Evaluative Sciences, the Hospital for Sick Children Research Institute, Toronto, ON, Canada, Department of Clinical Neuroscience, Institute of Psychiatry, King's College London, London, Division of Genetics and Epidemiology, Institute of Cancer Research, Sutton, Surrey, Molecular and Population Genetics and NIHR Comprehensive Biomedical Research Centre, Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK, Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada
| | - Sara Dobbins
- Department of Statistical Science, University of Toronto, Toronto, ON, Canada, Program in Child Health Evaluative Sciences, the Hospital for Sick Children Research Institute, Toronto, ON, Canada, Department of Clinical Neuroscience, Institute of Psychiatry, King's College London, London, Division of Genetics and Epidemiology, Institute of Cancer Research, Sutton, Surrey, Molecular and Population Genetics and NIHR Comprehensive Biomedical Research Centre, Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK, Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada
| | - Ian Tomlinson
- Department of Statistical Science, University of Toronto, Toronto, ON, Canada, Program in Child Health Evaluative Sciences, the Hospital for Sick Children Research Institute, Toronto, ON, Canada, Department of Clinical Neuroscience, Institute of Psychiatry, King's College London, London, Division of Genetics and Epidemiology, Institute of Cancer Research, Sutton, Surrey, Molecular and Population Genetics and NIHR Comprehensive Biomedical Research Centre, Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK, Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada
| | - Richard Houlston
- Department of Statistical Science, University of Toronto, Toronto, ON, Canada, Program in Child Health Evaluative Sciences, the Hospital for Sick Children Research Institute, Toronto, ON, Canada, Department of Clinical Neuroscience, Institute of Psychiatry, King's College London, London, Division of Genetics and Epidemiology, Institute of Cancer Research, Sutton, Surrey, Molecular and Population Genetics and NIHR Comprehensive Biomedical Research Centre, Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK, Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada
| | - Deb K Pal
- Department of Statistical Science, University of Toronto, Toronto, ON, Canada, Program in Child Health Evaluative Sciences, the Hospital for Sick Children Research Institute, Toronto, ON, Canada, Department of Clinical Neuroscience, Institute of Psychiatry, King's College London, London, Division of Genetics and Epidemiology, Institute of Cancer Research, Sutton, Surrey, Molecular and Population Genetics and NIHR Comprehensive Biomedical Research Centre, Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK, Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada
| | - Lisa J Strug
- Department of Statistical Science, University of Toronto, Toronto, ON, Canada, Program in Child Health Evaluative Sciences, the Hospital for Sick Children Research Institute, Toronto, ON, Canada, Department of Clinical Neuroscience, Institute of Psychiatry, King's College London, London, Division of Genetics and Epidemiology, Institute of Cancer Research, Sutton, Surrey, Molecular and Population Genetics and NIHR Comprehensive Biomedical Research Centre, Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK, Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, ON, CanadaDepartment of Statistical Science, University of Toronto, Toronto, ON, Canada, Program in Child Health Evaluative Sciences, the Hospital for Sick Children Research Institute, Toronto, ON, Canada, Department of Clinical Neuroscience, Institute of Psychiatry, King's College London, London, Division of Genetics and Epidemiology, Institute of Cancer Research, Sutton, Surrey, Molecular and Population Genetics and NIHR Comprehensive Biomedical Research Centre, Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK, Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada
| |
Collapse
|
15
|
Cook K, Benitez A, Fu C, Tintle N. Evaluating the impact of genotype errors on rare variant tests of association. Front Genet 2014; 5:62. [PMID: 24744770 PMCID: PMC3978329 DOI: 10.3389/fgene.2014.00062] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2013] [Accepted: 03/11/2014] [Indexed: 01/23/2023] Open
Abstract
The new class of rare variant tests has usually been evaluated assuming perfect genotype information. In reality, rare variant genotypes may be incorrect, and so rare variant tests should be robust to imperfect data. Errors and uncertainty in SNP genotyping are already known to dramatically impact statistical power for single marker tests on common variants and, in some cases, inflate the type I error rate. Recent results show that uncertainty in genotype calls derived from sequencing reads are dependent on several factors, including read depth, calling algorithm, number of alleles present in the sample, and the frequency at which an allele segregates in the population. We have recently proposed a general framework for the evaluation and investigation of rare variant tests of association, classifying most rare variant tests into one of two broad categories (length or joint tests). We use this framework to relate factors affecting genotype uncertainty to the power and type I error rate of rare variant tests. We find that non-differential genotype errors (an error process that occurs independent of phenotype) decrease power, with larger decreases for extremely rare variants, and for the common homozygote to heterozygote error. Differential genotype errors (an error process that is associated with phenotype status), lead to inflated type I error rates which are more likely to occur at sites with more common homozygote to heterozygote errors than vice versa. Finally, our work suggests that certain rare variant tests and study designs may be more robust to the inclusion of genotype errors. Further work is needed to directly integrate genotype calling algorithm decisions, study costs and test statistic choices to provide comprehensive design and analysis advice which appropriately accounts for the impact of genotype errors.
Collapse
Affiliation(s)
- Kaitlyn Cook
- Department of Mathematics, Carleton College Northfield, MN, USA
| | - Alejandra Benitez
- Department of Applied Mathematics, Brown University Providence, RI, USA
| | - Casey Fu
- Department of Mathematics, Massachusetts Institute of Technology Boston, MA, USA
| | - Nathan Tintle
- Department of Mathematics, Statistics and Computer Science, Dordt College Sioux Center, IA, USA
| |
Collapse
|
16
|
Zhao Z, Wang W, Wei Z. An empirical Bayes testing procedure for detecting variants in analysis of next generation sequencing data. Ann Appl Stat 2013. [DOI: 10.1214/13-aoas660] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
17
|
|
18
|
Cardinale CJ, Wei Z, Panossian S, Wang F, Kim CE, Mentch FD, Chiavacci RM, Kachelries KE, Pandey R, Grant SFA, Baldassano RN, Hakonarson H. Targeted resequencing identifies defective variants of decoy receptor 3 in pediatric-onset inflammatory bowel disease. Genes Immun 2013; 14:447-52. [DOI: 10.1038/gene.2013.43] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2013] [Accepted: 07/19/2013] [Indexed: 12/14/2022]
|
19
|
Nurminen R, Lehtonen R, Auvinen A, Tammela TLJ, Wahlfors T, Schleutker J. Fine mapping of 11q13.5 identifies regions associated with prostate cancer and prostate cancer death. Eur J Cancer 2013; 49:3335-43. [PMID: 23830236 DOI: 10.1016/j.ejca.2013.06.006] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2013] [Revised: 05/27/2013] [Accepted: 06/03/2013] [Indexed: 01/07/2023]
Abstract
BACKGROUND Chromosomal region 11q13-14 associates with prostate cancer (PrCa). Previously, we identified a rare intronic mutation on EMSY (11q13.5) that increases the risk of aggressive PrCa and associates with familial PrCa. Here, we further study the genetic structure and variants of the PrCa susceptibility region 11q13.5. METHODS This study included 2716 unselected hospital-based PrCa cases, 1318 cases of a screening trial and 908 controls of Finnish origin. We imputed single nucleotide polymorphisms (SNPs) and structural variants from the 1000 Genomes Project and validated the associations of the variants in two PrCa patient sets by genotyping. Genetic structure was studied with haplotype analysis. RESULTS Two independent regions at 11q13.5 were associated with PrCa risk. The most significant association was at EMSY (rs10899221, odds ratio (OR) 1.29-1.40, P=3.5 × 10(-4)-0.002) near the previously identified mutation. Correlated intronic SNPs rs10899221 and rs72944758 formed with other EMSY variants common and rare haplotypes that were associated with increased risk (P=4.0 × 10(-4)) and decreased risk (P=0.01) of PrCa, respectively. The other associated region was intergenic. Among the six validated variants, rs12277366 was significant in both patient sets (OR 1.15-1.17, P=0.01). Haplotypes associated with an increased risk (P=0.02) and a decreased risk (P=0.02) were identified. In addition, the intergenic region was strongly associated with PrCa death, with the most significant association at rs12277366 (OR=0.72, P=4.8 × 10(-5)). CONCLUSIONS These findings indicate that 11q13.5 contributes to PrCa predisposition with complex genetic structure and is associated with PrCa death.
Collapse
Affiliation(s)
- Riikka Nurminen
- Institute of Biomedical Technology/BioMediTech and Prostate Cancer Research Center, University of Tampere and Fimlab Laboratories, Biokatu 8, FI-33014 Tampere, Finland
| | | | | | | | | | | |
Collapse
|
20
|
Wu G, Zhi D. Pathway-based approaches for sequencing-based genome-wide association studies. Genet Epidemiol 2013; 37:478-94. [PMID: 23650134 DOI: 10.1002/gepi.21728] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2012] [Revised: 03/04/2013] [Accepted: 03/29/2013] [Indexed: 01/07/2023]
Abstract
For analyzing complex trait association with sequencing data, most current studies test aggregated effects of variants in a gene or genomic region. Although gene-based tests have insufficient power even for moderately sized samples, pathway-based analyses combine information across multiple genes in biological pathways and may offer additional insight. However, most existing pathway association methods are originally designed for genome-wide association studies, and are not comprehensively evaluated for sequencing data. Moreover, region-based rare variant association methods, although potentially applicable to pathway-based analysis by extending their region definition to gene sets, have never been rigorously tested. In the context of exome-based studies, we use simulated and real datasets to evaluate pathway-based association tests. Our simulation strategy adopts a genome-wide genetic model that distributes total genetic effects hierarchically into pathways, genes, and individual variants, allowing the evaluation of pathway-based methods with realistic quantifiable assumptions on the underlying genetic architectures. The results show that, although no single pathway-based association method offers superior performance in all simulated scenarios, a modification of Gene Set Enrichment Analysis approach using statistics from single-marker tests without gene-level collapsing (weighted Kolmogrov-Smirnov [WKS]-Variant method) is consistently powerful. Interestingly, directly applying rare variant association tests (e.g., sequence kernel association test) to pathway analysis offers a similar power, but its results are sensitive to assumptions of genetic architecture. We applied pathway association analysis to an exome-sequencing data of the chronic obstructive pulmonary disease, and found that the WKS-Variant method confirms associated genes previously published.
Collapse
Affiliation(s)
- Guodong Wu
- Department of Biostatistics, University of Alabama at Birmingham, Birmingham, Alabama 35294, USA
| | | |
Collapse
|
21
|
Oualkacha K, Dastani Z, Li R, Cingolani PE, Spector TD, Hammond CJ, Richards JB, Ciampi A, Greenwood CMT. Adjusted sequence kernel association test for rare variants controlling for cryptic and family relatedness. Genet Epidemiol 2013; 37:366-76. [PMID: 23529756 DOI: 10.1002/gepi.21725] [Citation(s) in RCA: 48] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2012] [Revised: 02/20/2013] [Accepted: 02/25/2013] [Indexed: 12/27/2022]
Abstract
Recent progress in sequencing technologies makes it possible to identify rare and unique variants that may be associated with complex traits. However, the results of such efforts depend crucially on the use of efficient statistical methods and study designs. Although family-based designs might enrich a data set for familial rare disease variants, most existing rare variant association approaches assume independence of all individuals. We introduce here a framework for association testing of rare variants in family-based designs. This framework is an adaptation of the sequence kernel association test (SKAT) which allows us to control for family structure. Our adjusted SKAT (ASKAT) combines the SKAT approach and the factored spectrally transformed linear mixed models (FaST-LMMs) algorithm to capture family effects based on a LMM incorporating the realized proportion of the genome that is identical by descent between pairs of individuals, and using restricted maximum likelihood methods for estimation. In simulation studies, we evaluated type I error and power of this proposed method and we showed that regardless of the level of the trait heritability, our approach has good control of type I error and good power. Since our approach uses FaST-LMM to calculate variance components for the proposed mixed model, ASKAT is reasonably fast and can analyze hundreds of thousands of markers. Data from the UK twins consortium are presented to illustrate the ASKAT methodology.
Collapse
Affiliation(s)
- Karim Oualkacha
- Lady Davis Institute for Medical Research, Jewish General Hospital, Montreal, QC, Canada
| | | | | | | | | | | | | | | | | |
Collapse
|
22
|
Moore CB, Wallace JR, Frase AT, Pendergrass SA, Ritchie MD. Using BioBin to explore rare variant population stratification. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2013:332-43. [PMID: 23424138 PMCID: PMC3638724] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]
Abstract
Rare variants (RVs) will likely explain additional heritability of many common complex diseases; however, the natural frequencies of rare variation across and between human populations are largely unknown. We have developed a powerful, flexible collapsing method called BioBin that utilizes prior biological knowledge using multiple publicly available database sources to direct analyses. Variants can be collapsed according to functional regions, evolutionary conserved regions, regulatory regions, genes, and/or pathways without the need for external files. We conducted an extensive comparison of rare variant burden differences (MAF < 0.03) between two ancestry groups from 1000 Genomes Project data, Yoruba (YRI) and European descent (CEU) individuals. We found that 56.86% of gene bins, 72.73% of intergenic bins, 69.45% of pathway bins, 32.36% of ORegAnno annotated bins, and 9.10% of evolutionary conserved regions (shared with primates) have statistically significant differences in RV burden. Ongoing efforts include examining additional regional characteristics using regulatory regions and protein binding domains. Our results show interesting variant differences between two ancestral populations and demonstrate that population stratification is a pervasive concern for sequence analyses.
Collapse
Affiliation(s)
- Carrie B. Moore
- Center for Human Genetics Research, Vanderbilt University, 519 Light Hall, Nashville, TN 37232, USA,
| | - John R. Wallace
- Center for Systems Genomics, Pennsylvania State University, 512 Wartik Laboratory University Park, PA 16802, USA,
| | - Alex T. Frase
- Center for Systems Genomics, Pennsylvania State University, 512 Wartik Laboratory, University Park, PA 16802, USA,
| | - Sarah A. Pendergrass
- Center for Systems Genomics, Pennsylvania State University, 512 Wartik Laboratory, University Park, PA 16802, USA,
| | - Marylyn D. Ritchie
- Center for Systems Genomics, Pennsylvania State University, 512 Wartik Laboratory, University Park, PA 16802, USA,
| |
Collapse
|