1
|
Iñiguez-Muñoz S, Llinàs-Arias P, Ensenyat-Mendez M, Bedoya-López AF, Orozco JIJ, Cortés J, Roy A, Forsberg-Nilsson K, DiNome ML, Marzese DM. Hidden secrets of the cancer genome: unlocking the impact of non-coding mutations in gene regulatory elements. Cell Mol Life Sci 2024; 81:274. [PMID: 38902506 PMCID: PMC11335195 DOI: 10.1007/s00018-024-05314-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2023] [Revised: 12/07/2023] [Accepted: 06/06/2024] [Indexed: 06/22/2024]
Abstract
Discoveries in the field of genomics have revealed that non-coding genomic regions are not merely "junk DNA", but rather comprise critical elements involved in gene expression. These gene regulatory elements (GREs) include enhancers, insulators, silencers, and gene promoters. Notably, new evidence shows how mutations within these regions substantially influence gene expression programs, especially in the context of cancer. Advances in high-throughput sequencing technologies have accelerated the identification of somatic and germline single nucleotide mutations in non-coding genomic regions. This review provides an overview of somatic and germline non-coding single nucleotide alterations affecting transcription factor binding sites in GREs, specifically involved in cancer biology. It also summarizes the technologies available for exploring GREs and the challenges associated with studying and characterizing non-coding single nucleotide mutations. Understanding the role of GRE alterations in cancer is essential for improving diagnostic and prognostic capabilities in the precision medicine era, leading to enhanced patient-centered clinical outcomes.
Collapse
Affiliation(s)
- Sandra Iñiguez-Muñoz
- Cancer Epigenetics Laboratory at the Cancer Cell Biology Group, Institut d'Investigació Sanitària Illes Balears (IdISBa), Palma, Spain
| | - Pere Llinàs-Arias
- Cancer Epigenetics Laboratory at the Cancer Cell Biology Group, Institut d'Investigació Sanitària Illes Balears (IdISBa), Palma, Spain
| | - Miquel Ensenyat-Mendez
- Cancer Epigenetics Laboratory at the Cancer Cell Biology Group, Institut d'Investigació Sanitària Illes Balears (IdISBa), Palma, Spain
| | - Andrés F Bedoya-López
- Cancer Epigenetics Laboratory at the Cancer Cell Biology Group, Institut d'Investigació Sanitària Illes Balears (IdISBa), Palma, Spain
| | - Javier I J Orozco
- Saint John's Cancer Institute, Providence Saint John's Health Center, Santa Monica, CA, USA
| | - Javier Cortés
- International Breast Cancer Center (IBCC), Pangaea Oncology, Quiron Group, 08017, Barcelona, Spain
- Medica Scientia Innovation Research SL (MEDSIR), 08018, Barcelona, Spain
- Faculty of Biomedical and Health Sciences, Department of Medicine, Universidad Europea de Madrid, 28670, Madrid, Spain
| | - Ananya Roy
- Department of Immunology, Genetics and Pathology and Science for Life Laboratory, Uppsala University, Uppsala, Sweden
| | - Karin Forsberg-Nilsson
- Department of Immunology, Genetics and Pathology and Science for Life Laboratory, Uppsala University, Uppsala, Sweden
- University of Nottingham Biodiscovery Institute, Nottingham, UK
| | - Maggie L DiNome
- Department of Surgery, Duke University School of Medicine, Durham, NC, USA
| | - Diego M Marzese
- Cancer Epigenetics Laboratory at the Cancer Cell Biology Group, Institut d'Investigació Sanitària Illes Balears (IdISBa), Palma, Spain.
- Department of Surgery, Duke University School of Medicine, Durham, NC, USA.
| |
Collapse
|
2
|
Eren KK, Çınar E, Karakurt HU, Özgür A. Improving the filtering of false positive single nucleotide variations by combining genomic features with quality metrics. Bioinformatics 2023; 39:btad694. [PMID: 38019945 PMCID: PMC10692869 DOI: 10.1093/bioinformatics/btad694] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2023] [Revised: 10/16/2023] [Accepted: 11/28/2023] [Indexed: 12/01/2023] Open
Abstract
MOTIVATION Technical errors in sequencing or bioinformatics steps and difficulties in alignment at some genomic sites result in false positive (FP) variants. Filtering based on quality metrics is a common method for detecting FP variants, but setting thresholds to reduce FP rates may reduce the number of true positive variants by overlooking the more complex relationships between features. The goal of this study is to develop a machine learning-based model for identifying FPs that integrates quality metrics with genomic features and with the feature interpretability property to provide insights into model results. RESULTS We propose a random forest-based model that utilizes genomic features to improve identification of FPs. Further examination of the features shows that the newly introduced features have an important impact on the prediction of variants misclassified by VEF, GATK-CNN, and GARFIELD, recently introduced FP detection systems. We applied cost-sensitive training to avoid errors in misclassification of true variants and developed a model that provides a robust mechanism against misclassification of true variants while increasing the prediction rate of FP variants. This model can be easily re-trained when factors such as experimental protocols might alter the FP distribution. In addition, it has an interpretability mechanism that allows users to understand the impact of features on the model's predictions. AVAILABILITY AND IMPLEMENTATION The software implementation can be found at https://github.com/ideateknoloji/FPDetect.
Collapse
Affiliation(s)
- Kazım Kıvanç Eren
- Department of Computer Engineering, Kocaeli University, Kocaeli 41000, Turkey
| | - Esra Çınar
- R&D Department, Idea Technology Solutions LLC., Istanbul 34396, Turkey
| | - Hamza U Karakurt
- R&D Department, Idea Technology Solutions LLC., Istanbul 34396, Turkey
- Department of Bioengineering, Gebze Technical University, Kocaeli 41400, Turkey
| | - Arzucan Özgür
- Department of Computer Engineering, Boğaziçi University, Istanbul 34342, Turkey
| |
Collapse
|
3
|
Bykova M, Hou Y, Eng C, Cheng F. Quantitative trait locus (xQTL) approaches identify risk genes and drug targets from human non-coding genomes. Hum Mol Genet 2022; 31:R105-R113. [PMID: 36018824 PMCID: PMC9989738 DOI: 10.1093/hmg/ddac208] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2022] [Revised: 08/18/2022] [Accepted: 08/19/2022] [Indexed: 11/13/2022] Open
Abstract
Advances and reduction of costs in various sequencing technologies allow for a closer look at variations present in the non-coding regions of the human genome. Correlating non-coding variants with large-scale multi-omic data holds the promise not only of a better understanding of likely causal connections between non-coding DNA and expression of traits but also identifying potential disease-modifying medicines. Genome-phenome association studies have created large datasets of DNA variants that are associated with multiple traits or diseases, such as Alzheimer's disease; yet, the functional consequences of variants, in particular of non-coding variants, remain largely unknown. Recent advances in functional genomics and computational approaches have led to the identification of potential roles of DNA variants, such as various quantitative trait locus (xQTL) techniques. Multi-omic assays and analytic approaches toward xQTL have identified links between genetic loci and human transcriptomic, epigenomic, proteomic and metabolomic data. In this review, we first discuss the recent development of xQTL from multi-omic findings. We then highlight multimodal analysis of xQTL and genetic data for identification of risk genes and drug targets using Alzheimer's disease as an example. We finally discuss challenges and future research directions (e.g. artificial intelligence) for annotation of non-coding variants in complex diseases.
Collapse
Affiliation(s)
- Marina Bykova
- Genomic Medicine Institute, Lerner Research Institute, Cleveland Clinic, Cleveland, OH 44195, USA
| | - Yuan Hou
- Genomic Medicine Institute, Lerner Research Institute, Cleveland Clinic, Cleveland, OH 44195, USA
| | - Charis Eng
- Genomic Medicine Institute, Lerner Research Institute, Cleveland Clinic, Cleveland, OH 44195, USA
- Department of Molecular Medicine, Cleveland Clinic Lerner College of Medicine, Case Western Reserve University, Cleveland, OH 44195, USA
| | - Feixiong Cheng
- Genomic Medicine Institute, Lerner Research Institute, Cleveland Clinic, Cleveland, OH 44195, USA
- Department of Molecular Medicine, Cleveland Clinic Lerner College of Medicine, Case Western Reserve University, Cleveland, OH 44195, USA
| |
Collapse
|
4
|
Zhao J, Zhu Y, Boerwinkle E, Xiong M. Pathway analysis with next-generation sequencing data. Eur J Hum Genet 2014; 23:507-15. [PMID: 24986826 DOI: 10.1038/ejhg.2014.121] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2013] [Revised: 03/29/2014] [Accepted: 04/26/2014] [Indexed: 12/21/2022] Open
Abstract
Although pathway analysis methods have been developed and successfully applied to association studies of common variants, the statistical methods for pathway-based association analysis of rare variants have not been well developed. Many investigators observed highly inflated false-positive rates and low power in pathway-based tests of association of rare variants. The inflated false-positive rates and low true-positive rates of the current methods are mainly due to their lack of ability to account for gametic phase disequilibrium. To overcome these serious limitations, we develop a novel statistic that is based on the smoothed functional principal component analysis (SFPCA) for pathway association tests with next-generation sequencing data. The developed statistic has the ability to capture position-level variant information and account for gametic phase disequilibrium. By intensive simulations, we demonstrate that the SFPCA-based statistic for testing pathway association with either rare or common or both rare and common variants has the correct type 1 error rates. Also the power of the SFPCA-based statistic and 22 additional existing statistics are evaluated. We found that the SFPCA-based statistic has a much higher power than other existing statistics in all the scenarios considered. To further evaluate its performance, the SFPCA-based statistic is applied to pathway analysis of exome sequencing data in the early-onset myocardial infarction (EOMI) project. We identify three pathways significantly associated with EOMI after the Bonferroni correction. In addition, our preliminary results show that the SFPCA-based statistic has much smaller P-values to identify pathway association than other existing methods.
Collapse
Affiliation(s)
- Jinying Zhao
- Department of Epidemiology, Tulane University School of Public Health and Tropical Medicine, New Orleans, LA, USA
| | - Yun Zhu
- Department of Epidemiology, Tulane University School of Public Health and Tropical Medicine, New Orleans, LA, USA
| | - Eric Boerwinkle
- Human Genetics Center, Division of Biostatistics, University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Momiao Xiong
- Human Genetics Center, Division of Biostatistics, University of Texas Health Science Center at Houston, Houston, TX, USA
| |
Collapse
|
5
|
Zakharov S, Teoh GHK, Salim A, Thalamuthu A. A method to incorporate prior information into score test for genetic association studies. BMC Bioinformatics 2014; 15:24. [PMID: 24450486 PMCID: PMC3904928 DOI: 10.1186/1471-2105-15-24] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2012] [Accepted: 01/17/2014] [Indexed: 12/13/2022] Open
Abstract
Background The interest of the scientific community in investigating the impact of rare variants on complex traits has stimulated the development of novel statistical methodologies for association studies. The fact that many of the recently proposed methods for association studies suffer from low power to identify a genetic association motivates the incorporation of prior knowledge into statistical tests. Results In this article we propose a methodology to incorporate prior information into the region-based score test. Within our framework prior information is used to partition variants within a region into several groups, following which asymptotically independent group statistics are constructed and then combined into a global test statistic. Under the null hypothesis the distribution of our test statistic has lower degrees of freedom compared with those of the region-based score statistic. Theoretical power comparison, population genetics simulations and results from analysis of the GAW17 sequencing data set suggest that under some scenarios our method may perform as well as or outperform the score test and other competing methods. Conclusions An approach which uses prior information to improve the power of the region-based score test is proposed. Theoretical power comparison, population genetics simulations and the results of GAW17 data analysis showed that for some scenarios power of our method is on the level with or higher than those of the score test and other methods.
Collapse
Affiliation(s)
- Sergii Zakharov
- Human Genetics, Genome Institute of Singapore, 60 Biopolis Street, #02-01 Genome, Singapore 138672, Singapore.
| | | | | | | |
Collapse
|
6
|
Durtschi J, Margraf RL, Coonrod EM, Mallempati KC, Voelkerding KV. VarBin, a novel method for classifying true and false positive variants in NGS data. BMC Bioinformatics 2013; 14 Suppl 13:S2. [PMID: 24266885 PMCID: PMC3849648 DOI: 10.1186/1471-2105-14-s13-s2] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023] Open
Abstract
Background Variant discovery for rare genetic diseases using Illumina genome or exome sequencing involves screening of up to millions of variants to find only the one or few causative variant(s). Sequencing or alignment errors create "false positive" variants, which are often retained in the variant screening process. Methods to remove false positive variants often retain many false positive variants. This report presents VarBin, a method to prioritize variants based on a false positive variant likelihood prediction. Methods VarBin uses the Genome Analysis Toolkit variant calling software to calculate the variant-to-wild type genotype likelihood ratio at each variant change and position divided by read depth. The resulting Phred-scaled, likelihood-ratio by depth (PLRD) was used to segregate variants into 4 Bins with Bin 1 variants most likely true and Bin 4 most likely false positive. PLRD values were calculated for a proband of interest and 41 additional Illumina HiSeq, exome and whole genome samples (proband's family or unrelated samples). At variant sites without apparent sequencing or alignment error, wild type/non-variant calls cluster near -3 PLRD and variant calls typically cluster above 10 PLRD. Sites with systematic variant calling problems (evident by variant quality scores and biases as well as displayed on the iGV viewer) tend to have higher and more variable wild type/non-variant PLRD values. Depending on the separation of a proband's variant PLRD value from the cluster of wild type/non-variant PLRD values for background samples at the same variant change and position, the VarBin method's classification is assigned to each proband variant (Bin 1 to Bin 4). Results To assess VarBin performance, Sanger sequencing was performed on 98 variants in the proband and background samples. True variants were confirmed in 97% of Bin 1 variants, 30% of Bin 2, and 0% of Bin 3/Bin 4. Conclusions These data indicate that VarBin correctly classifies the majority of true variants as Bin 1 and Bin 3/4 contained only false positive variants. The "uncertain" Bin 2 contained both true and false positive variants. Future work will further differentiate the variants in Bin 2.
Collapse
|
7
|
Liu K, Fast S, Zawistowski M, Tintle NL. A geometric framework for evaluating rare variant tests of association. Genet Epidemiol 2013; 37:345-57. [PMID: 23526307 DOI: 10.1002/gepi.21722] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2012] [Revised: 02/12/2013] [Accepted: 02/13/2013] [Indexed: 11/08/2022]
Abstract
The wave of next-generation sequencing data has arrived. However, many questions still remain about how to best analyze sequence data, particularly the contribution of rare genetic variants to human disease. Numerous statistical methods have been proposed to aggregate association signals across multiple rare variant sites in an effort to increase statistical power; however, the precise relation between the tests is often not well understood. We present a geometric representation for rare variant data in which rare allele counts in case and control samples are treated as vectors in Euclidean space. The geometric framework facilitates a rigorous classification of existing rare variant tests into two broad categories: tests for a difference in the lengths of the case and control vectors, and joint tests for a difference in either the lengths or angles of the two vectors. We demonstrate that genetic architecture of a trait, including the number and frequency of risk alleles, directly relates to the behavior of the length and joint tests. Hence, the geometric framework allows prediction of which tests will perform best under different disease models. Furthermore, the structure of the geometric framework immediately suggests additional classes and types of rare variant tests. We consider two general classes of tests which show robustness to noncausal and protective variants. The geometric framework introduces a novel and unique method to assess current rare variant methodology and provides guidelines for both applied and theoretical researchers.
Collapse
Affiliation(s)
- Keli Liu
- Department of Statistics, Harvard University, Cambridge, MA, USA
| | | | | | | |
Collapse
|
8
|
Assessing the impact of differential genotyping errors on rare variant tests of association. PLoS One 2013; 8:e56626. [PMID: 23472072 PMCID: PMC3589406 DOI: 10.1371/journal.pone.0056626] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2012] [Accepted: 01/15/2013] [Indexed: 11/19/2022] Open
Abstract
Genotyping errors are well-known to impact the power and type I error rate in single marker tests of association. Genotyping errors that happen according to the same process in cases and controls are known as non-differential genotyping errors, whereas genotyping errors that occur with different processes in the cases and controls are known as differential genotype errors. For single marker tests, non-differential genotyping errors reduce power, while differential genotyping errors increase the type I error rate. However, little is known about the behavior of the new generation of rare variant tests of association in the presence of genotyping errors. In this manuscript we use a comprehensive simulation study to explore the effects of numerous factors on the type I error rate of rare variant tests of association in the presence of differential genotyping error. We find that increased sample size, decreased minor allele frequency, and an increased number of single nucleotide variants (SNVs) included in the test all increase the type I error rate in the presence of differential genotyping errors. We also find that the greater the relative difference in case-control genotyping error rates the larger the type I error rate. Lastly, as is the case for single marker tests, genotyping errors classifying the common homozygote as the heterozygote inflate the type I error rate significantly more than errors classifying the heterozygote as the common homozygote. In general, our findings are in line with results from single marker tests. To ensure that type I error inflation does not occur when analyzing next-generation sequencing data careful consideration of study design (e.g. use of randomization), caution in meta-analysis and using publicly available controls, and the use of standard quality control metrics is critical.
Collapse
|
9
|
Ferguson J, Wheeler W, Fu Y, Prokunina-Olsson L, Zhao H, Sampson J. Statistical tests for detecting associations with groups of genetic variants: generalization, evaluation, and implementation. Eur J Hum Genet 2012; 21:680-6. [PMID: 23092956 DOI: 10.1038/ejhg.2012.220] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022] Open
Abstract
With recent advances in sequencing, genotyping arrays, and imputation, GWAS now aim to identify associations with rare and uncommon genetic variants. Here, we describe and evaluate a class of statistics, generalized score statistics (GSS), that can test for an association between a group of genetic variants and a phenotype. GSS are a simple weighted sum of single-variant statistics and their cross-products. We show that the majority of statistics currently used to detect associations with rare variants are equivalent to choosing a specific set of weights within this framework. We then evaluate the power of various weighting schemes as a function of variant characteristics, such as MAF, the proportion associated with the phenotype, and the direction of effect. Ultimately, we find that two classical tests are robust and powerful, but details are provided as to when other GSS may perform favorably. The software package CRaVe is available at our website (http://dceg.cancer.gov/bb/tools/crave).
Collapse
Affiliation(s)
- John Ferguson
- Division of Biostatistics, Yale School of Public Health, New Haven, CT, USA
| | | | | | | | | | | |
Collapse
|
10
|
Sun YV, Sung YJ, Tintle N, Ziegler A. Identification of genetic association of multiple rare variants using collapsing methods. Genet Epidemiol 2012; 35 Suppl 1:S101-6. [PMID: 22128049 DOI: 10.1002/gepi.20658] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
Next-generation sequencing technology allows investigation of both common and rare variants in humans. Exomes are sequenced on the population level or in families to further study the genetics of human diseases. Genetic Analysis Workshop 17 (GAW17) provided exomic data from the 1000 Genomes Project and simulated phenotypes. These data enabled evaluations of existing and newly developed statistical methods for rare variant sequence analysis for which standard statistical methods fail because of the rareness of the alleles. Various alternative approaches have been proposed that overcome the rareness problem by combining multiple rare variants within a gene. These approaches are termed collapsing methods, and our GAW17 group focused on studying the performance of existing and novel collapsing methods using rare variants. All tested methods performed similarly, as measured by type I error and power. Inflated type I error fractions were consistently observed and might be caused by gametic phase disequilibrium between causal and noncausal rare variants in this relatively small sample as well as by population stratification. Incorporating prior knowledge, such as appropriate covariates and information on functionality of SNPs, increased the power of detecting associated genes. Overall, collapsing rare variants can increase the power of identifying disease-associated genes. However, studying genetic associations of rare variants remains a challenging task that requires further development and improvement in data collection, management, analysis, and computation.
Collapse
Affiliation(s)
- Yan V Sun
- Department of Epidemiology, Rollins School of Public Health, Emory University, Atlanta, GA 30322, USA.
| | | | | | | |
Collapse
|
11
|
Melton PE, Pankratz N. Joint analyses of disease and correlated quantitative phenotypes using next-generation sequencing data. Genet Epidemiol 2012; 35 Suppl 1:S67-73. [PMID: 22128062 DOI: 10.1002/gepi.20653] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
The joint analysis of multiple disease phenotypes aims to increase statistical power and potentially identify pleiotropic genes involved in the biological development of common chronic diseases. As next-generation sequencing data become more common, it will be important to consider ways to maximize the ability to detect rare variants within the human genome. The two exome sequence data sets provided for analysis at Genetic Analysis Workshop 17 (GAW17) offered three quantitative phenotypes related to disease status in 200 simulated replicates for both families and unrelated individuals. Participants in Group 10 addressed the challenges and potential uses of next-generation sequencing data to identify causal variants through a broad range of statistical methods. These methods included investigating multiple phenotypes either through data reduction or joint methods, using family or unrelated individuals, and reducing the dimensionality inherent in these data. Most of the research teams regarded the use of multiple phenotypes as a means of increasing analytical power and as a way to clarify the biology of complex disease. Three major observations were gleaned from these Group 10 contributions. First, family and unrelated case-control samples are suited to finding different types of variants. In addition, collapsing either phenotypes or genotypes can reduce the dimensionality of the data and alleviate some of the problems of multiple testing. Finally, we were able to demonstrate in certain cases that performing a joint analysis of disease status and a quantitative trait can improve statistical power.
Collapse
Affiliation(s)
- Phillip E Melton
- Department of Genetics, Texas Biomedical Research Institute, San Antonio, Texas, USA
| | | |
Collapse
|
12
|
Kazma R, Bailey JN. Population-based and family-based designs to analyze rare variants in complex diseases. Genet Epidemiol 2012; 35 Suppl 1:S41-7. [PMID: 22128057 DOI: 10.1002/gepi.20648] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Genotyping of rare variants on a large scale is now possible using next-generation sequencing. Sample selection is a crucial step in designing the genetic study of a complex disease, and knowledge of the efficiency and limitations of population-based and family-based designs can help researchers make the appropriate choice. The nine contributions to Group 5 of Genetic Analysis Workshop 17 evaluate population-based and family-based designs by comparing the results obtained with various methods applied to the mini-exome simulations. These simulations consisted of 200 replicates composed of unrelated individuals and eight extended pedigrees with genotypes and various phenotypes. The methods tested for association with a population-based and/or a family-based design, tested for linkage with a family-based design, or estimated heritability. We summarize the strengths and weaknesses of both designs. Although population-based designs seem more suitable for detecting the effect of multiple rare variants, family-based designs can potentially enrich the sample in rare variants, for which the effect would be concealed at the population level. However, as of today, the main limitation is still the high cost of next-generation sequencing.
Collapse
Affiliation(s)
- Rémi Kazma
- Department of Epidemiology and Biostatistics and Institute for Human Genetics, University of California, San Francisco, CA 94143-3110, USA.
| | | |
Collapse
|
13
|
Abstract
We propose a two-stage design for the analysis of sequence variants in which a proportion of genes that show some evidence of association are identified initially and then followed up in an independent data set. We compare two different approaches. In both approaches the same summary measure (total number of minor alleles) is used for each gene in the initial analysis. In the first (simple) approach the same summary measure is used in the analysis of the independent data set. In the second (alternative) approach a more specific hypothesis is formed for the second stage; the summary measure used is the count of minor alleles in only those variants that in the initial data showed the same direction of association as was seen overall. We applied the methods to the simulated quantitative traits of Genetic Analysis Workshop 17, blind to the simulation model, and then evaluated their performance once the underlying model was known. Performance was similar for most genes, but the simple strategy considerably out-performed the alternative strategy for one gene, where most of the effect was due to very rare variants; this suggests that the alternative approach would not be advisable when the effect is seen in very rare variants. Further simulations are needed to investigate the potential superior power of the alternative method when some variants within a gene have opposing effects. Overall, the power to detect associations was low; this was also true when using a more powerful joint analysis that combined the two stages of the study.
Collapse
|
14
|
Dering C, Ziegler A, König IR, Hemmelmann C. Comparison of collapsing methods for the statistical analysis of rare variants. BMC Proc 2011; 5 Suppl 9:S115. [PMID: 22373249 PMCID: PMC3287839 DOI: 10.1186/1753-6561-5-s9-s115] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022] Open
Abstract
Novel technologies allow sequencing of whole genomes and are considered as an emerging approach for the identification of rare disease-associated variants. Recent studies have shown that multiple rare variants can explain a particular proportion of the genetic basis for disease. Following this assumption, we compare five collapsing approaches to test for groupwise association with disease status, using simulated data provided by Genetic Analysis Workshop 17 (GAW17). Variants are collapsed in different scenarios per gene according to different minor allele frequency (MAF) thresholds and their functionality. For comparing the different approaches, we consider the family-wise error rate and the power. Most of the methods could maintain the nominal type I error levels well for small MAF thresholds, but the power was generally low. Although the methods considered in this report are common approaches for analyzing rare variants, they performed poorly with respect to the simulated disease phenotype in the GAW17 data set.
Collapse
Affiliation(s)
- Carmen Dering
- Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Maria-Goeppert-Str, 1, 23562 Lübeck, Germany.
| | | | | | | |
Collapse
|
15
|
Bueno Filho JS, Morota G, Tran Q, Maenner MJ, Vera-Cala LM, Engelman CD, Meyers KJ. Analysis of human mini-exome sequencing data from Genetic Analysis Workshop 17 using a Bayesian hierarchical mixture model. BMC Proc 2011; 5 Suppl 9:S93. [PMID: 22373180 PMCID: PMC3287935 DOI: 10.1186/1753-6561-5-s9-s93] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
Next-generation sequencing technologies are rapidly changing the field of genetic epidemiology and enabling exploration of the full allele frequency spectrum underlying complex diseases. Although sequencing technologies have shifted our focus toward rare genetic variants, statistical methods traditionally used in genetic association studies are inadequate for estimating effects of low minor allele frequency variants. Four our study we use the Genetic Analysis Workshop 17 data from 697 unrelated individuals (genotypes for 24,487 autosomal variants from 3,205 genes). We apply a Bayesian hierarchical mixture model to identify genes associated with a simulated binary phenotype using a transformed genotype design matrix weighted by allele frequencies. A Metropolis Hasting algorithm is used to jointly sample each indicator variable and additive genetic effect pair from its conditional posterior distribution, and remaining parameters are sampled by Gibbs sampling. This method identified 58 genes with a posterior probability greater than 0.8 for being associated with the phenotype. One of these 58 genes, PIK3C2B was correctly identified as being associated with affected status based on the simulation process. This project demonstrates the utility of Bayesian hierarchical mixture models using a transformed genotype matrix to detect genes containing rare and common variants associated with a binary phenotype.
Collapse
Affiliation(s)
- Julio S Bueno Filho
- Department of Dairy Science, University of Wisconsin-Madison, 444 Animal Science Building, 1675 Observatory Drive, Madison, WI 53706-1284, USA.,Departamento de Ciências Exatas, Universidade Federal de Lavras, PO Box 3037, Lavras, MG 37200-000, Brazil
| | - Gota Morota
- Department of Dairy Science, University of Wisconsin-Madison, 444 Animal Science Building, 1675 Observatory Drive, Madison, WI 53706-1284, USA
| | - Quoc Tran
- Department of Statistics, University of Wisconsin-Madison, 1300 University Avenue, Madison, WI 53706, USA
| | - Matthew J Maenner
- Department of Population Health Sciences, University of Wisconsin-Madison, 707 WARF Building, 610 North Walnut Street, Madison, WI 53726, USA
| | - Lina M Vera-Cala
- Department of Population Health Sciences, University of Wisconsin-Madison, 707 WARF Building, 610 North Walnut Street, Madison, WI 53726, USA.,Departamento de Salud Pública Universidad Industrial de Santander, Carrera 32 #29-31 Piso 3, Bucaramanga, Santander 680002, Colombia
| | - Corinne D Engelman
- Department of Population Health Sciences, University of Wisconsin-Madison, 707 WARF Building, 610 North Walnut Street, Madison, WI 53726, USA
| | - Kristin J Meyers
- Department of Population Health Sciences, University of Wisconsin-Madison, 707 WARF Building, 610 North Walnut Street, Madison, WI 53726, USA
| |
Collapse
|
16
|
Abstract
The common genetic variants identified through genome-wide association studies explain only a small proportion of the genetic risk for complex diseases. The advancement of next-generation sequencing technologies has enabled the detection of rare variants that are expected to contribute significantly to the missing heritability. Some genetic association studies provide multiple correlated traits for analysis. Multiple trait analysis has the potential to improve the power to detect pleiotropic genetic variants that influence multiple traits. We propose a gene-level association test for multiple traits that accounts for correlation among the traits. Gene- or region-level testing for association involves both common and rare variants. Statistical tests for common variants may have limited power for individual rare variants because of their low frequency and multiple testing issues. To address these concerns, we use the weighted-sum pooling method to test the joint association of multiple rare and common variants within a gene. The proposed method is applied to the Genetic Association Workshop 17 (GAW17) simulated mini-exome data to analyze multiple traits. Because of the nature of the GAW17 simulation model, increased power was not observed for multiple-trait analysis compared to single-trait analysis. However, multiple-trait analysis did not result in a substantial loss of power because of the testing of multiple traits. We conclude that this method would be useful for identifying pleiotropic genes.
Collapse
Affiliation(s)
- Jingyuan Zhao
- Human Genetics, Genome Institute of Singapore, 60 Biopolis Street 02-01, Singapore 138672.
| | | |
Collapse
|
17
|
Sung YJ, Rice TK, Rao DC. Application of collapsing methods for continuous traits to the Genetic Analysis Workshop 17 exome sequence data. BMC Proc 2011; 5 Suppl 9:S121. [PMID: 22373425 PMCID: PMC3287846 DOI: 10.1186/1753-6561-5-s9-s121] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Genetic Analysis Workshop 17 used real sequence data from the 1000 Genomes Project and simulated phenotypes influenced by a large number of rare variants. Our aim is to evaluate the performance of various collapsing methods that were developed for analysis of multiple rare variants. We apply collapsing methods to continuous phenotypes Q1 and Q2 for all 200 replicates of the unrelated individuals data. Within each gene, we collapse (1) all SNPs, (2) all SNPs with minor allele frequency (MAF) < 0.05, and (3) nonsynonymous SNPs with MAF < 0.05. We consider two tests when collapsing variants: using the proportion of variants and using the presence/absence of any variant. We also compare our results to a single-marker analysis using PLINK. For phenotype Q1, the proportion test for collapsing rare nonsynonymous SNPs often performed the best. Two genes (FLT1 and KDR) had statistically significant results. A single-marker analysis using PLINK also provided statistically significant results for some SNPs within these two genes. For phenotype Q2, collapsing rare nonsynonymous SNPs performed the best, with almost no difference between proportion and presence tests. However, neither collapsing methods nor a single-marker analysis provided statistically significant results at the true genes for Q2. We also found that a large number of noncausal genes had high correlations with causal genes for Q1 and Q2, which may account for inflated false positives.
Collapse
Affiliation(s)
- Yun Ju Sung
- Division of Biostatistics, Washington University School of Medicine, 660 S, Euclid Ave,, St, Louis, MO 63110, USA.
| | | | | |
Collapse
|
18
|
Petersen A, Sitarik A, Luedtke A, Powers S, Bekmetjev A, Tintle NL. Evaluating methods for combining rare variant data in pathway-based tests of genetic association. BMC Proc 2011; 5 Suppl 9:S48. [PMID: 22373429 PMCID: PMC3287885 DOI: 10.1186/1753-6561-5-s9-s48] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
Abstract
Analyzing sets of genes in genome-wide association studies is a relatively new approach that aims to capitalize on biological knowledge about the interactions of genes in biological pathways. This approach, called pathway analysis or gene set analysis, has not yet been applied to the analysis of rare variants. Applying pathway analysis to rare variants offers two competing approaches. In the first approach rare variant statistics are used to generate p-values for each gene (e.g., combined multivariate collapsing [CMC] or weighted-sum [WS]) and the gene-level p-values are combined using standard pathway analysis methods (e.g., gene set enrichment analysis or Fisher’s combined probability method). In the second approach, rare variant methods (e.g., CMC and WS) are applied directly to sets of single-nucleotide polymorphisms (SNPs) representing all SNPs within genes in a pathway. In this paper we use simulated phenotype and real next-generation sequencing data from Genetic Analysis Workshop 17 to analyze sets of rare variants using these two competing approaches. The initial results suggest substantial differences in the methods, with Fisher’s combined probability method and the direct application of the WS method yielding the best power. Evidence suggests that the WS method works well in most situations, although Fisher’s method was more likely to be optimal when the number of causal SNPs in the set was low but the risk of the causal SNPs was high.
Collapse
Affiliation(s)
- Ashley Petersen
- Departments of Mathematics, Computer Science, and Statistics, St. Olaf College, 1520 St. Olaf Avenue, Northfield, MN 55057, USA
| | - Alexandra Sitarik
- Department of Mathematics, Wittenberg University, 200 West Ward Street, Springfield, OH 45501, USA
| | - Alexander Luedtke
- Division of Applied Mathematics, Brown University, 151 Thayer Street, Providence, RI 02912, USA
| | - Scott Powers
- Department of Statistics and Operations Research, University of North Carolina, 318 Hanes Hall, CB 3260, Chapel Hill, NC 27599-3260, USA
| | - Airat Bekmetjev
- Department of Mathematics, Statistics and Computer Science, Dordt College, 498 4th Ave. NE, Sioux Center, IA 51250, USA
| | - Nathan L Tintle
- Department of Mathematics, Statistics and Computer Science, Dordt College, 498 4th Ave. NE, Sioux Center, IA 51250, USA
| |
Collapse
|
19
|
Scholz M, Kirsten H. Comparison of scoring methods for the detection of causal genes with or without rare variants. BMC Proc 2011; 5 Suppl 9:S49. [PMID: 22373454 PMCID: PMC3287886 DOI: 10.1186/1753-6561-5-s9-s49] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/07/2023] Open
Abstract
Rare causal variants are believed to significantly contribute to the genetic basis of common diseases or quantitative traits. Appropriate statistical methods are required to discover the highest possible number of disease-relevant variants in a genome-wide screening study. The publicly available Genetic Analysis Workshop 17 data set consists of 697 individuals and 24,487 genetic variants. It includes a simulated complex disease model with intermediate quantitative phenotypes. We compare four gene-wise scoring methods with respect to ranking of causal genes under variable allele frequency thresholds for collapsing of rare variants and considering whether or not rare variants were included. We also compare causal genes for which the ranks differ clearly between scoring methods regarding such characteristics as number and strength of causal variants. We corroborated our findings with additional simulations. We found that the maximum statistics method was superior in assigning high ranks to genes with a single strong causal variant. Hotelling’s T2 test was superior for genes with several independent causal variants. This was consistent for all phenotypes and was confirmed by single-gene analyses and additional simulations. The multivariate analysis performed similarly to Hotelling’s T2 test. The least absolute shrinkage and selection operator (LASSO) analysis was widely comparable with the maximum statistics method. We conclude that the maximum statistics method is a superior alternative to Hotelling’s T2 test if one expects only one independent causal variant per gene with a dominating effect. Such a variant could also be a supermarker derived by collapsing rare variants. Because the true nature of the genetic effect is unknown for real data, both methods need to be taken into consideration.
Collapse
Affiliation(s)
- Markus Scholz
- Institute for Medical Informatics, Statistics, and Epidemiology (IMISE), Universität Leipzig, Härtelstrasse 16-18, 04107 Leipzig, Germany.
| | | |
Collapse
|
20
|
Liu T, Thalamuthu A. Identity by descent and association analysis of dichotomous traits based on large pedigrees. BMC Proc 2011; 5 Suppl 9:S31. [PMID: 22373483 PMCID: PMC3287867 DOI: 10.1186/1753-6561-5-s9-s31] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
The goals of our analysis were to map functional loci, which contribute to the case-control status of a trait of interest, using large pedigrees. We used logistic regression fitted with the generalized estimation equation to test associations between a dichotomous phenotype and all genotyped common and rare single-nucleotide polymorphisms. In addition to the association study, we also developed and applied a simple and fast identical-by-descent-based test to identify loci that were shared among affected individuals more often than expected by chance. Among the top significant loci, we assessed the statistical power and the false discovery rate of both methods. We also demonstrated that family-based studies, compared with the standard population-based association studies, have great values and advantages for the discovery of multiple rare causal variants.
Collapse
Affiliation(s)
- Tian Liu
- Human Genetics Group, Genome Institute of Singapore, 60 Biopolis Street #02-01, Singapore 138672.
| | | |
Collapse
|
21
|
Jiang R, Dong J. Detecting rare functional variants using a wavelet-based test on quantitative and qualitative traits. BMC Proc 2011; 5 Suppl 9:S70. [PMID: 22373061 PMCID: PMC3287910 DOI: 10.1186/1753-6561-5-s9-s70] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
We conducted a genome-wide association study on the Genetic Analysis Workshop 17 simulated unrelated individuals data using a multilocus score test based on wavelet transformation that we proposed recently. Wavelet transformation is an advanced smoothing technique, whereas the currently popular collapsing methods are the simplest way to smooth multilocus genotypes. The wavelet-based test suppresses noise from the data more effectively, which results in lower type I error rates. We chose a level-dependent threshold for the wavelet-based test to suppress the optimal amount of noise according to the data. We propose several remedies to reduce the inflated type I error rate: using a window of fixed size rather than a gene; using the Bonferroni correction rather than comparing to the maxima of test values for multiple testing corrections; and removing the influence of other factors by using residuals for the association test. A wavelet-based test can detect multiple rare functional variants. Type I error rates can be controlled using the wavelet-based test combined with the mentioned remedies.
Collapse
Affiliation(s)
- Renfang Jiang
- Department of Mathematical Sciences, Michigan Technological University, Fisher Hall, Room 319, 1400 Townsend Drive, Houghton, MI 49931-1295, USA.
| | | |
Collapse
|
22
|
Fardo DW, Druen AR, Liu J, Mirea L, Infante-Rivard C, Breheny P. Exploration and comparison of methods for combining population- and family-based genetic association using the Genetic Analysis Workshop 17 mini-exome. BMC Proc 2011; 5 Suppl 9:S28. [PMID: 22373349 PMCID: PMC3287863 DOI: 10.1186/1753-6561-5-s9-s28] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
We examine the performance of various methods for combining family- and population-based genetic association data. Several approaches have been proposed for situations in which information is collected from both a subset of unrelated subjects and a subset of family members. Analyzing these samples separately is known to be inefficient, and it is important to determine the scenarios for which differing methods perform well. Others have investigated this question; however, no extensive simulations have been conducted, nor have these methods been applied to mini-exome-style data such as that provided by Genetic Analysis Workshop 17. We quantify the empirical power and false-positive rates for three existing methods applied to the Genetic Analysis Workshop 17 mini-exome data and compare relative performance. We use knowledge of the underlying data simulation model to make these assessments.
Collapse
Affiliation(s)
- David W Fardo
- Department of Biostatistics, University of Kentucky College of Public Health, 121 Washington Avenue, Lexington, KY 40536, USA.
| | | | | | | | | | | |
Collapse
|
23
|
Thalamuthu A, Zhao J, Keong GTH, Kondragunta V, Mukhopadhyay I. Association tests for rare and common variants based on genotypic and phenotypic measures of similarity between individuals. BMC Proc 2011; 5 Suppl 9:S89. [PMID: 22373048 PMCID: PMC3287930 DOI: 10.1186/1753-6561-5-s9-s89] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Genome-wide association studies have helped us identify thousands of common variants associated with several widespread complex diseases. However, for most traits, these variants account for only a small fraction of phenotypic variance or heritability. Next-generation sequencing technologies are being used to identify additional rare variants hypothesized to have higher effect sizes than the already identified common variants, and to contribute significantly to the fraction of heritability that is still unexplained. Several pooling strategies have been proposed to test the joint association of multiple rare variants, because testing them individually may not be optimal. Within a gene or genomic region, if there are both rare and common variants, testing their joint association may be desirable to determine their synergistic effects. We propose new methods to test the joint association of several rare and common variants with binary and quantitative traits. Our association test for quantitative traits is based on genotypic and phenotypic measures of similarity between pairs of individuals. For the binary trait or case-control samples, we recently proposed an association test based on the genotypic similarity between individuals. Here, we develop a modified version of this test for rare variants. Our tests can be used for samples taken from multiple subpopulations. The power of our test statistics for case-control samples and quantitative traits was evaluated using the GAW17 simulated data sets. Type I error rates for the proposed tests are well controlled. Our tests are able to identify some of the important causal genes in the GAW17 simulated data sets.
Collapse
Affiliation(s)
- Anbupalam Thalamuthu
- Human Genetics, 60 Biopolis Street 02-01, Genome Institute of Singapore, Singapore 138672.
| | | | | | | | | |
Collapse
|
24
|
Yang W, Gu CC. Enrichment analysis of genetic association in genes and pathways by aggregating signals from both rare and common variants. BMC Proc 2011; 5 Suppl 9:S52. [PMID: 22373052 PMCID: PMC3287890 DOI: 10.1186/1753-6561-5-s9-s52] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
New high-throughput sequencing technologies have brought forth opportunities for unbiased analysis of thousands of rare genomic variants in genome-wide association studies of complex diseases. Because it is hard to detect single rare variants with appreciable effect sizes at the population level, existing methods mostly aggregate effects of multiple markers by collapsing the rare variants in genes (or genomic regions). We hypothesize that a higher level of aggregation can further improve association signal strength. Using the Genetic Analysis Workshop 17 simulated data, we test a two-step strategy that first applies a collapsing method in a gene-level analysis and then aggregates the gene-level test results by performing an enrichment analysis in gene sets. We find that the gene set approach which combines signals across multiple genes outperforms testing individual genes separately and that the power of the gene set enrichment test is further improved by proper adjustment of statistics to account for gene-wise differences.
Collapse
Affiliation(s)
- Wei Yang
- Division of Biostatistics, Washington University School of Medicine, Box 8067, 660 South Euclid Avenue, St, Louis, MO 63110, USA.
| | | |
Collapse
|
25
|
Culverhouse RC, Hinrichs AL, Suarez BK. Stratify or adjust? Dealing with multiple populations when evaluating rare variants. BMC Proc 2011; 5 Suppl 9:S101. [PMID: 22373399 PMCID: PMC3287824 DOI: 10.1186/1753-6561-5-s9-s101] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
The unrelated individuals sample from Genetic Analysis Workshop 17 consists of a small number of subjects from eight population samples and genetic data composed mostly of rare variants. We compare two simple approaches to collapsing rare variants within genes for their utility in identifying genes that affect phenotype. We also compare results from stratified analyses to those from a pooled analysis that uses ethnicity as a covariate. We found that the two collapsing approaches were similarly effective in identifying genes that contain causative variants in these data. However, including population as a covariate was not an effective substitute for analyzing the subpopulations separately when only one subpopulation contained a rare variant linked to the phenotype.
Collapse
Affiliation(s)
- Robert C Culverhouse
- Department of Medicine, Washington University School of Medicine, 660 South Euclid Avenue, Saint Louis, MO 63110, USA.
| | | | | |
Collapse
|
26
|
Lamina C. Digging into the extremes: a useful approach for the analysis of rare variants with continuous traits? BMC Proc 2011; 5 Suppl 9:S105. [PMID: 22373517 PMCID: PMC3287828 DOI: 10.1186/1753-6561-5-s9-s105] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023] Open
Abstract
The common disease/rare variant hypothesis predicts that rare variants with large effects will have a strong impact on corresponding phenotypes. Therefore it is assumed that rare functional variants are enriched in the extremes of the phenotype distribution. In this analysis of the Genetic Analysis Workshop 17 data set, my aim is to detect genes with rare variants that are associated with quantitative traits using two general approaches: analyzing the association with the complete distribution of values by means of linear regression and using statistical tests based on the tails of the distribution (bottom 10% of values versus top 10%). Three methods are used for this extreme phenotype approach: Fisher’s exact test, weighted-sum method, and beta method. Rare variants were collapsed on the gene level. Linear regression including all values provided the highest power to detect rare variants. Of the three methods used in the extreme phenotype approach, the beta method performed best. Furthermore, the sample size was enriched in this approach by adding additional samples with extreme phenotype values. Doubling the sample size using this approach, which corresponds to only 40% of sample size of the original continuous trait, yielded a comparable or even higher power than linear regression. If samples are selected primarily for sequencing, enriching the analysis by gathering a greater proportion of individuals with extreme values in the phenotype of interest rather than in the general population leads to a higher power to detect rare variants compared to analyzing a population-based sample with equivalent sample size.
Collapse
Affiliation(s)
- Claudia Lamina
- Division of Genetic Epidemiology, Department of Medical Genetics, Molecular, and Clinical Pharmacology, Innsbruck Medical University, Schöpfstrasse 41, 6020 Innsbruck, Austria.
| |
Collapse
|
27
|
Tintle N, Aschard H, Hu I, Nock N, Wang H, Pugh E. Inflated type I error rates when using aggregation methods to analyze rare variants in the 1000 Genomes Project exon sequencing data in unrelated individuals: summary results from Group 7 at Genetic Analysis Workshop 17. Genet Epidemiol 2011; 35 Suppl 1:S56-60. [PMID: 22128060 PMCID: PMC3249221 DOI: 10.1002/gepi.20650] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/24/2023]
Abstract
As part of Genetic Analysis Workshop 17 (GAW17), our group considered the application of novel and standard approaches to the analysis of genotype-phenotype association in next-generation sequencing data. Our group identified a major issue in the analysis of the GAW17 next-generation sequencing data: type I error and false-positive report probability rates higher than those expected based on empirical type I error levels (as high as 90%). Two main causes emerged: population stratification and long-range correlation (gametic phase disequilibrium) between rare variants. Population stratification was expected because of the diverse sample. Correlation between rare variants was attributable to both random causes (e.g., nearly 10,000 of 25,000 markers were private variants, and the sample size was small [n = 697]) and nonrandom causes (more correlation was observed than was expected by random chance). Principal components analysis was used to control for population structure and helped to minimize type I errors, but this was at the expense of identifying fewer causal variants. A novel multiple regression approach showed promise to handle correlation between markers. Further work is needed, first, to identify best practices for the control of type I errors in the analysis of sequencing data and then to explore and compare the many promising new aggregating approaches for identifying markers associated with disease phenotypes.
Collapse
Affiliation(s)
- Nathan Tintle
- Department of Mathematics, Statistics, and Computer Science, Dordt College, Sioux Center, IA 51250, USA.
| | | | | | | | | | | |
Collapse
|