1
|
Lin P, Gao J, He W, Nie L, Schauer JJ, Yang S, Xu Y, Zhang Y. Estimation of commercial cooking emissions in real-world operation: Particulate and gaseous emission factors, activity influencing and modelling. ENVIRONMENTAL POLLUTION (BARKING, ESSEX : 1987) 2021; 289:117847. [PMID: 34388553 DOI: 10.1016/j.envpol.2021.117847] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/08/2021] [Revised: 07/02/2021] [Accepted: 07/24/2021] [Indexed: 06/13/2023]
Abstract
Measurements of real-world cooking emission factors (CEFs) were rarely reported in recent year's studies. However, the needs for accurately estimating CEFs to produce cooking emission inventories and further implement controlling measures are urgent. In this study, we collected cooking emission aerosols from real-world commercial location operations in Beijing, China. 2 particulate (PM2.5, OC) and 2 gaseous (NMHC, OVOCs) CEF species were examined on influencing activity conditions of cuisine type, controlling technology, operation scales (represented by cook stove numbers), air exhausting volume, as well as location and operation period. Measured NMHC emission factors (Non-barbecue: 8.19 ± 9.06 g/h and Barbecue: 35.48 ± 11.98 g/h) were about 2 times higher than PM2.5 emission factors (Non-barbecue: 4.88 ± 3.43 g/h and Barbecue: 15.48 ± 7.22 g/h). T-test analysis results showed a significantly higher barbecued type CEFs than non-barbecued cuisines for both particulate and gaseous emission factor species. The efficacy of controlling technology was showing an average of 50 % in decreasing PM2.5 CEFs while a 50 % in increasing OC particulate CEFs. The effects of controlling equipment were not significant in removing NMHC and OVOCs exhaust concentrations. CEF variations within cook stove numbers and air exhausting volume also reflected a comprehensive effect of operation scale, cuisine type and control technology. The simulations among activity influencing factors and CEFs were further determined and estimated using hierarchical multiple regression model. The R square of this simulated model for PM2.5 CEFs was 0.80 (6.17 × 10-9) with standardized regression coefficient of cuisine type, location, sampling period, control technology, cook stove number (N) and N2 of 5.18 (0.02), 5.33 (0.02), 1.93 (0.19), 9.29 (4.18 × 10-6), 9.10 (1.71 × 10-3) and -1.18 (2.43 × 10-3), respectively. In perspective, our study provides ways of better estimating CEFs in real operation conditions and potentially highlighting much more importance of cooking emissions on air quality and human health.
Collapse
Affiliation(s)
- Pengchuan Lin
- State Key Laboratory of Environmental Criteria and Risk Assessment, Chinese Research Academy of Environmental Sciences, Beijing, 100012, China; College of Resources and Environment, University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Jian Gao
- State Key Laboratory of Environmental Criteria and Risk Assessment, Chinese Research Academy of Environmental Sciences, Beijing, 100012, China
| | - Wanqing He
- Beijing Key Laboratory of Urban Atmospheric Volatile Organic Compounds Pollution Control and Application, Beijing Municipal Research Institute of Environmental Protection, Beijing, 100037, China
| | - Lei Nie
- Beijing Key Laboratory of Urban Atmospheric Volatile Organic Compounds Pollution Control and Application, Beijing Municipal Research Institute of Environmental Protection, Beijing, 100037, China
| | - James J Schauer
- Environmental Chemistry and Technology Program, University of Wisconsin-Madison, Madison, WI, 53706, USA; Wisconsin State Laboratory of Hygiene, University of Wisconsin-Madison, Madison, WI, 53718, USA
| | - Shujian Yang
- College of Resources and Environment, University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Yisheng Xu
- State Key Laboratory of Environmental Criteria and Risk Assessment, Chinese Research Academy of Environmental Sciences, Beijing, 100012, China
| | - Yuanxun Zhang
- College of Resources and Environment, University of Chinese Academy of Sciences, Beijing, 100049, China; CAS Center for Excellence in Regional Atmospheric Environment, Chinese Academy of Sciences, Xiamen, 361021, China; Yanshan Earth Critical Zone and Surface Fluxes Research Station, University of Chinese Academy of Sciences, Beijing, 101408, China.
| |
Collapse
|
2
|
Bendová B, Piálek J, Ďureje Ľ, Schmiedová L, Čížková D, Martin JF, Kreisinger J. How being synanthropic affects the gut bacteriome and mycobiome: comparison of two mouse species with contrasting ecologies. BMC Microbiol 2020; 20:194. [PMID: 32631223 PMCID: PMC7336484 DOI: 10.1186/s12866-020-01859-8] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2019] [Accepted: 06/16/2020] [Indexed: 02/08/2023] Open
Abstract
Background The vertebrate gastrointestinal tract is colonised by microbiota that have a major effect on the host’s health, physiology and phenotype. Once introduced into captivity, however, the gut microbial composition of free-living individuals can change dramatically. At present, little is known about gut microbial changes associated with adaptation to a synanthropic lifestyle in commensal species, compared with their non-commensal counterparts. Here, we compare the taxonomic composition and diversity of bacterial and fungal communities across three gut sections in synanthropic house mouse (Mus musculus) and a closely related non-synanthropic mound-building mouse (Mus spicilegus). Results Using Illumina sequencing of bacterial 16S rRNA amplicons, we found higher bacterial diversity in M. spicilegus and detected 11 bacterial operational taxonomic units with significantly different proportions. Notably, abundance of Oscillospira, which is typically higher in lean or outdoor pasturing animals, was more abundant in non-commensal M. spicilegus. ITS2-based barcoding revealed low diversity and high uniformity of gut fungi in both species, with the genus Kazachstania clearly dominant. Conclusions Though differences in gut bacteria observed in the two species can be associated with their close association with humans, changes due to a move from commensalism to captivity would appear to have caused larger shifts in microbiota.
Collapse
Affiliation(s)
- Barbora Bendová
- Department of Zoology, Faculty of Science, Charles University, Prague, Czech Republic.,Studenec Research Facility, Institute of Vertebrate Biology, Czech Academy of Sciences, Brno, Czech Republic
| | - Jaroslav Piálek
- Studenec Research Facility, Institute of Vertebrate Biology, Czech Academy of Sciences, Brno, Czech Republic
| | - Ľudovít Ďureje
- Studenec Research Facility, Institute of Vertebrate Biology, Czech Academy of Sciences, Brno, Czech Republic
| | - Lucie Schmiedová
- Department of Zoology, Faculty of Science, Charles University, Prague, Czech Republic.,Studenec Research Facility, Institute of Vertebrate Biology, Czech Academy of Sciences, Brno, Czech Republic
| | - Dagmar Čížková
- Studenec Research Facility, Institute of Vertebrate Biology, Czech Academy of Sciences, Brno, Czech Republic
| | - Jean-Francois Martin
- CBGP, Montpellier SupAgro, INRA, CIRAD, IRD, Univ Montpellier, Montferrier-sur-Lez, France
| | - Jakub Kreisinger
- Department of Zoology, Faculty of Science, Charles University, Prague, Czech Republic.
| |
Collapse
|
3
|
Wen J, Ford CT, Janies D, Shi X. A parallelized strategy for epistasis analysis based on Empirical Bayesian Elastic Net models. Bioinformatics 2020; 36:3803-3810. [PMID: 32227194 DOI: 10.1093/bioinformatics/btaa216] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2019] [Revised: 03/05/2020] [Accepted: 03/26/2020] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Epistasis reflects the distortion on a particular trait or phenotype resulting from the combinatorial effect of two or more genes or genetic variants. Epistasis is an important genetic foundation underlying quantitative traits in many organisms as well as in complex human diseases. However, there are two major barriers in identifying epistasis using large genomic datasets. One is that epistasis analysis will induce over-fitting of an over-saturated model with the high-dimensionality of a genomic dataset. Therefore, the problem of identifying epistasis demands efficient statistical methods. The second barrier comes from the intensive computing time for epistasis analysis, even when the appropriate model and data are specified. RESULTS In this study, we combine statistical techniques and computational techniques to scale up epistasis analysis using Empirical Bayesian Elastic Net (EBEN) models. Specifically, we first apply a matrix manipulation strategy for pre-computing the correlation matrix and pre-filter to narrow down the search space for epistasis analysis. We then develop a parallelized approach to further accelerate the modeling process. Our experiments on synthetic and empirical genomic data demonstrate that our parallelized methods offer tens of fold speed up in comparison with the classical EBEN method which runs in a sequential manner. We applied our parallelized approach to a yeast dataset, and we were able to identify both main and epistatic effects of genetic variants associated with traits such as fitness. AVAILABILITY AND IMPLEMENTATION The software is available at github.com/shilab/parEBEN.
Collapse
Affiliation(s)
- Jia Wen
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27514, USA
| | - Colby T Ford
- Department of Bioinformatics and Genomics, College of Computing and Informatics.,School of Data Science, University of North Carolina at Charlotte, Charlotte, NC 28223, USA
| | - Daniel Janies
- Department of Bioinformatics and Genomics, College of Computing and Informatics
| | - Xinghua Shi
- Department of Computer and Information Sciences, Temple University, Philadelphia, PA 19122, USA
| |
Collapse
|
4
|
Marceau West R, Lu W, Rotroff DM, Kuenemann MA, Chang SM, Wu MC, Wagner MJ, Buse JB, Motsinger-Reif AA, Fourches D, Tzeng JY. Identifying individual risk rare variants using protein structure guided local tests (POINT). PLoS Comput Biol 2019; 15:e1006722. [PMID: 30779729 PMCID: PMC6396946 DOI: 10.1371/journal.pcbi.1006722] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2018] [Revised: 03/01/2019] [Accepted: 12/17/2018] [Indexed: 01/08/2023] Open
Abstract
Rare variants are of increasing interest to genetic association studies because of their etiological contributions to human complex diseases. Due to the rarity of the mutant events, rare variants are routinely analyzed on an aggregate level. While aggregation analyses improve the detection of global-level signal, they are not able to pinpoint causal variants within a variant set. To perform inference on a localized level, additional information, e.g., biological annotation, is often needed to boost the information content of a rare variant. Following the observation that important variants are likely to cluster together on functional domains, we propose a protein structure guided local test (POINT) to provide variant-specific association information using structure-guided aggregation of signal. Constructed under a kernel machine framework, POINT performs local association testing by borrowing information from neighboring variants in the 3-dimensional protein space in a data-adaptive fashion. Besides merely providing a list of promising variants, POINT assigns each variant a p-value to permit variant ranking and prioritization. We assess the selection performance of POINT using simulations and illustrate how it can be used to prioritize individual rare variants in PCSK9, ANGPTL4 and CETP in the Action to Control Cardiovascular Risk in Diabetes (ACCORD) clinical trial data.
Collapse
Affiliation(s)
- Rachel Marceau West
- Department of Statistics, North Carolina State University, Raleigh, North Carolina, United States of America
| | - Wenbin Lu
- Department of Statistics, North Carolina State University, Raleigh, North Carolina, United States of America
| | - Daniel M. Rotroff
- Department of Quantitative Health Sciences, Lerner Research Institute, Cleveland Clinic, Cleveland, Ohio, United States of America
| | - Melaine A. Kuenemann
- Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina, United States of America
| | - Sheng-Mao Chang
- Department of Statistics, National Cheng-Kung University, Tainan, Taiwan
| | - Michael C. Wu
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America
| | - Michael J. Wagner
- Center for Pharmacogenomics and Individualized Therapy, University of North Carolina, Chapel Hill, North Carolina, United States of America
| | - John B. Buse
- Department of Medicine, University of North Carolina School of Medicine, Chapel Hill, North Carolina, United States of America
| | - Alison A. Motsinger-Reif
- Department of Statistics, North Carolina State University, Raleigh, North Carolina, United States of America
- Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina, United States of America
| | - Denis Fourches
- Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina, United States of America
- Department of Chemistry, North Carolina State University, Raleigh, North Carolina, United States of America
| | - Jung-Ying Tzeng
- Department of Statistics, North Carolina State University, Raleigh, North Carolina, United States of America
- Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina, United States of America
- Department of Statistics, National Cheng-Kung University, Tainan, Taiwan
- Institute of Epidemiology and Preventive Medicine, National Taiwan University, Taipei, Taiwan
- * E-mail:
| |
Collapse
|
5
|
Novel Methods for Family-Based Genetic Studies. Methods Mol Biol 2018. [PMID: 29876895 DOI: 10.1007/978-1-4939-7868-7_9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register]
Abstract
The recent development of microarray and sequencing technology allows identification of disease susceptibility genes. Although the genome-wide association studies (GWAS) have successfully identified many genetic markers related to human diseases, the traditional statistical methods are not powerful to detect rare genetic markers. The rare genetic markers are usually grouped together and tested at the set level. One of such methods is the sequence kernel association test (SKAT), which has been commonly used in the rare genetic marker analysis. In recent publications, SKAT has been extended to be applicable for family-based rare variant analysis. Here, I present three published statistical approaches for family-based rare variant analysis for: 1. continuous traits, 2. binary traits, and 3. multiple correlated traits.
Collapse
|
6
|
Wang Z, Hall B, Xu J, Shi X. A Sparse Learning Framework for Joint Effect Analysis of Copy Number Variants. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2017; 14:1013-1027. [PMID: 28991724 DOI: 10.1109/tcbb.2015.2462332] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Copy number variants (CNVs), including large deletions and duplications, represent an unbalanced change of DNA segments. Abundant in human genomes, CNVs contribute to a large proportion of human genetic diversity, with impact on many human phenotypes. Although recent advances in genetic studies have shed light on the impact of individual CNVs on different traits, the analysis of joint effect of multiple interactive CNVs lags behind from many perspectives. A primary reason is that the large number of CNV combinations and interactions in the human genome make it computationally challenging to perform such joint analysis. To address this challenge, we developed a novel sparse learning framework that combines sparse learning with biological networks to identify interacting CNVs with joint effect on particular traits. We showed that our approach performs well in identifying CNVs with joint phenotypic effect using simulated data. Applied to a real human genomic dataset from the 1,000 Genomes Project, our approach identified multiple CNVs that collectively contribute to population differentiation. We found a set of multiple CNVs that have joint effect in different populations, and affect gene expression differently in distinct populations. These results provided a collection of CNVs that likely have downstream biomedical implications in individuals from diverse population backgrounds.
Collapse
|
7
|
Pereira M, Thompson JR, Weichenberger CX, Thomas DC, Minelli C. Inclusion of biological knowledge in a Bayesian shrinkage model for joint estimation of SNP effects. Genet Epidemiol 2017; 41:320-331. [PMID: 28393391 DOI: 10.1002/gepi.22038] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2016] [Revised: 12/18/2016] [Accepted: 12/26/2016] [Indexed: 01/04/2023]
Abstract
With the aim of improving detection of novel single-nucleotide polymorphisms (SNPs) in genetic association studies, we propose a method of including prior biological information in a Bayesian shrinkage model that jointly estimates SNP effects. We assume that the SNP effects follow a normal distribution centered at zero with variance controlled by a shrinkage hyperparameter. We use biological information to define the amount of shrinkage applied on the SNP effects distribution, so that the effects of SNPs with more biological support are less shrunk toward zero, thus being more likely detected. The performance of the method was tested in a simulation study (1,000 datasets, 500 subjects with ∼200 SNPs in 10 linkage disequilibrium (LD) blocks) using a continuous and a binary outcome. It was further tested in an empirical example on body mass index (continuous) and overweight (binary) in a dataset of 1,829 subjects and 2,614 SNPs from 30 blocks. Biological knowledge was retrieved using the bioinformatics tool Dintor, which queried various databases. The joint Bayesian model with inclusion of prior information outperformed the standard analysis: in the simulation study, the mean ranking of the true LD block was 2.8 for the Bayesian model versus 3.6 for the standard analysis of individual SNPs; in the empirical example, the mean ranking of the six true blocks was 8.5 versus 9.3 in the standard analysis. These results suggest that our method is more powerful than the standard analysis. We expect its performance to improve further as more biological information about SNPs becomes available.
Collapse
Affiliation(s)
- Miguel Pereira
- National Heart and Lung Institute, Imperial College London, London, United Kingdom
| | - John R Thompson
- Department of Health Sciences, University of Leicester, Leicester, United Kingdom
| | - Christian X Weichenberger
- Center for Biomedicine, European Academy of Bolzano/Bozen (EURAC), Bolzano, Italy, Affiliated to the University of Lübeck, Lübeck, Germany
| | - Duncan C Thomas
- Biostatistics Division, Department of Preventive Medicine, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
| | - Cosetta Minelli
- National Heart and Lung Institute, Imperial College London, London, United Kingdom
| |
Collapse
|
8
|
Abstract
In human genome research, genetic association studies of rare variants have been widely studied since the advent of high-throughput DNA sequencing platforms. However, detection of outcome-related rare variants still remains a statistically challenging problem because the number of observed genetic mutations is extremely rare. Recently, a power set-based statistical selection procedure has been proposed to locate both risk and protective rare variants within the outcome-related genes or genetic regions. Although it can perform an individual selection of rare variants, the procedure has a limitation that it cannot measure the certainty of selected rare variants. In this article, we propose a selection probability of individual rare variants, where selection frequencies of rare variants are computed based on bootstrap resampling. Therefore, it can quantify the certainty of both selected and unselected rare variants. Also, a new selection approach using a threshold of selection probability is introduced and compared with some existing selection procedures from extensive simulation studies and real sequencing data analysis. We have demonstrated that the proposed approach outperforms the existing methods in terms of a selection power.
Collapse
Affiliation(s)
- Gira Lee
- Department of Statistics, Pusan National University , Busan, Korea
| | - Hokeun Sun
- Department of Statistics, Pusan National University , Busan, Korea
| |
Collapse
|
9
|
Abstract
With the advance of sequencing technologies, it has become a routine practice to test for association between a quantitative trait and a set of rare variants (RVs). While a number of RV association tests have been proposed, there is a dearth of studies on the robustness of RV association testing for nonnormal distributed traits, e.g., due to skewness, which is ubiquitous in cohort studies. By extensive simulations, we demonstrate that commonly used RV tests, including sequence kernel association test (SKAT) and optimal unified SKAT (SKAT-O), are not robust to heavy-tailed or right-skewed trait distributions with inflated type I error rates; in contrast, the adaptive sum of powered score (aSPU) test is much more robust. Here we further propose a robust version of the aSPU test, called aSPUr. We conduct extensive simulations to evaluate the power of the tests, finding that for a larger number of RVs, aSPU is often more powerful than SKAT and SKAT-O, owing to its high data-adaptivity. We also compare different tests by conducting association analysis of triglyceride levels using the NHLBI ESP whole-exome sequencing data. The QQ plots for SKAT and SKAT-O were severely inflated (λ = 1.89 and 1.78, respectively), while those for aSPU and aSPUr behaved normally. Due to its relatively high robustness to outliers and high power of the aSPU test, we recommend its use complementary to SKAT and SKAT-O. If there is evidence of inflated type I error rate from the aSPU test, we would recommend the use of the more robust, but less powerful, aSPUr test.
Collapse
|
10
|
Abstract
Background Recent advances in next-generation sequencing technologies have made it possible to generate large amounts of sequence data with rare variants in a cost-effective way. Yet, the statistical aspect of testing disease association of rare variants is quite challenging as the typical assumptions fail to hold owing to low minor allele frequency (<0.5 or 1 %). Methods I present a Bayesian variable selection approach to detect associations with both rare and common genetic variants for quantitative traits simultaneously. In my model, I frame the problem of identifying disease-associated variants as a problem of variable selection in a sparse space, that is, how best to model the relationship between phenotypes and a set of genetic variants. By constructing a risk index score for a group of rare variants, my method can effectively consider all variants in a multivariate model. I also use a within-chain permutation to generate the empirical thresholds to detect true-positive variants. Results I apply our method to study the association between increases in baseline systolic and diastolic blood pressure (SBP and DBP, respectively) and genetic variants in the data from Genetic Analysis Workshop 19 unrelated samples. I identify several rare and common variants in the gene MAP4 that are potentially associated with SBP and DBP. Conclusions The application shows that my method is powerful in identifying disease-associated variants even with the extreme rarity.
Collapse
|
11
|
Yan Q, Weeks DE, Tiwari HK, Yi N, Zhang K, Gao G, Lin WY, Lou XY, Chen W, Liu N. Rare-Variant Kernel Machine Test for Longitudinal Data from Population and Family Samples. Hum Hered 2016; 80:126-38. [PMID: 27161037 DOI: 10.1159/000445057] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2015] [Accepted: 02/24/2016] [Indexed: 01/12/2023] Open
Abstract
OBJECTIVE The kernel machine (KM) test reportedly performs well in the set-based association test of rare variants. Many studies have been conducted to measure phenotypes at multiple time points, but the standard KM methodology has only been available for phenotypes at a single time point. In addition, family-based designs have been widely used in genetic association studies; therefore, the data analysis method used must appropriately handle familial relatedness. A rare-variant test does not currently exist for longitudinal data from family samples. Therefore, in this paper, we aim to introduce an association test for rare variants, which includes multiple longitudinal phenotype measurements for either population or family samples. METHODS This approach uses KM regression based on the linear mixed model framework and is applicable to longitudinal data from either population (L-KM) or family samples (LF-KM). RESULTS In our population-based simulation studies, L-KM has good control of Type I error rate and increased power in all the scenarios we considered compared with other competing methods. Conversely, in the family-based simulation studies, we found an inflated Type I error rate when L-KM was applied directly to the family samples, whereas LF-KM retained the desired Type I error rate and had the best power performance overall. Finally, we illustrate the utility of our proposed LF-KM approach by analyzing data from an association study between rare variants and blood pressure from the Genetic Analysis Workshop 18 (GAW18). CONCLUSION We propose a method for rare-variant association testing in population and family samples using phenotypes measured at multiple time points for each subject. The proposed method has the best power performance compared to competing approaches in our simulation study.
Collapse
Affiliation(s)
- Qi Yan
- Division of Pulmonary Medicine, Allergy and Immunology, Department of Pediatrics, Children's Hospital of Pittsburgh of UPMC, Pittsburgh, Pa., USA
| | | | | | | | | | | | | | | | | | | |
Collapse
|
12
|
Stell L, Sabatti C. Genetic Variant Selection: Learning Across Traits and Sites. Genetics 2016; 202:439-55. [PMID: 26680660 PMCID: PMC4788227 DOI: 10.1534/genetics.115.184572] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2015] [Accepted: 11/30/2015] [Indexed: 11/18/2022] Open
Abstract
We consider resequencing studies of associated loci and the problem of prioritizing sequence variants for functional follow-up. Working within the multivariate linear regression framework helps us to account for the joint effects of multiple genes; and adopting a Bayesian approach leads to posterior probabilities that coherently incorporate all information about the variants' function. We describe two novel prior distributions that facilitate learning the role of each variable site by borrowing evidence across phenotypes and across mutations in the same gene. We illustrate their potential advantages with simulations and reanalyzing a data set of sequencing variants.
Collapse
Affiliation(s)
- Laurel Stell
- Department of Health Research and Policy, Stanford University, Stanford, California 94305
| | - Chiara Sabatti
- Department of Health Research and Policy, Stanford University, Stanford, California 94305 Department of Statistics, Stanford University, Stanford, California 94305
| |
Collapse
|
13
|
Associating Multivariate Quantitative Phenotypes with Genetic Variants in Family Samples with a Novel Kernel Machine Regression Method. Genetics 2015; 201:1329-39. [PMID: 26482791 DOI: 10.1534/genetics.115.178590] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2015] [Accepted: 10/04/2015] [Indexed: 11/18/2022] Open
Abstract
The recent development of sequencing technology allows identification of association between the whole spectrum of genetic variants and complex diseases. Over the past few years, a number of association tests for rare variants have been developed. Jointly testing for association between genetic variants and multiple correlated phenotypes may increase the power to detect causal genes in family-based studies, but familial correlation needs to be appropriately handled to avoid an inflated type I error rate. Here we propose a novel approach for multivariate family data using kernel machine regression (denoted as MF-KM) that is based on a linear mixed-model framework and can be applied to a large range of studies with different types of traits. In our simulation studies, the usual kernel machine test has inflated type I error rates when applied directly to familial data, while our proposed MF-KM method preserves the expected type I error rates. Moreover, the MF-KM method has increased power compared to methods that either analyze each phenotype separately while considering family structure or use only unrelated founders from the families. Finally, we illustrate our proposed methodology by analyzing whole-genome genotyping data from a lung function study.
Collapse
|
14
|
Park L, Kim JH. A novel approach for identifying causal models of complex diseases from family data. Genetics 2015; 199:1007-16. [PMID: 25701286 PMCID: PMC4391573 DOI: 10.1534/genetics.114.174102] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2014] [Accepted: 02/16/2015] [Indexed: 02/01/2023] Open
Abstract
Causal models including genetic factors are important for understanding the presentation mechanisms of complex diseases. Familial aggregation and segregation analyses based on polygenic threshold models have been the primary approach to fitting genetic models to the family data of complex diseases. In the current study, an advanced approach to obtaining appropriate causal models for complex diseases based on the sufficient component cause (SCC) model involving combinations of traditional genetics principles was proposed. The probabilities for the entire population, i.e., normal-normal, normal-disease, and disease-disease, were considered for each model for the appropriate handling of common complex diseases. The causal model in the current study included the genetic effects from single genes involving epistasis, complementary gene interactions, gene-environment interactions, and environmental effects. Bayesian inference using a Markov chain Monte Carlo algorithm (MCMC) was used to assess of the proportions of each component for a given population lifetime incidence. This approach is flexible, allowing both common and rare variants within a gene and across multiple genes. An application to schizophrenia data confirmed the complexity of the causal factors. An analysis of diabetes data demonstrated that environmental factors and gene-environment interactions are the main causal factors for type II diabetes. The proposed method is effective and useful for identifying causal models, which can accelerate the development of efficient strategies for identifying causal factors of complex diseases.
Collapse
Affiliation(s)
- Leeyoung Park
- Natural Science Research Institute, Yonsei University, Seoul, Korea 120-749
| | - Ju H Kim
- Seoul National University Biomedical Informatics (SNUBI), Seoul National University College of Medicine, Seoul 110-799, Korea Systems Biomedical Informatics National Core Research Center (SBI-NCRC), Seoul National University College of Medicine, Seoul 110-799, Korea
| |
Collapse
|
15
|
Yan Q, Tiwari HK, Yi N, Gao G, Zhang K, Lin WY, Lou XY, Cui X, Liu N. A Sequence Kernel Association Test for Dichotomous Traits in Family Samples under a Generalized Linear Mixed Model. Hum Hered 2015; 79:60-8. [PMID: 25791389 DOI: 10.1159/000375409] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2014] [Accepted: 01/21/2015] [Indexed: 01/15/2023] Open
Abstract
OBJECTIVE The existing methods for identifying multiple rare variants underlying complex diseases in family samples are underpowered. Therefore, we aim to develop a new set-based method for an association study of dichotomous traits in family samples. METHODS We introduce a framework for testing the association of genetic variants with diseases in family samples based on a generalized linear mixed model. Our proposed method is based on a kernel machine regression and can be viewed as an extension of the sequence kernel association test (SKAT and famSKAT) for application to family data with dichotomous traits (F-SKAT). RESULTS Our simulation studies show that the original SKAT has inflated type I error rates when applied directly to family data. By contrast, our proposed F-SKAT has the correct type I error rate. Furthermore, in all of the considered scenarios, F-SKAT, which uses all family data, has higher power than both SKAT, which uses only unrelated individuals from the family data, and another method, which uses all family data. CONCLUSION We propose a set-based association test that can be used to analyze family data with dichotomous phenotypes while handling genetic variants with the same or opposite directions of effects as well as any types of family relationships.
Collapse
Affiliation(s)
- Qi Yan
- Department of Biostatistics, University of Alabama at Birmingham, Birmingham, Ala., USA
| | | | | | | | | | | | | | | | | |
Collapse
|
16
|
Lin WY. Adaptive combination of P-values for family-based association testing with sequence data. PLoS One 2014; 9:e115971. [PMID: 25541952 PMCID: PMC4277421 DOI: 10.1371/journal.pone.0115971] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2014] [Accepted: 12/01/2014] [Indexed: 12/24/2022] Open
Abstract
Family-based study design will play a key role in identifying rare causal variants, because rare causal variants can be enriched in families with multiple affected subjects. Furthermore, different from population-based studies, family studies are robust to bias induced by population substructure. It is well known that rare causal variants are difficult to detect from single-locus tests. Therefore, burden tests and non-burden tests have been developed, by combining signals of multiple variants in a chromosomal region or a functional unit. This inevitably incorporates some neutral variants into the test statistics, which can dilute the power of statistical methods. To guard against the noise caused by neutral variants, we here propose an 'adaptive combination of P-values method' (abbreviated as 'ADA'). This method combines per-site P-values of variants that are more likely to be causal. Variants with large P-values (which are more likely to be neutral variants) are discarded from the combined statistic. In addition to performing extensive simulation studies, we applied these tests to the Genetic Analysis Workshop 17 data sets, where real sequence data were generated according to the 1000 Genomes Project. Compared with some existing methods, ADA is more robust to the inclusion of neutral variants. This is a merit especially when dichotomous traits are analyzed. However, there are some limitations for ADA. First, it is more computationally intensive. Second, pedigree structures and founders' sequence data are required for the permutation procedure. Third, unrelated controls cannot be included. We here show that, for family-based studies, the application of ADA is limited to dichotomous trait analyses with full pedigree information.
Collapse
Affiliation(s)
- Wan-Yu Lin
- Institute of Epidemiology and Preventive Medicine, College of Public Health, National Taiwan University, Taipei, Taiwan
| |
Collapse
|
17
|
He L, Pitkäniemi J, Sarin AP, Salomaa V, Sillanpää MJ, Ripatti S. Hierarchical Bayesian model for rare variant association analysis integrating genotype uncertainty in human sequence data. Genet Epidemiol 2014; 39:89-100. [PMID: 25395270 DOI: 10.1002/gepi.21871] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2014] [Revised: 09/18/2014] [Accepted: 10/03/2014] [Indexed: 11/08/2022]
Abstract
Next-generation sequencing (NGS) has led to the study of rare genetic variants, which possibly explain the missing heritability for complex diseases. Most existing methods for rare variant (RV) association detection do not account for the common presence of sequencing errors in NGS data. The errors can largely affect the power and perturb the accuracy of association tests due to rare observations of minor alleles. We developed a hierarchical Bayesian approach to estimate the association between RVs and complex diseases. Our integrated framework combines the misclassification probability with shrinkage-based Bayesian variable selection. It allows for flexibility in handling neutral and protective RVs with measurement error, and is robust enough for detecting causal RVs with a wide spectrum of minor allele frequency (MAF). Imputation uncertainty and MAF are incorporated into the integrated framework to achieve the optimal statistical power. We demonstrate that sequencing error does significantly affect the findings, and our proposed model can take advantage of it to improve statistical power in both simulated and real data. We further show that our model outperforms existing methods, such as sequence kernel association test (SKAT). Finally, we illustrate the behavior of the proposed method using a Finnish low-density lipoprotein cholesterol study, and show that it identifies an RV known as FH North Karelia in LDLR gene with three carriers in 1,155 individuals, which is missed by both SKAT and Granvil.
Collapse
Affiliation(s)
- Liang He
- Department of Public Health, Hjelt Institute, University of Helsinki, Helsinki, Finland
| | | | | | | | | | | |
Collapse
|
18
|
Peng B. Reproducible simulations of realistic samples for next-generation sequencing studies using Variant Simulation Tools. Genet Epidemiol 2014; 39:45-52. [PMID: 25395236 DOI: 10.1002/gepi.21867] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2014] [Revised: 09/14/2014] [Accepted: 09/26/2014] [Indexed: 12/31/2022]
Abstract
Computer simulations have been widely used to validate and evaluate the power of statistical methods for genetic epidemiological studies. Although a large number of simulation methods and software packages have been developed for genome-wide association studies, methodological and bioinformatics challenges have limited their applications in simulating datasets for whole-genome and whole-exome sequencing studies. With the development of more sophisticated statistical methods that make fuller use of available data and our knowledge of the human genome, there is a pressing need for genetic simulators that capture more features of empirical data (e.g., multiallele variants, indels, use of the Variant Call Format) and the human genome (e.g., functional annotations of genetic variants). This article introduces Variant Simulation Tools (VST), a module of Variant Tools for the simulation of genetic variants for sequencing-based genetic epidemiological studies. Although multiple simulation engines are provided, the core of VST is a novel forward-time simulation engine that simulates real nucleotide sequences of the human genome using DNA mutation models, fine-scale recombination maps, and a selection model based on amino acid changes of translated protein sequences. The design of VST allows users to easily create and distribute simulation methods and simulated datasets for a variety of applications and encourages fair comparison between statistical methods through the use of existing or reproduced simulated datasets.
Collapse
Affiliation(s)
- Bo Peng
- Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| |
Collapse
|
19
|
Hormozdiari F, Kostem E, Kang EY, Pasaniuc B, Eskin E. Identifying causal variants at loci with multiple signals of association. Genetics 2014; 198:497-508. [PMID: 25104515 PMCID: PMC4196608 DOI: 10.1534/genetics.114.167908] [Citation(s) in RCA: 263] [Impact Index Per Article: 26.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2014] [Accepted: 07/18/2014] [Indexed: 12/22/2022] Open
Abstract
Although genome-wide association studies have successfully identified thousands of risk loci for complex traits, only a handful of the biologically causal variants, responsible for association at these loci, have been successfully identified. Current statistical methods for identifying causal variants at risk loci either use the strength of the association signal in an iterative conditioning framework or estimate probabilities for variants to be causal. A main drawback of existing methods is that they rely on the simplifying assumption of a single causal variant at each risk locus, which is typically invalid at many risk loci. In this work, we propose a new statistical framework that allows for the possibility of an arbitrary number of causal variants when estimating the posterior probability of a variant being causal. A direct benefit of our approach is that we predict a set of variants for each locus that under reasonable assumptions will contain all of the true causal variants with a high confidence level (e.g., 95%) even when the locus contains multiple causal variants. We use simulations to show that our approach provides 20-50% improvement in our ability to identify the causal variants compared to the existing methods at loci harboring multiple causal variants. We validate our approach using empirical data from an expression QTL study of CHI3L2 to identify new causal variants that affect gene expression at this locus. CAVIAR is publicly available online at http://genetics.cs.ucla.edu/caviar/.
Collapse
Affiliation(s)
- Farhad Hormozdiari
- Department of Computer Science, University of California, Los Angeles, California 90095
| | - Emrah Kostem
- Department of Computer Science, University of California, Los Angeles, California 90095
| | - Eun Yong Kang
- Department of Computer Science, University of California, Los Angeles, California 90095
| | - Bogdan Pasaniuc
- Department of Human Genetics, University of California, Los Angeles, California 90095 Department of Pathology and Laboratory Medicine, University of California, Los Angeles, California 90095
| | - Eleazar Eskin
- Department of Computer Science, University of California, Los Angeles, California 90095 Department of Human Genetics, University of California, Los Angeles, California 90095
| |
Collapse
|
20
|
Abstract
Genome-wide association studies (GWASs) have become the focus of the statistical analysis of complex traits in humans, successfully shedding light on several aspects of genetic architecture and biological aetiology. Single-nucleotide polymorphisms (SNPs) are usually modelled as having additive, cumulative and independent effects on the phenotype. Although evidently a useful approach, it is often argued that this is not a realistic biological model and that epistasis (that is, the statistical interaction between SNPs) should be included. The purpose of this Review is to summarize recent directions in methodology for detecting epistasis and to discuss evidence of the role of epistasis in human complex trait variation. We also discuss the relevance of epistasis in the context of GWASs and potential hazards in the interpretation of statistical interaction terms.
Collapse
|
21
|
Cao S, Qin H, Deng HW, Wang YP. A unified sparse representation for sequence variant identification for complex traits. Genet Epidemiol 2014; 38:671-9. [PMID: 25195875 DOI: 10.1002/gepi.21849] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2014] [Revised: 06/23/2014] [Accepted: 07/16/2014] [Indexed: 12/25/2022]
Abstract
Joint adjustment of cryptic relatedness and population structure is necessary to reduce bias in DNA sequence analysis; however, existent sparse regression methods model these two confounders separately. Incorporating prior biological information has great potential to enhance statistical power but such information is often overlooked in many existent sparse regression models. We developed a unified sparse regression (USR) to incorporate prior information and jointly adjust for cryptic relatedness, population structure, and other environmental covariates. Our USR models cryptic relatedness as a random effect and population structure as fixed effect, and utilize the weighted penalties to incorporate prior knowledge. As demonstrated by extensive simulations, our USR algorithm can discover more true causal variants and maintain a lower false discovery rate than do several commonly used feature selection methods. It can handle both rare and common variants simultaneously. Applying our USR algorithm to DNA sequence data of Mexican Americans from GAW18, we replicated three hypertension pathways, demonstrating the effectiveness in identifying susceptibility genetic variants.
Collapse
Affiliation(s)
- Shaolong Cao
- Department of Biomedical Engineering, Tulane University, New Orleans, Louisiana, United States of America; Center for Bioinformatics and Genomics, Tulane University, New Orleans, Louisiana, United States of America
| | | | | | | |
Collapse
|
22
|
Hu P, Paterson AD. Dynamic pathway analysis of genes associated with blood pressure using whole genome sequence data. BMC Proc 2014; 8:S106. [PMID: 25519360 PMCID: PMC4143637 DOI: 10.1186/1753-6561-8-s1-s106] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Groups of genes assigned to a pathway, also called a module, have similar functions. Finding such modules, and the topology of the changes of the modules over time, is a fundamental problem in understanding the mechanisms of complex diseases. Here we investigated an approach that categorized variants into rare or common and used a hierarchical model to jointly estimate the group effects of the variants in a pathway for identifying enriched pathways over time using whole genome sequencing data and blood pressure data. Our results suggest that the method can identify potentially biologically meaningful genes in modules associated with blood pressure over time.
Collapse
Affiliation(s)
- Pingzhao Hu
- The Centre for Applied Genomics, The Hospital for Sick Children, 686 Bay Street, Toronto, ON, M5G 0A4, Canada ; Department of Biochemistry and Medical Genetics and George and Fay Yee Centre for Healthcare Innovation, University of Manitoba,745 Bannatyne Avenue, Winnipeg, MB, R3E 0W3, Canada
| | - Andrew D Paterson
- The Centre for Applied Genomics, The Hospital for Sick Children, 686 Bay Street, Toronto, ON, M5G 0A4, Canada ; Program in Genetics and Genome Biology, The Hospital for Sick Children, 686 Bay Street, Toronto, ON, M5G 0A4, Canada ; Dalla Lana School of Public Health, University of Toronto, Health Sciences Building, 155 College St, Toronto, ON, M5T 3M7, Canada
| |
Collapse
|
23
|
Wang M, Lin S. FamLBL: detecting rare haplotype disease association based on common SNPs using case-parent triads. ACTA ACUST UNITED AC 2014; 30:2611-8. [PMID: 24849576 DOI: 10.1093/bioinformatics/btu347] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
Abstract
MOTIVATION In recent years, there has been an increasing interest in using common single-nucleotide polymorphisms (SNPs) amassed in genome-wide association studies to investigate rare haplotype effects on complex diseases. Evidence has suggested that rare haplotypes may tag rare causal single-nucleotide variants, making SNP-based rare haplotype analysis not only cost effective, but also more valuable for detecting causal variants. Although a number of methods for detecting rare haplotype association have been proposed in recent years, they are population based and thus susceptible to population stratification. RESULTS We propose family-triad-based logistic Bayesian Lasso (famLBL) for estimating effects of haplotypes on complex diseases using SNP data. By choosing appropriate prior distribution, effect sizes of unassociated haplotypes can be shrunk toward zero, allowing for more precise estimation of associated haplotypes, especially those that are rare, thereby achieving greater detection power. We evaluate famLBL using simulation to gauge its type I error and power. Compared with its population counterpart, LBL, highlights famLBL's robustness property in the presence of population substructure. Further investigation by comparing famLBL with Family-Based Association Test (FBAT) reveals its advantage for detecting rare haplotype association. AVAILABILITY AND IMPLEMENTATION famLBL is implemented as an R-package available at http://www.stat.osu.edu/∼statgen/SOFTWARE/LBL/.
Collapse
Affiliation(s)
- Meng Wang
- Department of Statistics, The Ohio State University, Columbus, OH 43210, USA
| | - Shili Lin
- Department of Statistics, The Ohio State University, Columbus, OH 43210, USA
| |
Collapse
|
24
|
Yan Q, Tiwari HK, Yi N, Lin WY, Gao G, Lou XY, Cui X, Liu N. Kernel-machine testing coupled with a rank-truncation method for genetic pathway analysis. Genet Epidemiol 2014; 38:447-56. [PMID: 24849109 DOI: 10.1002/gepi.21813] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2013] [Revised: 04/09/2014] [Accepted: 04/10/2014] [Indexed: 01/09/2023]
Abstract
Traditional genome-wide association studies (GWASs) usually focus on single-marker analysis, which only accesses marginal effects. Pathway analysis, on the other hand, considers biological pathway gene marker hierarchical structure and therefore provides additional insights into the genetic architecture underlining complex diseases. Recently, a number of methods for pathway analysis have been proposed to assess the significance of a biological pathway from a collection of single-nucleotide polymorphisms. In this study, we propose a novel approach for pathway analysis that assesses the effects of genes using the sequence kernel association test and the effects of pathways using an extended adaptive rank truncated product statistic. It has been increasingly recognized that complex diseases are caused by both common and rare variants. We propose a new weighting scheme for genetic variants across the whole allelic frequency spectrum to be analyzed together without any form of frequency cutoff for defining rare variants. The proposed approach is flexible. It is applicable to both binary and continuous traits, and incorporating covariates is easy. Furthermore, it can be readily applied to GWAS data, exome-sequencing data, and deep resequencing data. We evaluate the new approach on data simulated under comprehensive scenarios and show that it has the highest power in most of the scenarios while maintaining the correct type I error rate. We also apply our proposed methodology to data from a study of the association between bipolar disorder and candidate pathways from Wellcome Trust Case Control Consortium (WTCCC) to show its utility.
Collapse
Affiliation(s)
- Qi Yan
- Department of Biostatistics, University of Alabama at Birmingham, Birmingham, Alabama, United States of America
| | | | | | | | | | | | | | | |
Collapse
|
25
|
Abstract
This article focuses on conducting global testing for association between a binary trait and a set of rare variants (RVs), although its application can be much broader to other types of traits, common variants (CVs), and gene set or pathway analysis. We show that many of the existing tests have deteriorating performance in the presence of many nonassociated RVs: their power can dramatically drop as the proportion of nonassociated RVs in the group to be tested increases. We propose a class of so-called sum of powered score (SPU) tests, each of which is based on the score vector from a general regression model and hence can deal with different types of traits and adjust for covariates, e.g., principal components accounting for population stratification. The SPU tests generalize the sum test, a representative burden test based on pooling or collapsing genotypes of RVs, and a sum of squared score (SSU) test that is closely related to several other powerful variance component tests; a previous study (Basu and Pan 2011) has demonstrated good performance of one, but not both, of the Sum and SSU tests in many situations. The SPU tests are versatile in the sense that one of them is often powerful, although its identity varies with the unknown true association parameters. We propose an adaptive SPU (aSPU) test to approximate the most powerful SPU test for a given scenario, consequently maintaining high power and being highly adaptive across various scenarios. We conducted extensive simulations to show superior performance of the aSPU test over several state-of-the-art association tests in the presence of many nonassociated RVs. Finally we applied the SPU and aSPU tests to the GAW17 mini-exome sequence data to compare its practical performance with some existing tests, demonstrating their potential usefulness.
Collapse
|
26
|
Sun H, Wang S. A power set-based statistical selection procedure to locate susceptible rare variants associated with complex traits with sequencing data. Bioinformatics 2014; 30:2317-23. [PMID: 24755303 DOI: 10.1093/bioinformatics/btu207] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023] Open
Abstract
MOTIVATION Existing association methods for rare variants from sequencing data have focused on aggregating variants in a gene or a genetic region because of the fact that analysing individual rare variants is underpowered. However, these existing rare variant detection methods are not able to identify which rare variants in a gene or a genetic region of all variants are associated with the complex diseases or traits. Once phenotypic associations of a gene or a genetic region are identified, the natural next step in the association study with sequencing data is to locate the susceptible rare variants within the gene or the genetic region. RESULTS In this article, we propose a power set-based statistical selection procedure that is able to identify the locations of the potentially susceptible rare variants within a disease-related gene or a genetic region. The selection performance of the proposed selection procedure was evaluated through simulation studies, where we demonstrated the feasibility and superior power over several comparable existing methods. In particular, the proposed method is able to handle the mixed effects when both risk and protective variants are present in a gene or a genetic region. The proposed selection procedure was also applied to the sequence data on the ANGPTL gene family from the Dallas Heart Study to identify potentially susceptible rare variants within the trait-related genes. AVAILABILITY AND IMPLEMENTATION An R package 'rvsel' can be downloaded from http://www.columbia.edu/∼sw2206/ and http://statsun.pusan.ac.kr.
Collapse
Affiliation(s)
- Hokeun Sun
- Department of Statistics, Pusan National University, Pusan 609-735, Korea and Department of Biostatistics, Mailman School of Public Health, Columbia University, New York, NY 10032, USA
| | - Shuang Wang
- Department of Statistics, Pusan National University, Pusan 609-735, Korea and Department of Biostatistics, Mailman School of Public Health, Columbia University, New York, NY 10032, USA
| |
Collapse
|
27
|
Lin WY. Association testing of clustered rare causal variants in case-control studies. PLoS One 2014; 9:e94337. [PMID: 24736372 PMCID: PMC3988195 DOI: 10.1371/journal.pone.0094337] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2014] [Accepted: 03/12/2014] [Indexed: 11/18/2022] Open
Abstract
Biological evidence suggests that multiple causal variants in a gene may cluster physically. Variants within the same protein functional domain or gene regulatory element would locate in close proximity on the DNA sequence. However, spatial information of variants is usually not used in current rare variant association analyses. We here propose a clustering method (abbreviated as "CLUSTER"), which is extended from the adaptive combination of P-values. Our method combines the association signals of variants that are more likely to be causal. Furthermore, the statistic incorporates the spatial information of variants. With extensive simulations, we show that our method outperforms several commonly-used methods in many scenarios. To demonstrate its use in real data analyses, we also apply this CLUSTER test to the Dallas Heart Study data. CLUSTER is among the best methods when the effects of causal variants are all in the same direction. As variants located in close proximity are more likely to have similar impact on disease risk, CLUSTER is recommended for association testing of clustered rare causal variants in case-control studies.
Collapse
Affiliation(s)
- Wan-Yu Lin
- Institute of Epidemiology and Preventive Medicine, College of Public Health, National Taiwan University, Taipei, Taiwan
| |
Collapse
|
28
|
He L, Sillanpää MJ, Ripatti S, Pitkäniemi J. Bayesian Latent Variable Collapsing Model for Detecting Rare Variant Interaction Effect in Twin Study. Genet Epidemiol 2014; 38:310-24. [DOI: 10.1002/gepi.21804] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2013] [Revised: 02/28/2014] [Accepted: 02/28/2014] [Indexed: 12/12/2022]
Affiliation(s)
- Liang He
- Department of Public Health; Hjelt Institute; University of Helsinki; Finland
| | - Mikko J. Sillanpää
- Department of Mathematical Sciences; University of Oulu; Oulu Finland
- Department of Biology and Biocenter Oulu; University of Oulu; Oulu Finland
| | - Samuli Ripatti
- Department of Public Health; Hjelt Institute; University of Helsinki; Finland
- Institute for Molecular Medicine Finland FIMM; University of Helsinki; Finland
- Human Genetics; Wellcome Trust Sanger Institute; United Kingdom
| | - Janne Pitkäniemi
- Department of Public Health; Hjelt Institute; University of Helsinki; Finland
- Finnish Cancer Registry; Institute for Statistical and Epidemiological Cancer Research; Helsinki Finland
| |
Collapse
|
29
|
Lu M, Lee HS, Hadley D, Huang JZ, Qian X. Supervised categorical principal component analysis for genome-wide association analyses. BMC Genomics 2014; 15 Suppl 1:S10. [PMID: 24564304 PMCID: PMC4046680 DOI: 10.1186/1471-2164-15-s1-s10] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/22/2023] Open
Abstract
In order to have a better understanding of unexplained heritability for complex diseases in conventional Genome-Wide Association Studies (GWAS), aggregated association analyses based on predefined functional regions, such as genes and pathways, become popular recently as they enable evaluating joint effect of multiple Single-Nucleotide Polymorphisms (SNPs), which helps increase the detection power, especially when investigating genetic variants with weak individual effects. In this paper, we focus on aggregated analysis methods based on the idea of Principal Component Analysis (PCA). The past approaches using PCA mostly make some inherent genotype data and/or risk effect model assumptions, which may hinder the accurate detection of potential disease SNPs that influence disease phenotypes. In this paper, we derive a general Supervised Categorical Principal Component Analysis (SCPCA), which explicitly models categorical SNP data without imposing any risk effect model assumption. We have evaluated the efficacy of SCPCA with the comparison to a traditional Supervised PCA (SPCA) and a previously developed Supervised Logistic Principal Component Analysis (SLPCA) based on both the simulated genotype data by HAPGEN2 and the genotype data of Crohn's Disease (CD) from Wellcome Trust Case Control Consortium (WTCCC). Our preliminary results have demonstrated the superiority of SCPCA over both SPCA and SLPCA due to its modeling explicitly designed for categorical SNP data as well as its flexibility on the risk effect model assumption.
Collapse
|
30
|
Yi N, Xu S, Lou XY, Mallick H. Multiple comparisons in genetic association studies: a hierarchical modeling approach. Stat Appl Genet Mol Biol 2014; 13:35-48. [PMID: 24259248 PMCID: PMC5003626 DOI: 10.1515/sagmb-2012-0040] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023]
Abstract
Multiple comparisons or multiple testing has been viewed as a thorny issue in genetic association studies aiming to detect disease-associated genetic variants from a large number of genotyped variants. We alleviate the problem of multiple comparisons by proposing a hierarchical modeling approach that is fundamentally different from the existing methods. The proposed hierarchical models simultaneously fit as many variables as possible and shrink unimportant effects towards zero. Thus, the hierarchical models yield more efficient estimates of parameters than the traditional methods that analyze genetic variants separately, and also coherently address the multiple comparisons problem due to largely reducing the effective number of genetic effects and the number of statistically "significant" effects. We develop a method for computing the effective number of genetic effects in hierarchical generalized linear models, and propose a new adjustment for multiple comparisons, the hierarchical Bonferroni correction, based on the effective number of genetic effects. Our approach not only increases the power to detect disease-associated variants but also controls the Type I error. We illustrate and evaluate our method with real and simulated data sets from genetic association studies. The method has been implemented in our freely available R package BhGLM (http://www.ssg.uab.edu/bhglm/).
Collapse
Affiliation(s)
- Nengjun Yi
- Department of Biostatistics, Section on Statistical Genetics, University of Alabama at Birmingham, Birmingham, AL 35294
| | - Shizhong Xu
- Department of Botany and Plant Sciences, University of California, Riverside, CA 92521, USA
| | - Xiang-Yang Lou
- Department of Biostatistics, Section on Statistical Genetics, University of Alabama at Birmingham, Birmingham, AL 35294
| | - Himel Mallick
- Department of Biostatistics, Section on Statistical Genetics, University of Alabama at Birmingham, Birmingham, AL 35294
| |
Collapse
|
31
|
Rare variant association testing by adaptive combination of P-values. PLoS One 2014; 9:e85728. [PMID: 24454922 PMCID: PMC3893264 DOI: 10.1371/journal.pone.0085728] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2013] [Accepted: 12/02/2013] [Indexed: 01/21/2023] Open
Abstract
With the development of next-generation sequencing technology, there is a great demand for powerful statistical methods to detect rare variants (minor allele frequencies (MAFs)<1%) associated with diseases. Testing for each variant site individually is known to be underpowered, and therefore many methods have been proposed to test for the association of a group of variants with phenotypes, by pooling signals of the variants in a chromosomal region. However, this pooling strategy inevitably leads to the inclusion of a large proportion of neutral variants, which may compromise the power of association tests. To address this issue, we extend the -MidP method (Cheung et al., 2012, Genet Epidemiol 36: 675–685) and propose an approach (named ‘adaptive combination of P-values for rare variant association testing’, abbreviated as ‘ADA’) that adaptively combines per-site P-values with the weights based on MAFs. Before combining P-values, we first imposed a truncation threshold upon the per-site P-values, to guard against the noise caused by the inclusion of neutral variants. This ADA method is shown to outperform popular burden tests and non-burden tests under many scenarios. ADA is recommended for next-generation sequencing data analysis where many neutral variants may be included in a functional region.
Collapse
|
32
|
Smith S, Hay EH, Farhat N, Rekaya R. Genome wide association studies in presence of misclassified binary responses. BMC Genet 2013; 14:124. [PMID: 24369108 PMCID: PMC3879434 DOI: 10.1186/1471-2156-14-124] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2013] [Accepted: 12/17/2013] [Indexed: 01/06/2023] Open
Abstract
BACKGROUND Misclassification has been shown to have a high prevalence in binary responses in both livestock and human populations. Leaving these errors uncorrected before analyses will have a negative impact on the overall goal of genome-wide association studies (GWAS) including reducing predictive power. A liability threshold model that contemplates misclassification was developed to assess the effects of mis-diagnostic errors on GWAS. Four simulated scenarios of case-control datasets were generated. Each dataset consisted of 2000 individuals and was analyzed with varying odds ratios of the influential SNPs and misclassification rates of 5% and 10%. RESULTS Analyses of binary responses subject to misclassification resulted in underestimation of influential SNPs and failed to estimate the true magnitude and direction of the effects. Once the misclassification algorithm was applied there was a 12% to 29% increase in accuracy, and a substantial reduction in bias. The proposed method was able to capture the majority of the most significant SNPs that were not identified in the analysis of the misclassified data. In fact, in one of the simulation scenarios, 33% of the influential SNPs were not identified using the misclassified data, compared with the analysis using the data without misclassification. However, using the proposed method, only 13% were not identified. Furthermore, the proposed method was able to identify with high probability a large portion of the truly misclassified observations. CONCLUSIONS The proposed model provides a statistical tool to correct or at least attenuate the negative effects of misclassified binary responses in GWAS. Across different levels of misclassification probability as well as odds ratios of significant SNPs, the model proved to be robust. In fact, SNP effects, and misclassification probability were accurately estimated and the truly misclassified observations were identified with high probabilities compared to non-misclassified responses. This study was limited to situations where the misclassification probability was assumed to be the same in cases and controls which is not always the case based on real human disease data. Thus, it is of interest to evaluate the performance of the proposed model in that situation which is the current focus of our research.
Collapse
Affiliation(s)
| | | | | | - Romdhane Rekaya
- Department of Animal and Dairy Science, The University of Georgia, Athens, GA, USA.
| |
Collapse
|
33
|
Byrnes AE, Wu MC, Wright FA, Li M, Li Y. The value of statistical or bioinformatics annotation for rare variant association with quantitative trait. Genet Epidemiol 2013; 37:666-74. [PMID: 23836599 PMCID: PMC4083762 DOI: 10.1002/gepi.21747] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2013] [Revised: 05/20/2013] [Accepted: 06/03/2013] [Indexed: 11/06/2022]
Abstract
In the past few years, a plethora of methods for rare variant association with phenotype have been proposed. These methods aggregate information from multiple rare variants across genomic region(s), but there is little consensus as to which method is most effective. The weighting scheme adopted when aggregating information across variants is one of the primary determinants of effectiveness. Here we present a systematic evaluation of multiple weighting schemes through a series of simulations intended to mimic large sequencing studies of a quantitative trait. We evaluate existing phenotype-independent and phenotype-dependent methods, as well as weights estimated by penalized regression approaches including Lasso, Elastic Net, and SCAD. We find that the difference in power between phenotype-dependent schemes is negligible when high-quality functional annotations are available. When functional annotations are unavailable or incomplete, all methods suffer from power loss; however, the variable selection methods outperform the others at the cost of increased computational time. Therefore, in the absence of good annotation, we recommend variable selection methods (which can be viewed as "statistical annotation") on top of regions implicated by a phenotype-independent weighting scheme. Further, once a region is implicated, variable selection can help to identify potential causal single nucleotide polymorphisms for biological validation. These findings are supported by an analysis of a high coverage targeted sequencing study of 1,898 individuals.
Collapse
Affiliation(s)
- Andrea E. Byrnes
- Department of Biostatistics, University of North Carolina, Chapel Hill, North Carolina 27599
| | - Michael C. Wu
- Department of Biostatistics, University of North Carolina, Chapel Hill, North Carolina 27599
| | - Fred A. Wright
- Department of Biostatistics, University of North Carolina, Chapel Hill, North Carolina 27599
| | - Mingyao Li
- Department of Biostatistics and Epidemiology, University of Pennsylvania School of Medicine, Philadelphia, PA 19104
| | - Yun Li
- Department of Biostatistics, University of North Carolina, Chapel Hill, North Carolina 27599
- Department of Genetics, University of North Carolina, Chapel Hill, North Carolina 27599
- Department of Computer Science, University of North Carolina, Chapel Hill, North Carolina 27599
| |
Collapse
|
34
|
|
35
|
Iwata H, Hayashi T, Terakami S, Takada N, Saito T, Yamamoto T. Genomic prediction of trait segregation in a progeny population: a case study of Japanese pear (Pyrus pyrifolia). BMC Genet 2013; 14:81. [PMID: 24028660 PMCID: PMC3847345 DOI: 10.1186/1471-2156-14-81] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2013] [Accepted: 09/05/2013] [Indexed: 12/17/2022] Open
Abstract
Background In cross breeding, it is important to choose a good parental combination that has high probability of generating offspring with desired characteristics. This study examines a method for predicting the segregation of target traits in a progeny population based on genome-wide markers and phenotype data of parental cultivars. Results The proposed method combines segregation simulation and Bayesian modeling for genomic selection. Marker segregation in a progeny population was simulated based on parental genotypes. Posterior marker effects sampled via Markov Chain Monte Carlo were used to predict the segregation pattern of target traits. The posterior distribution of the proportion of progenies that fulfill selection criteria was calculated and used for determining a promising cross and the necessary size of the progeny population. We applied the proposed method to Japanese pear (Pyrus pyrifolia Nakai) data to demonstrate the method and to show how it works in the selection of a promising cross. Verification using an actual breeding population suggests that the segregation of target traits can be predicted with reasonable accuracy, especially in a highly heritable trait. The uncertainty in predictions was reflected on the posterior distribution of the proportion of progenies that fulfill selection criteria. A simulation study based on the real marker data of Japanese pear cultivars also suggests the potential of the method. Conclusions The proposed method is useful to provide objective and quantitative criteria for choosing a parental combination and the breeding population size.
Collapse
Affiliation(s)
- Hiroyoshi Iwata
- Graduate School of Agricultural and Life Sciences, The University of Tokyo, 1-1-1 Yayoi, Bunkyo, 113-8657, Tokyo, Japan.
| | | | | | | | | | | |
Collapse
|
36
|
Genetic association analysis of 30 genes related to obesity in a European American population. Int J Obes (Lond) 2013; 38:724-9. [PMID: 23900445 PMCID: PMC3909018 DOI: 10.1038/ijo.2013.140] [Citation(s) in RCA: 62] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/07/2012] [Revised: 07/11/2013] [Accepted: 07/20/2013] [Indexed: 12/18/2022]
Abstract
OBJECTIVE Obesity, which is frequently associated with diabetes, hypertension and cardiovascular diseases, is primarily the result of a net excess of caloric intake over energy expenditure. Human obesity is highly heritable, but the specific genes mediating susceptibility in non-syndromic obesity remain unclear. We tested candidate genes in pathways related to food intake and energy expenditure for association with body mass index (BMI). METHODS We reanalyzed 355 common genetic variants of 30 candidate genes in seven molecular pathways related to obesity in 1982 unrelated European Americans from the New York Cancer Project. Data were analyzed by using a Bayesian hierarchical generalized linear model. The BMIs were log-transformed and then adjusted for covariates, including age, age(2), gender and diabetes status. The single-nucleotide polymorphisms (SNPs) were modeled as additive effects. RESULTS With the stipulated adjustments, nine SNPs in eight genes were significantly associated with BMI: ghrelin (GHRL; rs35683), agouti-related peptide (AGRP; rs5030980), carboxypeptidase E (CPE; rs1946816 and rs4481204), glucagon-like peptide-1 receptor (GLP1R; rs2268641), serotonin receptors (HTR2A; rs912127), neuropeptide Y receptor (NPY5R;Y5R1c52), suppressor of cytokine signaling 3 (SOCS3; rs4969170) and signal transducer and activator of transcription 3 (STAT3; rs4796793). We also found a gender-by-SNP interaction (rs1745837 in HTR2A), which indicated that variants in the gene HTR2A had a stronger association with BMI in males. In addition, NPY1R was detected as having a significant gene effect even though none of the SNPs in this gene was significant. CONCLUSION Variations in genes AGRP, CPE, GHRL, GLP1R, HTR2A, NPY1R, NPY5R, SOCS3 and STAT3 showed modest associations with BMI in European Americans. The pathways in which these genes participate regulate energy intake, and thus these associations are mechanistically plausible in this context.
Collapse
|
37
|
Liang F, Xiong M. Bayesian detection of causal rare variants under posterior consistency. PLoS One 2013; 8:e69633. [PMID: 23922764 PMCID: PMC3724943 DOI: 10.1371/journal.pone.0069633] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2013] [Accepted: 06/12/2013] [Indexed: 12/17/2022] Open
Abstract
Identification of causal rare variants that are associated with complex traits poses a central challenge on genome-wide association studies. However, most current research focuses only on testing the global association whether the rare variants in a given genomic region are collectively associated with the trait. Although some recent work, e.g., the Bayesian risk index method, have tried to address this problem, it is unclear whether the causal rare variants can be consistently identified by them in the small-n-large-P situation. We develop a new Bayesian method, the so-called Bayesian Rare Variant Detector (BRVD), to tackle this problem. The new method simultaneously addresses two issues: (i) (Global association test) Are there any of the variants associated with the disease, and (ii) (Causal variant detection) Which variants, if any, are driving the association. The BRVD ensures the causal rare variants to be consistently identified in the small-n-large-P situation by imposing some appropriate prior distributions on the model and model specific parameters. The numerical results indicate that the BRVD is more powerful for testing the global association than the existing methods, such as the combined multivariate and collapsing test, weighted sum statistic test, RARECOVER, sequence kernel association test, and Bayesian risk index, and also more powerful for identification of causal rare variants than the Bayesian risk index method. The BRVD has also been successfully applied to the Early-Onset Myocardial Infarction (EOMI) Exome Sequence Data. It identified a few causal rare variants that have been verified in the literature.
Collapse
Affiliation(s)
- Faming Liang
- Department of Statistics, Texas A&M University, College Station, Texas, United States of America.
| | | |
Collapse
|
38
|
Ayers KL, Cordell HJ. Identification of grouped rare and common variants via penalized logistic regression. Genet Epidemiol 2013; 37:592-602. [PMID: 23836590 PMCID: PMC3842118 DOI: 10.1002/gepi.21746] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2012] [Revised: 05/24/2013] [Accepted: 05/24/2013] [Indexed: 11/09/2022]
Abstract
In spite of the success of genome-wide association studies in finding many common variants associated with disease, these variants seem to explain only a small proportion of the estimated heritability. Data collection has turned toward exome and whole genome sequencing, but it is well known that single marker methods frequently used for common variants have low power to detect rare variants associated with disease, even with very large sample sizes. In response, a variety of methods have been developed that attempt to cluster rare variants so that they may gather strength from one another under the premise that there may be multiple causal variants within a gene. Most of these methods group variants by gene or proximity, and test one gene or marker window at a time. We propose a penalized regression method (PeRC) that analyzes all genes at once, allowing grouping of all (rare and common) variants within a gene, along with subgrouping of the rare variants, thus borrowing strength from both rare and common variants within the same gene. The method can incorporate either a burden-based weighting of the rare variants or one in which the weights are data driven. In simulations, our method performs favorably when compared to many previously proposed approaches, including its predecessor, the sparse group lasso [Friedman et al., 2010].
Collapse
Affiliation(s)
- Kristin L Ayers
- Institute of Genetic Medicine, Newcastle University, Newcastle upon Tyne NE1 3BZ, United Kingdom.
| | | |
Collapse
|
39
|
Curtis D. Approaches to the detection of recessive effects using next generation sequencing data from outbred populations. Adv Appl Bioinform Chem 2013; 6:29-35. [PMID: 23807854 PMCID: PMC3685401 DOI: 10.2147/aabc.s44332] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
Conventional methods to analyze genome-wide association studies and whole exome or whole genome sequencing studies would be prone to overlook variants which might exert a recessive effect on risk of disease, either as homozygotes or compound heterozygotes. It is plausible that such effects may be common even in outbred populations. An approach is described which is based on identifying a set of variants in a gene as being potentially of interest and then testing whether there is an excess of cases who are either homozygotes or complex heterozygotes for these variants. Methods based on departure from Hardy–Weinberg equilibrium are more powerful than those which compare cases to controls. However, linkage disequilibrium between variants can be difficult to deal with if phase is unknown. A simple approach for discarding variants apparently in strong linkage disequilibrium with others is proposed. The procedure is simple and quick to apply so can be used in the context of whole genome or exome sequencing studies and is implemented in the SCOREASSOC program.
Collapse
Affiliation(s)
- David Curtis
- Centre for Psychiatry, Barts and the London School of Medicine and Dentistry, London, UK
| |
Collapse
|
40
|
Lin WY, Yi N, Lou XY, Zhi D, Zhang K, Gao G, Tiwari HK, Liu N. Haplotype kernel association test as a powerful method to identify chromosomal regions harboring uncommon causal variants. Genet Epidemiol 2013; 37:560-70. [PMID: 23740760 DOI: 10.1002/gepi.21740] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2012] [Revised: 05/01/2013] [Accepted: 05/06/2013] [Indexed: 01/09/2023]
Abstract
For most complex diseases, the fraction of heritability that can be explained by the variants discovered from genome-wide association studies is minor. Although the so-called "rare variants" (minor allele frequency [MAF] < 1%) have attracted increasing attention, they are unlikely to account for much of the "missing heritability" because very few people may carry these rare variants. The genetic variants that are likely to fill in the "missing heritability" include uncommon causal variants (MAF < 5%), which are generally untyped in association studies using tagging single-nucleotide polymorphisms (SNPs) or commercial SNP arrays. Developing powerful statistical methods can help to identify chromosomal regions harboring uncommon causal variants, while bypassing the genome-wide or exome-wide next-generation sequencing. In this work, we propose a haplotype kernel association test (HKAT) that is equivalent to testing the variance component of random effects for distinct haplotypes. With an appropriate weighting scheme given to haplotypes, we can further enhance the ability of HKAT to detect uncommon causal variants. With scenarios simulated according to the population genetics theory, HKAT is shown to be a powerful method for detecting chromosomal regions harboring uncommon causal variants.
Collapse
Affiliation(s)
- Wan-Yu Lin
- Institute of Epidemiology and Preventive Medicine, College of Public Health, National Taiwan University, Taipei, Taiwan
| | | | | | | | | | | | | | | |
Collapse
|
41
|
Feng JY, Zhang J, Zhang WJ, Wang SB, Han SF, Zhang YM. An efficient hierarchical generalized linear mixed model for mapping QTL of ordinal traits in crop cultivars. PLoS One 2013; 8:e59541. [PMID: 23593144 PMCID: PMC3614919 DOI: 10.1371/journal.pone.0059541] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2012] [Accepted: 02/15/2013] [Indexed: 11/18/2022] Open
Abstract
Many important phenotypic traits in plants are ordinal. However, relatively little is known about the methodologies for ordinal trait association studies. In this study, we proposed a hierarchical generalized linear mixed model for mapping quantitative trait locus (QTL) of ordinal traits in crop cultivars. In this model, all the main-effect QTL and QTL-by-environment interaction were treated as random, while population mean, environmental effect and population structure were fixed. In the estimation of parameters, the pseudo data normal approximation of likelihood function and empirical Bayes approach were adopted. A series of Monte Carlo simulation experiments were performed to confirm the reliability of new method. The result showed that new method works well with satisfactory statistical power and precision. The new method was also adopted to dissect the genetic basis of soybean alkaline-salt tolerance in 257 soybean cultivars obtained, by stratified random sampling, from 6 geographic ecotypes in China. As a result, 6 main-effect QTL and 3 QTL-by-environment interactions were identified.
Collapse
Affiliation(s)
- Jian-Ying Feng
- Section on Statistical Genomics, State Key Laboratory of Crop Genetics and Germplasm Enhancement, Department of Crop Genetics and Breeding, Nanjing Agricultural University, Nanjing, Jiangsu, China
| | - Jin Zhang
- Section on Statistical Genomics, State Key Laboratory of Crop Genetics and Germplasm Enhancement, Department of Crop Genetics and Breeding, Nanjing Agricultural University, Nanjing, Jiangsu, China
| | - Wen-Jie Zhang
- Section on Statistical Genomics, State Key Laboratory of Crop Genetics and Germplasm Enhancement, Department of Crop Genetics and Breeding, Nanjing Agricultural University, Nanjing, Jiangsu, China
| | - Shi-Bo Wang
- Section on Statistical Genomics, State Key Laboratory of Crop Genetics and Germplasm Enhancement, Department of Crop Genetics and Breeding, Nanjing Agricultural University, Nanjing, Jiangsu, China
| | - Shi-Feng Han
- Section on Statistical Genomics, State Key Laboratory of Crop Genetics and Germplasm Enhancement, Department of Crop Genetics and Breeding, Nanjing Agricultural University, Nanjing, Jiangsu, China
| | - Yuan-Ming Zhang
- Section on Statistical Genomics, State Key Laboratory of Crop Genetics and Germplasm Enhancement, Department of Crop Genetics and Breeding, Nanjing Agricultural University, Nanjing, Jiangsu, China
| |
Collapse
|
42
|
Chen YC, Carter H, Parla J, Kramer M, Goes FS, Pirooznia M, Zandi PP, McCombie WR, Potash JB, Karchin R. A hybrid likelihood model for sequence-based disease association studies. PLoS Genet 2013; 9:e1003224. [PMID: 23358228 PMCID: PMC3554549 DOI: 10.1371/journal.pgen.1003224] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2012] [Accepted: 11/21/2012] [Indexed: 11/18/2022] Open
Abstract
In the past few years, case-control studies of common diseases have shifted their focus from single genes to whole exomes. New sequencing technologies now routinely detect hundreds of thousands of sequence variants in a single study, many of which are rare or even novel. The limitation of classical single-marker association analysis for rare variants has been a challenge in such studies. A new generation of statistical methods for case-control association studies has been developed to meet this challenge. A common approach to association analysis of rare variants is the burden-style collapsing methods to combine rare variant data within individuals across or within genes. Here, we propose a new hybrid likelihood model that combines a burden test with a test of the position distribution of variants. In extensive simulations and on empirical data from the Dallas Heart Study, the new model demonstrates consistently good power, in particular when applied to a gene set (e.g., multiple candidate genes with shared biological function or pathway), when rare variants cluster in key functional regions of a gene, and when protective variants are present. When applied to data from an ongoing sequencing study of bipolar disorder (191 cases, 107 controls), the model identifies seven gene sets with nominal p-values0.05, of which one MAPK signaling pathway (KEGG) reaches trend-level significance after correcting for multiple testing. Inexpensive, high-throughput sequencing has transformed the field of case-control association studies. For the first time, it may be possible to identify the genetic underpinnings of complex diseases, by sequencing the DNA of hundreds (even thousands) of cases and controls and comparing patterns of DNA sequence variation. However, complex diseases are likely to be caused by many variants, some of which are very rare. Taken one at a time, the association between variant and disease phenotype may not be detectable by current statistical methods. One strategy is to identify regions where important variants occur by “collapsing” variants into groups. Here, we present a new collapsing approach, capable of detecting subtle genetic differences between cases and controls. We show, in extensive simulations and using a benchmark set of genes involved in human triglyceride levels, that the approach is potentially more powerful than existing methods. We apply the new method to an ongoing sequencing study of bipolar cases and controls and identify a set of genes found in neuronal synapses, which may be implicated in bipolar disorder.
Collapse
Affiliation(s)
- Yun-Ching Chen
- Department of Biomedical Engineering and Institute for Computational Medicine, Johns Hopkins University, Baltimore, Maryland, United States of America
| | - Hannah Carter
- Department of Biomedical Engineering and Institute for Computational Medicine, Johns Hopkins University, Baltimore, Maryland, United States of America
| | - Jennifer Parla
- Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America
| | - Melissa Kramer
- Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America
| | - Fernando S. Goes
- Department of Psychiatry and Behavioral Sciences, Johns Hopkins School of Medicine, Baltimore, Maryland, United States of America
| | - Mehdi Pirooznia
- Department of Psychiatry and Behavioral Sciences, Johns Hopkins School of Medicine, Baltimore, Maryland, United States of America
| | - Peter P. Zandi
- Department of Psychiatry and Behavioral Sciences, Johns Hopkins School of Medicine, Baltimore, Maryland, United States of America
| | - W. Richard McCombie
- Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America
| | - James B. Potash
- Department of Psychiatry, University of Iowa, Iowa City, Iowa, United States of America
| | - Rachel Karchin
- Department of Biomedical Engineering and Institute for Computational Medicine, Johns Hopkins University, Baltimore, Maryland, United States of America
- * E-mail:
| |
Collapse
|
43
|
Mao X, Li Y, Liu Y, Lange L, Li M. Testing genetic association with rare variants in admixed populations. Genet Epidemiol 2013; 37:38-47. [PMID: 23032398 PMCID: PMC3524352 DOI: 10.1002/gepi.21687] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2012] [Revised: 08/23/2012] [Accepted: 09/07/2012] [Indexed: 11/07/2022]
Abstract
Recent studies suggest that rare variants play an important role in the etiology of many traits. Although a number of methods have been developed for genetic association analysis of rare variants, they all assume a relatively homogeneous population under study. Such an assumption may not be valid for samples collected from admixed populations such as African Americans and Hispanic Americans as there is a great extent of local variation in ancestry in these populations. To ensure valid and more powerful rare variant association tests performed in admixed populations, we have developed a local ancestry-based weighted dosage test, which is able to take into account local ancestry of rare alleles, uncertainties in rare variant imputation when imputed data are included, and the direction of effect that rare variants exert on phenotypic outcome. We used simulated sequence data to show that our proposed test has controlled type I error rates, whereas naïve application of existing rare variants tests and tests that adjust for global ancestry lead to inflated type I error rates. We showed that our test has higher power than tests without proper adjustment of ancestry. We also applied the proposed method to a candidate gene study on low-density lipoprotein cholesterol. Our results suggest that it is important to appropriately control for potential population stratification induced by local ancestry difference in the analysis of rare variants in admixed populations.
Collapse
Affiliation(s)
- Xianyun Mao
- Department of Biostatistics and Epidemiology, University of Pennsylvania School of Medicine, Philadelphia, Pennsylvania
| | - Yun Li
- Department of Genetics, University of North Carolina, Chapel Hill, North Carolina
- Department of Biostatistics, University of North Carolina, Chapel Hill, North Carolina
- Department of Computer Science, University of North Carolina, Chapel Hill, North Carolina
| | - Yichuan Liu
- Department of Biostatistics and Epidemiology, University of Pennsylvania School of Medicine, Philadelphia, Pennsylvania
| | - Leslie Lange
- Department of Genetics, University of North Carolina, Chapel Hill, North Carolina
| | - Mingyao Li
- Department of Biostatistics and Epidemiology, University of Pennsylvania School of Medicine, Philadelphia, Pennsylvania
| |
Collapse
|
44
|
Abstract
This unit provides an overview of the design and analysis of population-based case-control studies of genetic risk factors for complex disease. Considerations specific to genetic studies are emphasized. The unit reviews basic study designs differentiating case-control studies from others, presents different genetic association strategies (candidate gene, genome-wide association, and high-throughput sequencing), introduces basic methods of statistical analysis for case-control data and approaches to combining case-control studies, and discusses measures of association and impact. Admixed populations, controlling for confounding (including population stratification), consideration of multiple loci and environmental risk factors, and complementary analyses of haplotypes, genes, and pathways are briefly discussed. Readers are referred to basic texts on epidemiology for more details on general conduct of case-control studies.
Collapse
Affiliation(s)
- Dana B Hancock
- Research Triangle Institute International, Research Triangle Park, North Carolina, USA
| | | |
Collapse
|
45
|
Abstract
It is widely believed that both common and rare variants contribute to the risks of common diseases or complex traits and the cumulative effects of multiple rare variants can explain a significant proportion of trait variances. Advances in high-throughput DNA sequencing technologies allow us to genotype rare causal variants and investigate the effects of such rare variants on complex traits. We developed an adaptive ridge regression method to analyze the collective effects of multiple variants in the same gene or the same functional unit. Our model focuses on continuous trait and incorporates covariate factors to remove potential confounding effects. The proposed method estimates and tests multiple rare variants collectively but does not depend on the assumption of same direction of each rare variant effect. Compared with the Bayesian hierarchical generalized linear model approach, the state-of-the-art method of rare variant detection, the proposed new method is easy to implement, yet it has higher statistical power. Application of the new method is demonstrated using the well-known data from the Dallas Heart Study.
Collapse
Affiliation(s)
- Haimao Zhan
- Department of Botany and Plant Sciences, University of California Riverside, Riverside, California, United States of America
| | - Shizhong Xu
- Department of Botany and Plant Sciences, University of California Riverside, Riverside, California, United States of America
| |
Collapse
|
46
|
Lin WY, Yi N, Zhi D, Zhang K, Gao G, Tiwari HK, Liu N. Haplotype-based methods for detecting uncommon causal variants with common SNPs. Genet Epidemiol 2012; 36:572-82. [PMID: 22706849 DOI: 10.1002/gepi.21650] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2012] [Revised: 04/19/2012] [Accepted: 05/09/2012] [Indexed: 01/01/2023]
Abstract
Detecting uncommon causal variants (minor allele frequency [MAF] < 5%) is difficult with commercial single-nucleotide polymorphism (SNP) arrays that are designed to capture common variants (MAF > 5%). Haplotypes can provide insights into underlying linkage disequilibrium (LD) structure and can tag uncommon variants that are not well tagged by common variants. In this work, we propose a wei-SIMc-matching test that inversely weights haplotype similarities with the estimated standard deviation of haplotype counts to boost the power of similarity-based approaches for detecting uncommon causal variants. We then compare the power of the wei-SIMc-matching test with that of several popular haplotype-based tests, including four other similarity-based tests, a global score test for haplotypes (global), a test based on the maximum score statistic over all haplotypes (max), and two newly proposed haplotype-based tests for rare variant detection. With systematic simulations under a wide range of LD patterns, the results show that wei-SIMc-matching and global are the two most powerful tests. Among these two tests, wei-SIMc-matching has reliable asymptotic P-values, whereas global needs permutations to obtain reliable P-values when the frequencies of some haplotype categories are low or when the trait is skewed. Therefore, we recommend wei-SIMc-matching for detecting uncommon causal variants with surrounding common SNPs, in light of its power and computational feasibility.
Collapse
Affiliation(s)
- Wan-Yu Lin
- Institute of Epidemiology and Preventive Medicine, College of Public Health, National Taiwan University, Taipei, Taiwan
| | | | | | | | | | | | | |
Collapse
|