1
|
Biziaev T, Kopciuk K, Chekouo T. Using prior-data conflict to tune Bayesian regularized regression models. STATISTICS AND COMPUTING 2025; 35:53. [PMID: 39990592 PMCID: PMC11842445 DOI: 10.1007/s11222-025-10582-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/25/2023] [Accepted: 02/03/2025] [Indexed: 02/25/2025]
Abstract
In high-dimensional regression models, variable selection becomes challenging from a computational and theoretical perspective. Bayesian regularized regression via shrinkage priors like the Laplace or spike-and-slab prior are effective methods for variable selection in p > n scenarios provided the shrinkage priors are configured adequately. We propose an empirical Bayes configuration using checks for prior-data conflict: tests that assess whether there is disagreement in parameter information provided by the prior and data. We apply our proposed method to the Bayesian LASSO and spike-and-slab shrinkage priors in the linear regression model and assess the variable selection performance of our prior configurations through a high-dimensional simulation study. Additionally, we apply our method to proteomic data collected from patients admitted to the Albany Medical Center in Albany NY in April of 2020 with COVID-like respiratory issues. Simulation results suggest our proposed configurations may outperform competing models when the true regression effects are small. Supplementary Information The online version contains supplementary material available at 10.1007/s11222-025-10582-1.
Collapse
Affiliation(s)
- Timofei Biziaev
- Department of Mathematics and Statistics, University of Calgary, 2500 University Drive NW, Calgary, AB T2N 1N4 Canada
| | - Karen Kopciuk
- Department of Mathematics and Statistics, University of Calgary, 2500 University Drive NW, Calgary, AB T2N 1N4 Canada
- Cancer Epidemiology and Prevention Research, Cancer Care Alberta, Alberta Health Services, 3395 Hospital Drive N.W., Calgary, AB T2N 5G2 Canada
- Department of Oncology, Community Health Sciences, University of Calgary, 2500 University Drive NW, Calgary, AB T2N 1N4 Canada
| | - Thierry Chekouo
- Department of Mathematics and Statistics, University of Calgary, 2500 University Drive NW, Calgary, AB T2N 1N4 Canada
- Division of Biostatistics and Health Data Science, School of Public Health, University of Minnesota, Minneapolis, MN 55455 USA
| |
Collapse
|
2
|
Ollier A, Mozgunov P. On Inclusion of Covariates in Model Based Dose Finding Clinical Trial Designs. Stat Med 2025; 44:e10337. [PMID: 39853785 PMCID: PMC11758501 DOI: 10.1002/sim.10337] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2023] [Revised: 12/24/2024] [Accepted: 12/27/2024] [Indexed: 01/26/2025]
Abstract
There is a growing number of Phase I dose-finding studies that use a model-based approach, such as the CRM or the EWOC method to estimate the dose-toxicity relationship. It is common to assume that all patients will have similar toxicity risk given the dose regardless of patients' individual characteristics. In many trials, however, some patients' covariates (e.g., a concomitant drug assigned by a clinician) might have an impact on the dose-toxicity relationship. In this work, motivated by a real trial, we evaluate an impact of taking into account (or omitting) some patients' covariates on the individual target dose recommendations and patients' safety in Phase I model-based dose-finding study. We investigate several variable penalisation criteria and found that, for continuous and binary covariates, omitting a prognostic covariate leads to a drastically low proportion of correct selections and an increase of overdosing. At the same time, including a covariate can lead to good operating characteristics in all scenarios but can sometimes slightly decrease the proportion of good selections and increase the overdosing. To tackle this, we propose to use a Bayesian Lasso Bayesian Logistic Regression Model (BLRM) and Spike-and-Slab BLRM. We have found that the BLRM coupled to the Bayesian LASSO and the BLRM with Spike-and-Slab are on average better appropriate to consider variable inclusion.
Collapse
Affiliation(s)
- Adrien Ollier
- MRC Biostatistics UnitUniversity of CambridgeCambridgeUK
| | - Pavel Mozgunov
- MRC Biostatistics UnitUniversity of CambridgeCambridgeUK
| |
Collapse
|
3
|
Lowe MX, Kettner H, Jolly DRP, Carhart-Harris RL, Jackson H. Long-term benefits to psychological health and well-being after ceremonial use of Ayahuasca in Middle Eastern and North African immigrants and refugees. Front Psychiatry 2024; 15:1279887. [PMID: 38666090 PMCID: PMC11044680 DOI: 10.3389/fpsyt.2024.1279887] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/18/2023] [Accepted: 03/20/2024] [Indexed: 04/28/2024] Open
Abstract
Background Refugees and immigrants can experience complex stressors from the process of immigration that can have lasting and severe long-term mental health consequences. Experiences after ayahuasca ingestion are shown to produce positive effects on psychological wellbeing and mental health, including anecdotal reports of improved symptoms of trauma and related disorders. However, data on the longitudinal health impact of naturalistic ayahuasca use in Middle Eastern and North African (MENA) immigrant and refugee populations is limited. Aims The current longitudinal online survey study was conducted to gather prospective data on ceremonial ayahuasca use in a group (N = 15) of primarily female MENA immigrants and refugees and to provide further insight into the patterns and outcomes surrounding that use. The study sought to assess self-reported changes in physical and mental health, well-being, and psychological functioning, examine relationships between aspects of individual mindset (e.g., psychedelic preparedness) prior to ayahuasca use and observed outcomes during (e.g., subjective drug effects) and afterwards (i.e., persisting effects), characterize risks and negative experiences, and describe trauma exposure and personal history. Results/Outcomes Our findings revealed ceremonial use of ayahuasca is associated with significant improvements in mental health, well-being, and psychological functioning, including reductions in depression, anxiety, and shame, and increases in cognitive reappraisal and self-compassion. Most participants reported no lasting adverse effects and experienced notable positive behavioral changes persisting months after ingestion. Conclusion/Interpretation While preliminary, results suggest naturalistic ayahuasca use might hold therapeutic potential for MENA populations exposed to trauma prior to and during the process of migration.
Collapse
Affiliation(s)
| | - Hannes Kettner
- Psychedelics Division, Neuroscape, University of California, San Francisco, San Francisco, CA, United States
- Centre for Psychedelic Research, Imperial College London, London, United Kingdom
| | | | - Robin L. Carhart-Harris
- Psychedelics Division, Neuroscape, University of California, San Francisco, San Francisco, CA, United States
- Centre for Psychedelic Research, Imperial College London, London, United Kingdom
| | | |
Collapse
|
4
|
Liu Z, Turkmen AS, Lin S. Bayesian LASSO for population stratification correction in rare haplotype association studies. Stat Appl Genet Mol Biol 2024; 23:sagmb-2022-0034. [PMID: 38235525 PMCID: PMC10794901 DOI: 10.1515/sagmb-2022-0034] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2022] [Accepted: 12/19/2023] [Indexed: 01/19/2024]
Abstract
Population stratification (PS) is one major source of confounding in both single nucleotide polymorphism (SNP) and haplotype association studies. To address PS, principal component regression (PCR) and linear mixed model (LMM) are the current standards for SNP associations, which are also commonly borrowed for haplotype studies. However, the underfitting and overfitting problems introduced by PCR and LMM, respectively, have yet to be addressed. Furthermore, there have been only a few theoretical approaches proposed to address PS specifically for haplotypes. In this paper, we propose a new method under the Bayesian LASSO framework, QBLstrat, to account for PS in identifying rare and common haplotypes associated with a continuous trait of interest. QBLstrat utilizes a large number of principal components (PCs) with appropriate priors to sufficiently correct for PS, while shrinking the estimates of unassociated haplotypes and PCs. We compare the performance of QBLstrat with the Bayesian counterparts of PCR and LMM and a current method, haplo.stats. Extensive simulation studies and real data analyses show that QBLstrat is superior in controlling false positives while maintaining competitive power for identifying true positives under PS.
Collapse
Affiliation(s)
- Zilu Liu
- Department of Statistics, The Ohio State University, Columbus, OH43210, USA
| | | | - Shili Lin
- Department of Statistics, The Ohio State University, Columbus, OH43210, USA
| |
Collapse
|
5
|
Chen H, Naseri A, Zhi D. FiMAP: A fast identity-by-descent mapping test for biobank-scale cohorts. PLoS Genet 2023; 19:e1011057. [PMID: 38039339 PMCID: PMC10718418 DOI: 10.1371/journal.pgen.1011057] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2023] [Revised: 12/13/2023] [Accepted: 11/07/2023] [Indexed: 12/03/2023] Open
Abstract
Although genome-wide association studies (GWAS) have identified tens of thousands of genetic loci, the genetic architecture is still not fully understood for many complex traits. Most GWAS and sequencing association studies have focused on single nucleotide polymorphisms or copy number variations, including common and rare genetic variants. However, phased haplotype information is often ignored in GWAS or variant set tests for rare variants. Here we leverage the identity-by-descent (IBD) segments inferred from a random projection-based IBD detection algorithm in the mapping of genetic associations with complex traits, to develop a computationally efficient statistical test for IBD mapping in biobank-scale cohorts. We used sparse linear algebra and random matrix algorithms to speed up the computation, and a genome-wide IBD mapping scan of more than 400,000 samples finished within a few hours. Simulation studies showed that our new method had well-controlled type I error rates under the null hypothesis of no genetic association in large biobank-scale cohorts, and outperformed traditional GWAS single-variant tests when the causal variants were untyped and rare, or in the presence of haplotype effects. We also applied our method to IBD mapping of six anthropometric traits using the UK Biobank data and identified a total of 3,442 associations, 2,131 (62%) of which remained significant after conditioning on suggestive tag variants in the ± 3 centimorgan flanking regions from GWAS.
Collapse
Affiliation(s)
- Han Chen
- Human Genetics Center, Department of Epidemiology, School of Public Health, The University of Texas Health Science Center at Houston, Houston, Texas, United States of America
| | - Ardalan Naseri
- Center for Artificial Intelligence and Genome Informatics, McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, United States of America
| | - Degui Zhi
- Center for Artificial Intelligence and Genome Informatics, McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, United States of America
| |
Collapse
|
6
|
Liang X, Sun H. Weighted Selection Probability to Prioritize Susceptible Rare Variants in Multi-Phenotype Association Studies with Application to a Soybean Genetic Data Set. J Comput Biol 2023; 30:1075-1088. [PMID: 37871292 DOI: 10.1089/cmb.2022.0487] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2023] Open
Abstract
Rare variant association studies with multiple traits or diseases have drawn a lot of attention since association signals of rare variants can be boosted if more than one phenotype outcome is associated with the same rare variants. Most of the existing statistical methods to identify rare variants associated with multiple phenotypes are based on a group test, where a pre-specified genetic region is tested one at a time. However, these methods are not designed to locate susceptible rare variants within the genetic region. In this article, we propose new statistical methods to prioritize rare variants within a genetic region when a group test for the genetic region identifies a statistical association with multiple phenotypes. It computes the weighted selection probability (WSP) of individual rare variants and ranks them from largest to smallest according to their WSP. In simulation studies, we demonstrated that the proposed method outperforms other statistical methods in terms of true positive selection, when multiple phenotypes are correlated with each other. We also applied it to our soybean single nucleotide polymorphism (SNP) data with 13 highly correlated amino acids, where we identified some potentially susceptible rare variants in chromosome 19.
Collapse
Affiliation(s)
- Xianglong Liang
- Department of Statistic, Pusan National University, Busan, Korea
| | - Hokeun Sun
- Department of Statistic, Pusan National University, Busan, Korea
| |
Collapse
|
7
|
MacTavish R, Bixby H, Cavanaugh A, Agyei-Mensah S, Bawah A, Owusu G, Ezzati M, Arku R, Robinson B, Schmidt AM, Baumgartner J. Identifying deprived "slum" neighbourhoods in the Greater Accra Metropolitan Area of Ghana using census and remote sensing data. WORLD DEVELOPMENT 2023; 167:106253. [PMID: 37767357 PMCID: PMC7615130 DOI: 10.1016/j.worlddev.2023.106253] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 09/29/2023]
Abstract
Background Identifying urban deprived areas, including slums, can facilitate more targeted planning and development policies in cities to reduce socio-economic and health inequities, but methods to identify them are often ad-hoc, resource intensive, and cannot keep pace with rapidly urbanizing communities. Objectives We apply a spatial modelling approach to identify census enumeration areas (EAs) in the Greater Accra Metropolitan Area (GAMA) of Ghana with a high probability of being a deprived area using publicly available census and remote sensing data. Methods We obtained United Nations (UN) supported field mapping data that identified deprived "slum" areas in Accra's urban core, data on housing and population conditions from the most recent census, and remotely sensed data on environmental conditions in the GAMA. We first fitted a Bayesian logistic regression model on the data in Accra's urban core (n=2,414 EAs) that estimated the relationship between housing, population, and environmental predictors and being a deprived area according to the UN's deprived area assessment. Using these relationships, we predicted the probability of being a deprived area for each of the 4,615 urban EAs in GAMA. Results 899 (19%) of the 4,615 urban EAs in GAMA, with an estimated 745,714 residents (22% of its urban population), had a high predicted probability (≥80%) of being a deprived area. These deprived EAs were dispersed across GAMA and relatively heterogeneous in their housing and environmental conditions, but shared some common features including a higher population density, lower elevation and vegetation abundance, and less access to indoor piped water and sanitation. Conclusion Our approach using ubiquitously available administrative and satellite data can be used to identify deprived neighbourhoods where interventions are warranted to improve living conditions, and track progress in achieving the Sustainable Development Goals aiming to reduce the population living in unsafe or vulnerable human settlements.
Collapse
Affiliation(s)
- Robert MacTavish
- Department of Epidemiology, Biostatistics, and Occupational Health, McGill University, Montreal, Canada
- Institute for Health and Social Policy, McGill University, Montreal, Canada
| | - Honor Bixby
- Department of Epidemiology, Biostatistics, and Occupational Health, McGill University, Montreal, Canada
- Institute for Health and Social Policy, McGill University, Montreal, Canada
- Institute of Public Health and Wellbeing, University of Essex, Colchester, England
| | | | - Samuel Agyei-Mensah
- Department of Geography and Resource Development, University of Ghana, Accra, Ghana
| | - Ayaga Bawah
- Department of Geography and Resource Development, University of Ghana, Accra, Ghana
| | - George Owusu
- Department of Geography and Resource Development, University of Ghana, Accra, Ghana
- Institute of Statistical, Social and Economic Research, University of Ghana, Accra, Ghana
| | - Majid Ezzati
- Faculty of Medicine, School of Public Health, Imperial College, London, England
| | - Raphael Arku
- Institute for Global Health, University of Massachusetts Amherst, Amherst, United States
- Department of Environmental Health Sciences, University of Massachusetts Amherst, Amherst, United States
| | - Brian Robinson
- Department of Geography, McGill University, Montreal, Canada
| | - Alexandra M. Schmidt
- Department of Epidemiology, Biostatistics, and Occupational Health, McGill University, Montreal, Canada
| | - Jill Baumgartner
- Department of Epidemiology, Biostatistics, and Occupational Health, McGill University, Montreal, Canada
- Institute for Health and Social Policy, McGill University, Montreal, Canada
| |
Collapse
|
8
|
Sajal IH, Biswas S. Bivariate quantitative Bayesian LASSO for detecting association of rare haplotypes with two correlated continuous phenotypes. Front Genet 2023; 14:1104727. [PMID: 36968609 PMCID: PMC10033866 DOI: 10.3389/fgene.2023.1104727] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2022] [Accepted: 02/21/2023] [Indexed: 03/12/2023] Open
Abstract
In genetic association studies, the multivariate analysis of correlated phenotypes offers statistical and biological advantages compared to analyzing one phenotype at a time. The joint analysis utilizes additional information contained in the correlation and avoids multiple testing. It also provides an opportunity to investigate and understand shared genetic mechanisms of multiple phenotypes. Bivariate logistic Bayesian LASSO (LBL) was proposed earlier to detect rare haplotypes associated with two binary phenotypes or one binary and one continuous phenotype jointly. There is currently no haplotype association test available that can handle multiple continuous phenotypes. In this study, by employing the framework of bivariate LBL, we propose bivariate quantitative Bayesian LASSO (QBL) to detect rare haplotypes associated with two continuous phenotypes. Bivariate QBL removes unassociated haplotypes by regularizing the regression coefficients and utilizing a latent variable to model correlation between two phenotypes. We carry out extensive simulations to investigate the performance of bivariate QBL and compare it with that of a standard (univariate) haplotype association test, Haplo.score (applied twice to two phenotypes individually). Bivariate QBL performs better than Haplo.score in all simulations with varying degrees of power gain. We analyze Genetic Analysis Workshop 19 exome sequencing data on systolic and diastolic blood pressures and detect several rare haplotypes associated with the two phenotypes.
Collapse
Affiliation(s)
| | - Swati Biswas
- Department of Mathematical Sciences, University of Texas at Dallas, Richardson, TX, United States
| |
Collapse
|
9
|
Seffernick AE, Mrózek K, Nicolet D, Stone RM, Eisfeld AK, Byrd JC, Archer KJ. High-dimensional genomic feature selection with the ordered stereotype logit model. Brief Bioinform 2022; 23:bbac414. [PMID: 36184192 PMCID: PMC9677495 DOI: 10.1093/bib/bbac414] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2022] [Revised: 08/18/2022] [Accepted: 08/27/2022] [Indexed: 12/30/2022] Open
Abstract
For many high-dimensional genomic and epigenomic datasets, the outcome of interest is ordinal. While these ordinal outcomes are often thought of as the observed cutpoints of some latent continuous variable, some ordinal outcomes are truly discrete and are comprised of the subjective combination of several factors. The nonlinear stereotype logistic model, which does not assume proportional odds, was developed for these 'assessed' ordinal variables. It has previously been extended to the frequentist high-dimensional feature selection setting, but the Bayesian framework provides some distinct advantages in terms of simultaneous uncertainty quantification and variable selection. Here, we review the stereotype model and Bayesian variable selection methods and demonstrate how to combine them to select genomic features associated with discrete ordinal outcomes. We compared the Bayesian and frequentist methods in terms of variable selection performance. We additionally applied the Bayesian stereotype method to an acute myeloid leukemia RNA-sequencing dataset to further demonstrate its variable selection abilities by identifying features associated with the European LeukemiaNet prognostic risk score.
Collapse
Affiliation(s)
- Anna Eames Seffernick
- Division of Biostatistics, College of Public Health, The Ohio State University, Columbus, OH, USA
| | - Krzysztof Mrózek
- Clara D. Bloomfield Center for Leukemia Outcomes Research, The Ohio State University, Columbus, OH, USA
- The Ohio State Comprehensive Cancer Center, Columbus, OH, USA
| | - Deedra Nicolet
- Clara D. Bloomfield Center for Leukemia Outcomes Research, The Ohio State University, Columbus, OH, USA
- The Ohio State Comprehensive Cancer Center, Columbus, OH, USA
- Alliance Statistics and Data Management Center, The Ohio State University Comprehensive Cancer Center, Columbus, OH, USA
| | - Richard M Stone
- Dana Farber/Partners Cancer Care, Harvard University, Boston, MA, USA
| | - Ann-Kathrin Eisfeld
- Clara D. Bloomfield Center for Leukemia Outcomes Research, The Ohio State University, Columbus, OH, USA
- The Ohio State Comprehensive Cancer Center, Columbus, OH, USA
| | - John C Byrd
- Department of Internal Medicine, University of Cincinnati, Cincinnati, OH, USA
| | - Kellie J Archer
- Division of Biostatistics, College of Public Health, The Ohio State University, Columbus, OH, USA
| |
Collapse
|
10
|
Yamaguchi Y, Yoshida S, Misumi T, Maruo K. Multiple imputation for longitudinal data using Bayesian lasso imputation model. Stat Med 2022; 41:1042-1058. [PMID: 35064581 DOI: 10.1002/sim.9315] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2021] [Revised: 11/03/2021] [Accepted: 12/21/2021] [Indexed: 11/11/2022]
Abstract
Multiple imputation is a promising approach to handle missing data and is widely used in analysis of longitudinal clinical studies. A key consideration in the implementation of multiple imputation is to obtain accurate imputed values by specifying an imputation model that incorporates auxiliary variables potentially associated with missing variables. The use of informative auxiliary variables is known to be beneficial to make the missing at random assumption more plausible and help to reduce uncertainty of the imputations; however, it is not straightforward to pre-specify them in many cases. We propose a data-driven specification of the imputation model using Bayesian lasso in the context of longitudinal clinical study, and develop a built-in function of the Bayesian lasso imputation model which is performed within the framework of multiple imputation using chained equations. A simulation study suggested that the Bayesian lasso imputation model worked well in a variety of longitudinal study settings, providing unbiased treatment effect estimates with well-controlled type I error rates and coverage probabilities of the confidence interval; in contrast, ignorance of the informative auxiliary variables led to serious bias and inflation of type I error rate. Moreover, the Bayesian lasso imputation model offered higher statistical powers compared with conventional imputation methods. In our simulation study, the gains in statistical power were remarkable when the sample size was small relative to the number of auxiliary variables. An illustration through a real example also suggested that the Bayesian lasso imputation model could give smaller standard errors of the treatment effect estimate.
Collapse
Affiliation(s)
| | - Satoshi Yoshida
- Data Science, Development, Astellas Pharma Inc., Tokyo, Japan
| | - Toshihiro Misumi
- Department of Biostatistics, School of Medicine, Yokohama City University, Yokohama, Japan
| | - Kazushi Maruo
- Department of Biostatistics, Faculty of Medicine, University of Tsukuba, Tsukuba, Japan
| |
Collapse
|
11
|
Zhang Y, Archer KJ. Bayesian variable selection for high-dimensional data with an ordinal response: identifying genes associated with prognostic risk group in acute myeloid leukemia. BMC Bioinformatics 2021; 22:539. [PMID: 34727888 PMCID: PMC8565083 DOI: 10.1186/s12859-021-04432-w] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2021] [Accepted: 10/04/2021] [Indexed: 12/18/2022] Open
Abstract
BACKGROUND Acute myeloid leukemia (AML) is a heterogeneous cancer of the blood, though specific recurring cytogenetic abnormalities in AML are strongly associated with attaining complete response after induction chemotherapy, remission duration, and survival. Therefore recurring cytogenetic abnormalities have been used to segregate patients into favorable, intermediate, and adverse prognostic risk groups. However, it is unclear how expression of genes is associated with these prognostic risk groups. We postulate that expression of genes monotonically associated with these prognostic risk groups may yield important insights into leukemogenesis. Therefore, in this paper we propose penalized Bayesian ordinal response models to predict prognostic risk group using gene expression data. We consider a double exponential prior, a spike-and-slab normal prior, a spike-and-slab double exponential prior, and a regression-based approach with variable inclusion indicators for modeling our high-dimensional ordinal response, prognostic risk group, and identify genes through hypothesis tests using Bayes factor. RESULTS Gene expression was ascertained using Affymetrix HG-U133Plus2.0 GeneChips for 97 favorable, 259 intermediate, and 97 adverse risk AML patients. When applying our penalized Bayesian ordinal response models, genes identified for model inclusion were consistent among the four different models. Additionally, the genes included in the models were biologically plausible, as most have been previously associated with either AML or other types of cancer. CONCLUSION These findings demonstrate that our proposed penalized Bayesian ordinal response models are useful for performing variable selection for high-dimensional genomic data and have the potential to identify genes relevantly associated with an ordinal phenotype.
Collapse
Affiliation(s)
| | - Kellie J Archer
- Division of Biostatistics, College of Public Health, The Ohio State University, Columbus, OH, USA.
| |
Collapse
|
12
|
Yang H, Xiong W, Zhang X, Wang K, Tian M. Penalized homophily latent space models for directed scale-free networks. PLoS One 2021; 16:e0253873. [PMID: 34339437 PMCID: PMC8328337 DOI: 10.1371/journal.pone.0253873] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2021] [Accepted: 06/14/2021] [Indexed: 11/28/2022] Open
Abstract
Online social networks like Twitter and Facebook are among the most popular sites on the Internet. Most online social networks involve some specific features, including reciprocity, transitivity and degree heterogeneity. Such networks are so called scale-free networks and have drawn lots of attention in research. The aim of this paper is to develop a novel methodology for directed network embedding within the latent space model (LSM) framework. It is known, the link probability between two individuals may increase as the features of each become similar, which is referred to as homophily attributes. To this end, penalized pair-specific attributes, acting as a distance measure, are introduced to provide with more powerful interpretation and improve link prediction accuracy, named penalized homophily latent space models (PHLSM). The proposed models also involve in-degree heterogeneity of directed scale-free networks by embedding with the popularity scales. We also introduce LASSO-based PHLSM to produce an accurate and sparse model for high-dimensional covariates. We make Bayesian inference using MCMC algorithms. The finite sample performance of the proposed models is evaluated by three benchmark simulation datasets and two real data examples. Our methods are competitive and interpretable, they outperform existing approaches for fitting directed networks.
Collapse
Affiliation(s)
- Hanxuan Yang
- School of Statistics, University of International Business and Economics, Beijing, China
| | - Wei Xiong
- School of Statistics, University of International Business and Economics, Beijing, China
| | - Xueliang Zhang
- Department of Medical Engineering and Technology, Xinjiang Medical University, Urumqi, China
| | - Kai Wang
- Department of Medical Engineering and Technology, Xinjiang Medical University, Urumqi, China
| | - Maozai Tian
- Department of Medical Engineering and Technology, Xinjiang Medical University, Urumqi, China
- Center for Applied Statistics, School of Statistics, Renmin University of China, Beijing, China
- * E-mail:
| |
Collapse
|
13
|
Impairments of Photoreceptor Outer Segments Renewal and Phototransduction Due to a Peripherin Rare Haplotype Variant: Insights from Molecular Modeling. Int J Mol Sci 2021; 22:ijms22073484. [PMID: 33801777 PMCID: PMC8036374 DOI: 10.3390/ijms22073484] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2021] [Revised: 03/23/2021] [Accepted: 03/25/2021] [Indexed: 12/14/2022] Open
Abstract
BACKGROUND Retinitis pigmentosa punctata albescens (RPA) is a particular form of retinitis pigmentosa characterized by childhood onset night blindness and areas of peripheral retinal atrophy. We investigated the genetic cause of RPA in a family consisting of two affected Egyptian brothers with healthy consanguineous parents. METHODS Mutational analysis of four RPA causative genes was realized by Sanger sequencing on both probands, and detected variants were subsequently genotyped in their parents. Afterwards, found variants were deeply, statistically, and in silico characterized to determine their possible effects and association with RPA. RESULTS Both brothers carry three missense PRPH2 variants in a homozygous condition (c.910C > A, c.929G > A, and c.1013A > C) and two promoter variants in RHO (c.-26A > G) and RLBP1 (c.-70G > A) genes, respectively. Haplotype analyses highlighted a PRPH2 rare haplotype variant (GAG), determining a possible alteration of PRPH2 binding with melanoregulin and other outer segment proteins, followed by photoreceptor outer segment instability. Furthermore, an altered balance of transcription factor binding sites, due to the presence of RHO and RLBP1 promoter variants, might determine a comprehensive downregulation of both genes, possibly altering the PRPH2 shared visual-related pathway. CONCLUSIONS Despite several limitations, the study might be a relevant step towards detection of novel scenarios in RPA etiopathogenesis.
Collapse
|
14
|
Yuan X, Biswas S. Detecting rare haplotype association with two correlated phenotypes of binary and continuous types. Stat Med 2021; 40:1877-1900. [PMID: 33438281 DOI: 10.1002/sim.8877] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2020] [Revised: 11/18/2020] [Accepted: 12/25/2020] [Indexed: 11/10/2022]
Abstract
Multiple correlated traits/phenotypes are often collected in genetic association studies and they may share a common genetic mechanism. Joint analysis of correlated phenotypes has well-known advantages over one-at-a-time analysis including gain in power and better understanding of genetic etiology. However, when the phenotypes are of discordant types such as binary and continuous, the joint modeling is more challenging. Another research area of current interest is discovery of rare genetic variants. Currently there is no method available for detecting association of rare (or common) haplotypes with multiple discordant phenotypes jointly. Our goal is to fill this gap specifically for two discordant phenotypes. We consider a rare haplotype association method for a binary phenotype, logistic Bayesian LASSO (univariate LBL) and its extension for two correlated binary phenotypes (bivariate LBL-2B). Under this framework, we propose a haplotype association test with binary and continuous phenotypes jointly (bivariate LBL-BC). Specifically, we use a latent variable to induce correlation between the two phenotypes. We carry out extensive simulations to investigate bivariate LBL-BC and compare it with univariate LBL and bivariate LBL-2B. In most settings, bivariate LBL-BC performs the best. In only two situations, bivariate LBL-BC has similar performance-when the two phenotypes are (1) weakly or not correlated and the target haplotype affects the binary phenotype only and (2) strongly positively correlated and the target haplotype affects both phenotypes in positive direction. Finally, we apply the method to a data set on lung cancer and nicotine dependence and detect several haplotypes including a rare one.
Collapse
Affiliation(s)
- Xiaochen Yuan
- Department of Mathematical Sciences, University of Texas at Dallas, Richardson, Texas, USA
| | - Swati Biswas
- Department of Mathematical Sciences, University of Texas at Dallas, Richardson, Texas, USA
| |
Collapse
|
15
|
Zhang Y, Archer KJ. Bayesian penalized cumulative logit model for high-dimensional data with an ordinal response. Stat Med 2020; 40:1453-1481. [PMID: 33336826 DOI: 10.1002/sim.8851] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2019] [Revised: 11/23/2020] [Accepted: 11/23/2020] [Indexed: 01/15/2023]
Abstract
Many previous studies have identified associations between gene expression, measured using high-throughput genomic platforms, and quantitative or dichotomous traits. However, we note that health outcome and disease status measurements frequently appear on an ordinal scale, that is, the outcome is categorical but has inherent ordering. Identification of important genes may be useful for developing novel diagnostic and prognostic tools to predict or classify stage of disease. Gene expression data are usually high-dimensional, meaning that the number of genes is much larger than the sample size or number of patients. Herein we describe some existing frequentist methods for modeling an ordinal response in a high-dimensional predictor space. Following Tibshirani (1996), who described the LASSO estimate as the Bayesian posterior mode when the regression coefficients have independent Laplace priors, we propose a new approach for high-dimensional data with an ordinal response that is rooted in the Bayesian paradigm. We show that our proposed Bayesian approach outperforms existing frequentist methods through simulation studies. We then compare the performance of frequentist and Bayesian approaches using a study evaluating progression to hepatocellular carcinoma in hepatitis C infected patients.
Collapse
Affiliation(s)
- Yiran Zhang
- College of Public Health, The Ohio State University, Columbus, Ohio, USA
| | - Kellie J Archer
- College of Public Health, The Ohio State University, Columbus, Ohio, USA
| |
Collapse
|
16
|
Zhou X, Wang M, Lin S. Detecting rare haplotypes associated with complex diseases using both population and family data: Combined logistic Bayesian Lasso. Stat Methods Med Res 2020; 29:3340-3350. [DOI: 10.1177/0962280220927728] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
Haplotype-based association methods have been developed to understand the genetic architecture of complex diseases. Compared to single-variant-based methods, haplotype methods are thought to be more biologically relevant, since there are typically multiple non-independent genetic variants involved in complex diseases, and the use of haplotypes implicitly accounts for non-independence caused by linkage disequilibrium. In recent years, with the focus moving from common to rare variants, haplotype-based methods have also evolved accordingly to uncover the roles of rare haplotypes. One particular approach is regularization-based, with the use of Bayesian least absolute shrinkage and selection operator (Lasso) as an example. This type of methods has been developed for either case-control population data (the logistic Bayesian Lasso (LBL)) or family data (family-triad-based logistic Bayesian Lasso (famLBL)). In some situations, both family data and case-control data are available; therefore, it would be a waste of resources if only one of them could be analyzed. To make full usage of available data to increase power, we propose a unified approach that can combine both case-control and family data (combined logistic Bayesian Lasso (cLBL)). Through simulations, we characterized the performance of cLBL and showed the advantage of cLBL over existing methods. We further applied cLBL to the Framingham Heart Study data to demonstrate its utility in real data applications.
Collapse
Affiliation(s)
- Xiaofei Zhou
- Department of Statistics, The Ohio State University, Columbus, OH, USA
| | - Meng Wang
- Battelle Center for Mathematical Medicine, Nationwide Children’s Hospital, Columbus, OH, USA
| | - Shili Lin
- Department of Statistics, The Ohio State University, Columbus, OH, USA
| |
Collapse
|
17
|
Gupta S, Lee REC, Faeder JR. Parallel Tempering with Lasso for model reduction in systems biology. PLoS Comput Biol 2020; 16:e1007669. [PMID: 32150537 PMCID: PMC7082068 DOI: 10.1371/journal.pcbi.1007669] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2019] [Revised: 03/19/2020] [Accepted: 01/20/2020] [Indexed: 01/08/2023] Open
Abstract
Systems Biology models reveal relationships between signaling inputs and observable molecular or cellular behaviors. The complexity of these models, however, often obscures key elements that regulate emergent properties. We use a Bayesian model reduction approach that combines Parallel Tempering with Lasso regularization to identify minimal subsets of reactions in a signaling network that are sufficient to reproduce experimentally observed data. The Bayesian approach finds distinct reduced models that fit data equivalently. A variant of this approach that uses Lasso to perform selection at the level of reaction modules is applied to the NF-κB signaling network to test the necessity of feedback loops for responses to pulsatile and continuous pathway stimulation. Taken together, our results demonstrate that Bayesian parameter estimation combined with regularization can isolate and reveal core motifs sufficient to explain data from complex signaling systems. Cells respond to diverse environmental cues using complex networks of interacting proteins and other biomolecules. Mathematical and computational models have become invaluable tools to understand these networks and make informed predictions to rationally perturb cell behavior. However, the complexity of detailed models that try to capture all known biochemical elements of signaling networks often makes it difficult to determine the key regulatory elements that are responsible for specific cell behaviors. Here, we present a Bayesian computational approach, PTLasso, to automatically extract minimal subsets of detailed models that are sufficient to explain experimental data. The method simultaneously calibrates and reduces models, and the Bayesian approach samples globally, allowing us to find alternate mechanistic explanations for the data if present. We demonstrate the method on both synthetic and real biological data and show that PTLasso is an effective method to isolate distinct parts of a larger signaling model that are sufficient for specific data.
Collapse
Affiliation(s)
- Sanjana Gupta
- Department of Computational and Systems Biology, University of Pittsburgh School of Medicine, Pittsburgh, Pennsylvania, United States of America
| | - Robin E C Lee
- Department of Computational and Systems Biology, University of Pittsburgh School of Medicine, Pittsburgh, Pennsylvania, United States of America
| | - James R Faeder
- Department of Computational and Systems Biology, University of Pittsburgh School of Medicine, Pittsburgh, Pennsylvania, United States of America
| |
Collapse
|
18
|
Yuan X, Biswas S. Bivariate logistic Bayesian LASSO for detecting rare haplotype association with two correlated phenotypes. Genet Epidemiol 2019; 43:996-1017. [PMID: 31544985 DOI: 10.1002/gepi.22258] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2019] [Revised: 07/31/2019] [Accepted: 08/09/2019] [Indexed: 11/08/2022]
Abstract
In genetic association studies, joint modeling of related traits/phenotypes can utilize the correlation between them and thereby provide more power and uncover additional information about genetic etiology. Moreover, detecting rare genetic variants are of current scientific interest as a key to missing heritability. Logistic Bayesian LASSO (LBL) has been proposed recently to detect rare haplotype variants using case-control data, that is, a single binary phenotype. As there is currently no haplotype association method that can handle multiple binary phenotypes, we extend LBL to fill this gap. We develop a bivariate model by using a latent variable to induce correlation between the two outcomes. We carry out extensive simulations to investigate the bivariate LBL and compare with the univariate LBL. The bivariate LBL performs better or similar to the univariate LBL in most settings. It has the highest gain in power when a haplotype is associated with both traits and it affects at least one trait in a direction opposite to the direction of the correlation between the traits. We analyze two data sets-Genetic Analysis Workshop 19 sequence data on systolic and diastolic blood pressures and a genome-wide association data set on lung cancer and smoking and detect several associated rare haplotypes.
Collapse
Affiliation(s)
- Xiaochen Yuan
- Department of Mathematical Sciences, University of Texas at Dallas, Richardson, Texas
| | - Swati Biswas
- Department of Mathematical Sciences, University of Texas at Dallas, Richardson, Texas
| |
Collapse
|
19
|
Luo S, Chen Z. Feature Selection by Canonical Correlation Search in High-Dimensional Multiresponse Models With Complex Group Structures. J Am Stat Assoc 2019. [DOI: 10.1080/01621459.2019.1609972] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
Affiliation(s)
- Shan Luo
- Department of Statistics, Shanghai Jiao Tong University, Shanghai, China
| | - Zehua Chen
- Department of Statistics & Applied Probability, National University of Singapore, Singapore
| |
Collapse
|
20
|
Papachristou C, Biswas S. Comparison of haplotype-based tests for detecting gene-environment interactions with rare variants. Brief Bioinform 2019; 21:851-862. [PMID: 31329820 DOI: 10.1093/bib/bbz031] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2018] [Revised: 02/06/2019] [Accepted: 02/28/2019] [Indexed: 11/13/2022] Open
Abstract
Dissecting the genetic mechanism underlying a complex disease hinges on discovering gene-environment interactions (GXE). However, detecting GXE is a challenging problem especially when the genetic variants under study are rare. Haplotype-based tests have several advantages over the so-called collapsing tests for detecting rare variants as highlighted in recent literature. Thus, it is of practical interest to compare haplotype-based tests for detecting GXE including the recent ones developed specifically for rare haplotypes. We compare the following methods: haplo.glm, hapassoc, HapReg, Bayesian hierarchical generalized linear model (BhGLM) and logistic Bayesian LASSO (LBL). We simulate data under different types of association scenarios and levels of gene-environment dependence. We find that when the type I error rates are controlled to be the same for all methods, LBL is the most powerful method for detecting GXE. We applied the methods to a lung cancer data set, in particular, in region 15q25.1 as it has been suggested in the literature that it interacts with smoking to affect the lung cancer susceptibility and that it is associated with smoking behavior. LBL and BhGLM were able to detect a rare haplotype-smoking interaction in this region. We also analyzed the sequence data from the Dallas Heart Study, a population-based multi-ethnic study. Specifically, we considered haplotype blocks in the gene ANGPTL4 for association with trait serum triglyceride and used ethnicity as a covariate. Only LBL found interactions of haplotypes with race (Hispanic). Thus, in general, LBL seems to be the best method for detecting GXE among the ones we studied here. Nonetheless, it requires the most computation time.
Collapse
Affiliation(s)
| | - Swati Biswas
- Department of Mathematical Sciences, University of Texas at Dallas, Richardson, TX, USA
| |
Collapse
|
21
|
Datta AS, Lin S, Biswas S. A Family-Based Rare Haplotype Association Method for Quantitative Traits. Hum Hered 2019; 83:175-195. [PMID: 30799419 DOI: 10.1159/000493543] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2018] [Accepted: 09/07/2018] [Indexed: 12/15/2022] Open
Abstract
BACKGROUND The variants identified in genome-wide association studies account for only a small fraction of disease heritability. A key to this "missing heritability" is believed to be rare variants. Specifically, we focus on rare haplotype variant (rHTV). The existing methods for detecting rHTV are mostly population-based, and as such, are susceptible to population stratification and admixture, leading to an inflated false-positive rate. Family-based methods are more robust in this respect. METHODS We propose a method for detecting rHTVs associated with quantitative traits called family-based quantitative Bayesian LASSO (famQBL). FamQBL can analyze any type of pedigree and is based on a mixed model framework. We regularize the haplotype effects using Bayesian LASSO and estimate the posterior distributions using Markov chain Monte Carlo methods. RESULTS We conduct simulation studies, including analyses of Genetic Analysis Workshop 18 simulated data, to study the properties of famQBL and compare with a standard family-based haplotype association test implemented in FBAT (family-based association test) software. We find famQBL to be more powerful than FBAT with well-controlled false-positive rates. We also apply famQBL to the Framingham Heart Study data and detect an rHTV associated with diastolic blood pressure. CONCLUSION FamQBL can help uncover rHTVs associated with quantitative traits.
Collapse
Affiliation(s)
- Ananda S Datta
- Department of Mathematical Sciences, University of Texas at Dallas, Richardson, Texas, USA
| | - Shili Lin
- Department of Statistics, The Ohio State University, Columbus, Ohio, USA
| | - Swati Biswas
- Department of Mathematical Sciences, University of Texas at Dallas, Richardson, Texas, USA,
| |
Collapse
|
22
|
Zhou X, Wang M, Zhang H, Stewart WCL, Lin S. Logistic Bayesian LASSO for detecting association combining family and case-control data. BMC Proc 2018; 12:54. [PMID: 30263052 PMCID: PMC6156907 DOI: 10.1186/s12919-018-0139-4] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open
Abstract
Because of the limited information from the GAW20 samples when only case-control or trio data are considered, we propose eLBL, an extension of the Logistic Bayesian LASSO (least absolute shrinkage and selection operator) methodology so that both types of data can be analyzed jointly in the hope of obtaining an increased statistical power, especially for detecting association between rare haplotypes and complex diseases. The methodology is further extended to account for familial correlation among the case-control individuals and the trios. A 2-step analysis strategy was taken to first perform a genome-wise single single-nucleotide polymorphism (SNP) search using the Monte Carlo pedigree disequilibrium test (MCPDT) to determine interesting regions for the Adult Treatment Panel (ATP) binary trait. Then eLBL was applied to haplotype blocks covering the flagged SNPs in Step 1. Several significantly associated haplotypes were identified; most are in blocks contained in protein coding genes that appear to be relevant for metabolic syndrome. The results are further substantiated with a Type I error study and by an additional analysis using the triglyceride measurements directly as a quantitative trait.
Collapse
Affiliation(s)
- Xiaofei Zhou
- 1Department of Statistics, The Ohio State University, 1958 Neil Avenue, Columbus, OH 43210 USA
| | - Meng Wang
- 2Battelle Center for Mathematical Medicine, Nationwide Children's Hospital Research Institute, 700 Childrens Drive, Columbus, OH 43205 USA
| | - Han Zhang
- 1Department of Statistics, The Ohio State University, 1958 Neil Avenue, Columbus, OH 43210 USA
| | - William C L Stewart
- 1Department of Statistics, The Ohio State University, 1958 Neil Avenue, Columbus, OH 43210 USA.,2Battelle Center for Mathematical Medicine, Nationwide Children's Hospital Research Institute, 700 Childrens Drive, Columbus, OH 43205 USA
| | - Shili Lin
- 1Department of Statistics, The Ohio State University, 1958 Neil Avenue, Columbus, OH 43210 USA
| |
Collapse
|
23
|
Wang X, Boekstegers F, Brinster R. Methods and results from the genome-wide association group at GAW20. BMC Genet 2018; 19:79. [PMID: 30255814 PMCID: PMC6157187 DOI: 10.1186/s12863-018-0649-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Abstract
BACKGROUND This paper summarizes the contributions from the Genome-wide Association Study group (GWAS group) of the GAW20. The GWAS group contributions focused on topics such as association tests, phenotype imputation, and application of empirical kinships. The goals of the GWAS group contributions were varied. A real or a simulated data set based on the Genetics of Lipid Lowering Drugs and Diet Network (GOLDN) study was employed by different methods. Different outcomes and covariates were considered, and quality control procedures varied throughout the contributions. RESULTS The consideration of heritability and family structure played a major role in some contributions. The inclusion of family information and adaptive weights based on data were found to improve power in genome-wide association studies. It was proven that gene-level approaches are more powerful than single-marker analysis. Other contributions focused on the comparison between pedigree-based kinship and empirical kinship matrices, and investigated similar results in heritability estimation, association mapping, and genomic prediction. A new approach for linkage mapping of triglyceride levels was able to identify a novel linkage signal. CONCLUSIONS This summary paper reports on promising statistical approaches and findings of the members of the GWAS group applied on real and simulated data which encompass the current topics of epigenetic and pharmacogenomics.
Collapse
Affiliation(s)
- Xuexia Wang
- University of North Texas, GAB 459, 1155 Union Circle #311430, Denton, TX 76203 USA
| | - Felix Boekstegers
- Institute of Medical Biometry and Informatics, University of Heidelberg, Im Neuenheimer Feld 130.3, 69120 Heidelberg, Germany
| | - Regina Brinster
- Institute of Medical Biometry and Informatics, University of Heidelberg, Im Neuenheimer Feld 130.3, 69120 Heidelberg, Germany
| |
Collapse
|
24
|
Zhang Y, Hofmann JN, Purdue MP, Lin S, Biswas S. Logistic Bayesian LASSO for genetic association analysis of data from complex sampling designs. J Hum Genet 2017; 62:819-829. [PMID: 28424482 PMCID: PMC5572548 DOI: 10.1038/jhg.2017.43] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2016] [Revised: 03/21/2017] [Accepted: 03/22/2017] [Indexed: 01/20/2023]
Abstract
Detecting gene-environment interactions with rare variants is critical in dissecting the etiology of common diseases. Interactions with rare haplotype variants (rHTVs) are of particular interest. At the same time, complex sampling designs, such as stratified random sampling, are becoming increasingly popular for designing case-control studies, especially for recruiting controls. The US Kidney Cancer Study (KCS) is an example, wherein all available cases were included while the controls at each site were randomly selected from the population by frequency matching with cases based on age, sex and race. There is currently no rHTV association method that can account for such a complex sampling design. To fill this gap, we consider logistic Bayesian LASSO (LBL), an existing rHTV approach for case-control data, and show that its model can easily accommodate the complex sampling design. We study two extensions that include stratifying variables either as main effects only or with additional modeling of their interactions with haplotypes. We conduct extensive simulation studies to compare the complex sampling methods with the original LBL methods. We find that, when there is no interaction between haplotype and stratifying variables, both extensions perform well while the original LBL methods lead to inflated type I error rates. However, when such an interaction exists, it is necessary to include the interaction effect in the model to control the type I error rate. Finally, we analyze the KCS data and find a significant interaction between (current) smoking and a specific rHTV in the N-acetyltransferase 2 gene.
Collapse
Affiliation(s)
- Yuan Zhang
- Department of Mathematical Sciences, University of Texas at Dallas, Richardson, TX, USA
| | - Jonathan N Hofmann
- Occupational and Environmental Epidemiology Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, MD, USA
| | - Mark P Purdue
- Occupational and Environmental Epidemiology Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, MD, USA
| | - Shili Lin
- Department of Statistics, The Ohio State University, Columbus, OH, USA
| | - Swati Biswas
- Department of Mathematical Sciences, University of Texas at Dallas, Richardson, TX, USA
| |
Collapse
|
25
|
Datta AS, Zhang Y, Zhang L, Biswas S. Association of rare haplotypes on ULK4 and MAP4 genes with hypertension. BMC Proc 2016; 10:363-369. [PMID: 27980663 PMCID: PMC5133474 DOI: 10.1186/s12919-016-0057-2] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023] Open
Abstract
Several variants have been implicated earlier on ULK4 and MAP4 genes on chromosome 3 to be associated with hypertension. As a natural follow-up step, we explore association of haplotypes in those genes. We consider the Genetic Analysis Workshop 19 real data on unrelated individuals and analyze haplotype blocks of 5 single-nucleotide polymorphisms through a sliding window approach. We apply 4 haplotype association methods-haplo.score, haplo.glm, hapassoc, and logistic Bayesian LASSO (LBL)-and for comparison, sequence kernel association test (SKAT) and its variants. We find several rare haplotype blocks to be associated. To get an idea about the false-positive proportions, we also analyzed the data after permuting the case-control status of individuals. We found that LBL, unlike the other methods, maintains low false-positive rates in presence of rare haplotypes. Thus, we conclude that the haplotypes found to be associated by LBL are more likely to be true positive. SKAT and its variants did not find significance on either gene.
Collapse
Affiliation(s)
- Ananda S. Datta
- Department of Mathematical Sciences, University of Texas at Dallas, Richardson, TX USA
| | - Yuan Zhang
- Department of Mathematical Sciences, University of Texas at Dallas, Richardson, TX USA
| | - Lei Zhang
- Department of Mathematical Sciences, University of Texas at Dallas, Richardson, TX USA
| | - Swati Biswas
- Department of Mathematical Sciences, University of Texas at Dallas, Richardson, TX USA
| |
Collapse
|
26
|
Zhang Y, Lin S, Biswas S. Detecting rare and common haplotype-environment interaction under uncertainty of gene-environment independence assumption. Biometrics 2016; 73:344-355. [PMID: 27478935 DOI: 10.1111/biom.12567] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2015] [Revised: 05/01/2016] [Accepted: 06/01/2016] [Indexed: 11/28/2022]
Abstract
Finding rare variants and gene-environment interactions (GXE) is critical in dissecting complex diseases. We consider the problem of detecting GXE where G is a rare haplotype and E is a nongenetic factor. Such methods typically assume G-E independence, which may not hold in many applications. A pertinent example is lung cancer-there is evidence that variants on Chromosome 15q25.1 interact with smoking to affect the risk. However, these variants are associated with smoking behavior rendering the assumption of G-E independence inappropriate. With the motivation of detecting GXE under G-E dependence, we extend an existing approach, logistic Bayesian LASSO, which assumes G-E independence (LBL-GXE-I) by modeling G-E dependence through a multinomial logistic regression (referred to as LBL-GXE-D). Unlike LBL-GXE-I, LBL-GXE-D controls type I error rates in all situations; however, it has reduced power when G-E independence holds. To control type I error without sacrificing power, we further propose a unified approach, LBL-GXE, to incorporate uncertainty in the G-E independence assumption by employing a reversible jump Markov chain Monte Carlo method. Our simulations show that LBL-GXE has power similar to that of LBL-GXE-I when G-E independence holds, yet has well-controlled type I errors in all situations. To illustrate the utility of LBL-GXE, we analyzed a lung cancer dataset and found several significant interactions in the 15q25.1 region, including one between a specific rare haplotype and smoking.
Collapse
Affiliation(s)
- Yuan Zhang
- Department of Mathematical Sciences, University of Texas at Dallas, Richardson, Texas 75080, U.S.A
| | - Shili Lin
- Department of Statistics, The Ohio State University, Columbus, Ohio 43210, U.S.A
| | - Swati Biswas
- Department of Mathematical Sciences, University of Texas at Dallas, Richardson, Texas 75080, U.S.A
| |
Collapse
|
27
|
Datta AS, Biswas S. Comparison of haplotype-based statistical tests for disease association with rare and common variants. Brief Bioinform 2015; 17:657-71. [PMID: 26338417 DOI: 10.1093/bib/bbv072] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2015] [Indexed: 01/26/2023] Open
Abstract
Recent literature has highlighted the advantages of haplotype association methods for detecting rare variants associated with common diseases. As several new haplotype association methods have been proposed in the past few years, a comparison of new and standard methods is important and timely for guidance to the practitioners. We consider nine methods-Haplo.score, Haplo.glm, Hapassoc, Bayesian hierarchical Generalized Linear Model (BhGLM), Logistic Bayesian LASSO (LBL), regularized GLM (rGLM), Haplotype Kernel Association Test, wei-SIMc-matching and Weighted Haplotype and Imputation-based Tests. These can be divided into two types-individual haplotype-specific tests and global tests depending on whether there is just one overall test for a haplotype region (global) or there is an individual test for each haplotype in the region. Haplo.score is the only method that tests for both; Haplo.glm, Hapassoc, BhGLM and LBL are individual haplotype-specific, while the rest are global tests. For comparison, we also apply a popular collapsing method-Sequence Kernel Association Test (SKAT) and its two variants-SKAT-O (Optimal) and SKAT-C (Combined). We carry out an extensive comparison on our simulated data sets as well as on the Genetic Analysis Workshop (GAW) 18 simulated data. Further, we apply the methods to GAW18 real hypertension data and Dallas Heart Study sequence data. We find that LBL, Haplo.score (global test) and rGLM perform well over the scenarios considered here. Also, haplotype methods are more powerful (albeit more computationally intensive) than SKAT and its variants in scenarios where multiple causal variants act interactively to produce haplotype effects.
Collapse
|
28
|
Kullback-Leibler divergence for detection of rare haplotype common disease association. Eur J Hum Genet 2015; 23:1558-65. [PMID: 25735482 DOI: 10.1038/ejhg.2015.25] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2014] [Revised: 11/16/2014] [Accepted: 01/28/2015] [Indexed: 12/12/2022] Open
Abstract
Rare haplotypes may tag rare causal variants of common diseases; hence, detection of such rare haplotypes may also contribute to our understanding of complex disease etiology. Because rare haplotypes frequently result from common single-nucleotide polymorphisms (SNPs), focusing on rare haplotypes is much more economical compared with using rare single-nucleotide variants (SNVs) from sequencing, as SNPs are available and 'free' from already amassed genome-wide studies. Further, associated haplotypes may shed light on the underlying disease causal mechanism, a feat unmatched by SNV-based collapsing methods. In recent years, data mining approaches have been adapted to detect rare haplotype association. However, as they rely on an assumed underlying disease model and require the specification of a null haplotype, results can be erroneous if such assumptions are violated. In this paper, we present a haplotype association method based on Kullback-Leibler divergence (hapKL) for case-control samples. The idea is to compare haplotype frequencies for the cases versus the controls by computing symmetrical divergence measures. An important property of such measures is that both the frequencies and logarithms of the frequencies contribute in parallel, thus balancing the contributions from rare and common, and accommodating both deleterious and protective, haplotypes. A simulation study under various scenarios shows that hapKL has well-controlled type I error rates and good power compared with existing data mining methods. Application of hapKL to age-related macular degeneration (AMD) shows a strong association of the complement factor H (CFH) gene with AMD, identifying several individual rare haplotypes with strong signals.
Collapse
|
29
|
Li Y, Nan B, Zhu J. Multivariate sparse group lasso for the multivariate multiple linear regression with an arbitrary group structure. Biometrics 2015; 71:354-63. [PMID: 25732839 DOI: 10.1111/biom.12292] [Citation(s) in RCA: 61] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2013] [Revised: 12/01/2014] [Accepted: 01/01/2015] [Indexed: 11/27/2022]
Abstract
We propose a multivariate sparse group lasso variable selection and estimation method for data with high-dimensional predictors as well as high-dimensional response variables. The method is carried out through a penalized multivariate multiple linear regression model with an arbitrary group structure for the regression coefficient matrix. It suits many biology studies well in detecting associations between multiple traits and multiple predictors, with each trait and each predictor embedded in some biological functional groups such as genes, pathways or brain regions. The method is able to effectively remove unimportant groups as well as unimportant individual coefficients within important groups, particularly for large p small n problems, and is flexible in handling various complex group structures such as overlapping or nested or multilevel hierarchical structures. The method is evaluated through extensive simulations with comparisons to the conventional lasso and group lasso methods, and is applied to an eQTL association study.
Collapse
Affiliation(s)
- Yanming Li
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, 48109, U.S.A
| | - Bin Nan
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, 48109, U.S.A
| | - Ji Zhu
- Department of Statistics, University of Michigan, Ann Arbor, Michigan, 48109, U.S.A
| |
Collapse
|
30
|
Zhang Y, Biswas S. An Improved Version of Logistic Bayesian LASSO for Detecting Rare Haplotype-Environment Interactions with Application to Lung Cancer. Cancer Inform 2015; 14:11-6. [PMID: 25733797 PMCID: PMC4332044 DOI: 10.4137/cin.s17290] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2014] [Revised: 12/23/2014] [Accepted: 12/25/2014] [Indexed: 11/25/2022] Open
Abstract
The importance of haplotype association and gene–environment interactions (GxE) in the context of rare variants has been underlined in voluminous literature. Recently, a software based on logistic Bayesian LASSO (LBL) was proposed for detecting GxE, where G is a rare (or common) haplotype variant (rHTV)–it is called LBL-GxE. However, it required relatively long computation time and could handle only one environmental covariate with two levels. Here we propose an improved version of LBL-GxE, which is not only computationally faster but can also handle multiple covariates, each with multiple levels. We also discuss details of the software, including input, output, and some options. We apply LBL-GxE to a lung cancer dataset and find a rare haplotype with protective effect for current smokers. Our results indicate that LBL-GxE, especially with the improvements proposed here, is a useful and computationally viable tool for investigating rare haplotype interactions.
Collapse
Affiliation(s)
- Yuan Zhang
- Department of Mathematical Sciences, University of Texas at Dallas, Richardson, TX, USA
| | - Swati Biswas
- Department of Mathematical Sciences, University of Texas at Dallas, Richardson, TX, USA
| |
Collapse
|
31
|
Wang M, Lin S. Detecting associations of rare variants with common diseases: collapsing or haplotyping? Brief Bioinform 2015; 16:759-68. [PMID: 25596401 DOI: 10.1093/bib/bbu050] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2014] [Indexed: 01/11/2023] Open
Abstract
In recent years, a myriad of new statistical methods have been proposed for detecting associations of rare single-nucleotide variants (SNVs) with common diseases. These methods can be generally classified as 'collapsing' or 'haplotyping' based. The former is the predominant class, composed of most of the rare variant association methods proposed to date. However, recent works have suggested that haplotyping-based methods may offer advantages and can even be more powerful than collapsing methods in certain situations. In this article, we review and compare collapsing- versus haplotyping-based methods/software in terms of both power and type I error. For collapsing methods, we consider three approaches: Combined Multivariate and Collapsing, Sequence Kernel Association Test and Family-Based Association Test (FBAT): the first two are population based and are among the most popular; the last test is family based, a modification from the popular FBAT to accommodate rare SNVs. For haplotyping-based methods, we include Logistic Bayesian Lasso (LBL) for population data and family-based LBL (famLBL) for family (trio) data. These two methods are selected, as they can be used to test association for specific rare and common haplotypes. Our results show that haplotype methods can be more powerful than collapsing methods if there are interacting SNVs leading to larger haplotype effects. Even if only common SNVs are genotyped, haplotype methods can still detect specific rare haplotypes that tag rare causal SNVs. As expected, family-based methods are robust, whereas population-based methods are susceptible, to population substructure. However, the population-based haplotype approach appears to have smaller inflation of type I error than its collapsing counterparts.
Collapse
|
32
|
Kawano S, Hoshina I, Shimamura K, Konishi S. PREDICTIVE MODEL SELECTION CRITERIA FOR BAYESIAN LASSO REGRESSION. JOURNAL JAPANESE SOCIETY OF COMPUTATIONAL STATISTICS 2015. [DOI: 10.5183/jjscs.1501001_220] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Affiliation(s)
- Shuichi Kawano
- Graduate School of Information Systems, The University of Electro-Communications
| | - Ibuki Hoshina
- Department of Mathematics, Graduate School of Science and Engineering, Chuo University
| | - Kaito Shimamura
- Department of Mathematics, Faculty of Science and Engineering, Chuo University
| | - Sadanori Konishi
- Department of Mathematics, Faculty of Science and Engineering, Chuo University
| |
Collapse
|
33
|
Satten GA, Biswas S, Papachristou C, Turkmen A, König IR. Population-based association and gene by environment interactions in Genetic Analysis Workshop 18. Genet Epidemiol 2014; 38 Suppl 1:S49-56. [PMID: 25112188 DOI: 10.1002/gepi.21825] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
Abstract
In the past decade, genome-wide association studies have been successful in identifying genetic loci that play a role in many complex diseases. Despite this, it has become clear that for many traits, investigation of single common variants does not give a complete picture of the genetic contribution to the phenotype. Therefore a number of new approaches are currently being investigated to further the search for susceptibility loci or regions. We summarize the contributions to Genetic Analysis Workshop 18 (GAW18) that concern this search using methods for population-based association analysis. Many of the members of our GAW18 working group made use of data types that have only recently become available through the use of next-generation sequencing technologies, with many focusing on the investigation of rare variants instead of or in combination with common variants. Some contributors used a haplotype-based approach, which to date has been used relatively infrequently but may become more important for analyzing rare variant association data. Others analyzed gene-gene or gene-environment interactions, where novel statistical approaches were needed to make the best use of the available information without requiring an excessive computational burden. GAW18 provided participants with the chance to make use of state-of-the-art data, statistical techniques, and technology. We report here some of the experiences and conclusions that were reached by workshop participants who analyzed the GAW18 data as a population-based association study.
Collapse
Affiliation(s)
- Glen A Satten
- Centers for Disease Control and Prevention, Atlanta, Georgia, United States of America
| | | | | | | | | |
Collapse
|
34
|
Biswas S, Papachristou C. Evaluation of logistic Bayesian LASSO for identifying association with rare haplotypes. BMC Proc 2014; 8:S54. [PMID: 25519334 PMCID: PMC4144467 DOI: 10.1186/1753-6561-8-s1-s54] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
It has been hypothesized that rare variants may hold the key to unraveling the genetic transmission mechanism of many common complex traits. Currently, there is a dearth of statistical methods that are powerful enough to detect association with rare haplotypes. One of the recently proposed methods is logistic Bayesian LASSO for case-control data. By penalizing the regression coefficients through appropriate priors, logistic Bayesian LASSO weeds out the unassociated haplotypes, making it possible for the associated rare haplotypes to be detected with higher powers. We used the Genetic Analysis Workshop 18 simulated data to evaluate the behavior of logistic Bayesian LASSO in terms of its power and type I error under a complex disease model. We obtained knowledge of the simulation model, including the locations of the functional variants, and we chose to focus on two genomic regions in the MAP4 gene on chromosome 3. The sample size was 142 individuals and there were 200 replicates. Despite the small sample size, logistic Bayesian LASSO showed high power to detect two haplotypes containing functional variants in these regions while maintaining low type I errors. At the same time, a commonly used approach for haplotype association implemented in the software hapassoc failed to converge because of the presence of rare haplotypes. Thus, we conclude that logistic Bayesian LASSO can play an important role in the search for rare haplotypes.
Collapse
Affiliation(s)
- Swati Biswas
- Department of Mathematical Sciences, FO 35, University of Texas at Dallas, 800 West Campbell Road,Richardson, TX 75080, USA
| | - Charalampos Papachristou
- Department of Mathematics, Physics, and Statistics, University of the Sciences in Philadelphia, 600 South 43rd Street, Philadelphia, PA 19104, USA
| |
Collapse
|
35
|
Xia S, Lin S. Detecting longitudinal effects of haplotypes and smoking on hypertension using B-splines and Bayesian LASSO. BMC Proc 2014; 8:S85. [PMID: 25519413 PMCID: PMC4143712 DOI: 10.1186/1753-6561-8-s1-s85] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023] Open
Abstract
The behavior of a gene can be dynamic; thus, if longitudinal data are available, it is important that we study the dynamic effects of genes on a trait over time. The effect of a haplotype can be expressed by time-varying coefficients. In this paper, we use the natural cubic B-spline to express these coefficients that capture the trends of the effects of haplotypes, some of which may be rare, over time; that is, at different ages. More specifically, to capture disease-associated common and rare haplotypes and environmental factors for data from unrelated individuals, we developed a method of time-varying coefficients that uses the logistic Bayesian LASSO methodology and B-spline by setting proper prior distributions. Haplotype and environmental effect coefficients are obtained by using Markov chain Monte Carlo methods. We applied the method to analyze the MAP4 gene on chromosome 3 and have identified several haplotypes that are associated with hypertension with varying effect sizes in the range of 55 to 85 years of age.
Collapse
Affiliation(s)
- Shuang Xia
- Department of Statistics, The Ohio State University, 1958 Neil Avenue, Columbus, OH 43210-1247, USA
| | - Shili Lin
- Department of Statistics, The Ohio State University, 1958 Neil Avenue, Columbus, OH 43210-1247, USA
| |
Collapse
|
36
|
Wang M, Lin S. FamLBL: detecting rare haplotype disease association based on common SNPs using case-parent triads. ACTA ACUST UNITED AC 2014; 30:2611-8. [PMID: 24849576 DOI: 10.1093/bioinformatics/btu347] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
Abstract
MOTIVATION In recent years, there has been an increasing interest in using common single-nucleotide polymorphisms (SNPs) amassed in genome-wide association studies to investigate rare haplotype effects on complex diseases. Evidence has suggested that rare haplotypes may tag rare causal single-nucleotide variants, making SNP-based rare haplotype analysis not only cost effective, but also more valuable for detecting causal variants. Although a number of methods for detecting rare haplotype association have been proposed in recent years, they are population based and thus susceptible to population stratification. RESULTS We propose family-triad-based logistic Bayesian Lasso (famLBL) for estimating effects of haplotypes on complex diseases using SNP data. By choosing appropriate prior distribution, effect sizes of unassociated haplotypes can be shrunk toward zero, allowing for more precise estimation of associated haplotypes, especially those that are rare, thereby achieving greater detection power. We evaluate famLBL using simulation to gauge its type I error and power. Compared with its population counterpart, LBL, highlights famLBL's robustness property in the presence of population substructure. Further investigation by comparing famLBL with Family-Based Association Test (FBAT) reveals its advantage for detecting rare haplotype association. AVAILABILITY AND IMPLEMENTATION famLBL is implemented as an R-package available at http://www.stat.osu.edu/∼statgen/SOFTWARE/LBL/.
Collapse
Affiliation(s)
- Meng Wang
- Department of Statistics, The Ohio State University, Columbus, OH 43210, USA
| | - Shili Lin
- Department of Statistics, The Ohio State University, Columbus, OH 43210, USA
| |
Collapse
|
37
|
Biswas S, Xia S, Lin S. Detecting rare haplotype-environment interaction with logistic Bayesian LASSO. Genet Epidemiol 2013; 38:31-41. [PMID: 24272913 DOI: 10.1002/gepi.21773] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2013] [Revised: 09/13/2013] [Accepted: 10/15/2013] [Indexed: 11/09/2022]
Abstract
Two important contributors to missing heritability are believed to be rare variants and gene-environment interaction (GXE). Thus, detecting GXE where G is a rare haplotype variant (rHTV) is a pressing problem. Haplotype analysis is usually the natural second step to follow up on a genomic region that is implicated to be associated through single nucleotide variants (SNV) analysis. Further, rHTV can tag associated rare SNV and provide greater power to detect them than popular collapsing methods. Recently we proposed Logistic Bayesian LASSO (LBL) for detecting rHTV association with case-control data. LBL shrinks the unassociated (especially common) haplotypes toward zero so that an associated rHTV can be identified with greater power. Here, we incorporate environmental factors and their interactions with haplotypes in LBL. As LBL is based on retrospective likelihood, this extension is not trivial. We model the joint distribution of haplotypes and covariates given the case-control status. We apply the approach (LBL-GXE) to the Michigan, Mayo, AREDS, Pennsylvania Cohort Study on Age-related Macular Degeneration (AMD). LBL-GXE detects interaction of a specific rHTV in CFH gene with smoking. To the best of our knowledge, this is the first time in the AMD literature that an interaction of smoking with a specific (rather than pooled) rHTV has been implicated. We also carry out simulations and find that LBL-GXE has reasonably good powers for detecting interactions with rHTV while keeping the type I error rates well controlled. Thus, we conclude that LBL-GXE is a useful tool for uncovering missing heritability.
Collapse
Affiliation(s)
- Swati Biswas
- Department of Mathematical Sciences, University of Texas at Dallas, Richardson, Texas, United States of America
| | | | | |
Collapse
|
38
|
Xu C, Ladouceur M, Dastani Z, Richards JB, Ciampi A, Greenwood CMT. Multiple regression methods show great potential for rare variant association tests. PLoS One 2012; 7:e41694. [PMID: 22916111 PMCID: PMC3420665 DOI: 10.1371/journal.pone.0041694] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2012] [Accepted: 06/25/2012] [Indexed: 01/08/2023] Open
Abstract
The investigation of associations between rare genetic variants and diseases or phenotypes has two goals. Firstly, the identification of which genes or genomic regions are associated, and secondly, discrimination of associated variants from background noise within each region. Over the last few years, many new methods have been developed which associate genomic regions with phenotypes. However, classical methods for high-dimensional data have received little attention. Here we investigate whether several classical statistical methods for high-dimensional data: ridge regression (RR), principal components regression (PCR), partial least squares regression (PLS), a sparse version of PLS (SPLS), and the LASSO are able to detect associations with rare genetic variants. These approaches have been extensively used in statistics to identify the true associations in data sets containing many predictor variables. Using genetic variants identified in three genes that were Sanger sequenced in 1998 individuals, we simulated continuous phenotypes under several different models, and we show that these feature selection and feature extraction methods can substantially outperform several popular methods for rare variant analysis. Furthermore, these approaches can identify which variants are contributing most to the model fit, and therefore both goals of rare variant analysis can be achieved simultaneously with the use of regression regularization methods. These methods are briefly illustrated with an analysis of adiponectin levels and variants in the ADIPOQ gene.
Collapse
Affiliation(s)
- ChangJiang Xu
- Lady Davis Institute for Medical Research, Jewish General Hospital, Montreal, Quebec, Canada
| | | | | | | | | | | |
Collapse
|