1
|
Chen W, Liu Y, Zhang S, Jiang Z, Wang T, Huang S, Zeng P. Transfer Learning Prediction of Early Exposures and Genetic Risk Score on Adult Obesity in Two Minority Cohorts. PREVENTION SCIENCE : THE OFFICIAL JOURNAL OF THE SOCIETY FOR PREVENTION RESEARCH 2025; 26:234-245. [PMID: 39913075 DOI: 10.1007/s11121-025-01781-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/17/2025] [Indexed: 02/07/2025]
Abstract
Due to ethnic heterogeneity in genetic architecture, genetic risk score (GRS) constructed within the European population generally possesses poor portability in underrepresented non-European populations, but substantial genetic similarity exists across diverse ancestral groups. We here explore the prediction performance of early exposures and GRS on body mass index (BMI) through leveraging genetic similarity knowledge acquired from Europeans into non-Europeans. We present a linear mixed prediction model for BMI in three distinct UK Biobank cohorts under the transfer learning framework, where we consider Asians (n = 7487) and Africans (n = 7533) as target samples and Europeans (n = 280,575) as informative auxiliary samples. Besides environmental and behavior exposures, we incorporate multiple BMI-related variants, by which the GRS is constructed via transfer machine learning techniques informed by genetic similarity shared across target and auxiliary samples. The use of GRS gained more predictive odds for BMI than the model with traditional risk factors alone in the Asian and African cohorts, leading to an approximately 3.6% and 0.7% accuracy improvement in each target population. After borrowing genetic similarity from Europeans via transfer learning, the R2 increased to 0.270 for Asians and 0.302 for Africans, enhanced by 21.1% and 7.5%, respectively, compared to the early exposure-only models. We also provided evidence for the well-known conclusion that GRS constructed in the European population behaved poorly in non-Europeans. Prediction accuracy is greatly elevated in racial minority or underrepresented populations via the transfer learning method by leveraging shared genetic similarity from informative auxiliary populations.
Collapse
Affiliation(s)
- Wenying Chen
- Department of Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, 221004, Jiangsu, China
| | - Yuxin Liu
- Department of Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, 221004, Jiangsu, China
| | - Shuo Zhang
- Department of Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, 221004, Jiangsu, China
| | - Zhou Jiang
- Department of Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, 221004, Jiangsu, China
| | - Ting Wang
- Department of Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, 221004, Jiangsu, China
| | - Shuiping Huang
- Department of Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, 221004, Jiangsu, China
- Jiangsu Engineering Research Center of Biological Data Mining and Healthcare Transformation, Xuzhou Medical University, Xuzhou, 221004, Jiangsu, China
| | - Ping Zeng
- Department of Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, 221004, Jiangsu, China.
- Jiangsu Engineering Research Center of Biological Data Mining and Healthcare Transformation, Xuzhou Medical University, Xuzhou, 221004, Jiangsu, China.
| |
Collapse
|
2
|
Sun X, Shin Y, Lafata JE, Raudenbush SW. Variability in Causal Effects and Noncompliance in a Multisite Trial: A Bivariate Hierarchical Generalized Random Coefficients Model for a Binary Outcome. Stat Med 2024; 43:5353-5365. [PMID: 39410741 PMCID: PMC11586915 DOI: 10.1002/sim.10229] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2023] [Revised: 08/26/2024] [Accepted: 09/11/2024] [Indexed: 11/26/2024]
Abstract
Within each of 170 physicians, patients were randomized to access e-assist, an online program that aimed to increase colorectal cancer screening (CRCS), or control. Compliance was partial:78.34 % $$ 78.34\% $$ of the experimental patients accessed e-assist while no controls were provided the access. Of interest are the average causal effect of assignment to treatment and the complier average causal effect as well as the variation of these causal effects across physicians. Each physician generates probabilities of screening for experimental compliers (experimental patients who accessed e-assist), control compliers (controls who would have accessed e-assist had they been assigned to e-assist), and never takers (patients who would have avoided e-assist no matter what). Estimating physician-specific probabilities jointly over physicians poses novel challenges. We address these challenges by maximum likelihood, factoring a "complete-data likelihood" uniquely into the conditional distribution of screening and partially observed compliance given random effects and the distribution of random effects. We marginalize this likelihood using adaptive Gauss-Hermite quadrature. The approach is doubly iterative in that the conditional distribution defies analytic evaluation. Because the small sample size per physician constrains estimability of multiple random effects, we reduce their dimensionality using a shared random effects model having a factor analytic structure. We assess estimators and recommend sample sizes to produce reasonably accurate and precise estimates by simulation, and analyze data from a trial of a CRCS intervention.
Collapse
Affiliation(s)
- Xinxin Sun
- Virginia Commonwealth UniversityRichmondVirginiaUSA
- U.S. Food & Drug AdministrationSilver SpringMarylandUSA
| | - Yongyun Shin
- Virginia Commonwealth UniversityRichmondVirginiaUSA
| | | | | |
Collapse
|
3
|
Lages M. A hierarchical signal detection model with unequal variance for binary responses. Psychon Bull Rev 2024; 31:2534-2557. [PMID: 38806791 PMCID: PMC11680650 DOI: 10.3758/s13423-024-02504-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 03/17/2024] [Indexed: 05/30/2024]
Abstract
Gaussian signal detection models with equal variance are commonly used in simple yes-no detection and discrimination tasks whereas more flexible models with unequal variance require additional information. Here, a hierarchical Bayesian model with equal variance is extended to an unequal-variance model by exploiting variability of hit and false-alarm rates in a random sample of participants. This hierarchical model is investigated analytically, in simulations and in applications to existing data sets. The results suggest that signal variance and other parameters can be accurately estimated if plausible assumptions are met. It is concluded that the model provides a promising alternative to the ubiquitous equal-variance model for binary data.
Collapse
Affiliation(s)
- Martin Lages
- School of Psychology and Neuroscience, University of Glasgow, 62 Hillhead Street, Glasgow G12 8QQ, Glasgow, UK.
| |
Collapse
|
4
|
Zhu Y, Chen W, Zhu K, Liu Y, Huang S, Zeng P. Polygenic prediction for underrepresented populations through transfer learning by utilizing genetic similarity shared with European populations. Brief Bioinform 2024; 26:bbaf048. [PMID: 39905953 PMCID: PMC11794457 DOI: 10.1093/bib/bbaf048] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2024] [Revised: 01/10/2025] [Accepted: 01/21/2025] [Indexed: 02/06/2025] Open
Abstract
Because current genome-wide association studies are primarily conducted in individuals of European ancestry and information disparities exist among different populations, the polygenic score derived from Europeans thus exhibits poor transferability. Borrowing the idea of transfer learning, which enables the utilization of knowledge acquired from auxiliary samples to enhance learning capability in target samples, we propose transPGS, a novel polygenic score method, for genetic prediction in underrepresented populations by leveraging genetic similarity shared between the European and non-European populations while explaining the trans-ethnic difference in linkage disequilibrium (LD) and effect sizes. We demonstrate the usefulness and robustness of transPGS in elevated prediction accuracy via individual-level and summary-level simulations and apply it to seven continuous phenotypes and three diseases in the African, Chinese, and East Asian populations of the UK Biobank and Genetic Epidemiology Research Study on Adult Health and Aging cohorts. We further reveal that distinct LD and minor allele frequency patterns across ancestral groups are responsible for the dissatisfactory portability of PGS.
Collapse
Affiliation(s)
- Yiyang Zhu
- Department of Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, Jiangsu, 221004, China
| | - Wenying Chen
- Department of Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, Jiangsu, 221004, China
| | - Kexuan Zhu
- Department of Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, Jiangsu, 221004, China
| | - Yuxin Liu
- Department of Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, Jiangsu, 221004, China
| | - Shuiping Huang
- Department of Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, Jiangsu, 221004, China
- Jiangsu Engineering Research Center of Biological Data Mining and Healthcare Transformation, Xuzhou Medical University, Xuzhou, Jiangsu, 221004, China
| | - Ping Zeng
- Department of Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, Jiangsu, 221004, China
- Jiangsu Engineering Research Center of Biological Data Mining and Healthcare Transformation, Xuzhou Medical University, Xuzhou, Jiangsu, 221004, China
| |
Collapse
|
5
|
Zhao J, Ma X, Shi L, Wang Z. Robust Bilinear Probabilistic PCA Using a Matrix Variate t Distribution. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2023; 34:10683-10697. [PMID: 35533172 DOI: 10.1109/tnnls.2022.3170797] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
The bilinear probabilistic principal component analysis (BPPCA) was introduced recently as a model-based dimension reduction technique on matrix data. However, BPPCA is based on the Gaussian assumption and hence is vulnerable to potential outlying matrix-valued observations. In this article, we present a new robust extension of BPPCA, called BPPCA using a matrix variate t distribution ( t BPPCA), that is built upon a matrix variate t distribution. Like the multivariate t , this distribution offers an additional robustness tuning parameter, which can downweight outliers. By introducing a Gamma distributed latent weight variable, this distribution can be represented hierarchically. With this representation, two efficient accelerated expectation-maximization (EM)-like algorithms for parameter estimation are developed. Experiments on a number of synthetic and real datasets are conducted to understand t BPPCA and compare with several closely related competitors, including its vector-based counterpart. The results reveal that t BPPCA is generally more robust and accurate in the presence of outliers. Moreover, the expected latent weights under t BPPCA can be effectively used for outliers' detection, which is much more reliable than its vector-based counterpart due to its better robustness.
Collapse
|
6
|
Berg P, Popescu G. Baldur: Bayesian Hierarchical Modeling for Label-Free Proteomics with Gamma Regressing Mean-Variance Trends. Mol Cell Proteomics 2023; 22:100658. [PMID: 37806340 PMCID: PMC10687340 DOI: 10.1016/j.mcpro.2023.100658] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2023] [Revised: 09/20/2023] [Accepted: 10/04/2023] [Indexed: 10/10/2023] Open
Abstract
Label-free proteomics is a fast-growing methodology to infer abundances in mass spectrometry proteomics. Extensive research has focused on spectral quantification and peptide identification. However, research toward modeling and understanding quantitative proteomics data is scarce. Here we propose a Bayesian hierarchical decision model (Baldur) to test for differences in means between conditions for proteins, peptides, and post-translational modifications. We developed a Bayesian regression model to characterize local mean-variance trends in data, to estimate measurement uncertainty and hyperparameters for the decision model. A key contribution is the development of a new gamma regression model that describes the mean-variance dependency as a mixture of a common and a latent trend-allowing for localized trend estimates. We then evaluate the performance of Baldur, limma-trend, and t test on six benchmark datasets: five total proteomics and one post-translational modification dataset. We find that Baldur drastically improves the decision in noisier post-translational modification data over limma-trend and t test. In addition, we see significant improvements using Baldur over the other methods in the total proteomics datasets. Finally, we analyzed Baldur's performance when increasing the number of replicates and found that the method always increases precision with sample size, while showing robust control of the false positive rate. We conclude that our model vastly improves over popular data analysis methods (limma-trend and t test) in several spike-in datasets by achieving a high true positive detection rate, while greatly reducing the false-positive rate.
Collapse
Affiliation(s)
- Philip Berg
- Institute for Genomics, Biocomputing & Biotechnology, Mississippi State University, Mississippi State, Mississippi, USA; Department of Biochemistry, Molecular Biology, Entomology and Plant Pathology, Mississippi State University, Mississippi State, Mississippi, USA.
| | - George Popescu
- Institute for Genomics, Biocomputing & Biotechnology, Mississippi State University, Mississippi State, Mississippi, USA; Department of Biochemistry, Molecular Biology, Entomology and Plant Pathology, Mississippi State University, Mississippi State, Mississippi, USA.
| |
Collapse
|
7
|
Lu H, Zhang S, Jiang Z, Zeng P. Leveraging trans-ethnic genetic risk scores to improve association power for complex traits in underrepresented populations. Brief Bioinform 2023:bbad232. [PMID: 37332016 DOI: 10.1093/bib/bbad232] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2022] [Revised: 05/06/2023] [Accepted: 06/04/2023] [Indexed: 06/20/2023] Open
Abstract
Trans-ethnic genome-wide association studies have revealed that many loci identified in European populations can be reproducible in non-European populations, indicating widespread trans-ethnic genetic similarity. However, how to leverage such shared information more efficiently in association analysis is less investigated for traits in underrepresented populations. We here propose a statistical framework, trans-ethnic genetic risk score informed gene-based association mixed model (GAMM), by hierarchically modeling single-nucleotide polymorphism effects in the target population as a function of effects of the same trait in well-studied populations. GAMM powerfully integrates genetic similarity across distinct ancestral groups to enhance power in understudied populations, as confirmed by extensive simulations. We illustrate the usefulness of GAMM via the application to 13 blood cell traits (i.e. basophil count, eosinophil count, hematocrit, hemoglobin concentration, lymphocyte count, mean corpuscular hemoglobin, mean corpuscular hemoglobin concentration, mean corpuscular volume, monocyte count, neutrophil count, platelet count, red blood cell count and total white blood cell count) in Africans of the UK Biobank (n = 3204) while utilizing genetic overlap shared in Europeans (n = 746 667) and East Asians (n = 162 255). We discovered multiple new associated genes, which had otherwise been missed by existing methods, and revealed that the trans-ethnic information indirectly contributed much to the phenotypic variance. Overall, GAMM represents a flexible and powerful statistical framework of association analysis for complex traits in underrepresented populations by integrating trans-ethnic genetic similarity across well-studied populations, and helps attenuate health inequities in current genetics research for people of minority populations.
Collapse
Affiliation(s)
- Haojie Lu
- Department of Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, Jiangsu 221004, China
| | - Shuo Zhang
- Department of Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, Jiangsu 221004, China
| | - Zhou Jiang
- Department of Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, Jiangsu 221004, China
| | - Ping Zeng
- Department of Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, Jiangsu 221004, China
- Center for Medical Statistics and Data Analysis, Xuzhou Medical University, Xuzhou, Jiangsu, 221004, China
- Key Laboratory of Human Genetics and Environmental Medicine, Xuzhou Medical University, Xuzhou, Jiangsu, 221004, China
- Key Laboratory of Environment and Health, Xuzhou Medical University, Xuzhou, Jiangsu, 221004, China
- Engineering Research Innovation Center of Biological Data Mining and Healthcare Transformation, Xuzhou Medical University, Xuzhou, Jiangsu, 221004, China
| |
Collapse
|
8
|
Saâdaoui F. Randomized extrapolation for accelerating EM-type fixed-point algorithms. J MULTIVARIATE ANAL 2023. [DOI: 10.1016/j.jmva.2023.105188] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/03/2023]
|
9
|
An algorithm for searching optimal variance component estimators in linear mixed models. J Stat Plan Inference 2023. [DOI: 10.1016/j.jspi.2023.03.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/17/2023]
|
10
|
Ketz AC, Storm DJ, Barker RE, Apa AD, Oliva‐Aviles C, Walsh DP. Assimilating ecological theory with empiricism: Using constrained generalized additive models to enhance survival analyses. Methods Ecol Evol 2023. [DOI: 10.1111/2041-210x.14057] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2023]
Affiliation(s)
- Alison C. Ketz
- Wisconsin Cooperative Research Unit, Department of Forest and Wildlife Ecology University of Wisconsin Madison Wisconsin USA
| | - Daniel J. Storm
- Wisconsin Department of Natural Resources Rhinelander Wisconsin USA
| | - Rachel E. Barker
- Department of Forest and Wildlife Ecology University of Wisconsin Madison Wisconsin USA
| | | | | | - Daniel P. Walsh
- U.S. Geological Survey Montana Cooperative Wildlife Research Unit Missoula Montana USA
| |
Collapse
|
11
|
Wanduku D. The multilevel hierarchical data EM-algorithm. Applications to discrete-time Markov chain epidemic models. Heliyon 2022; 8:e12622. [PMID: 36643325 PMCID: PMC9834773 DOI: 10.1016/j.heliyon.2022.e12622] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2021] [Revised: 06/21/2022] [Accepted: 12/16/2022] [Indexed: 12/24/2022] Open
Abstract
The theory of multilevel hierarchical data Expectation Maximization (EM)-algorithm is introduced via discrete time Markov chain (DTMC) epidemic models. A general model for a multilevel hierarchical discrete data is derived. The observed sample Y in the system is a stochastic incomplete data, and the missing data Z exhibits a multilevel hierarchical data structure. The EM-algorithm to find ML-estimates for parameters in the stochastic system is derived. Applications of the EM-algorithm are exhibited in the two DTMC models, to find ML-estimates of the system parameters. Numerical results are given for influenza epidemics in the state of Georgia (GA), USA.
Collapse
|
12
|
Donoghoe MW, Marschner IC. Parameter expansion for fitting regression models with non-negativity constraints. COMMUN STAT-SIMUL C 2022. [DOI: 10.1080/03610918.2022.2154791] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Affiliation(s)
- Mark W. Donoghoe
- Stats Central, Mark Wainwright Analytical Centre, UNSW, Sydney, Australia
| | - Ian C. Marschner
- NHMRC Clinical Trials Centre, University of Sydney, Sydney, Australia
| |
Collapse
|
13
|
Ma X, Zhao J, Wang Y, Shang C, Jiang F. Robust factored principal component analysis for matrix-valued outlier accommodation and detection. Comput Stat Data Anal 2022. [DOI: 10.1016/j.csda.2022.107657] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
|
14
|
Shao Z, Wang T, Qiao J, Zhang Y, Huang S, Zeng P. A comprehensive comparison of multilocus association methods with summary statistics in genome-wide association studies. BMC Bioinformatics 2022; 23:359. [PMID: 36042399 PMCID: PMC9429742 DOI: 10.1186/s12859-022-04897-3] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2022] [Accepted: 08/22/2022] [Indexed: 02/07/2023] Open
Abstract
BACKGROUND Multilocus analysis on a set of single nucleotide polymorphisms (SNPs) pre-assigned within a gene constitutes a valuable complement to single-marker analysis by aggregating data on complex traits in a biologically meaningful way. However, despite the existence of a wide variety of SNP-set methods, few comprehensive comparison studies have been previously performed to evaluate the effectiveness of these methods. RESULTS We herein sought to fill this knowledge gap by conducting a comprehensive empirical comparison for 22 commonly-used summary-statistics based SNP-set methods. We showed that only seven methods could effectively control the type I error, and that these well-calibrated approaches had varying power performance under the simulation scenarios. Overall, we confirmed that the burden test was generally underpowered and score-based variance component tests (e.g., sequence kernel association test) were much powerful under the polygenic genetic architecture in both common and rare variant association analyses. We further revealed that two linkage-disequilibrium-free P value combination methods (e.g., harmonic mean P value method and aggregated Cauchy association test) behaved very well under the sparse genetic architecture in simulations and real-data applications to common and rare variant association analyses as well as in expression quantitative trait loci weighted integrative analysis. We also assessed the scalability of these approaches by recording computational time and found that all these methods can be scalable to biobank-scale data although some might be relatively slow. CONCLUSION In conclusion, we hope that our findings can offer an important guidance on how to choose appropriate multilocus association analysis methods in post-GWAS era. All the SNP-set methods are implemented in the R package called MCA, which is freely available at https://github.com/biostatpzeng/ .
Collapse
Affiliation(s)
- Zhonghe Shao
- Department of Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, 221004, Jiangsu, China
| | - Ting Wang
- Department of Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, 221004, Jiangsu, China
| | - Jiahao Qiao
- Department of Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, 221004, Jiangsu, China
| | - Yuchen Zhang
- Department of Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, 221004, Jiangsu, China
| | - Shuiping Huang
- Department of Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, 221004, Jiangsu, China
- Center for Medical Statistics and Data Analysis, Xuzhou Medical University, Xuzhou, 221004, Jiangsu, China
- Key Laboratory of Human Genetics and Environmental Medicine, Xuzhou Medical University, Xuzhou, 221004, Jiangsu, China
- Key Laboratory of Environment and Health, Xuzhou Medical University, Xuzhou, 221004, Jiangsu, China
- Engineering Research Innovation Center of Biological Data Mining and Healthcare Transformation, Xuzhou Medical University, Xuzhou, 221004, Jiangsu, China
| | - Ping Zeng
- Department of Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, 221004, Jiangsu, China.
- Center for Medical Statistics and Data Analysis, Xuzhou Medical University, Xuzhou, 221004, Jiangsu, China.
- Key Laboratory of Human Genetics and Environmental Medicine, Xuzhou Medical University, Xuzhou, 221004, Jiangsu, China.
- Key Laboratory of Environment and Health, Xuzhou Medical University, Xuzhou, 221004, Jiangsu, China.
- Engineering Research Innovation Center of Biological Data Mining and Healthcare Transformation, Xuzhou Medical University, Xuzhou, 221004, Jiangsu, China.
| |
Collapse
|
15
|
Boehm FJ, Zhou X. Statistical methods for Mendelian randomization in genome-wide association studies: A review. Comput Struct Biotechnol J 2022; 20:2338-2351. [PMID: 35615025 PMCID: PMC9123217 DOI: 10.1016/j.csbj.2022.05.015] [Citation(s) in RCA: 154] [Impact Index Per Article: 51.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2022] [Revised: 05/08/2022] [Accepted: 05/09/2022] [Indexed: 11/15/2022] Open
Abstract
Genome-wide association studies have yielded thousands of associations for many common diseases and disease-related complex traits. The identified associations made it possible to identify the causal risk factors underlying diseases and investigate the causal relationships among complex traits through Mendelian randomization. Mendelian randomization is a form of instrumental variable analysis that uses SNP associations from genome-wide association studies as instruments to study and uncover causal relationships between complex traits. By leveraging SNP genotypes as instrumental variables, or proxies, for the exposure complex trait, investigators can tease out causal effects from observational data, provided that necessary assumptions are satisfied. We discuss below the development of Mendelian randomization methods in parallel with the growth of genome-wide association studies. We argue that the recent availability of GWAS summary statistics for diverse complex traits has motivated new Mendelian randomization methods with relaxed causality assumptions and that this area continues to offer opportunities for robust biological discoveries.
Collapse
Affiliation(s)
- Frederick J. Boehm
- Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Xiang Zhou
- Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA
- Center for Statistical Genetics, University of Michigan, Ann Arbor, MI 48109, USA
| |
Collapse
|
16
|
Wang A, Liu W, Liu Z. A two-sample robust Bayesian Mendelian Randomization method accounting for linkage disequilibrium and idiosyncratic pleiotropy with applications to the COVID-19 outcomes. Genet Epidemiol 2022; 46:159-169. [PMID: 35192729 PMCID: PMC9648496 DOI: 10.1002/gepi.22445] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2021] [Revised: 11/03/2021] [Accepted: 01/20/2022] [Indexed: 01/02/2023]
Abstract
Mendelian randomization (MR) is a statistical method exploiting genetic variants as instrumental variables to estimate the causal effect of modifiable risk factors on an outcome of interest. Despite wide uses of various popular two-sample MR methods based on genome-wide association study summary level data, however, those methods could suffer from potential power loss or/and biased inference when the chosen genetic variants are in linkage disequilibrium (LD), and also have relatively large direct effects on the outcome whose distribution might be heavy-tailed which is commonly referred to as the idiosyncratic pleiotropy phenomenon. To resolve those two issues, we propose a novel Robust Bayesian Mendelian Randomization (RBMR) model that uses the more robust multivariate generalized t $t$ -distribution to model such direct effects in a probabilistic model framework which can also incorporate the LD structure explicitly. The generalized t $t$ -distribution can be represented as a Gaussian scaled mixture so that our model parameters can be estimated by the expectation maximization (EM)-type algorithms. We compute the standard errors by calibrating the evidence lower bound using the likelihood ratio test. Through extensive simulation studies, we show that our RBMR has robust performance compared with other competing methods. We further apply our RBMR method to two benchmark data sets and find that RBMR has smaller bias and standard errors. Using our proposed RBMR method, we find that coronary artery disease is associated with increased risk of critically ill coronavirus disease 2019. We also develop a user-friendly R package RBMR (https://github.com/AnqiWang2021/RBMR) for public use.
Collapse
Affiliation(s)
- Anqi Wang
- Department of Statistics and Actuarial ScienceUniversity of Hong KongHong KongSARChina
| | - Wei Liu
- Department of Statistics and Actuarial ScienceUniversity of Hong KongHong KongSARChina
| | - Zhonghua Liu
- Department of Statistics and Actuarial ScienceUniversity of Hong KongHong KongSARChina
| |
Collapse
|
17
|
Qi X, Zhou S, Plummer M. On Bayesian modeling of censored data in JAGS. BMC Bioinformatics 2022; 23:102. [PMID: 35321656 PMCID: PMC8944154 DOI: 10.1186/s12859-021-04496-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2021] [Accepted: 11/19/2021] [Indexed: 11/20/2022] Open
Abstract
Background Just Another Gibbs Sampling (JAGS) is a convenient tool to draw posterior samples using Markov Chain Monte Carlo for Bayesian modeling. However, the built-in function dinterval() for censored data misspecifies the default computation of deviance function, which limits likelihood-based Bayesian model comparison. Results To establish an automatic approach to specifying the correct deviance function in JAGS, we propose a simple and generic alternative modeling strategy for the analysis of censored outcomes. The two illustrative examples demonstrate that the alternative strategy not only properly draws posterior samples in JAGS, but also automatically delivers the correct deviance for model assessment. In the survival data application, our proposed method provides the correct value of mean deviance based on the exact likelihood function. In the drug safety data application, the deviance information criterion and penalized expected deviance for seven Bayesian models of censored data are simultaneously computed by our proposed approach and compared to examine the model performance. Conclusions We propose an effective strategy to model censored data in the Bayesian modeling framework in JAGS with the correct deviance specification, which can simplify the calculation of popular Kullback–Leibler based measures for model selection. The proposed approach applies to a broad spectrum of censored data types, such as survival data, and facilitates different censored Bayesian model structures.
Collapse
Affiliation(s)
- Xinyue Qi
- The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Shouhao Zhou
- Pennsylvania State University, Hershey, PA, USA.
| | | |
Collapse
|
18
|
Wang T, Qiao J, Zhang S, Wei Y, Zeng P. Simultaneous test and estimation of total genetic effect in eQTL integrative analysis through mixed models. Brief Bioinform 2022; 23:6535679. [PMID: 35212359 DOI: 10.1093/bib/bbac038] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2021] [Revised: 01/22/2022] [Accepted: 02/07/2021] [Indexed: 11/14/2022] Open
Abstract
Integration of expression quantitative trait loci (eQTL) into genome-wide association studies (GWASs) is a promising manner to reveal functional roles of associated single-nucleotide polymorphisms (SNPs) in complex phenotypes and has become an active research field in post-GWAS era. However, how to efficiently incorporate eQTL mapping study into GWAS for prioritization of causal genes remains elusive. We herein proposed a novel method termed as Mixed transcriptome-wide association studies (TWAS) and mediated Variance estimation (MTV) by modeling the effects of cis-SNPs of a gene as a function of eQTL. MTV formulates the integrative method and TWAS within a unified framework via mixed models and therefore includes many prior methods/tests as special cases. We further justified MTV from another two statistical perspectives of mediation analysis and two-stage Mendelian randomization. Relative to existing methods, MTV is superior for pronounced features including the processing of direct effects of cis-SNPs on phenotypes, the powerful likelihood ratio test for assessment of joint effects of cis-SNPs and genetically regulated gene expression (GReX), two useful quantities to measure relative genetic contributions of GReX and cis-SNPs to phenotypic variance, and the computationally efferent parameter expansion expectation maximum algorithm. With extensive simulations, we identified that MTV correctly controlled the type I error in joint evaluation of the total genetic effect and proved more powerful to discover true association signals across various scenarios compared to existing methods. We finally applied MTV to 41 complex traits/diseases available from three GWASs and discovered many new associated genes that had otherwise been missed by existing methods. We also revealed that a small but substantial fraction of phenotypic variation was mediated by GReX. Overall, MTV constructs a robust and realistic modeling foundation for integrative omics analysis and has the advantage of offering more attractive biological interpretations of GWAS results.
Collapse
Affiliation(s)
- Ting Wang
- Department of Biostatistics at Xuzhou Medical University, China
| | - Jiahao Qiao
- Department of Biostatistics at Xuzhou Medical University, China
| | - Shuo Zhang
- Department of Biostatistics at Xuzhou Medical University, China
| | - Yongyue Wei
- Department of Biostatistics at Nanjing Medical University, China
| | - Ping Zeng
- Department of Biostatistics, Center for Medical Statistics and Data Analysis and Key Laboratory of Human Genetics and Environmental Medicine at Xuzhou Medical University, China
| |
Collapse
|
19
|
Yang M. A Bayesian analysis of the incomplete block crossover design. COMMUN STAT-SIMUL C 2022. [DOI: 10.1080/03610918.2021.1966463] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
Affiliation(s)
- Mingan Yang
- Division of Biostat and Epidemiology, School of Public Health, San Diego State University, San Diego, California, USA
| |
Collapse
|
20
|
Tan LSL. Efficient Data Augmentation Techniques for Some Classes of State Space Models. Stat Sci 2022. [DOI: 10.1214/22-sts867] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Affiliation(s)
- Linda S. L. Tan
- Linda S. L. Tan is Assistant Professor, Department of Statistics and Data Science, National University of Singapore, Singapore 117546, Singapore
| |
Collapse
|
21
|
Mirniam AS, Nematollahi AR. On accelerating the EM-based algorithms for the VAR(1) models with multivariate generalized scaled t-distributed innovations. COMMUN STAT-THEOR M 2021. [DOI: 10.1080/03610926.2021.1994608] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Affiliation(s)
- A. S. Mirniam
- Department of Statistics, Shiraz University, Shiraz, Iran
| | | |
Collapse
|
22
|
Yang Y, Yeung KF, Liu J. CoMM-S 4: A Collaborative Mixed Model Using Summary-Level eQTL and GWAS Datasets in Transcriptome-Wide Association Studies. Front Genet 2021; 12:704538. [PMID: 34616426 PMCID: PMC8488198 DOI: 10.3389/fgene.2021.704538] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2021] [Accepted: 09/03/2021] [Indexed: 11/13/2022] Open
Abstract
Motivation: Genome-wide association studies (GWAS) have achieved remarkable success in identifying SNP-trait associations in the last decade. However, it is challenging to identify the mechanisms that connect the genetic variants with complex traits as the majority of GWAS associations are in non-coding regions. Methods that integrate genomic and transcriptomic data allow us to investigate how genetic variants may affect a trait through their effect on gene expression. These include CoMM and CoMM-S2, likelihood-ratio-based methods that integrate GWAS and eQTL studies to assess expression-trait association. However, their reliance on individual-level eQTL data render them inapplicable when only summary-level eQTL results, such as those from large-scale eQTL analyses, are available. Result: We develop an efficient probabilistic model, CoMM-S4, to explore the expression-trait association using summary-level eQTL and GWAS datasets. Compared with CoMM-S2, which uses individual-level eQTL data, CoMM-S4 requires only summary-level eQTL data. To test expression-trait association, an efficient variational Bayesian EM algorithm and a likelihood ratio test were constructed. We applied CoMM-S4 to both simulated and real data. The simulation results demonstrate that CoMM-S4 can perform as well as CoMM-S2 and S-PrediXcan, and analyses using GWAS summary statistics from Biobank Japan and eQTL summary statistics from eQTLGen and GTEx suggest novel susceptibility loci for cardiovascular diseases and osteoporosis. Availability and implementation: The developed R package is available at https://github.com/gordonliu810822/CoMM.
Collapse
Affiliation(s)
- Yi Yang
- Centre for Quantitative Medicine, Program in Health Services and Systems Research, Duke-NUS Medical School, Singapore, Singapore
| | - Kar-Fu Yeung
- Centre for Quantitative Medicine, Program in Health Services and Systems Research, Duke-NUS Medical School, Singapore, Singapore
| | - Jin Liu
- Centre for Quantitative Medicine, Program in Health Services and Systems Research, Duke-NUS Medical School, Singapore, Singapore
| |
Collapse
|
23
|
Donelli N, Peluso S, Mira A. A Bayesian semiparametric vector Multiplicative Error Model. Comput Stat Data Anal 2021. [DOI: 10.1016/j.csda.2021.107242] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
24
|
Estimating the Variance of Estimator of the Latent Factor Linear Mixed Model Using Supplemented Expectation-Maximization Algorithm. Symmetry (Basel) 2021. [DOI: 10.3390/sym13071286] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
This paper deals with symmetrical data that can be modelled based on Gaussian distribution, such as linear mixed models for longitudinal data. The latent factor linear mixed model (LFLMM) is a method generally used for analysing changes in high-dimensional longitudinal data. It is usual that the model estimates are based on the expectation-maximization (EM) algorithm, but unfortunately, the algorithm does not produce the standard errors of the regression coefficients, which then hampers testing procedures. To fill in the gap, the Supplemented EM (SEM) algorithm for the case of fixed variables is proposed in this paper. The computational aspects of the SEM algorithm have been investigated by means of simulation. We also calculate the variance matrix of beta using the second moment as a benchmark to compare with the asymptotic variance matrix of beta of SEM. Both the second moment and SEM produce symmetrical results, the variance estimates of beta are getting smaller when number of subjects in the simulation increases. In addition, the practical usefulness of this work was illustrated using real data on political attitudes and behaviour in Flanders-Belgium.
Collapse
|
25
|
Inference in skew generalized t-link models for clustered binary outcome via a parameter-expanded EM algorithm. PLoS One 2021; 16:e0249604. [PMID: 33822818 PMCID: PMC8028747 DOI: 10.1371/journal.pone.0249604] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2020] [Accepted: 03/19/2021] [Indexed: 11/19/2022] Open
Abstract
Binary Generalized Linear Mixed Model (GLMM) is the most common method used by researchers to analyze clustered binary data in biological and social sciences. The traditional approach to GLMMs causes substantial bias in estimates due to steady shape of logistic and normal distribution assumptions thereby resulting into wrong and misleading decisions. This study brings forward an approach governed by skew generalized t distributions that belong to a class of potentially skewed and heavy tailed distributions. Interestingly, both the traditional logistic and probit mixed models, as well as other available methods can be utilized within the skew generalized t-link model (SGTLM) frame. We have taken advantage of the Expectation-Maximization algorithm accelerated via parameter-expansion for model fitting. We evaluated the performance of this approach to GLMMs through a simulation experiment by varying sample size and data distribution. Our findings indicated that the proposed methodology outperforms competing approaches in estimating population parameters and predicting random effects, when the traditional link and normality assumptions are violated. In addition, empirical standard errors and information criteria proved useful for detecting spurious skewness and avoiding complex models for probit data. An application with respiratory infection data points out to the superiority of the SGTLM which turns to be the most adequate model. In future, studies should focus on integrating the demonstrated flexibility in other generalized linear mixed models to enhance robust modeling.
Collapse
|
26
|
MacDonald IL. Is EM really necessary here? Examples where it seems simpler not to use EM. ASTA ADVANCES IN STATISTICAL ANALYSIS 2021. [DOI: 10.1007/s10182-021-00392-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
27
|
Linearly preconditioned nonlinear conjugate gradient acceleration of the PX-EM algorithm. Comput Stat Data Anal 2021. [DOI: 10.1016/j.csda.2020.107056] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
28
|
Jauch M, Hoff PD, Dunson DB. Monte Carlo Simulation on the Stiefel Manifold via Polar Expansion. J Comput Graph Stat 2021. [DOI: 10.1080/10618600.2020.1859382] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Affiliation(s)
- Michael Jauch
- Center for Applied Mathematics, Cornell University, Ithaca, NY
| | - Peter D. Hoff
- Department of Statistical Science, Duke University, Durham, NC
| | - David B. Dunson
- Department of Statistical Science, Duke University, Durham, NC
| |
Collapse
|
29
|
Xie Y, Shan N, Zhao H, Hou L. Transcriptome wide association studies: general framework and methods. QUANTITATIVE BIOLOGY 2021. [DOI: 10.15302/j-qb-020-0228] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
30
|
Comparison of Recent Acceleration Techniques for the EM Algorithm in One- and Two-Parameter Logistic IRT Models. PSYCH 2020. [DOI: 10.3390/psych2040018] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
The expectation–maximization (EM) algorithm is an important numerical method for maximum likelihood estimation in incomplete data problems. However, convergence of the EM algorithm can be slow, and for this reason, many EM acceleration techniques have been proposed. After a review of acceleration techniques in a unified notation with illustrations, three recently proposed EM acceleration techniques are compared in detail: quasi-Newton methods (QN), “squared” iterative methods (SQUAREM), and parabolic EM (PEM). These acceleration techniques are applied to marginal maximum likelihood estimation with the EM algorithm in one- and two-parameter logistic item response theory (IRT) models for binary data, and their performance is compared. QN and SQUAREM methods accelerate convergence of the EM algorithm for the two-parameter logistic model significantly in high-dimensional data problems. Compared to the standard EM, all three methods reduce the number of iterations, but increase the number of total marginal log-likelihood evaluations per iteration. Efficient approximations of the marginal log-likelihood are hence an important part of implementation.
Collapse
|
31
|
Saul LK. An EM Algorithm for Capsule Regression. Neural Comput 2020; 33:194-226. [PMID: 33080167 DOI: 10.1162/neco_a_01336] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/04/2022]
Abstract
We investigate a latent variable model for multinomial classification inspired by recent capsule architectures for visual object recognition (Sabour, Frosst, & Hinton, 2017). Capsule architectures use vectors of hidden unit activities to encode the pose of visual objects in an image, and they use the lengths of these vectors to encode the probabilities that objects are present. Probabilities from different capsules can also be propagated through deep multilayer networks to model the part-whole relationships of more complex objects. Notwithstanding the promise of these networks, there still remains much to understand about capsules as primitive computing elements in their own right. In this letter, we study the problem of capsule regression-a higher-dimensional analog of logistic, probit, and softmax regression in which class probabilities are derived from vectors of competing magnitude. To start, we propose a simple capsule architecture for multinomial classification: the architecture has one capsule per class, and each capsule uses a weight matrix to compute the vector of hidden unit activities for patterns it seeks to recognize. Next, we show how to model these hidden unit activities as latent variables, and we use a squashing nonlinearity to convert their magnitudes as vectors into normalized probabilities for multinomial classification. When different capsules compete to recognize the same pattern, the squashing nonlinearity induces nongaussian terms in the posterior distribution over their latent variables. Nevertheless, we show that exact inference remains tractable and use an expectation-maximization procedure to derive least-squares updates for each capsule's weight matrix. We also present experimental results to demonstrate how these ideas work in practice.
Collapse
Affiliation(s)
- Lawrence K Saul
- Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA 92093-0404
| |
Collapse
|
32
|
Ming J, Wang T, Yang C. LPM: a latent probit model to characterize the relationship among complex traits using summary statistics from multiple GWASs and functional annotations. Bioinformatics 2020; 36:2506-2514. [PMID: 31860024 DOI: 10.1093/bioinformatics/btz947] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2019] [Revised: 12/13/2019] [Accepted: 12/18/2019] [Indexed: 12/21/2022] Open
Abstract
MOTIVATION Much effort has been made toward understanding the genetic architecture of complex traits and diseases. In the past decade, fruitful GWAS findings have highlighted the important role of regulatory variants and pervasive pleiotropy. Because of the accumulation of GWAS data on a wide range of phenotypes and high-quality functional annotations in different cell types, it is timely to develop a statistical framework to explore the genetic architecture of human complex traits by integrating rich data resources. RESULTS In this study, we propose a unified statistical approach, aiming to characterize relationship among complex traits, and prioritize risk variants by leveraging regulatory information collected in functional annotations. Specifically, we consider a latent probit model (LPM) to integrate summary-level GWAS data and functional annotations. The developed computational framework not only makes LPM scalable to hundreds of annotations and phenotypes but also ensures its statistically guaranteed accuracy. Through comprehensive simulation studies, we evaluated LPM's performance and compared it with related methods. Then, we applied it to analyze 44 GWASs with 9 genic category annotations and 127 cell-type specific functional annotations. The results demonstrate the benefits of LPM and gain insights of genetic architecture of complex traits. AVAILABILITY AND IMPLEMENTATION The LPM package, all simulation codes and real datasets in this study are available at https://github.com/mingjingsi/LPM. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jingsi Ming
- Department of Mathematics, The Hong Kong University of Science and Technology, Hong Kong SAR, China
| | - Tao Wang
- Department of Bioinformatics and Biostatistics, Shanghai Jiao Tong University, Shanghai, China.,MoE Key Lab of Artificial Intelligence, Shanghai Jiao Tong University, Shanghai, China
| | - Can Yang
- Department of Mathematics, The Hong Kong University of Science and Technology, Hong Kong SAR, China
| |
Collapse
|
33
|
Yang Y, Shi X, Jiao Y, Huang J, Chen M, Zhou X, Sun L, Lin X, Yang C, Liu J. CoMM-S2: a collaborative mixed model using summary statistics in transcriptome-wide association studies. Bioinformatics 2020; 36:2009-2016. [PMID: 31755899 DOI: 10.1093/bioinformatics/btz880] [Citation(s) in RCA: 26] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2019] [Revised: 09/25/2019] [Accepted: 11/21/2019] [Indexed: 12/23/2022] Open
Abstract
MOTIVATION Although genome-wide association studies (GWAS) have deepened our understanding of the genetic architecture of complex traits, the mechanistic links that underlie how genetic variants cause complex traits remains elusive. To advance our understanding of the underlying mechanistic links, various consortia have collected a vast volume of genomic data that enable us to investigate the role that genetic variants play in gene expression regulation. Recently, a collaborative mixed model (CoMM) was proposed to jointly interrogate genome on complex traits by integrating both the GWAS dataset and the expression quantitative trait loci (eQTL) dataset. Although CoMM is a powerful approach that leverages regulatory information while accounting for the uncertainty in using an eQTL dataset, it requires individual-level GWAS data and cannot fully make use of widely available GWAS summary statistics. Therefore, statistically efficient methods that leverages transcriptome information using only summary statistics information from GWAS data are required. RESULTS In this study, we propose a novel probabilistic model, CoMM-S2, to examine the mechanistic role that genetic variants play, by using only GWAS summary statistics instead of individual-level GWAS data. Similar to CoMM which uses individual-level GWAS data, CoMM-S2 combines two models: the first model examines the relationship between gene expression and genotype, while the second model examines the relationship between the phenotype and the predicted gene expression from the first model. Distinct from CoMM, CoMM-S2 requires only GWAS summary statistics. Using both simulation studies and real data analysis, we demonstrate that even though CoMM-S2 utilizes GWAS summary statistics, it has comparable performance as CoMM, which uses individual-level GWAS data. AVAILABILITY AND IMPLEMENTATION The implement of CoMM-S2 is included in the CoMM package that can be downloaded from https://github.com/gordonliu810822/CoMM. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yi Yang
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai 200433, China.,Centre for Quantitative Medicine, Program in Health Services & Systems Research, Duke-NUS Medical School, 169857, Singapore
| | - Xingjie Shi
- Centre for Quantitative Medicine, Program in Health Services & Systems Research, Duke-NUS Medical School, 169857, Singapore.,Department of Statistics, Nanjing University of Finance and Economics, Nanjing 210046, China
| | - Yuling Jiao
- School of Statistics and Mathematics, Zhongnan University of Economics and Law, Wuhan 430073, China
| | - Jian Huang
- Department of Statistics and Actuarial Science, University of Iowa, Iowa City, IA 52242, USA
| | - Min Chen
- Academy of Mathematics and Systems Science, The Chinese Academy of Sciences, Beijing 100190, China
| | - Xiang Zhou
- Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Lei Sun
- Cardiovascular and Metabolic Disorders Program, Duke-NUS Medical School, 169857, Singapore
| | - Xinyi Lin
- Centre for Quantitative Medicine, Program in Health Services & Systems Research, Duke-NUS Medical School, 169857, Singapore.,Singapore Clinical Research Institute, 138669, Singapore.,Singapore Institute for Clinical Sciences, A*STAR, 117609, Singapore
| | - Can Yang
- Department of Mathematics, Hong Kong University of Science and Technology, Hong Kong 999077, China
| | - Jin Liu
- Centre for Quantitative Medicine, Program in Health Services & Systems Research, Duke-NUS Medical School, 169857, Singapore
| |
Collapse
|
34
|
Taavoni M, Arashi M, Wang WL, Lin TI. Multivariate t semiparametric mixed-effects model for longitudinal data with multiple characteristics. J STAT COMPUT SIM 2020. [DOI: 10.1080/00949655.2020.1812608] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023]
Affiliation(s)
- M. Taavoni
- Department of Statistic, Faculty of Mathematical Sciences, Shahrood University of Technology, Shahrood, Iran
| | - M. Arashi
- Department of Statistic, Faculty of Mathematical Sciences, Shahrood University of Technology, Shahrood, Iran
| | - Wan-Lun Wang
- Department of Statistics, Graduate Institute of Statistics and Actuarial Science, Feng Chia University, Taichung, Taiwan
| | - Tsung-I Lin
- Institute of Statistics, National Chung Hsing University, Taichung, Taiwan
- Department of Public Health, China Medical University, Taichung, Taiwan
| |
Collapse
|
35
|
The statistical practice of the GTEx Project: from single to multiple tissues. QUANTITATIVE BIOLOGY 2020. [DOI: 10.1007/s40484-020-0210-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
36
|
Yang C, Wan X, Lin X, Chen M, Zhou X, Liu J. CoMM: a collaborative mixed model to dissecting genetic contributions to complex traits by leveraging regulatory information. Bioinformatics 2020; 35:1644-1652. [PMID: 30295737 DOI: 10.1093/bioinformatics/bty865] [Citation(s) in RCA: 26] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2018] [Revised: 09/15/2018] [Accepted: 10/05/2018] [Indexed: 12/12/2022] Open
Abstract
MOTIVATION Genome-wide association studies (GWASs) have been successful in identifying many genetic variants associated with complex traits. However, the mechanistic links between these variants and complex traits remain elusive. A scientific hypothesis is that genetic variants influence complex traits at the organismal level via affecting cellular traits, such as regulating gene expression and altering protein abundance. Although earlier works have already presented some scientific insights about this hypothesis and their findings are very promising, statistical methods that effectively harness multilayered data (e.g. genetic variants, cellular traits and organismal traits) on a large scale for functional and mechanistic exploration are highly demanding. RESULTS In this study, we propose a collaborative mixed model (CoMM) to investigate the mechanistic role of associated variants in complex traits. The key idea is built upon the emerging scientific evidence that genetic effects at the cellular level are much stronger than those at the organismal level. Briefly, CoMM combines two models: the first model relating gene expression with genotype and the second model relating phenotype with predicted gene expression using the first model. The two models are fitted jointly in CoMM, such that the uncertainty in predicting gene expression has been fully accounted. To demonstrate the advantages of CoMM over existing methods, we conducted extensive simulation studies, and also applied CoMM to analyze 25 traits in NFBC1966 and Genetic Epidemiology Research on Aging (GERA) studies by integrating transcriptome information from the Genetic European in Health and Disease (GEUVADIS) Project. The results indicate that by leveraging regulatory information, CoMM can effectively improve the power of prioritizing risk variants. Regarding the computational efficiency, CoMM can complete the analysis of NFBC1966 dataset and GERA datasets in 2 and 18 min, respectively. AVAILABILITY AND IMPLEMENTATION The developed R package is available at https://github.com/gordonliu810822/CoMM. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Can Yang
- Department of Mathematics, Hong Kong University of Science and Technology, Hong Kong, China
| | - Xiang Wan
- Shenzhen Research Institute of Big Data, Shenzhen, China
| | - Xinyi Lin
- Centre for Quantitative Medicine, Program in Health Services and Systems Research, Duke-NUS Medical School, Singapore
| | - Mengjie Chen
- Department of Medicine, University of Chicago, Chicago, IL, USA
| | - Xiang Zhou
- Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA
| | - Jin Liu
- Centre for Quantitative Medicine, Program in Health Services and Systems Research, Duke-NUS Medical School, Singapore
| |
Collapse
|
37
|
Cheng Q, Yang Y, Shi X, Yeung KF, Yang C, Peng H, Liu J. MR-LDP: a two-sample Mendelian randomization for GWAS summary statistics accounting for linkage disequilibrium and horizontal pleiotropy. NAR Genom Bioinform 2020; 2:lqaa028. [PMID: 33575584 PMCID: PMC7671398 DOI: 10.1093/nargab/lqaa028] [Citation(s) in RCA: 28] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2019] [Revised: 02/27/2020] [Accepted: 04/14/2020] [Indexed: 12/12/2022] Open
Abstract
The proliferation of genome-wide association studies (GWAS) has prompted the use of two-sample Mendelian randomization (MR) with genetic variants as instrumental variables (IVs) for drawing reliable causal relationships between health risk factors and disease outcomes. However, the unique features of GWAS demand that MR methods account for both linkage disequilibrium (LD) and ubiquitously existing horizontal pleiotropy among complex traits, which is the phenomenon wherein a variant affects the outcome through mechanisms other than exclusively through the exposure. Therefore, statistical methods that fail to consider LD and horizontal pleiotropy can lead to biased estimates and false-positive causal relationships. To overcome these limitations, we proposed a probabilistic model for MR analysis in identifying the causal effects between risk factors and disease outcomes using GWAS summary statistics in the presence of LD and to properly account for horizontal pleiotropy among genetic variants (MR-LDP) and develop a computationally efficient algorithm to make the causal inference. We then conducted comprehensive simulation studies to demonstrate the advantages of MR-LDP over the existing methods. Moreover, we used two real exposure-outcome pairs to validate the results from MR-LDP compared with alternative methods, showing that our method is more efficient in using all-instrumental variants in LD. By further applying MR-LDP to lipid traits and body mass index (BMI) as risk factors for complex diseases, we identified multiple pairs of significant causal relationships, including a protective effect of high-density lipoprotein cholesterol on peripheral vascular disease and a positive causal effect of BMI on hemorrhoids.
Collapse
Affiliation(s)
- Qing Cheng
- Centre for Quantitative Medicine, Health Services & Systems Research, Duke-NUS Medical School, Singapore 169857, Singapore
| | - Yi Yang
- Centre for Quantitative Medicine, Health Services & Systems Research, Duke-NUS Medical School, Singapore 169857, Singapore
| | - Xingjie Shi
- Centre for Quantitative Medicine, Health Services & Systems Research, Duke-NUS Medical School, Singapore 169857, Singapore.,Department of Statistics, Nanjing University of Finance and Economics, Nanjing, 210023, China
| | - Kar-Fu Yeung
- Centre for Quantitative Medicine, Health Services & Systems Research, Duke-NUS Medical School, Singapore 169857, Singapore
| | - Can Yang
- Department of Mathematics, The Hong Kong University of Science and Technology, Kowloon, Hong Kong
| | - Heng Peng
- Department of Mathematics, Hong Kong Baptist University, Kowloon, Hong Kong
| | - Jin Liu
- Centre for Quantitative Medicine, Health Services & Systems Research, Duke-NUS Medical School, Singapore 169857, Singapore
| |
Collapse
|
38
|
Cai M, Chen LS, Liu J, Yang C. IGREX for quantifying the impact of genetically regulated expression on phenotypes. NAR Genom Bioinform 2020; 2:lqaa010. [PMID: 32118202 PMCID: PMC7034630 DOI: 10.1093/nargab/lqaa010] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2019] [Revised: 01/08/2020] [Accepted: 02/05/2020] [Indexed: 12/20/2022] Open
Abstract
By leveraging existing GWAS and eQTL resources, transcriptome-wide association studies (TWAS) have achieved many successes in identifying trait-associations of genetically regulated expression (GREX) levels. TWAS analysis relies on the shared GREX variation across GWAS and the reference eQTL data, which depends on the cellular conditions of the eQTL data. Considering the increasing availability of eQTL data from different conditions and the often unknown trait-relevant cell/tissue-types, we propose a method and tool, IGREX, for precisely quantifying the proportion of phenotypic variation attributed to the GREX component. IGREX takes as input a reference eQTL panel and individual-level or summary-level GWAS data. Using eQTL data of 48 tissue types from the GTEx project as a reference panel, we evaluated the tissue-specific IGREX impact on a wide spectrum of phenotypes. We observed strong GREX effects on immune-related protein biomarkers. By incorporating trans-eQTLs and analyzing genetically regulated alternative splicing events, we evaluated new potential directions for TWAS analysis.
Collapse
Affiliation(s)
- Mingxuan Cai
- Department of Mathematics, The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong
| | - Lin S Chen
- Department of Public Health Sciences, The University of Chicago, IL 60637, USA
| | - Jin Liu
- Center for Quantitative Medicine, Duke-NUS Medical School, 169856, Singapore
| | - Can Yang
- Department of Mathematics, The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong
| |
Collapse
|
39
|
Tak H, You K, Ghosh SK, Su B, Kelly J. Data transforming augmentation for heteroscedastic models. J Comput Graph Stat 2020. [DOI: 10.1080/10618600.2019.1704295] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
Affiliation(s)
- Hyungsuk Tak
- Departments of Statistics, Astronomy and Astrophysics, Institute for Computational and Data Sciences, Pennsylvania State University, University Park, PA
| | - Kisung You
- Department of Applied and Computational Mathematical and Statistics University of Notre Dame, Notre Dame, IN
| | - Sujit K. Ghosh
- Department of Statistics, North Carolina State University, Raleigh, NC
| | - Bingyue Su
- Department of Applied and Computational Mathematical and Statistics University of Notre Dame, Notre Dame, IN
| | | |
Collapse
|
40
|
Tian GL, Liu Y, Tang ML, Li T. A novel MM algorithm and the mode-sharing method in Bayesian computation for the analysis of general incomplete categorical data. Comput Stat Data Anal 2019. [DOI: 10.1016/j.csda.2019.04.012] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
41
|
Yeung KF, Yang Y, Yang C, Liu J. CoMM: A Collaborative Mixed Model That Integrates GWAS and eQTL Data Sets to Investigate the Genetic Architecture of Complex Traits. Bioinform Biol Insights 2019; 13:1177932219881435. [PMID: 31662603 PMCID: PMC6792274 DOI: 10.1177/1177932219881435] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2019] [Accepted: 09/18/2019] [Indexed: 12/22/2022] Open
Abstract
Genome-wide association study (GWAS) analyses have identified thousands of associations between genetic variants and complex traits. However, it is still a challenge to uncover the mechanisms underlying the association. With the growing availability of transcriptome data sets, it has become possible to perform statistical analyses targeted at identifying influential genes whose expression levels correlate with the phenotype. Methods such as PrediXcan and transcriptome-wide association study (TWAS) use the transcriptome data set to fit a predictive model for gene expression, with genetic variants as covariates. The gene expression levels for the GWAS data set are then 'imputed' using the prediction model, and the imputed expression levels are tested for their association with the phenotype. These methods fail to account for the uncertainty in the GWAS imputation step, and we propose a collaborative mixed model (CoMM) that addresses this limitation by jointly modelling the multiple analysis steps. We illustrate CoMM's ability to identify relevant genes in the Northern Finland Birth Cohort 1966 data set and extend the model to handle the more widely available GWAS summary statistics.
Collapse
Affiliation(s)
- Kar-Fu Yeung
- Centre for Quantitative Medicine, Programme in Health Services and System Research, Duke-NUS Medical School, Singapore
| | - Yi Yang
- Centre for Quantitative Medicine, Programme in Health Services and System Research, Duke-NUS Medical School, Singapore
| | - Can Yang
- Department of Mathematics, The Hong Kong University of Science and Technology, Hong Kong, China
| | - Jin Liu
- Centre for Quantitative Medicine, Programme in Health Services and System Research, Duke-NUS Medical School, Singapore
| |
Collapse
|
42
|
Kim W, Kwak SH, Won S. Heritability estimation of dichotomous phenotypes using a liability threshold model on ascertained family-based samples. Genet Epidemiol 2019; 43:761-775. [PMID: 31298783 DOI: 10.1002/gepi.22244] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2018] [Revised: 05/07/2019] [Accepted: 05/30/2019] [Indexed: 11/05/2022]
Abstract
Numerous methods for estimating heritability have been proposed; however, unlike quantitative phenotypes, heritability estimation for dichotomous phenotypes is computationally and statistically complex, and the use of heritability is infrequent. In this study, we developed a statistical method to estimate heritability of dichotomous phenotypes using a liability threshold model in the context of ascertained family-based samples. This model assumes that dichotomous phenotypes are determined by unobserved latent variables that are normally distributed and can be applied to general pedigree data. The proposed methods were applied to simulated data and Korean type-2 diabetes family-based samples, and the accuracy of the estimates provided by the experimental methods was compared with that of the established methods.
Collapse
Affiliation(s)
- Wonji Kim
- Channing Division of Network Medicine, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, Massachusetts.,Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Korea
| | - Soo Heon Kwak
- Department of Internal Medicine, Seoul National University College of Medicine, Seoul, Korea
| | - Sungho Won
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Korea.,Department of Public Health Sciences, Seoul National University, Seoul, Korea.,Institute of Health and Environment, Seoul National University, Seoul, Korea
| |
Collapse
|
43
|
Gilmour AR. Average information residual maximum likelihood in practice. J Anim Breed Genet 2019; 136:262-272. [PMID: 31247685 DOI: 10.1111/jbg.12398] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2018] [Revised: 04/01/2019] [Accepted: 04/02/2019] [Indexed: 11/29/2022]
Abstract
Gilmour, Thompson, and Cullis (Biometrics, 1995, 51, 1440) presented the average information residual maximum likelihood (REML) algorithm for efficient variance parameter estimation in the linear mixed model. That paper dealt specifically with traditional variance component models, but the algorithm was quickly applied to more general models and implemented in several REML packages including ASReml (Gilmour et al., Biometrics, 2015, 51, 1440). This paper outlines the theory with respect to these more general models, describes the main issues encountered in fitting these models and how they have been addressed in the ASReml software. The issues covered are the basics steps in the implementation of the algorithm, keeping parameters within the parameter space, maximizing sparsity, avoiding issues associated with unstructured variance matrices by using the factor-analytic structure and handling singularities in marker-based relationship matrices and current work.
Collapse
|
44
|
Henderson NC, Varadhan R. Damped Anderson Acceleration With Restarts and Monotonicity Control for Accelerating EM and EM-like Algorithms. J Comput Graph Stat 2019. [DOI: 10.1080/10618600.2019.1594835] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Affiliation(s)
| | - Ravi Varadhan
- Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD
- Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, MD
| |
Collapse
|
45
|
|
46
|
Ning C, Wang D, Zhou L, Wei J, Liu Y, Kang H, Zhang S, Zhou X, Xu S, Liu JF. Efficient multivariate analysis algorithms for longitudinal genome-wide association studies. Bioinformatics 2019; 35:4879-4885. [DOI: 10.1093/bioinformatics/btz304] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2018] [Revised: 04/16/2019] [Accepted: 04/25/2019] [Indexed: 11/14/2022] Open
Abstract
Abstract
Motivation
Current dynamic phenotyping system introduces time as an extra dimension to genome-wide association studies (GWAS), which helps to explore the mechanism of dynamical genetic control for complex longitudinal traits. However, existing methods for longitudinal GWAS either ignore the covariance among observations of different time points or encounter computational efficiency issues.
Results
We herein developed efficient genome-wide multivariate association algorithms for longitudinal data. In contrast to existing univariate linear mixed model analyses, the proposed method has improved statistic power for association detection and computational speed. In addition, the new method can analyze unbalanced longitudinal data with thousands of individuals and more than ten thousand records within a few hours. The corresponding time for balanced longitudinal data is just a few minutes.
Availability and implementation
A software package to implement the efficient algorithm named GMA (https://github.com/chaoning/GMA) is available freely for interested users in relevant fields.
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Chao Ning
- National Engineering Laboratory for Animal Breeding, Key Laboratory of Animal Genetics, Breeding and Reproduction, Ministry of Agriculture, College of Animal Science and Technology, China Agricultural University, Beijing, China
| | - Dan Wang
- National Engineering Laboratory for Animal Breeding, Key Laboratory of Animal Genetics, Breeding and Reproduction, Ministry of Agriculture, College of Animal Science and Technology, China Agricultural University, Beijing, China
| | - Lei Zhou
- National Engineering Laboratory for Animal Breeding, Key Laboratory of Animal Genetics, Breeding and Reproduction, Ministry of Agriculture, College of Animal Science and Technology, China Agricultural University, Beijing, China
| | - Julong Wei
- Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA
| | - Yuanxin Liu
- School of English, Beijing International Studies University, Beijing, China
| | - Huimin Kang
- National Engineering Laboratory for Animal Breeding, Key Laboratory of Animal Genetics, Breeding and Reproduction, Ministry of Agriculture, College of Animal Science and Technology, China Agricultural University, Beijing, China
| | - Shengli Zhang
- National Engineering Laboratory for Animal Breeding, Key Laboratory of Animal Genetics, Breeding and Reproduction, Ministry of Agriculture, College of Animal Science and Technology, China Agricultural University, Beijing, China
| | - Xiang Zhou
- Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA
| | - Shizhong Xu
- Department of Botany and Plant Science, University of California, Riverside, CA, USA
| | - Jian-Feng Liu
- National Engineering Laboratory for Animal Breeding, Key Laboratory of Animal Genetics, Breeding and Reproduction, Ministry of Agriculture, College of Animal Science and Technology, China Agricultural University, Beijing, China
| |
Collapse
|
47
|
Gu C, Gutman R. Development of a common patient assessment scale across the continuum of care: A nested multiple imputation approach. Ann Appl Stat 2019. [DOI: 10.1214/18-aoas1202] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
48
|
Mixtures of generalized hyperbolic distributions and mixtures of skew-t distributions for model-based clustering with incomplete data. Comput Stat Data Anal 2019. [DOI: 10.1016/j.csda.2018.08.016] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
49
|
Zhang H, Liu D, Zhao J, Bi X. Modeling Hybrid Traits for Comorbidity and Genetic Studies of Alcohol and Nicotine Co-Dependence. Ann Appl Stat 2018; 12:2359-2378. [PMID: 30666272 PMCID: PMC6338437 DOI: 10.1214/18-aoas1156] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023]
Abstract
We propose a novel multivariate model for analyzing hybrid traits and identifying genetic factors for comorbid conditions. Comorbidity is a common phenomenon in mental health in which an individual suffers from multiple disorders simultaneously. For example, in the Study of Addiction: Genetics and Environment (SAGE), alcohol and nicotine addiction were recorded through multiple assessments that we refer to as hybrid traits. Statistical inference for studying the genetic basis of hybrid traits has not been well-developed. Recent rank-based methods have been utilized for conducting association analyses of hybrid traits but do not inform the strength or direction of effects. To overcome this limitation, a parametric modeling framework is imperative. Although such parametric frameworks have been proposed in theory, they are neither well-developed nor extensively used in practice due to their reliance on complicated likelihood functions that have high computational complexity. Many existing parametric frameworks tend to instead use pseudo-likelihoods to reduce computational burdens. Here, we develop a model fitting algorithm for the full likelihood. Our extensive simulation studies demonstrate that inference based on the full likelihood can control the type-I error rate, and gains power and improves the effect size estimation when compared with several existing methods for hybrid models. These advantages remain even if the distribution of the latent variables is misspecified. After analyzing the SAGE data, we identify three genetic variants (rs7672861, rs958331, rs879330) that are significantly associated with the comorbidity of alcohol and nicotine addiction at the chromosome-wide level. Moreover, our approach has greater power in this analysis than several existing methods for hybrid traits.Although the analysis of the SAGE data motivated us to develop the model, it can be broadly applied to analyze any hybrid responses.
Collapse
Affiliation(s)
- Heping Zhang
- Heping Zhang is Susan Dwight Bliss Professor , Department of Biostatistics, Yale School of Public Health, New Haven, Connecticut 06520; Dungang Liu is Assistant Professor , Department of Operations, Business Analytics and Information Systems, University of Cincinnati Lindner College of Business, Cincinnati, OH 45221; Jiwei Zhao is Assistant Professor , Department of Biostatistics, State University of New York at Buffalo, Buffalo, NY 14214; and Xuan Bi is Postdoctoral Associate, Department of Biostatistics, Yale School of Public Health, New Haven, Connecticut 06520
| | - Dungang Liu
- Heping Zhang is Susan Dwight Bliss Professor , Department of Biostatistics, Yale School of Public Health, New Haven, Connecticut 06520; Dungang Liu is Assistant Professor , Department of Operations, Business Analytics and Information Systems, University of Cincinnati Lindner College of Business, Cincinnati, OH 45221; Jiwei Zhao is Assistant Professor , Department of Biostatistics, State University of New York at Buffalo, Buffalo, NY 14214; and Xuan Bi is Postdoctoral Associate, Department of Biostatistics, Yale School of Public Health, New Haven, Connecticut 06520
| | - Jiwei Zhao
- Heping Zhang is Susan Dwight Bliss Professor , Department of Biostatistics, Yale School of Public Health, New Haven, Connecticut 06520; Dungang Liu is Assistant Professor , Department of Operations, Business Analytics and Information Systems, University of Cincinnati Lindner College of Business, Cincinnati, OH 45221; Jiwei Zhao is Assistant Professor , Department of Biostatistics, State University of New York at Buffalo, Buffalo, NY 14214; and Xuan Bi is Postdoctoral Associate, Department of Biostatistics, Yale School of Public Health, New Haven, Connecticut 06520
| | - Xuan Bi
- Heping Zhang is Susan Dwight Bliss Professor , Department of Biostatistics, Yale School of Public Health, New Haven, Connecticut 06520; Dungang Liu is Assistant Professor , Department of Operations, Business Analytics and Information Systems, University of Cincinnati Lindner College of Business, Cincinnati, OH 45221; Jiwei Zhao is Assistant Professor , Department of Biostatistics, State University of New York at Buffalo, Buffalo, NY 14214; and Xuan Bi is Postdoctoral Associate, Department of Biostatistics, Yale School of Public Health, New Haven, Connecticut 06520
| |
Collapse
|
50
|
Srivastava S, DePalma G, Liu C. An Asynchronous Distributed Expectation Maximization Algorithm for Massive Data: The DEM Algorithm. J Comput Graph Stat 2018. [DOI: 10.1080/10618600.2018.1497512] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
Affiliation(s)
- Sanvesh Srivastava
- Department of Statistics and Actuarial Science, The University of Iowa, Iowa City, IA
| | - Glen DePalma
- Department of Statistics, Purdue University, West Lafayette, IN
| | - Chuanhai Liu
- Department of Statistics, Purdue University, West Lafayette, IN
| |
Collapse
|