1
|
Dey D, Datta A, Banerjee S. Graphical Gaussian Process Models for Highly Multivariate Spatial Data. Biometrika 2022; 109:993-1014. [PMID: 36643962 PMCID: PMC9838617 DOI: 10.1093/biomet/asab061] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023] Open
Abstract
For multivariate spatial Gaussian process (GP) models, customary specifications of cross-covariance functions do not exploit relational inter-variable graphs to ensure process-level conditional independence among the variables. This is undesirable, especially for highly multivariate settings, where popular cross-covariance functions such as the multivariate Matérn suffer from a "curse of dimensionality" as the number of parameters and floating point operations scale up in quadratic and cubic order, respectively, in the number of variables. We propose a class of multivariate "Graphical Gaussian Processes" using a general construction called "stitching" that crafts cross-covariance functions from graphs and ensures process-level conditional independence among variables. For the Matérn family of functions, stitching yields a multivariate GP whose univariate components are Matérn GPs, and conforms to process-level conditional independence as specified by the graphical model. For highly multivariate settings and decomposable graphical models, stitching offers massive computational gains and parameter dimension reduction. We demonstrate the utility of the graphical Matérn GP to jointly model highly multivariate spatial data using simulation examples and an application to air-pollution modelling.
Collapse
Affiliation(s)
- Debangan Dey
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health
| | - Abhirup Datta
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health
| | - Sudipto Banerjee
- Department of Biostatistics, University of California Los Angeles
| |
Collapse
|
2
|
Bottolo L, Banterle M, Richardson S, Ala-Korpela M, Järvelin MR, Lewin A. A computationally efficient Bayesian seemingly unrelated regressions model for high-dimensional quantitative trait loci discovery. J R Stat Soc Ser C Appl Stat 2021; 70:886-908. [PMID: 35001978 PMCID: PMC7612194 DOI: 10.1111/rssc.12490] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/02/2023]
Abstract
Our work is motivated by the search for metabolite quantitative trait loci (QTL) in a cohort of more than 5000 people. There are 158 metabolites measured by NMR spectroscopy in the 31-year follow-up of the Northern Finland Birth Cohort 1966 (NFBC66). These metabolites, as with many multivariate phenotypes produced by high-throughput biomarker technology, exhibit strong correlation structures. Existing approaches for combining such data with genetic variants for multivariate QTL analysis generally ignore phenotypic correlations or make restrictive assumptions about the associations between phenotypes and genetic loci. We present a computationally efficient Bayesian seemingly unrelated regressions model for high-dimensional data, with cell-sparse variable selection and sparse graphical structure for covariance selection. Cell sparsity allows different phenotype responses to be associated with different genetic predictors and the graphical structure is used to represent the conditional dependencies between phenotype variables. To achieve feasible computation of the large model space, we exploit a factorisation of the covariance matrix. Applying the model to the NFBC66 data with 9000 directly genotyped single nucleotide polymorphisms, we are able to simultaneously estimate genotype-phenotype associations and the residual dependence structure among the metabolites. The R package BayesSUR with full documentation is available at https://cran.r-project.org/web/packages/BayesSUR/.
Collapse
Affiliation(s)
- Leonardo Bottolo
- Department of Medical Genetics, University of Cambridge, Cambridge, UK
- The Alan Turing Institute, London, UK
- MRC Biostatistics Unit, Cambridge, UK
| | - Marco Banterle
- Department of Medical Statistics, London School of Hygiene and Tropical Medicine, London, UK
| | - Sylvia Richardson
- The Alan Turing Institute, London, UK
- MRC Biostatistics Unit, Cambridge, UK
| | - Mika Ala-Korpela
- Computational Medicine, Faculty of Medicine, University of Oulu and Biocenter Oulu, Oulu, Finland
- NMR Metabolomics Laboratory, School of Pharmacy, University of Eastern Finland, Kuopio, Finland
| | - Marjo-Riitta Järvelin
- Center for Life Course Health Research, University of Oulu, Oulu, Finland
- Biocenter Oulu, University of Oulu, Oulu, Finland
- Department of Epidemiology and Biostatistics, Imperial College London, London, UK
- MRC-PHE Centre for Environment and Health, Imperial College London, London, UK
- Department of Life Sciences, Brunel University London, Uxbridge, UK
| | - Alex Lewin
- Department of Medical Statistics, London School of Hygiene and Tropical Medicine, London, UK
| |
Collapse
|
3
|
Affiliation(s)
| | - Xuan Cao
- Department of Mathematical Sciences, University of Cincinnati
| |
Collapse
|
4
|
|
5
|
Ni Y, Müller P, Ji Y. Bayesian Double Feature Allocation for Phenotyping with Electronic Health Records. J Am Stat Assoc 2019; 115:1620-1634. [PMID: 38111606 PMCID: PMC10727496 DOI: 10.1080/01621459.2019.1686985] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2018] [Revised: 10/04/2019] [Accepted: 10/17/2019] [Indexed: 10/25/2022]
Abstract
Electronic health records (EHR) provide opportunities for deeper understanding of human phenotypes - in our case, latent disease - based on statistical modeling. We propose a categorical matrix factorization method to infer latent diseases from EHR data. A latent disease is defined as an unknown biological aberration that causes a set of common symptoms for a group of patients. The proposed approach is based on a novel double feature allocation model which simultaneously allocates features to the rows and the columns of a categorical matrix. Using a Bayesian approach, available prior information on known diseases (e.g., hypertension and diabetes) greatly improves identifiability and interpretability of the latent diseases. We assess the proposed approach by simulation studies including mis-specified models and comparison with sparse latent factor models. In the application to a Chinese EHR data set, we identify 10 latent diseases, each of which is shared by groups of subjects with specific health traits related to lipid disorder, thrombocytopenia, polycythemia, anemia, bacterial and viral infections, allergy, and malnutrition. The identification of the latent diseases can help healthcare officials better monitor the subjects' ongoing health conditions and look into potential risk factors and approaches for disease prevention. We cross-check the reported latent diseases with medical literature and find agreement between our discovery and reported findings elsewhere. We provide an R package "dfa" implementing our method and an R shiny web application reporting the findings.
Collapse
Affiliation(s)
- Yang Ni
- Department of Statistics, Texas A&M University
- Department of Statistics and Data Sciences, The University of Texas at Austin
| | - Peter Müller
- Department of Mathematics, The University of Texas at Austin
| | - Yuan Ji
- Department of Public Health Sciences, The University of Chicago
| |
Collapse
|
6
|
Kundu S, Mallick BK, Baladandayuthapan V. Efficient Bayesian Regularization for Graphical Model Selection. BAYESIAN ANALYSIS 2019; 14:449-476. [PMID: 33123305 PMCID: PMC7592715 DOI: 10.1214/17-ba1086] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
There has been an intense development in the Bayesian graphical model literature over the past decade; however, most of the existing methods are restricted to moderate dimensions. We propose a novel graphical model selection approach for large dimensional settings where the dimension increases with the sample size, by decoupling model fitting and covariance selection. First, a full model based on a complete graph is fit under a novel class of mixtures of inverse-Wishart priors, which induce shrinkage on the precision matrix under an equivalence with Cholesky-based regularization, while enabling conjugate updates. Subsequently, a post-fitting model selection step uses penalized joint credible regions to perform model selection. This allows our methods to be computationally feasible for large dimensional settings using a combination of straightforward Gibbs samplers and efficient post-fitting inferences. Theoretical guarantees in terms of selection consistency are also established. Simulations show that the proposed approach compares favorably with competing methods, both in terms of accuracy metrics and computation times. We apply this approach to a cancer genomics data example.
Collapse
Affiliation(s)
- Suprateek Kundu
- Department of Biostatistics & Bioinformatics, Emory University, 1518 Clifton Road, Atlanta, Georgia 30322, U.S.A
| | - Bani K Mallick
- Department of Statistics, Texas A&M University, 3143 TAMU, College Station, Texas 77843-3143, U.S.A
| | - Veera Baladandayuthapan
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, 1515 Holcombe Boulevard, Houston, Texas 77030, U.S.A
| |
Collapse
|
7
|
Olsson J, Pavlenko T, Rios FL. Bayesian learning of weakly structural Markov graph laws using sequential Monte Carlo methods. Electron J Stat 2019. [DOI: 10.1214/19-ejs1585] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
8
|
Abstract
BACKGROUND Computational network biology is an emerging interdisciplinary research area. Among many other network approaches, probabilistic graphical models provide a comprehensive probabilistic characterization of interaction patterns between molecules and the associated uncertainties. RESULTS In this article, we first review graphical models, including directed, undirected, and reciprocal graphs (RG), with an emphasis on the RG models that are curiously under-utilized in biostatistics and bioinformatics literature. RG's strictly contain chain graphs as a special case and are suitable to model reciprocal causality such as feedback mechanism in molecular networks. We then extend the RG approach to modeling molecular networks by integrating DNA-, RNA- and protein-level data. We apply the extended RG method to The Cancer Genome Atlas multi-platform ovarian cancer data and reveal several interesting findings. CONCLUSIONS This study aims to review the basics of different probabilistic graphical models as well as recent development in RG approaches for network modeling. The extension presented in this paper provides a principled and efficient way of integrating DNA copy number, DNA methylation, mRNA gene expression and protein expression.
Collapse
Affiliation(s)
- Yang Ni
- Department of Statistics and Data Sciences, The University of Texas at Austin, Austin, 78712 TX USA
| | - Peter Müller
- Department of Mathematics, The University of Texas at Austin, Austin, 78712 TX USA
| | - Lin Wei
- NorthShore University HealthSystem, Evanston, 60201 IL USA
| | - Yuan Ji
- NorthShore University HealthSystem, Evanston, 60201 IL USA
- Department of Public Health Sciences, The University of Chicago, Chicago, 60637 IL USA
| |
Collapse
|
9
|
Green PJ, Thomas A. A structural Markov property for decomposable graph laws that allows control of clique intersections. Biometrika 2017. [DOI: 10.1093/biomet/asx072] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Affiliation(s)
- Peter J Green
- School of Mathematical and Physical Sciences, University of Technology Sydney, Broadway, Sydney, New South Wales 2007, Australia
| | - Alun Thomas
- Division of Genetic Epidemiology, Department of Internal Medicine, University of Utah, 391 Chipeta Way, Suite D, Salt Lake City, Utah 84108, U.S.A.
| |
Collapse
|
10
|
Ni Y, Müller P, Zhu Y, Ji Y. Heterogeneous reciprocal graphical models. Biometrics 2017; 74:606-615. [PMID: 29023632 DOI: 10.1111/biom.12791] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2016] [Revised: 07/01/2017] [Accepted: 09/01/2017] [Indexed: 12/27/2022]
Abstract
We develop novel hierarchical reciprocal graphical models to infer gene networks from heterogeneous data. In the case of data that can be naturally divided into known groups, we propose to connect graphs by introducing a hierarchical prior across group-specific graphs, including a correlation on edge strengths across graphs. Thresholding priors are applied to induce sparsity of the estimated networks. In the case of unknown groups, we cluster subjects into subpopulations and jointly estimate cluster-specific gene networks, again using similar hierarchical priors across clusters. We illustrate the proposed approach by simulation studies and three applications with multiplatform genomic data for multiple cancers.
Collapse
Affiliation(s)
- Yang Ni
- Department of Statistics and Data Sciences, The University of Texas at Austin, Texas, U.S.A
| | - Peter Müller
- Department of Mathematics, The University of Texas at Austin, Texas, U.S.A
| | - Yitan Zhu
- Program for Computational Genomics and Medicine, NorthShore University HealthSystem, Illinois, U.S.A
| | - Yuan Ji
- Program for Computational Genomics and Medicine, NorthShore University HealthSystem, Illinois, U.S.A.,Department of Public Health Sciences, The University of Chicago, Illinois, U.S.A
| |
Collapse
|
11
|
Jones E, Didelez V. Thinning a Triangulation of a Bayesian Network or Undirected Graph to Create a Minimal Triangulation. INT J UNCERTAIN FUZZ 2017. [DOI: 10.1142/s0218488517500143] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
In one procedure for finding the maximal prime decomposition of a Bayesian network or undirected graphical model, the first step is to create a minimal triangulation of the network, and a common and straightforward way to do this is to create a triangulation that is not necessarily minimal and then thin this triangulation by removing excess edges. We show that the algorithm for thinning proposed in several previous publications is incorrect. A different version of this algorithm is available in the R package gRbase, but its correctness has not previously been proved. We prove that this version is correct and provide a simpler version, also with a proof. We compare the speed of the two corrected algorithms in three ways and find that asymptotically their speeds are the same, neither algorithm is consistently faster than the other, and in a computer experiment the algorithm used by gRbase is faster when the original graph is large, dense, and undirected, but usually slightly slower when it is directed.
Collapse
Affiliation(s)
- Edmund Jones
- School of Mathematics, University of Bristol, University Walk, Bristol, BS8 1TW, UK
- Department of Public Health & Primary Care, University of Cambridge, Worts’ Causeway, Cambridge, CB1 8RN, UK
| | - Vanessa Didelez
- Leibniz Institute for Prevention Research and Epidemiology — BIPS, Achterstr. 30, D-28359 Bremen, Germany
| |
Collapse
|
12
|
Consonni G, La Rocca L, Peluso S. Objective Bayes Covariate-Adjusted Sparse Graphical Model Selection. Scand Stat Theory Appl 2017. [DOI: 10.1111/sjos.12273] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Affiliation(s)
- Guido Consonni
- Department of Statistical Sciences; Università Cattolica del Sacro Cuore
| | - Luca La Rocca
- Department of Physics, Informatics and Mathematics; Università di Modena e Reggio Emilia
| | - Stefano Peluso
- Department of Statistical Sciences; Università Cattolica del Sacro Cuore
| |
Collapse
|
13
|
Jones E, Didelez V. Inequalities on partial correlations in Gaussian graphical models containing star shapes. COMMUN STAT-THEOR M 2016. [DOI: 10.1080/03610926.2014.953696] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
14
|
|