1
|
Gilbert B, Datta A. Visibility graph-based covariance functions for scalable spatial analysis in non-convex partially Euclidean domains. Biometrics 2024; 80:ujae089. [PMID: 39248123 DOI: 10.1093/biomtc/ujae089] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2023] [Revised: 05/07/2024] [Accepted: 08/14/2024] [Indexed: 09/10/2024]
Abstract
We present a new method for constructing valid covariance functions of Gaussian processes for spatial analysis in irregular, non-convex domains such as bodies of water. Standard covariance functions based on geodesic distances are not guaranteed to be positive definite on such domains, while existing non-Euclidean approaches fail to respect the partially Euclidean nature of these domains where the geodesic distance agrees with the Euclidean distances for some pairs of points. Using a visibility graph on the domain, we propose a class of covariance functions that preserve Euclidean-based covariances between points that are connected in the domain while incorporating the non-convex geometry of the domain via conditional independence relationships. We show that the proposed method preserves the partially Euclidean nature of the intrinsic geometry on the domain while maintaining validity (positive definiteness) and marginal stationarity of the covariance function over the entire parameter space, properties which are not always fulfilled by existing approaches to construct covariance functions on non-convex domains. We provide useful approximations to improve computational efficiency, resulting in a scalable algorithm. We compare the performance of our method with those of competing state-of-the-art methods using simulation studies on synthetic non-convex domains. The method is applied to data regarding acidity levels in the Chesapeake Bay, showing its potential for ecological monitoring in real-world spatial applications on irregular domains.
Collapse
Affiliation(s)
- Brian Gilbert
- Department of Population Health, NYU Grossman School of Medicine, New York, New York, 10016, United States
| | - Abhirup Datta
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland, 21205, United States
| |
Collapse
|
2
|
Chen C, Kim HJ, Yang P. Evaluating spatially variable gene detection methods for spatial transcriptomics data. Genome Biol 2024; 25:18. [PMID: 38225676 PMCID: PMC10789051 DOI: 10.1186/s13059-023-03145-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2022] [Accepted: 12/14/2023] [Indexed: 01/17/2024] Open
Abstract
BACKGROUND The identification of genes that vary across spatial domains in tissues and cells is an essential step for spatial transcriptomics data analysis. Given the critical role it serves for downstream data interpretations, various methods for detecting spatially variable genes (SVGs) have been proposed. However, the lack of benchmarking complicates the selection of a suitable method. RESULTS Here we systematically evaluate a panel of popular SVG detection methods on a large collection of spatial transcriptomics datasets, covering various tissue types, biotechnologies, and spatial resolutions. We address questions including whether different methods select a similar set of SVGs, how reliable is the reported statistical significance from each method, how accurate and robust is each method in terms of SVG detection, and how well the selected SVGs perform in downstream applications such as clustering of spatial domains. Besides these, practical considerations such as computational time and memory usage are also crucial for deciding which method to use. CONCLUSIONS Our study evaluates the performance of each method from multiple aspects and highlights the discrepancy among different methods when calling statistically significant SVGs across diverse datasets. Overall, our work provides useful considerations for choosing methods for identifying SVGs and serves as a key reference for the future development of related methods.
Collapse
Affiliation(s)
- Carissa Chen
- Computational Systems Biology Group, Faculty of Medicine and Health, Children's Medical Research Institute, The University of Sydney, Westmead, NSW, 2145, Australia
| | - Hani Jieun Kim
- Computational Systems Biology Group, Faculty of Medicine and Health, Children's Medical Research Institute, The University of Sydney, Westmead, NSW, 2145, Australia.
| | - Pengyi Yang
- Computational Systems Biology Group, Faculty of Medicine and Health, Children's Medical Research Institute, The University of Sydney, Westmead, NSW, 2145, Australia.
- School of Mathematics and Statistics, The University of Sydney, Sydney, NSW, 2006, Australia.
- Charles Perkins Centre, The University of Sydney, Sydney, NSW, 2006, Australia.
| |
Collapse
|
3
|
DEY D, DATTA A, BANERJEE S. Modeling Multivariate Spatial Dependencies Using Graphical Models. THE NEW ENGLAND JOURNAL OF STATISTICS IN DATA SCIENCE 2023; 1:283-295. [PMID: 37817840 PMCID: PMC10563032 DOI: 10.51387/23-nejsds47] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/12/2023]
Abstract
Graphical models have witnessed significant growth and usage in spatial data science for modeling data referenced over a massive number of spatial-temporal coordinates. Much of this literature has focused on a single or relatively few spatially dependent outcomes. Recent attention has focused upon addressing modeling and inference for substantially large number of outcomes. While spatial factor models and multivariate basis expansions occupy a prominent place in this domain, this article elucidates a recent approach, graphical Gaussian Processes, that exploits the notion of conditional independence among a very large number of spatial processes to build scalable graphical models for fully model-based Bayesian analysis of multivariate spatial data.
Collapse
Affiliation(s)
- Debangan DEY
- Department of Biostatistics, Johns Hopkins University, USA
| | - Abhirup DATTA
- Department of Biostatistics, Johns Hopkins University, USA
| | - Sudipto BANERJEE
- Department of Biostatistics, University of California Los Angeles, USA
| |
Collapse
|
4
|
Weber LM, Saha A, Datta A, Hansen KD, Hicks SC. nnSVG for the scalable identification of spatially variable genes using nearest-neighbor Gaussian processes. Nat Commun 2023; 14:4059. [PMID: 37429865 PMCID: PMC10333391 DOI: 10.1038/s41467-023-39748-z] [Citation(s) in RCA: 17] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2022] [Accepted: 06/23/2023] [Indexed: 07/12/2023] Open
Abstract
Feature selection to identify spatially variable genes or other biologically informative genes is a key step during analyses of spatially-resolved transcriptomics data. Here, we propose nnSVG, a scalable approach to identify spatially variable genes based on nearest-neighbor Gaussian processes. Our method (i) identifies genes that vary in expression continuously across the entire tissue or within a priori defined spatial domains, (ii) uses gene-specific estimates of length scale parameters within the Gaussian process models, and (iii) scales linearly with the number of spatial locations. We demonstrate the performance of our method using experimental data from several technological platforms and simulations. A software implementation is available at https://bioconductor.org/packages/nnSVG .
Collapse
Affiliation(s)
- Lukas M Weber
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
| | - Arkajyoti Saha
- Department of Statistics, University of Washington, Seattle, WA, USA
| | - Abhirup Datta
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
| | - Kasper D Hansen
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
| | - Stephanie C Hicks
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA.
| |
Collapse
|
5
|
Dey D, Datta A, Banerjee S. Graphical Gaussian Process Models for Highly Multivariate Spatial Data. Biometrika 2022; 109:993-1014. [PMID: 36643962 PMCID: PMC9838617 DOI: 10.1093/biomet/asab061] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023] Open
Abstract
For multivariate spatial Gaussian process (GP) models, customary specifications of cross-covariance functions do not exploit relational inter-variable graphs to ensure process-level conditional independence among the variables. This is undesirable, especially for highly multivariate settings, where popular cross-covariance functions such as the multivariate Matérn suffer from a "curse of dimensionality" as the number of parameters and floating point operations scale up in quadratic and cubic order, respectively, in the number of variables. We propose a class of multivariate "Graphical Gaussian Processes" using a general construction called "stitching" that crafts cross-covariance functions from graphs and ensures process-level conditional independence among variables. For the Matérn family of functions, stitching yields a multivariate GP whose univariate components are Matérn GPs, and conforms to process-level conditional independence as specified by the graphical model. For highly multivariate settings and decomposable graphical models, stitching offers massive computational gains and parameter dimension reduction. We demonstrate the utility of the graphical Matérn GP to jointly model highly multivariate spatial data using simulation examples and an application to air-pollution modelling.
Collapse
Affiliation(s)
- Debangan Dey
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health
| | - Abhirup Datta
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health
| | - Sudipto Banerjee
- Department of Biostatistics, University of California Los Angeles
| |
Collapse
|
6
|
Saha A, Datta A, Banerjee S. Scalable Predictions for Spatial Probit Linear Mixed Models Using Nearest Neighbor Gaussian Processes. JOURNAL OF DATA SCIENCE : JDS 2022; 20:533-544. [PMID: 37786782 PMCID: PMC10544813 DOI: 10.6339/22-jds1073] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 10/04/2023]
Abstract
Spatial probit generalized linear mixed models (spGLMM) with a linear fixed effect and a spatial random effect, endowed with a Gaussian Process prior, are widely used for analysis of binary spatial data. However, the canonical Bayesian implementation of this hierarchical mixed model can involve protracted Markov Chain Monte Carlo sampling. Alternate approaches have been proposed that circumvent this by directly representing the marginal likelihood from spGLMM in terms of multivariate normal cummulative distribution functions (cdf). We present a direct and fast rendition of this latter approach for predictions from a spatial probit linear mixed model. We show that the covariance matrix of the cdf characterizing the marginal cdf of binary spatial data from spGLMM is amenable to approximation using Nearest Neighbor Gaussian Processes (NNGP). This facilitates a scalable prediction algorithm for spGLMM using NNGP that only involves sparse or small matrix computations and can be deployed in an embarrassingly parallel manner. We demonstrate the accuracy and scalability of the algorithm via numerous simulation experiments and an analysis of species presence-absence data.
Collapse
Affiliation(s)
- Arkajyoti Saha
- Department of Statistics, University of Washington, Seattle, WA, USA
| | - Abhirup Datta
- Department of Biostatistics, Johns Hopkins University, Baltimore, MD, USA
| | - Sudipto Banerjee
- UCLA Department of Biostatistics, 650 Charles E. Young Drive South, University of California Los Angeles, CA 90095-1772, USA
| |
Collapse
|
7
|
Affiliation(s)
- Arkajyoti Saha
- Department of Biostatistics, Johns Hopkins University, Baltimore, MD
| | - Sumanta Basu
- Department of Statistics and Data Science, Cornell University, Ithaca, NY
| | - Abhirup Datta
- Department of Biostatistics, Johns Hopkins University, Baltimore, MD
| |
Collapse
|
8
|
Datta A, Saha A, Zamora ML, Buehler C, Hao L, Xiong F, Gentner DR, Koehler K. Statistical field calibration of a low-cost PM 2.5 monitoring network in Baltimore. ATMOSPHERIC ENVIRONMENT (OXFORD, ENGLAND : 1994) 2020; 242:117761. [PMID: 32922146 PMCID: PMC7480820 DOI: 10.1016/j.atmosenv.2020.117761] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Low-cost air pollution monitors are increasingly being deployed to enrich knowledge about ambient air-pollution at high spatial and temporal resolutions. However, unlike regulatory-grade (FEM or FRM) instruments, universal quality standards for low-cost sensors are yet to be established and their data quality varies widely. This mandates thorough evaluation and calibration before any responsible use of such data. This study presents evaluation and field-calibration of the PM2.5 data from a network of low-cost monitors currently operating in Baltimore, MD, which has only one regulatory PM2.5 monitoring site within city limits. Co-location analysis at this regulatory site in Oldtown, Baltimore revealed high variability and significant overestimation of PM2.5 levels by the raw data from these monitors. Universal laboratory corrections reduced the bias in the data, but only partially mitigated the high variability. Eight months of field co-location data at Oldtown were used to develop a gain-offset calibration model, recast as a multiple linear regression. The statistical model offered substantial improvement in prediction quality over the raw or lab-corrected data. The results were robust to the choice of the low-cost monitor used for field-calibration, as well as to different seasonal choices of training period. The raw, lab-corrected and statistically-calibrated data were evaluated for a period of two months following the training period. The statistical model had the highest agreement with the reference data, producing a 24-hour average root-mean-square-error (RMSE) of around 2 μg m -3. To assess transferability of the calibration equations to other monitors in the network, a cross-site evaluation was conducted at a second co-location site in suburban Essex, MD. The statistically calibrated data once again produced the lowest RMSE. The calibrated PM2.5 readings from the monitors in the low-cost network provided insights into the intra-urban spatiotemporal variations of PM2.5 in Baltimore.
Collapse
Affiliation(s)
- Abhirup Datta
- Department of Biostatistics, Johns Hopkins University
| | | | - Misti Levy Zamora
- Department of Environmental Health and Engineering, Johns Hopkins Bloomberg School of Public Health, 615 N. Wolfe St., Baltimore, Maryland 21205
- SEARCH (Solutions for Energy, Air, Climate and Health) Center, Yale University, New Haven, CT, USA
| | - Colby Buehler
- SEARCH (Solutions for Energy, Air, Climate and Health) Center, Yale University, New Haven, CT, USA
- Department of Chemical & Environmental Engineering, Yale University, School of Engineering and Applied Science, New Haven, Connecticut 06511, USA
| | - Lei Hao
- Department of Environmental Health and Engineering, Johns Hopkins Bloomberg School of Public Health, 615 N. Wolfe St., Baltimore, Maryland 21205
| | - Fulizi Xiong
- SEARCH (Solutions for Energy, Air, Climate and Health) Center, Yale University, New Haven, CT, USA
- Department of Chemical & Environmental Engineering, Yale University, School of Engineering and Applied Science, New Haven, Connecticut 06511, USA
| | - Drew R Gentner
- SEARCH (Solutions for Energy, Air, Climate and Health) Center, Yale University, New Haven, CT, USA
- Department of Chemical & Environmental Engineering, Yale University, School of Engineering and Applied Science, New Haven, Connecticut 06511, USA
| | - Kirsten Koehler
- Department of Environmental Health and Engineering, Johns Hopkins Bloomberg School of Public Health, 615 N. Wolfe St., Baltimore, Maryland 21205
- SEARCH (Solutions for Energy, Air, Climate and Health) Center, Yale University, New Haven, CT, USA
| |
Collapse
|