1
|
Ver Hoef JM, Dumelle M, Higham M, Peterson EE, Isaak DJ. Indexing and partitioning the spatial linear model for large data sets. PLoS One 2023; 18:e0291906. [PMID: 37910525 PMCID: PMC10619847 DOI: 10.1371/journal.pone.0291906] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2023] [Accepted: 09/07/2023] [Indexed: 11/03/2023] Open
Abstract
We consider four main goals when fitting spatial linear models: 1) estimating covariance parameters, 2) estimating fixed effects, 3) kriging (making point predictions), and 4) block-kriging (predicting the average value over a region). Each of these goals can present different challenges when analyzing large spatial data sets. Current research uses a variety of methods, including spatial basis functions (reduced rank), covariance tapering, etc, to achieve these goals. However, spatial indexing, which is very similar to composite likelihood, offers some advantages. We develop a simple framework for all four goals listed above by using indexing to create a block covariance structure and nearest-neighbor predictions while maintaining a coherent linear model. We show exact inference for fixed effects under this block covariance construction. Spatial indexing is very fast, and simulations are used to validate methods and compare to another popular method. We study various sample designs for indexing and our simulations showed that indexing leading to spatially compact partitions are best over a range of sample sizes, autocorrelation values, and generating processes. Partitions can be kept small, on the order of 50 samples per partition. We use nearest-neighbors for kriging and block kriging, finding that 50 nearest-neighbors is sufficient. In all cases, confidence intervals for fixed effects, and prediction intervals for (block) kriging, have appropriate coverage. Some advantages of spatial indexing are that it is available for any valid covariance matrix, can take advantage of parallel computing, and easily extends to non-Euclidean topologies, such as stream networks. We use stream networks to show how spatial indexing can achieve all four goals, listed above, for very large data sets, in a matter of minutes, rather than days, for an example data set.
Collapse
Affiliation(s)
- Jay M. Ver Hoef
- Marine Mammal Laboratory, NOAA-NMFS Alaska Fisheries Science Center, Seattle, WA, United States of America
| | - Michael Dumelle
- United States Environmental Protection Agency, Corvallis, Oregon, United States of America
| | - Matt Higham
- St. Lawrence University Department of Mathematics, Computer Science, and Statistics, Canton, New York, United States of America
| | - Erin E. Peterson
- Australian Research Council Centre of Excellence in Mathematical and Statistical Frontiers (ACEMS), Queensland University of Technology, Brisbane, Queensland, Australia
| | - Daniel J. Isaak
- Rocky Mountain Research Station, U.S. Forest Service, Boise, ID, United States of America
| |
Collapse
|
2
|
Mukerjee R. Improving upon the effective sample size based on Godambe information for block likelihood inference. Comput Stat 2023. [DOI: 10.1007/s00180-023-01328-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/12/2023]
|
3
|
Saha A, Datta A, Banerjee S. Scalable Predictions for Spatial Probit Linear Mixed Models Using Nearest Neighbor Gaussian Processes. JOURNAL OF DATA SCIENCE : JDS 2022; 20:533-544. [PMID: 37786782 PMCID: PMC10544813 DOI: 10.6339/22-jds1073] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 10/04/2023]
Abstract
Spatial probit generalized linear mixed models (spGLMM) with a linear fixed effect and a spatial random effect, endowed with a Gaussian Process prior, are widely used for analysis of binary spatial data. However, the canonical Bayesian implementation of this hierarchical mixed model can involve protracted Markov Chain Monte Carlo sampling. Alternate approaches have been proposed that circumvent this by directly representing the marginal likelihood from spGLMM in terms of multivariate normal cummulative distribution functions (cdf). We present a direct and fast rendition of this latter approach for predictions from a spatial probit linear mixed model. We show that the covariance matrix of the cdf characterizing the marginal cdf of binary spatial data from spGLMM is amenable to approximation using Nearest Neighbor Gaussian Processes (NNGP). This facilitates a scalable prediction algorithm for spGLMM using NNGP that only involves sparse or small matrix computations and can be deployed in an embarrassingly parallel manner. We demonstrate the accuracy and scalability of the algorithm via numerous simulation experiments and an analysis of species presence-absence data.
Collapse
Affiliation(s)
- Arkajyoti Saha
- Department of Statistics, University of Washington, Seattle, WA, USA
| | - Abhirup Datta
- Department of Biostatistics, Johns Hopkins University, Baltimore, MD, USA
| | - Sudipto Banerjee
- UCLA Department of Biostatistics, 650 Charles E. Young Drive South, University of California Los Angeles, CA 90095-1772, USA
| |
Collapse
|
4
|
Moran KR, Wheeler MW. Fast increased fidelity samplers for approximate Bayesian Gaussian process regression. J R Stat Soc Series B Stat Methodol 2022; 84:1198-1228. [PMID: 36570797 PMCID: PMC9770094 DOI: 10.1111/rssb.12494] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
Abstract
Gaussian processes (GPs) are common components in Bayesian non-parametric models having a rich methodological literature and strong theoretical grounding. The use of exact GPs in Bayesian models is limited to problems containing several thousand observations due to their prohibitive computational demands. We develop a posterior sampling algorithm using H -matrix approximations that scales at O ( n log 2 n ) . We show that this approximation's Kullback-Leibler divergence to the true posterior can be made arbitrarily small. Though multidimensional GPs could be used with our algorithm, d-dimensional surfaces are modeled as tensor products of univariate GPs to minimize the cost of matrix construction and maximize computational efficiency. We illustrate the performance of this fast increased fidelity approximate GP, FIFA-GP, using both simulated and non-synthetic data sets.
Collapse
|
5
|
Davies TM, Banerjee S, Martin AP, Turnbull RE. A nearest‐neighbour Gaussian process spatial factor model for censored, multi‐depth geochemical data. J R Stat Soc Ser C Appl Stat 2022. [DOI: 10.1111/rssc.12565] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
Affiliation(s)
- Tilman M. Davies
- Department of Mathematics & StatisticsUniversity of Otago DunedinNew Zealand
| | - Sudipto Banerjee
- Department of BiostatisticsUniversity of California Los Angeles Los AngelesUSA
| | | | | |
Collapse
|
6
|
A bioinformatic analysis of WFDC2 (HE4) expression in high grade serous ovarian cancer reveals tumor-specific changes in metabolic and extracellular matrix gene expression. Med Oncol 2022; 39:71. [PMID: 35568777 PMCID: PMC9107348 DOI: 10.1007/s12032-022-01665-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2021] [Accepted: 01/22/2022] [Indexed: 10/31/2022]
Abstract
Human epididymis protein-4 (HE4/WFDC2) has been well-studied as an ovarian cancer clinical biomarker. To improve our understanding of its functional role in high grade serous ovarian cancer, we determined transcriptomic differences between ovarian tumors with high- versus low-WFDC2 mRNA levels in The Cancer Genome Atlas dataset. High-WFDC2 transcript levels were significantly associated with reduced survival in stage III/IV serous ovarian cancer patients. Differential expression and correlation analyses revealed secretory leukocyte peptidase inhibitor (SLPI/WFDC4) as the gene most positively correlated with WFDC2, while A kinase anchor protein-12 was most negatively correlated. WFDC2 and SLPI were strongly correlated across many cancers. Gene ontology analysis revealed enrichment of oxidative phosphorylation in differentially expressed genes associated with high-WFDC2 levels, while extracellular matrix organization was enriched among genes associated with low-WFDC2 levels. Immune cell subsets found to be positively correlated with WFDC2 levels were B cells and plasmacytoid dendritic cells, while neutrophils and endothelial cells were negatively correlated with WFDC2. Results were compared with DepMap cell culture gene expression data. Gene ontology analysis of k-means clustering revealed that genes associated with low-WFDC2 were also enriched in extracellular matrix and adhesion categories, while high-WFDC2 genes were enriched in epithelial cell proliferation and peptidase activity. These results support previous findings regarding the effect of HE4/WFDC2 on ovarian cancer pathogenesis in cell lines and mouse models, while adding another layer of complexity to its potential functions in ovarian tumor tissue. Further experimental explorations of these findings in the context of the tumor microenvironment are merited.
Collapse
|
7
|
Affiliation(s)
- Arkajyoti Saha
- Department of Biostatistics, Johns Hopkins University, Baltimore, MD
| | - Sumanta Basu
- Department of Statistics and Data Science, Cornell University, Ithaca, NY
| | - Abhirup Datta
- Department of Biostatistics, Johns Hopkins University, Baltimore, MD
| |
Collapse
|
8
|
Chen J, Stein ML. Linear-Cost Covariance Functions for Gaussian Random Fields. J Am Stat Assoc 2021. [DOI: 10.1080/01621459.2021.1919122] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Affiliation(s)
- Jie Chen
- MIT-IBM Watson AI Lab, IBM Research, Cambridge, MA
| | | |
Collapse
|
9
|
Katzfuss M, Guinness J. A General Framework for Vecchia Approximations of Gaussian Processes. Stat Sci 2021. [DOI: 10.1214/19-sts755] [Citation(s) in RCA: 35] [Impact Index Per Article: 11.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
10
|
Liu H, Ong YS, Shen X, Cai J. When Gaussian Process Meets Big Data: A Review of Scalable GPs. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2020; 31:4405-4423. [PMID: 31944966 DOI: 10.1109/tnnls.2019.2957109] [Citation(s) in RCA: 79] [Impact Index Per Article: 19.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/26/2023]
Abstract
The vast quantity of information brought by big data as well as the evolving computer hardware encourages success stories in the machine learning community. In the meanwhile, it poses challenges for the Gaussian process regression (GPR), a well-known nonparametric, and interpretable Bayesian model, which suffers from cubic complexity to data size. To improve the scalability while retaining desirable prediction quality, a variety of scalable GPs have been presented. However, they have not yet been comprehensively reviewed and analyzed to be well understood by both academia and industry. The review of scalable GPs in the GP community is timely and important due to the explosion of data size. To this end, this article is devoted to reviewing state-of-the-art scalable GPs involving two main categories: global approximations that distillate the entire data and local approximations that divide the data for subspace learning. Particularly, for global approximations, we mainly focus on sparse approximations comprising prior approximations that modify the prior but perform exact inference, posterior approximations that retain exact prior but perform approximate inference, and structured sparse approximations that exploit specific structures in kernel matrix; for local approximations, we highlight the mixture/product of experts that conducts model averaging from multiple local experts to boost predictions. To present a complete review, recent advances for improving the scalability and capability of scalable GPs are reviewed. Finally, the extensions and open issues of scalable GPs in various scenarios are reviewed and discussed to inspire novel ideas for future research avenues.
Collapse
|
11
|
Davis BJK, Curriero FC. Development and Evaluation of Geostatistical Methods for Non-Euclidean-Based Spatial Covariance Matrices. MATHEMATICAL GEOSCIENCES 2019; 51:767-791. [PMID: 31827631 PMCID: PMC6905632 DOI: 10.1007/s11004-019-09791-y] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/16/2018] [Accepted: 02/23/2019] [Indexed: 06/01/2023]
Abstract
Customary and routine practice of geostatistical modeling assumes that inter-point distances are a Euclidean metric (i.e., as the crow flies) when characterizing spatial variation. There are many real-world settings, however, in which the use of a non-Euclidean distance is more appropriate, for example in complex bodies of water. However, if such a distance is used with current semivariogram functions, the resulting spatial covariance matrices are no longer guaranteed to be positive-definite. Previous attempts to address this issue for geostatistical prediction (i.e., kriging) models transform the non-Euclidean space into a Euclidean metric, such as through multi-dimensional scaling (MDS). However, these attempts estimate spatial covariances only after distances are scaled. An alternative method is proposed to re-estimate a spatial covariance structure originally based on a non-Euclidean distance metric to ensure validity. This method is compared to the standard use of Euclidean distance, as well as a previously utilized MDS method. All methods are evaluated using cross-validation assessments on both simulated and real-world experiments. Results show a high level of bias in prediction variance for the previously developed MDS method that has not been highlighted previously. Conversely, the proposed method offers a preferred tradeoff between prediction accuracy and prediction variance and at times outperforms the existing methods for both sets of metrics. Overall results indicate that this proposed method can provide improved geostatistical predictions while ensuring valid results when the use of non-Euclidean distances is warranted.
Collapse
Affiliation(s)
- Benjamin J K Davis
- Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health, Johns Hopkins University, Baltimore, MD 21205, USA
- Spatial Science for Public Health Center, Johns Hopkins University, Baltimore, MD 21205, USA
| | - Frank C Curriero
- Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health, Johns Hopkins University, Baltimore, MD 21205, USA
- Spatial Science for Public Health Center, Johns Hopkins University, Baltimore, MD 21205, USA
| |
Collapse
|
12
|
Gelfand AE, Shirota S. Preferential sampling for presence/absence data and for fusion of presence/absence data with presence‐only data. ECOL MONOGR 2019. [DOI: 10.1002/ecm.1372] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Affiliation(s)
- Alan E. Gelfand
- Department of Statistical Science Duke University 214 Old Chemistry Building, Box 90251 Durham North Carolina 27708‐0251 USA
| | - Shinichiro Shirota
- Department of Biostatistics UCLA 650 Charles E. Young Drive South 51‐254 CHS, Box 951772 Los Angeles California 90095‐1772 USA
| |
Collapse
|
13
|
Finley AO, Datta A, Cook BC, Morton DC, Andersen HE, Banerjee S. Efficient algorithms for Bayesian Nearest Neighbor Gaussian Processes. J Comput Graph Stat 2019; 28:401-414. [PMID: 31543693 PMCID: PMC6753955 DOI: 10.1080/10618600.2018.1537924] [Citation(s) in RCA: 35] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2017] [Revised: 07/18/2018] [Accepted: 10/09/2018] [Indexed: 10/27/2022]
Abstract
We consider alternate formulations of recently proposed hierarchical Nearest Neighbor Gaussian Process (NNGP) models (Datta et al., 2016a) for improved convergence, faster computing time, and more robust and reproducible Bayesian inference. Algorithms are defined that improve CPU memory management and exploit existing high-performance numerical linear algebra libraries. Computational and inferential benefits are assessed for alternate NNGP specifications using simulated datasets and remotely sensed light detection and ranging (LiDAR) data collected over the US Forest Service Tanana Inventory Unit (TIU) in a remote portion of Interior Alaska. The resulting data product is the first statistically robust map of forest canopy for the TIU.
Collapse
|
14
|
Taylor-Rodriguez D, Finley AO, Datta A, Babcock C, Andersen HE, Cook BD, Morton DC, Banerjee S. Spatial Factor Models for High-Dimensional and Large Spatial Data: An Application in Forest Variable Mapping. Stat Sin 2019; 29:1155-1180. [PMID: 33311955 PMCID: PMC7731981 DOI: 10.5705/ss.202018.0005] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Gathering information about forest variables is an expensive and arduous activity. As such, directly collecting the data required to produce high-resolution maps over large spatial domains is infeasible. Next generation collection initiatives of remotely sensed Light Detection and Ranging (LiDAR) data are specifically aimed at producing complete-coverage maps over large spatial domains. Given that LiDAR data and forest characteristics are often strongly correlated, it is possible to make use of the former to model, predict, and map forest variables over regions of interest. This entails dealing with the high-dimensional (~102) spatially dependent LiDAR outcomes over a large number of locations (~105-106). With this in mind, we develop the Spatial Factor Nearest Neighbor Gaussian Process (SF-NNGP) model, and embed it in a two-stage approach that connects the spatial structure found in LiDAR signals with forest variables. We provide a simulation experiment that demonstrates inferential and predictive performance of the SF-NNGP, and use the two-stage modeling strategy to generate complete-coverage maps of forest variables with associated uncertainty over a large region of boreal forests in interior Alaska.
Collapse
Affiliation(s)
| | - Andrew O. Finley
- Department of Forestry, Michigan State University, East Lansing, MI
| | - Abhirup Datta
- Department of Biostatistics, Johns Hopkins University, Baltimore, MA
| | - Chad Babcock
- School of Environmental and Forest Sciences, University of Washington, Seattle, WA
| | | | - Bruce D. Cook
- Biospheric Sciences Laboratory, NASA Goddard Space Flight Center, Greenbelt, MD
| | - Douglas C. Morton
- Biospheric Sciences Laboratory, NASA Goddard Space Flight Center, Greenbelt, MD
| | - Sudipto Banerjee
- Department of Biostatistics, University of California Los Angeles, Los Angeles, CA
| |
Collapse
|
15
|
Heaton MJ, Datta A, Finley AO, Furrer R, Guinness J, Guhaniyogi R, Gerber F, Gramacy RB, Hammerling D, Katzfuss M, Lindgren F, Nychka DW, Sun F, Zammit-Mangion A. A Case Study Competition Among Methods for Analyzing Large Spatial Data. JOURNAL OF AGRICULTURAL, BIOLOGICAL, AND ENVIRONMENTAL STATISTICS 2018; 24:398-425. [PMID: 31496633 PMCID: PMC6709111 DOI: 10.1007/s13253-018-00348-w] [Citation(s) in RCA: 91] [Impact Index Per Article: 15.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/30/2018] [Accepted: 12/05/2018] [Indexed: 10/27/2022]
Abstract
The Gaussian process is an indispensable tool for spatial data analysts. The onset of the "big data" era, however, has lead to the traditional Gaussian process being computationally infeasible for modern spatial data. As such, various alternatives to the full Gaussian process that are more amenable to handling big spatial data have been proposed. These modern methods often exploit low-rank structures and/or multi-core and multi-threaded computing environments to facilitate computation. This study provides, first, an introductory overview of several methods for analyzing large spatial data. Second, this study describes the results of a predictive competition among the described methods as implemented by different groups with strong expertise in the methodology. Specifically, each research group was provided with two training datasets (one simulated and one observed) along with a set of prediction locations. Each group then wrote their own implementation of their method to produce predictions at the given location and each was subsequently run on a common computing environment. The methods were then compared in terms of various predictive diagnostics. Supplementary materials regarding implementation details of the methods and code are available for this article online. ELECTRONIC SUPPLEMENTARY MATERIAL Supplementary materials for this article are available at 10.1007/s13253-018-00348-w.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | | | | | | | | | - Furong Sun
- Brigham Young University, Provo, UT USA
| | | |
Collapse
|
16
|
Guinness J. Permutation and Grouping Methods for Sharpening Gaussian Process Approximations. Technometrics 2018; 60:415-429. [PMID: 31447491 DOI: 10.1080/00401706.2018.1437476] [Citation(s) in RCA: 31] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Abstract
Vecchia's approximate likelihood for Gaussian process parameters depends on how the observations are ordered, which has been cited as a deficiency. This article takes the alternative standpoint that the ordering can be tuned to sharpen the approximations. Indeed, the first part of the paper includes a systematic study of how ordering affects the accuracy of Vecchia's approximation. We demonstrate the surprising result that random orderings can give dramatically sharper approximations than default coordinate-based orderings. Additional ordering schemes are described and analyzed numerically, including orderings capable of improving on random orderings. The second contribution of this paper is a new automatic method for grouping calculations of components of the approximation. The grouping methods simultaneously improve approximation accuracy and reduce computational burden. In common settings, reordering combined with grouping reduces Kullback-Leibler divergence from the target model by more than a factor of 60 compared to ungrouped approximations with default ordering. The claims are supported by theory and numerical results with comparisons to other approximations, including tapered covariances and stochastic partial differential equations. Computational details are provided, including the use of the approximations for prediction and conditional simulation. An application to space-time satellite data is presented.
Collapse
|
17
|
Gelfand AE, Banerjee S. Bayesian Modeling and Analysis of Geostatistical Data. ANNUAL REVIEW OF STATISTICS AND ITS APPLICATION 2017; 4:245-266. [PMID: 29392155 PMCID: PMC5790124 DOI: 10.1146/annurev-statistics-060116-054155] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/22/2023]
Abstract
The most prevalent spatial data setting is, arguably, that of so-called geostatistical data, data that arise as random variables observed at fixed spatial locations. Collection of such data in space and in time has grown enormously in the past two decades. With it has grown a substantial array of methods to analyze such data. Here, we attempt a review of a fully model-based perspective for such data analysis, the approach of hierarchical modeling fitted within a Bayesian framework. The benefit, as with hierarchical Bayesian modeling in general, is full and exact inference, with proper assessment of uncertainty. Geostatistical modeling includes univariate and multivariate data collection at sites, continuous and categorical data at sites, static and dynamic data at sites, and datasets over very large numbers of sites and long periods of time. Within the hierarchical modeling framework, we offer a review of the current state of the art in these settings.
Collapse
Affiliation(s)
- Alan E Gelfand
- Department of Statistical Science, Duke University, Durham, North Carolina 27708-0251
| | - Sudipto Banerjee
- Department of Biostatistics, University of California, Los Angeles, California 90095-1772
| |
Collapse
|