1
|
Schwob MR, Hooten MB, Narasimhan V. Composite dyadic models for spatio-temporal data. Biometrics 2024; 80:ujae107. [PMID: 39360904 DOI: 10.1093/biomtc/ujae107] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2023] [Revised: 08/10/2024] [Accepted: 09/11/2024] [Indexed: 10/05/2024]
Abstract
Mechanistic statistical models are commonly used to study the flow of biological processes. For example, in landscape genetics, the aim is to infer spatial mechanisms that govern gene flow in populations. Existing statistical approaches in landscape genetics do not account for temporal dependence in the data and may be computationally prohibitive. We infer mechanisms with a Bayesian hierarchical dyadic model that scales well with large data sets and that accounts for spatial and temporal dependence. We construct a fully connected network comprising spatio-temporal data for the dyadic model and use normalized composite likelihoods to account for the dependence structure in space and time. We develop a dyadic model to account for physical mechanisms commonly found in physical-statistical models and apply our methods to ancient human DNA data to infer the mechanisms that affected human movement in Bronze Age Europe.
Collapse
Affiliation(s)
- Michael R Schwob
- Department of Statistics and Data Sciences, The University of Texas at Austin, Austin, TX 78712, United States
| | - Mevin B Hooten
- Department of Statistics and Data Sciences, The University of Texas at Austin, Austin, TX 78712, United States
| | - Vagheesh Narasimhan
- Department of Statistics and Data Sciences, The University of Texas at Austin, Austin, TX 78712, United States
- Department of Integrative Biology, The University of Texas at Austin, Austin, TX 78712, United States
- Department of Population Health, Dell Medical School, Austin, TX 78712, United States
| |
Collapse
|
2
|
Abdel-Aty Y, Kayid M, Alomani G. Selection effect of learning rate parameter on estimators of k exponential populations under the joint hybrid censoring. Heliyon 2024; 10:e34087. [PMID: 39071643 PMCID: PMC11277392 DOI: 10.1016/j.heliyon.2024.e34087] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2024] [Revised: 06/30/2024] [Accepted: 07/03/2024] [Indexed: 07/30/2024] Open
Abstract
A Bayesian method based on the learning rate parameter η is called a generalized Bayesian method. In this study, joint hybrid censored type I and type II samples from k exponential populations were examined to determine the influence of the parameter η on the estimation results. To investigate the selection effects of the learning rate and the loss parameters on the estimation results, we considered two additional loss functions in the Bayesian approach: the linear and the generalized entropy loss functions. We then compared the generalized Bayesian algorithm with the traditional Bayesian algorithm. We performed Monte Carlo simulations to compare the performance of the estimation results with the losses and different values of η . The effects of different losses with different values and learning rate parameters are examined using an example.
Collapse
Affiliation(s)
- Yahia Abdel-Aty
- Department of Mathematics, College of Science, Taibah University, Saudi Arabia
- Department of Mathematics, Faculty of Science, Al-Azhar University, Nasr City, 11884, Egypt
| | - Mohamed Kayid
- Department of Statistics and Operations Research, College of Science, King Saud University, P.O. Box 2455, Riyadh, 11451, Saudi Arabia
| | - Ghadah Alomani
- Department of Mathematical Sciences, College of Science, Princess Nourah bint Abdulrahman University, P.O. Box 84428, Riyadh, 11671, Saudi Arabia
| |
Collapse
|
3
|
Chakraborty A, Bhattacharya A, Pati D. A Gibbs Posterior Framework for Fair Clustering. ENTROPY (BASEL, SWITZERLAND) 2024; 26:63. [PMID: 38248188 PMCID: PMC10814285 DOI: 10.3390/e26010063] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/01/2023] [Revised: 12/28/2023] [Accepted: 01/09/2024] [Indexed: 01/23/2024]
Abstract
The rise of machine learning-driven decision-making has sparked a growing emphasis on algorithmic fairness. Within the realm of clustering, the notion of balance is utilized as a criterion for attaining fairness, which characterizes a clustering mechanism as fair when the resulting clusters maintain a consistent proportion of observations representing individuals from distinct groups delineated by protected attributes. Building on this idea, the literature has rapidly incorporated a myriad of extensions, devising fair versions of the existing frequentist clustering algorithms, e.g., k-means, k-medioids, etc., that aim at minimizing specific loss functions. These approaches lack uncertainty quantification associated with the optimal clustering configuration and only provide clustering boundaries without quantifying the probabilities associated with each observation belonging to the different clusters. In this article, we intend to offer a novel probabilistic formulation of the fair clustering problem that facilitates valid uncertainty quantification even under mild model misspecifications, without incurring substantial computational overhead. Mixture model-based fair clustering frameworks facilitate automatic uncertainty quantification, but tend to showcase brittleness under model misspecification and involve significant computational challenges. To circumnavigate such issues, we propose a generalized Bayesian fair clustering framework that inherently enjoys decision-theoretic interpretation. Moreover, we devise efficient computational algorithms that crucially leverage techniques from the existing literature on optimal transport and clustering based on loss functions. The gain from the proposed technology is showcased via numerical experiments and real data examples.
Collapse
Affiliation(s)
- Abhisek Chakraborty
- Department of Statistics, Texas A&M University, College Station, TX 77843, USA; (A.B.); (D.P.)
| | | | | |
Collapse
|
4
|
Kasa SR, Rajan V. Avoiding inferior clusterings with misspecified Gaussian mixture models. Sci Rep 2023; 13:19164. [PMID: 37932317 PMCID: PMC10628229 DOI: 10.1038/s41598-023-44608-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2023] [Accepted: 10/10/2023] [Indexed: 11/08/2023] Open
Abstract
Clustering is a fundamental tool for exploratory data analysis, and is ubiquitous across scientific disciplines. Gaussian Mixture Model (GMM) is a popular probabilistic and interpretable model for clustering. In many practical settings, the true data distribution, which is unknown, may be non-Gaussian and may be contaminated by noise or outliers. In such cases, clustering may still be done with a misspecified GMM. However, this may lead to incorrect classification of the underlying subpopulations. In this paper, we identify and characterize the problem of inferior clustering solutions. Similar to well-known spurious solutions, these inferior solutions have high likelihood and poor cluster interpretation; however, they differ from spurious solutions in other characteristics, such as asymmetry in the fitted components. We theoretically analyze this asymmetry and its relation to misspecification. We propose a new penalty term that is designed to avoid both inferior and spurious solutions. Using this penalty term, we develop a new model selection criterion and a new GMM-based clustering algorithm, SIA. We empirically demonstrate that, in cases of misspecification, SIA avoids inferior solutions and outperforms previous GMM-based clustering methods.
Collapse
Affiliation(s)
- Siva Rajesh Kasa
- School of Computing, National University of Singapore, COM1, 13, Computing Dr, Singapore, 117417, Singapore.
| | - Vaibhav Rajan
- School of Computing, National University of Singapore, COM1, 13, Computing Dr, Singapore, 117417, Singapore
| |
Collapse
|
5
|
Baek Y, Aquino W, Mukherjee S. Generalized Bayes approach to inverse problems with model misspecification. INVERSE PROBLEMS 2023; 39:105011. [PMID: 37990698 PMCID: PMC10659580 DOI: 10.1088/1361-6420/acf51c] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/23/2023]
Abstract
We propose a general framework for obtaining probabilistic solutions to PDE-based inverse problems. Bayesian methods are attractive for uncertainty quantification but assume knowledge of the likelihood model or data generation process. This assumption is difficult to justify in many inverse problems, where the specification of the data generation process is not obvious. We adopt a Gibbs posterior framework that directly posits a regularized variational problem on the space of probability distributions of the parameter. We propose a novel model comparison framework that evaluates the optimality of a given loss based on its "predictive performance". We provide cross-validation procedures to calibrate the regularization parameter of the variational objective and compare multiple loss functions. Some novel theoretical properties of Gibbs posteriors are also presented. We illustrate the utility of our framework via a simulated example, motivated by dispersion-based wave models used to characterize arterial vessels in ultrasound vibrometry.
Collapse
Affiliation(s)
- Youngsoo Baek
- Department of Statistical Science, Duke University, Durham, NC, United States of America
| | - Wilkins Aquino
- Department of Mechanical Engineering and Materials Science, Duke University, Durham, NC, United States of America
| | - Sayan Mukherjee
- Department of Statistical Science, Duke University, Durham, NC, United States of America
- Department of Mathematics, Computer Science, Biostatistics & Bioinformatics, Durham, NC, United States of America
- Center for Scalable Data Analytics and Artificial Intelligence, Universität Leipzig, Leipzig, Germany
- Max Planck Institute for Mathematics in the Sciences, Leipzig, Germany
| |
Collapse
|
6
|
Duan LL, Roy A. Spectral Clustering, Bayesian Spanning Forest, and Forest Process. J Am Stat Assoc 2023; 119:2140-2153. [PMID: 39583343 PMCID: PMC11580821 DOI: 10.1080/01621459.2023.2250098] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2022] [Revised: 07/23/2023] [Accepted: 08/04/2023] [Indexed: 11/26/2024]
Abstract
Spectral clustering views the similarity matrix as a weighted graph, and partitions the data by minimizing a graph-cut loss. Since it minimizes the across-cluster similarity, there is no need to model the distribution within each cluster. As a result, one reduces the chance of model misspecification, which is often a risk in mixture model-based clustering. Nevertheless, compared to the latter, spectral clustering has no direct ways of quantifying the clustering uncertainty (such as the assignment probability), or allowing easy model extensions for complicated data applications. To fill this gap, we propose the Bayesian forest model as a generative graphical model for spectral clustering. This is motivated by our discovery that the posterior connecting matrix in a forest model has almost the same leading eigenvectors, as the ones used by normalized spectral clustering. To induce a distribution for the forest, we develop a "forest process" as a graph extension to the urn process, while we carefully characterize the differences in the partition probability. We derive a simple Markov chain Monte Carlo algorithm for posterior estimation, and demonstrate superior performance compared to existing algorithms. We illustrate several model-based extensions useful for data applications, including high-dimensional and multi-view clustering for images.
Collapse
Affiliation(s)
- Leo L. Duan
- Department of Statistics, University of Florida
| | | | | |
Collapse
|
7
|
Lui A, Lee J, Thall PF, Daher M, Rezvani K, Basar R. A Bayesian feature allocation model for identifying cell subpopulations using CyTOF data. J R Stat Soc Ser C Appl Stat 2023; 72:718-738. [PMID: 37325776 PMCID: PMC10264057 DOI: 10.1093/jrsssc/qlad029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2023] [Accepted: 04/02/2023] [Indexed: 06/17/2023]
Abstract
A Bayesian feature allocation model (FAM) is presented for identifying cell subpopulations based on multiple samples of cell surface or intracellular marker expression level data obtained by cytometry by time of flight (CyTOF). Cell subpopulations are characterized by differences in marker expression patterns, and cells are clustered into subpopulations based on their observed expression levels. A model-based method is used to construct cell clusters within each sample by modeling subpopulations as latent features, using a finite Indian buffet process. Non-ignorable missing data due to technical artifacts in mass cytometry instruments are accounted for by defining a static missingship mechanism. In contrast with conventional cell clustering methods, which cluster observed marker expression levels separately for each sample, the FAM-based method can be applied simultaneously to multiple samples, and also identify important cell subpopulations likely to be otherwise missed. The proposed FAM-based method is applied to jointly analyse three CyTOF datasets to study natural killer (NK) cells. Because the subpopulations identified by the FAM may define novel NK cell subsets, this statistical analysis may provide useful information about the biology of NK cells and their potential role in cancer immunotherapy which may lead, in turn, to development of improved NK cell therapies.
Collapse
Affiliation(s)
- Arthur Lui
- Department of Statistics, Baskin School of Engineering, University of California Santa Cruz, 1156 High Street, Santa Cruz, CA, 95064, USA
| | - Juhee Lee
- Department of Statistics, University of California at Santa Cruz, Santa Cruz, CA, USA
| | - Peter F Thall
- Department of Biostatistics, M.D. Anderson Cancer Center, Houston, TX, USA
| | - May Daher
- Department of Stem Cell Transplantation and Cellular Therapy, M.D. Anderson Cancer Center, Houston, TX, USA
| | - Katy Rezvani
- Department of Stem Cell Transplantation and Cellular Therapy, M.D. Anderson Cancer Center, Houston, TX, USA
| | - Rafet Basar
- Department of Stem Cell Transplantation and Cellular Therapy, M.D. Anderson Cancer Center, Houston, TX, USA
| |
Collapse
|
8
|
Wade S. Bayesian cluster analysis. PHILOSOPHICAL TRANSACTIONS. SERIES A, MATHEMATICAL, PHYSICAL, AND ENGINEERING SCIENCES 2023; 381:20220149. [PMID: 36970819 PMCID: PMC10041359 DOI: 10.1098/rsta.2022.0149] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/22/2022] [Accepted: 01/03/2023] [Indexed: 06/18/2023]
Abstract
Bayesian cluster analysis offers substantial benefits over algorithmic approaches by providing not only point estimates but also uncertainty in the clustering structure and patterns within each cluster. An overview of Bayesian cluster analysis is provided, including both model-based and loss-based approaches, along with a discussion on the importance of the kernel or loss selected and prior specification. Advantages are demonstrated in an application to cluster cells and discover latent cell types in single-cell RNA sequencing data to study embryonic cellular development. Lastly, we focus on the ongoing debate between finite and infinite mixtures in a model-based approach and robustness to model misspecification. While much of the debate and asymptotic theory focuses on the marginal posterior of the number of clusters, we empirically show that quite a different behaviour is obtained when estimating the full clustering structure. This article is part of the theme issue 'Bayesian inference: challenges, perspectives, and prospects'.
Collapse
Affiliation(s)
- S. Wade
- School of Mathematics and Maxwell Institute for Mathematical Sciences, University of Edinburgh, James Clerk Maxwell Building, Edinburgh, UK
| |
Collapse
|
9
|
Rios Insua D, Naveiro R, Gallego V, Poulos J. Adversarial Machine Learning: Bayesian Perspectives. J Am Stat Assoc 2023. [DOI: 10.1080/01621459.2023.2183129] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/06/2023]
|
10
|
Liu Y, Goudie RJB. Generalized Geographically Weighted Regression Model within a Modularized Bayesian Framework. BAYESIAN ANALYSIS 2023; -1:1-36. [PMID: 36714467 PMCID: PMC7614111 DOI: 10.1214/22-ba1357] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/18/2023]
Abstract
Geographically weighted regression (GWR) models handle geographical dependence through a spatially varying coefficient model and have been widely used in applied science, but its general Bayesian extension is unclear because it involves a weighted log-likelihood which does not imply a probability distribution on data. We present a Bayesian GWR model and show that its essence is dealing with partial misspecification of the model. Current modularized Bayesian inference models accommodate partial misspecification from a single component of the model. We extend these models to handle partial misspecification in more than one component of the model, as required for our Bayesian GWR model. Information from the various spatial locations is manipulated via a geographically weighted kernel and the optimal manipulation is chosen according to a Kullback-Leibler (KL) divergence. We justify the model via an information risk minimization approach and show the consistency of the proposed estimator in terms of a geographically weighted KL divergence.
Collapse
Affiliation(s)
- Yang Liu
- MRC Biostatistics Unit, University of Cambridge, UK
| | | |
Collapse
|
11
|
Martin GM, Frazier DT, Robert CP. Approximating Bayes in the 21st Century. Stat Sci 2023. [DOI: 10.1214/22-sts875] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/25/2023]
Affiliation(s)
- Gael M. Martin
- Gael M. Martin is Professor, Department of Econometrics and Business Statistics, Monash University, Melbourne, Australia
| | - David T. Frazier
- David T. Frazier is Associate Professor, Department of Econometrics and Business Statistics, Monash University, Melbourne, Australia
| | | |
Collapse
|
12
|
Zito A, Rigon T, Dunson DB. Inferring taxonomic placement from
DNA
barcoding aiding in discovery of new taxa. Methods Ecol Evol 2022. [DOI: 10.1111/2041-210x.14009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022]
Affiliation(s)
- Alessandro Zito
- Department of Statistical Science Duke University Durham North Carolina USA
| | - Tommaso Rigon
- Department of Economics, Management and Statistics University of Milano‐Bicocca Milan Italy
| | - David B. Dunson
- Department of Statistical Science Duke University Durham North Carolina USA
| |
Collapse
|
13
|
Jewson J, Rossell D. General Bayesian loss function selection and the use of improper models. J R Stat Soc Series B Stat Methodol 2022. [DOI: 10.1111/rssb.12553] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
- Jack Jewson
- Department of Business and Economics Universitat Pompeu Fabra Barcelona Spain
- Data Science Center Barcelona School of Economics Barcelona Spain
| | - David Rossell
- Department of Business and Economics Universitat Pompeu Fabra Barcelona Spain
- Data Science Center Barcelona School of Economics Barcelona Spain
| |
Collapse
|
14
|
Privé F, Arbel J, Aschard H, Vilhjálmsson BJ. Identifying and correcting for misspecifications in GWAS summary statistics and polygenic scores. HGG ADVANCES 2022; 3:100136. [PMID: 36105883 PMCID: PMC9465343 DOI: 10.1016/j.xhgg.2022.100136] [Citation(s) in RCA: 21] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2022] [Accepted: 08/11/2022] [Indexed: 11/18/2022] Open
Abstract
Publicly available genome-wide association studies (GWAS) summary statistics exhibit uneven quality, which can impact the validity of follow-up analyses. First, we present an overview of possible misspecifications that come with GWAS summary statistics. Then, in both simulations and real-data analyses, we show that additional information such as imputation INFO scores, allele frequencies, and per-variant sample sizes in GWAS summary statistics can be used to detect possible issues and correct for misspecifications in the GWAS summary statistics. One important motivation for us is to improve the predictive performance of polygenic scores built from these summary statistics. Unfortunately, owing to the lack of reporting standards for GWAS summary statistics, this additional information is not systematically reported. We also show that using well-matched linkage disequilibrium (LD) references can improve model fit and translate into more accurate prediction. Finally, we discuss how to make polygenic score methods such as lassosum and LDpred2 more robust to these misspecifications to improve their predictive power.
Collapse
Affiliation(s)
- Florian Privé
- National Centre for Register-Based Research, Aarhus University, 8210 Aarhus, Denmark
| | - Julyan Arbel
- Université Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000 Grenoble, France
| | - Hugues Aschard
- Department of Computational Biology, Institut Pasteur, Université Paris Cité, 75015 Paris, France
- Program in Genetic Epidemiology and Statistical Genetics, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA
| | - Bjarni J. Vilhjálmsson
- National Centre for Register-Based Research, Aarhus University, 8210 Aarhus, Denmark
- Bioinformatics Research Centre, Aarhus University, 8000 Aarhus, Denmark
| |
Collapse
|
15
|
Kurniawan Y, Petrie CL, Williams KJ, Transtrum MK, Tadmor EB, Elliott RS, Karls DS, Wen M. Bayesian, frequentist, and information geometric approaches to parametric uncertainty quantification of classical empirical interatomic potentials. J Chem Phys 2022; 156:214103. [DOI: 10.1063/5.0084988] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2023] Open
Abstract
In this paper, we consider the problem of quantifying parametric uncertainty in classical empirical interatomic potentials (IPs) using both Bayesian (Markov Chain Monte Carlo) and frequentist (profile likelihood) methods. We interface these tools with the Open Knowledgebase of Interatomic Models and study three models based on the Lennard-Jones, Morse, and Stillinger–Weber potentials. We confirm that IPs are typically sloppy, i.e., insensitive to coordinated changes in some parameter combinations. Because the inverse problem in such models is ill-conditioned, parameters are unidentifiable. This presents challenges for traditional statistical methods, as we demonstrate and interpret within both Bayesian and frequentist frameworks. We use information geometry to illuminate the underlying cause of this phenomenon and show that IPs have global properties similar to those of sloppy models from fields, such as systems biology, power systems, and critical phenomena. IPs correspond to bounded manifolds with a hierarchy of widths, leading to low effective dimensionality in the model. We show how information geometry can motivate new, natural parameterizations that improve the stability and interpretation of uncertainty quantification analysis and further suggest simplified, less-sloppy models.
Collapse
Affiliation(s)
- Yonatan Kurniawan
- Department of Physics and Astronomy, Brigham Young University, Provo, Utah 84604, USA
| | - Cody L. Petrie
- Department of Physics and Astronomy, Brigham Young University, Provo, Utah 84604, USA
| | - Kinamo J. Williams
- Department of Physics and Astronomy, Brigham Young University, Provo, Utah 84604, USA
| | - Mark K. Transtrum
- Department of Physics and Astronomy, Brigham Young University, Provo, Utah 84604, USA
| | - Ellad B. Tadmor
- Department of Aerospace Engineering and Mechanics, University of Minnesota, Minneapolis, Minnesota 55455, USA
| | - Ryan S. Elliott
- Department of Aerospace Engineering and Mechanics, University of Minnesota, Minneapolis, Minnesota 55455, USA
| | - Daniel S. Karls
- Department of Aerospace Engineering and Mechanics, University of Minnesota, Minneapolis, Minnesota 55455, USA
| | - Mingjian Wen
- Energy Technologies Area, Lawrence Berkeley National Laboratory, Berkeley, California 94720, USA
| |
Collapse
|
16
|
Matsubara T, Knoblauch J, Briol F, Oates CJ. Robust generalised Bayesian inference for intractable likelihoods. J R Stat Soc Series B Stat Methodol 2022. [DOI: 10.1111/rssb.12500] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
Affiliation(s)
- Takuo Matsubara
- Newcastle University Newcastle upon TyneUK
- The Alan Turing Institute LondonUK
| | | | | | - Chris J. Oates
- Newcastle University Newcastle upon TyneUK
- The Alan Turing Institute LondonUK
| |
Collapse
|
17
|
McGoff K, Mukherjee S, Nobel AB. Gibbs posterior convergence and the thermodynamic formalism. ANN APPL PROBAB 2022. [DOI: 10.1214/21-aap1685] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Kevin McGoff
- Department of Mathematics and Statistics, University of North Carolina at Charlotte
| | - Sayan Mukherjee
- Departments of Statistical Science, Mathematics, Computer Science, and Biostatistics & Bioinformatics, Duke University
| | - Andrew B. Nobel
- Department of Statistics and Operations Research, University of North Carolina at Chapel Hill
| |
Collapse
|
18
|
Beraha M, Argiento R, Møller J, Guglielmi A. MCMC Computations for Bayesian Mixture Models Using Repulsive Point Processes. J Comput Graph Stat 2022. [DOI: 10.1080/10618600.2021.2000424] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Affiliation(s)
- Mario Beraha
- Department of Mathematics, Politecnico di Milano, Milano, Italy
- Department of Computer Science, Università di Bologna, Bologna, Italy
| | - Raffaele Argiento
- Department of Economics, Università degli Studi di Bergamo, Milano, Italy
| | - Jesper Møller
- Department of Mathematical Sciences, Aalborg University, Aalborg, Denmark
| | | |
Collapse
|
19
|
Liu Y, Goudie RJB. Stochastic approximation cut algorithm for inference in modularized Bayesian models. STATISTICS AND COMPUTING 2021; 32:7. [PMID: 35125678 PMCID: PMC7612314 DOI: 10.1007/s11222-021-10070-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/22/2021] [Accepted: 11/06/2021] [Indexed: 06/14/2023]
Abstract
Bayesian modelling enables us to accommodate complex forms of data and make a comprehensive inference, but the effect of partial misspecification of the model is a concern. One approach in this setting is to modularize the model and prevent feedback from suspect modules, using a cut model. After observing data, this leads to the cut distribution which normally does not have a closed form. Previous studies have proposed algorithms to sample from this distribution, but these algorithms have unclear theoretical convergence properties. To address this, we propose a new algorithm called the stochastic approximation cut (SACut) algorithm as an alternative. The algorithm is divided into two parallel chains. The main chain targets an approximation to the cut distribution; the auxiliary chain is used to form an adaptive proposal distribution for the main chain. We prove convergence of the samples drawn by the proposed algorithm and present the exact limit. Although SACut is biased, since the main chain does not target the exact cut distribution, we prove this bias can be reduced geometrically by increasing a user-chosen tuning parameter. In addition, parallel computing can be easily adopted for SACut, which greatly reduces computation time.
Collapse
Affiliation(s)
- Yang Liu
- MRC Biostatistics Unit, University of Cambridge, Cambridge, UK
| | | |
Collapse
|
20
|
Fiksel J, Datta A, Amouzou A, Zeger S. Generalized Bayes Quantification Learning under Dataset Shift. J Am Stat Assoc 2021. [DOI: 10.1080/01621459.2021.1909599] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Affiliation(s)
- Jacob Fiksel
- Department of Biostatistics, Johns Hopkins University, Baltimore, MD
| | - Abhirup Datta
- Department of Biostatistics, Johns Hopkins University, Baltimore, MD
| | - Agbessi Amouzou
- Department of International Health, Johns Hopkins University, Baltimore, MD
| | - Scott Zeger
- Department of Biostatistics, Johns Hopkins University, Baltimore, MD
| |
Collapse
|
21
|
Duan LL, Dunson DB. Bayesian Distance Clustering. JOURNAL OF MACHINE LEARNING RESEARCH : JMLR 2021; 22:224. [PMID: 35782785 PMCID: PMC9245927] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Model-based clustering is widely used in a variety of application areas. However, fundamental concerns remain about robustness. In particular, results can be sensitive to the choice of kernel representing the within-cluster data density. Leveraging on properties of pairwise differences between data points, we propose a class of Bayesian distance clustering methods, which rely on modeling the likelihood of the pairwise distances in place of the original data. Although some information in the data is discarded, we gain substantial robustness to modeling assumptions. The proposed approach represents an appealing middle ground between distance- and model-based clustering, drawing advantages from each of these canonical approaches. We illustrate dramatic gains in the ability to infer clusters that are not well represented by the usual choices of kernel. A simulation study is included to assess performance relative to competitors, and we apply the approach to clustering of brain genome expression data.
Collapse
Affiliation(s)
- Leo L Duan
- Department of Statistics, University of Florida, Gainesville, FL 32611, USA
| | - David B Dunson
- Department of Statistical Science, Duke University, Durham, NC 27708, USA
| |
Collapse
|
22
|
Andrade D, Fukumizu K. Disjunct support spike‐and‐slab priors for variable selection in regression under quasi‐sparseness. Stat (Int Stat Inst) 2020. [DOI: 10.1002/sta4.307] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Affiliation(s)
- Daniel Andrade
- Data Science Research Laboratories NEC Kanagawa 211‐8666 Japan
| | - Kenji Fukumizu
- The Institute of Statistical Mathematics Tokyo 190‐0014 Japan
| |
Collapse
|
23
|
Observational nonidentifiability, generalized likelihood and free energy. Int J Approx Reason 2020. [DOI: 10.1016/j.ijar.2020.06.009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
24
|
Mukhopadhyay M, Li D, Dunson DB. Estimating densities with non‐linear support by using Fisher–Gaussian kernels. J R Stat Soc Series B Stat Methodol 2020; 82:1249-1271. [DOI: 10.1111/rssb.12390] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
|
25
|
Abstract
We propose a general method for constructing confidence sets and hypothesis tests that have finite-sample guarantees without regularity conditions. We refer to such procedures as "universal." The method is very simple and is based on a modified version of the usual likelihood-ratio statistic that we call "the split likelihood-ratio test" (split LRT) statistic. The (limiting) null distribution of the classical likelihood-ratio statistic is often intractable when used to test composite null hypotheses in irregular statistical models. Our method is especially appealing for statistical inference in these complex setups. The method we suggest works for any parametric model and also for some nonparametric models, as long as computing a maximum-likelihood estimator (MLE) is feasible under the null. Canonical examples arise in mixture modeling and shape-constrained inference, for which constructing tests and confidence sets has been notoriously difficult. We also develop various extensions of our basic methods. We show that in settings when computing the MLE is hard, for the purpose of constructing valid tests and intervals, it is sufficient to upper bound the maximum likelihood. We investigate some conditions under which our methods yield valid inferences under model misspecification. Further, the split LRT can be used with profile likelihoods to deal with nuisance parameters, and it can also be run sequentially to yield anytime-valid P values and confidence sequences. Finally, when combined with the method of sieves, it can be used to perform model selection with nested model classes.
Collapse
|
26
|
Robust Bayesian Regression with Synthetic Posterior Distributions. ENTROPY 2020; 22:e22060661. [PMID: 33286432 PMCID: PMC7517196 DOI: 10.3390/e22060661] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/29/2020] [Revised: 06/04/2020] [Accepted: 06/10/2020] [Indexed: 11/17/2022]
Abstract
Although linear regression models are fundamental tools in statistical science, the estimation results can be sensitive to outliers. While several robust methods have been proposed in frequentist frameworks, statistical inference is not necessarily straightforward. We here propose a Bayesian approach to robust inference on linear regression models using synthetic posterior distributions based on γ-divergence, which enables us to naturally assess the uncertainty of the estimation through the posterior distribution. We also consider the use of shrinkage priors for the regression coefficients to carry out robust Bayesian variable selection and estimation simultaneously. We develop an efficient posterior computation algorithm by adopting the Bayesian bootstrap within Gibbs sampling. The performance of the proposed method is illustrated through simulation studies and applications to famous datasets.
Collapse
|
27
|
Ho N, Nguyen X, Ritov Y. Robust estimation of mixing measures in finite mixture models. BERNOULLI 2020. [DOI: 10.3150/18-bej1087] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
28
|
|
29
|
Abstract
Summary
In a 1970 Biometrika paper, W. K. Hastings developed a broad class of Markov chain algorithms for sampling from probability distributions that are difficult to sample from directly. The algorithm draws a candidate value from a proposal distribution and accepts the candidate with a probability that can be computed using only the unnormalized density of the target distribution, allowing one to sample from distributions known only up to a constant of proportionality. The stationary distribution of the corresponding Markov chain is the target distribution one is attempting to sample from. The Hastings algorithm generalizes the Metropolis algorithm to allow a much broader class of proposal distributions instead of just symmetric cases. An important class of applications for the Hastings algorithm corresponds to sampling from Bayesian posterior distributions, which have densities given by a prior density multiplied by a likelihood function and divided by a normalizing constant equal to the marginal likelihood. The marginal likelihood is typically intractable, presenting a fundamental barrier to implementation in Bayesian statistics. This barrier can be overcome by Markov chain Monte Carlo sampling algorithms. Amazingly, even after 50 years, the majority of algorithms used in practice today involve the Hastings algorithm. This article provides a brief celebration of the continuing impact of this ingenious algorithm on the 50th anniversary of its publication.
Collapse
Affiliation(s)
- D B Dunson
- Department of Statistical Science, Duke University, Box 90251, Durham, North Carolina 27707, U.S.A
| | - J E Johndrow
- Department of Statistics, The Wharton School, University of Pennsylvania, 3730 Walnut St, Philadelphia, Pennsylvania 19104, U.S.A
| |
Collapse
|
30
|
Lee K, Lee J, Lin L. Minimax posterior convergence rates and model selection consistency in high-dimensional DAG models based on sparse Cholesky factors. Ann Stat 2019. [DOI: 10.1214/18-aos1783] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
31
|
Bernton E, Jacob PE, Gerber M, Robert CP. Approximate Bayesian computation with the Wasserstein distance. J R Stat Soc Series B Stat Methodol 2019. [DOI: 10.1111/rssb.12312] [Citation(s) in RCA: 32] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/24/2023]
Affiliation(s)
| | | | | | - Christian P. Robert
- Ceremade, Université Paris‐Dauphine, Université de Recherche Paris Sciences et Lettres France
- University of Warwick Coventry UK
| |
Collapse
|
32
|
Linero AR, Yang Y. Bayesian regression tree ensembles that adapt to smoothness and sparsity. J R Stat Soc Series B Stat Methodol 2018. [DOI: 10.1111/rssb.12293] [Citation(s) in RCA: 42] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
| | - Yun Yang
- University of Illinois at Urbana–Champaign Champaign USA
| |
Collapse
|
33
|
Principles of Bayesian Inference Using General Divergence Criteria. ENTROPY 2018; 20:e20060442. [PMID: 33265532 PMCID: PMC7512964 DOI: 10.3390/e20060442] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/02/2018] [Revised: 05/25/2018] [Accepted: 05/28/2018] [Indexed: 11/22/2022]
Abstract
When it is acknowledged that all candidate parameterised statistical models are misspecified relative to the data generating process, the decision maker (DM) must currently concern themselves with inference for the parameter value minimising the Kullback–Leibler (KL)-divergence between the model and this process (Walker, 2013). However, it has long been known that minimising the KL-divergence places a large weight on correctly capturing the tails of the sample distribution. As a result, the DM is required to worry about the robustness of their model to tail misspecifications if they want to conduct principled inference. In this paper we alleviate these concerns for the DM. We advance recent methodological developments in general Bayesian updating (Bissiri, Holmes & Walker, 2016) to propose a statistically well principled Bayesian updating of beliefs targeting the minimisation of more general divergence criteria. We improve both the motivation and the statistical foundations of existing Bayesian minimum divergence estimation (Hooker & Vidyashankar, 2014; Ghosh & Basu, 2016), allowing the well principled Bayesian to target predictions from the model that are close to the genuine model in terms of some alternative divergence measure to the KL-divergence. Our principled formulation allows us to consider a broader range of divergences than have previously been considered. In fact, we argue defining the divergence measure forms an important, subjective part of any statistical analysis, and aim to provide some decision theoretic rational for this selection. We illustrate how targeting alternative divergence measures can impact the conclusions of simple inference tasks, and discuss then how our methods might apply to more complicated, high dimensional models.
Collapse
|