1
|
Xi X, Ruffieux H. A modeling framework for detecting and leveraging node-level information in Bayesian network inference. Biostatistics 2024:kxae021. [PMID: 38916966 DOI: 10.1093/biostatistics/kxae021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2023] [Revised: 03/11/2024] [Accepted: 06/02/2024] [Indexed: 06/27/2024] Open
Abstract
Bayesian graphical models are powerful tools to infer complex relationships in high dimension, yet are often fraught with computational and statistical challenges. If exploited in a principled way, the increasing information collected alongside the data of primary interest constitutes an opportunity to mitigate these difficulties by guiding the detection of dependence structures. For instance, gene network inference may be informed by the use of publicly available summary statistics on the regulation of genes by genetic variants. Here we present a novel Gaussian graphical modeling framework to identify and leverage information on the centrality of nodes in conditional independence graphs. Specifically, we consider a fully joint hierarchical model to simultaneously infer (i) sparse precision matrices and (ii) the relevance of node-level information for uncovering the sought-after network structure. We encode such information as candidate auxiliary variables using a spike-and-slab submodel on the propensity of nodes to be hubs, which allows hypothesis-free selection and interpretation of a sparse subset of relevant variables. As efficient exploration of large posterior spaces is needed for real-world applications, we develop a variational expectation conditional maximization algorithm that scales inference to hundreds of samples, nodes and auxiliary variables. We illustrate and exploit the advantages of our approach in simulations and in a gene network study which identifies hub genes involved in biological pathways relevant to immune-mediated diseases.
Collapse
Affiliation(s)
- Xiaoyue Xi
- MRC Biostatistics Unit, University of Cambridge, East Forvie Building, Forvie Site, Robinson Way, Cambridge CB2 0SR, United Kingdom
| | - Hélène Ruffieux
- MRC Biostatistics Unit, University of Cambridge, East Forvie Building, Forvie Site, Robinson Way, Cambridge CB2 0SR, United Kingdom
| |
Collapse
|
2
|
Reynolds M, Chaudhary T, Eshaghzadeh Torbati M, Tudorascu DL, Batmanghelich K. ComBat Harmonization: Empirical Bayes versus fully Bayes approaches. Neuroimage Clin 2023; 39:103472. [PMID: 37506457 PMCID: PMC10412957 DOI: 10.1016/j.nicl.2023.103472] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2023] [Revised: 07/05/2023] [Accepted: 07/06/2023] [Indexed: 07/30/2023]
Abstract
Studying small effects or subtle neuroanatomical variation requires large-scale sample size data. As a result, combining neuroimaging data from multiple datasets is necessary. Variation in acquisition protocols, magnetic field strength, scanner build, and many other non-biologically related factors can introduce undesirable bias into studies. Hence, harmonization is required to remove the bias-inducing factors from the data. ComBat is one of the most common methods applied to features from structural images. ComBat models the data using a hierarchical Bayesian model and uses the empirical Bayes approach to infer the distribution of the unknown factors. The empirical Bayes harmonization method is computationally efficient and provides valid point estimates. However, it tends to underestimate uncertainty. This paper investigates a new approach, fully Bayesian ComBat, where Monte Carlo sampling is used for statistical inference. When comparing fully Bayesian and empirical Bayesian ComBat, we found Empirical Bayesian ComBat more effectively removed scanner strength information and was much more computationally efficient. Conversely, fully Bayesian ComBat better preserved biological disease and age-related information while performing more accurate harmonization on traveling subjects. The fully Bayesian approach generates a rich posterior distribution, which is useful for generating simulated imaging features for improving classifier performance in a limited data setting. We show the generative capacity of our model for augmenting and improving the detection of patients with Alzheimer's disease. Posterior distributions for harmonized imaging measures can also be used for brain-wide uncertainty comparison and more principled downstream statistical analysis.Code for our new fully Bayesian ComBat extension is available at https://github.com/batmanlab/BayesComBat.
Collapse
Affiliation(s)
- Maxwell Reynolds
- Department of Biomedical Informatics, University of Pittsburgh School of Medicine, 5607 Baum Blvd. Suite 500, Pittsburgh, PA 15206, USA.
| | - Tigmanshu Chaudhary
- Department of Biomedical Informatics, University of Pittsburgh School of Medicine, 5607 Baum Blvd. Suite 500, Pittsburgh, PA 15206, USA.
| | - Mahbaneh Eshaghzadeh Torbati
- Intelligent System Program, University of Pittsburgh School of Computing and Information, 210 South Bouquet Street, Pittsburgh, PA 15260, USA.
| | - Dana L Tudorascu
- Department of Psychiatry, University of Pittsburgh School of Medicine, 3811 O'Hara Street, Pittsburgh, PA 15213, USA; Department of Biostatistics, University of Pittsburgh, 130 De Soto Street, Pittsburgh, PA 15213, USA.
| | - Kayhan Batmanghelich
- Department of Biomedical Informatics, University of Pittsburgh School of Medicine, 5607 Baum Blvd. Suite 500, Pittsburgh, PA 15206, USA.
| |
Collapse
|
3
|
van Nee MM, van de Brug T, van de Wiel MA. Fast Marginal Likelihood Estimation of Penalties for Group-Adaptive Elastic Net. J Comput Graph Stat 2022; 32:950-960. [PMID: 38013849 PMCID: PMC10511031 DOI: 10.1080/10618600.2022.2128809] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2021] [Accepted: 09/12/2022] [Indexed: 10/10/2022]
Abstract
Elastic net penalization is widely used in high-dimensional prediction and variable selection settings. Auxiliary information on the variables, for example, groups of variables, is often available. Group-adaptive elastic net penalization exploits this information to potentially improve performance by estimating group penalties, thereby penalizing important groups of variables less than other groups. Estimating these group penalties is, however, hard due to the high dimension of the data. Existing methods are computationally expensive or not generic in the type of response. Here we present a fast method for estimation of group-adaptive elastic net penalties for generalized linear models. We first derive a low-dimensional representation of the Taylor approximation of the marginal likelihood for group-adaptive ridge penalties, to efficiently estimate these penalties. Then we show by using asymptotic normality of the linear predictors that this marginal likelihood approximates that of elastic net models. The ridge group penalties are then transformed to elastic net group penalties by matching the ridge prior variance to the elastic net prior variance as function of the group penalties. The method allows for overlapping groups and unpenalized variables, and is easily extended to other penalties. For a model-based simulation study and two cancer genomics applications we demonstrate a substantially decreased computation time and improved or matching performance compared to other methods. Supplementary materials for this article are available online.
Collapse
Affiliation(s)
- Mirrelijn M. van Nee
- Department of Epidemiology and Data Science, Amsterdam University Medical Centers, Amsterdam, The Netherlands
| | - Tim van de Brug
- Department of Epidemiology and Data Science, Amsterdam University Medical Centers, Amsterdam, The Netherlands
| | - Mark A. van de Wiel
- Department of Epidemiology and Data Science, Amsterdam University Medical Centers, Amsterdam, The Netherlands
| |
Collapse
|
4
|
He H, Guo X, Yu J, Ai C, Shi S. Overcoming the inadaptability of sparse group lasso for data with various group structures by stacking. Bioinformatics 2022; 38:1542-1549. [PMID: 34908103 DOI: 10.1093/bioinformatics/btab848] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2021] [Revised: 12/08/2021] [Accepted: 12/13/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Efficiently identifying genes based on gene expression level have been studied to help to classify different cancer types and improve the prediction performance. Logistic regression model based on regularization technique is often one of the effective approaches for simultaneously realizing prediction and feature (gene) selection in genomic data of high dimensionality. However, standard methods ignore biological group structure and generally result in poorer predictive models. RESULTS In this article, we develop a classifier named Stacked SGL that satisfies the criteria of prediction, stability and selection based on sparse group lasso penalty by stacking. Sparse group lasso has a mixing parameter representing the ratio of lasso to group lasso, thus providing a compromise between selecting a subset of sparse feature groups and introducing sparsity within each group. We propose to use stacked generalization to combine different ratios rather than choosing one ratio, which could help to overcome the inadaptability of sparse group lasso for some data. Considering that stacking weakens feature selection, we perform a post hoc feature selection which might slightly reduce predictive performance, but it shows superior in feature selection. Experimental results on simulation demonstrate that our approach enjoys competitive and stable classification performance and lower false discovery rate in feature selection for varying sets of data compared with other regularization methods. In addition, our method presents better accuracy in three public cancer datasets and identifies more powerful discriminatory and potential mutation genes for thyroid carcinoma. AVAILABILITY AND IMPLEMENTATION The real data underlying this article are available from https://github.com/huanheaha/Stacked_SGL; https://zenodo.org/record/5761577#.YbAUyciEwk2. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Huan He
- Department of Mathematics and Numerical Simulation and High-Performance Computing Laboratory, School of Sciences, Nanchang University, Nanchang 330031, China
| | - Xinyun Guo
- Department of Mathematics and Numerical Simulation and High-Performance Computing Laboratory, School of Sciences, Nanchang University, Nanchang 330031, China
| | - Jialin Yu
- Department of Mathematics and Numerical Simulation and High-Performance Computing Laboratory, School of Sciences, Nanchang University, Nanchang 330031, China
| | - Chen Ai
- Department of Mathematics and Numerical Simulation and High-Performance Computing Laboratory, School of Sciences, Nanchang University, Nanchang 330031, China
| | - Shaoping Shi
- Department of Mathematics and Numerical Simulation and High-Performance Computing Laboratory, School of Sciences, Nanchang University, Nanchang 330031, China
| |
Collapse
|
5
|
van Nee MM, Wessels LFA, van de Wiel MA. Flexible co-data learning for high-dimensional prediction. Stat Med 2021; 40:5910-5925. [PMID: 34438466 PMCID: PMC9292202 DOI: 10.1002/sim.9162] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2020] [Revised: 05/18/2021] [Accepted: 07/29/2021] [Indexed: 02/06/2023]
Abstract
Clinical research often focuses on complex traits in which many variables play a role in mechanisms driving, or curing, diseases. Clinical prediction is hard when data is high-dimensional, but additional information, like domain knowledge and previously published studies, may be helpful to improve predictions. Such complementary data, or co-data, provide information on the covariates, such as genomic location or P-values from external studies. We use multiple and various co-data to define possibly overlapping or hierarchically structured groups of covariates. These are then used to estimate adaptive multi-group ridge penalties for generalized linear and Cox models. Available group adaptive methods primarily target for settings with few groups, and therefore likely overfit for non-informative, correlated or many groups, and do not account for known structure on group level. To handle these issues, our method combines empirical Bayes estimation of the hyperparameters with an extra level of flexible shrinkage. This renders a uniquely flexible framework as any type of shrinkage can be used on the group level. We describe various types of co-data and propose suitable forms of hypershrinkage. The method is very versatile, as it allows for integration and weighting of multiple co-data sets, inclusion of unpenalized covariates and posterior variable selection. For three cancer genomics applications we demonstrate improvements compared to other models in terms of performance, variable selection stability and validation.
Collapse
Affiliation(s)
- Mirrelijn M van Nee
- Epidemiology & Data Science
- Amsterdam Public Health Research Institute, Amsterdam University Medical Centers, Amsterdam, The Netherlands
| | - Lodewyk F A Wessels
- Molecular Carcinogenesis, Netherlands Cancer Institute, Amsterdam, The Netherlands.,Computational Cancer Biology, Oncode Institute, Amsterdam, The Netherlands.,Intelligent Systems, Delft University of Technology, Delft, The Netherlands
| | - Mark A van de Wiel
- Epidemiology & Data Science
- Amsterdam Public Health Research Institute, Amsterdam University Medical Centers, Amsterdam, The Netherlands.,MRC Biostatistics Unit, University of Cambridge, Cambridge, UK
| |
Collapse
|
6
|
Rauschenberger A, Glaab E, van de Wiel MA. Predictive and interpretable models via the stacked elastic net. Bioinformatics 2021; 37:2012-2016. [PMID: 32437519 PMCID: PMC8336997 DOI: 10.1093/bioinformatics/btaa535] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2019] [Revised: 04/30/2020] [Accepted: 05/18/2020] [Indexed: 12/18/2022] Open
Abstract
Motivation Machine learning in the biomedical sciences should ideally provide predictive
and interpretable models. When predicting outcomes from clinical or
molecular features, applied researchers often want to know which features
have effects, whether these effects are positive or negative and how strong
these effects are. Regression analysis includes this information in the
coefficients but typically renders less predictive models than more advanced
machine learning techniques. Results Here, we propose an interpretable meta-learning approach for high-dimensional
regression. The elastic net provides a compromise between estimating weak
effects for many features and strong effects for some features. It has a
mixing parameter to weight between ridge and lasso regularization. Instead
of selecting one weighting by tuning, we combine multiple weightings by
stacking. We do this in a way that increases predictivity without
sacrificing interpretability. Availability and implementation The R package starnet is available on GitHub
(https://github.com/rauschenberger/starnet) and CRAN
(https://CRAN.R-project.org/package=starnet).
Collapse
Affiliation(s)
- Armin Rauschenberger
- Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, 4362 Esch-sur-Alzette, Luxembourg, The Netherlands.,Department of Epidemiology and Data Science, Amsterdam UMC, 1081 HV Amsterdam, The Netherlands
| | - Enrico Glaab
- Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, 4362 Esch-sur-Alzette, Luxembourg, The Netherlands
| | - Mark A van de Wiel
- Department of Epidemiology and Data Science, Amsterdam UMC, 1081 HV Amsterdam, The Netherlands.,MRC Biostatistics Unit, University of Cambridge, CB2 0SR Cambridge, UK
| |
Collapse
|
7
|
McDonald S, Campbell D. A review of uncertainty quantification for density estimation. STATISTICS SURVEYS 2021. [DOI: 10.1214/21-ss130] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Shaun McDonald
- Department of Statistics & Actuarial Science, Simon Fraser University, Room SC K10545, 8888 University Drive, Burnaby, B.C., Canada V5A 1S6
| | - David Campbell
- School of Mathematics and Statistics, 4302 Herzberg Laboratories, Carleton University, 1125 Colonel By Drive, Ottawa, ON, K1S 5B6
| |
Collapse
|
8
|
Münch MM, van de Wiel MA, Richardson S, Leday GGR. Drug sensitivity prediction with normal inverse Gaussian shrinkage informed by external data. Biom J 2020; 63:289-304. [PMID: 33155717 PMCID: PMC7891636 DOI: 10.1002/bimj.201900371] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2019] [Revised: 04/30/2020] [Accepted: 06/03/2020] [Indexed: 11/09/2022]
Abstract
In precision medicine, a common problem is drug sensitivity prediction from cancer tissue cell lines. These types of problems entail modelling multivariate drug responses on high-dimensional molecular feature sets in typically >1000 cell lines. The dimensions of the problem require specialised models and estimation methods. In addition, external information on both the drugs and the features is often available. We propose to model the drug responses through a linear regression with shrinkage enforced through a normal inverse Gaussian prior. We let the prior depend on the external information, and estimate the model and external information dependence in an empirical-variational Bayes framework. We demonstrate the usefulness of this model in both a simulated setting and in the publicly available Genomics of Drug Sensitivity in Cancer data.
Collapse
Affiliation(s)
- Magnus M Münch
- Department of Epidemiology & Biostatistics, Amsterdam UMC, VU University, Amsterdam, The Netherlands.,Mathematical Institute, Leiden University, Leiden, The Netherlands.,MRC Biostatistics Unit, University of Cambridge, Cambridge Institute of Public Health, Forvie Site, Robinson Way, Cambridge Biomedical Campus, Cambridge, United Kingdom
| | - Mark A van de Wiel
- Department of Epidemiology & Biostatistics, Amsterdam UMC, VU University, Amsterdam, The Netherlands.,MRC Biostatistics Unit, University of Cambridge, Cambridge Institute of Public Health, Forvie Site, Robinson Way, Cambridge Biomedical Campus, Cambridge, United Kingdom
| | - Sylvia Richardson
- MRC Biostatistics Unit, University of Cambridge, Cambridge Institute of Public Health, Forvie Site, Robinson Way, Cambridge Biomedical Campus, Cambridge, United Kingdom
| | - Gwenaël G R Leday
- MRC Biostatistics Unit, University of Cambridge, Cambridge Institute of Public Health, Forvie Site, Robinson Way, Cambridge Biomedical Campus, Cambridge, United Kingdom
| |
Collapse
|
9
|
Ruffieux H, Davison AC, Hager J, Inshaw J, Fairfax BP, Richardson S, Bottolo L. A Global-Local Approach for Detecting Hotspots in Multiple-Response Regression. Ann Appl Stat 2020; 14:905-928. [PMID: 34992707 PMCID: PMC7612176 DOI: 10.1214/20-aoas1332] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
We tackle modelling and inference for variable selection in regression problems with many predictors and many responses. We focus on detecting hotspots, that is, predictors associated with several responses. Such a task is critical in statistical genetics, as hotspot genetic variants shape the architecture of the genome by controlling the expression of many genes and may initiate decisive functional mechanisms underlying disease endpoints. Existing hierarchical regression approaches designed to model hotspots suffer from two limitations: their discrimination of hotspots is sensitive to the choice of top-level scale parameters for the propensity of predictors to be hotspots, and they do not scale to large predictor and response vectors, for example, of dimensions 103-105 in genetic applications. We address these shortcomings by introducing a flexible hierarchical regression framework that is tailored to the detection of hotspots and scalable to the above dimensions. Our proposal implements a fully Bayesian model for hotspots based on the horseshoe shrinkage prior. Its global-local formulation shrinks noise globally and, hence, accommodates the highly sparse nature of genetic analyses while being robust to individual signals, thus leaving the effects of hotspots unshrunk. Inference is carried out using a fast variational algorithm coupled with a novel simulated annealing procedure that allows efficient exploration of multimodal distributions.
Collapse
Affiliation(s)
| | | | | | - Jamie Inshaw
- Wellcome Centre for Human Genetics, Oxford, University of Oxford
| | - Benjamin P. Fairfax
- Department of Oncology, MRC Weatherall Institute for Molecular Medicine, University of Oxford
| | - Sylvia Richardson
- MRC Biostatistics Unit, University of Cambridge
- Alan Turing Institute
| | - Leonardo Bottolo
- MRC Biostatistics Unit, University of Cambridge
- Alan Turing Institute
- Department of Medical Genetics, University of Cambridge
| |
Collapse
|
10
|
Nam K, Henderson NC, Rohan P, Russek-Cohen E. Penalized Logistic Regression Likelihood Ratio Test Analysis to Detect Signals of Adverse Events From Interactions in Postmarket Safety Surveillance. Stat Biopharm Res 2020. [DOI: 10.1080/19466315.2020.1752299] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Affiliation(s)
| | | | - Patricia Rohan
- Division of Epidemiology, Office of Biostatistics and Epidemiology, CBER, FDA, Silver Spring, MD
| | | |
Collapse
|
11
|
Vincenzi S, Jesensek D, Crivelli AJ. Biological and statistical interpretation of size-at-age, mixed-effects models of growth. ROYAL SOCIETY OPEN SCIENCE 2020; 7:192146. [PMID: 32431890 PMCID: PMC7211857 DOI: 10.1098/rsos.192146] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/13/2019] [Accepted: 03/16/2020] [Indexed: 06/11/2023]
Abstract
The differences in life-history traits and processes between organisms living in the same or different populations contribute to their ecological and evolutionary dynamics. We developed mixed-effect model formulations of the popular size-at-age von Bertalanffy and Gompertz growth functions to estimate individual and group variation in body growth, using as a model system four freshwater fish populations, where tagged individuals were sampled for more than 10 years. We used the software Template Model Builder to estimate the parameters of the mixed-effect growth models. Tests on data that were not used to estimate model parameters showed good predictions of individual growth trajectories using the mixed-effects models and starting from one single observation of body size early in life; the best models had R 2 > 0.80 over more than 500 predictions. Estimates of asymptotic size from the Gompertz and von Bertalanffy models were not significantly correlated, but their predictions of size-at-age of individuals were strongly correlated (r > 0.99), which suggests that choosing between the best models of the two growth functions would have negligible effects on the predictions of size-at-age of individuals. Model results pointed to size ranks that are largely maintained throughout the lifetime of individuals in all populations.
Collapse
Affiliation(s)
| | - Dusan Jesensek
- Tolmin Angling Association, Most Na Soci, Tolmin, Slovenia
| | - Alain J. Crivelli
- Station Biologique de la Tour du Valat, Le Sambuc 13200, Arles, France
| |
Collapse
|