1
|
Zhang B, Griesbach C, Bergherr E. Bayesian learners in gradient boosting for linear mixed models. Int J Biostat 2024; 20:123-141. [PMID: 36473129 DOI: 10.1515/ijb-2022-0029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2022] [Accepted: 11/15/2022] [Indexed: 02/17/2024]
Abstract
Selection of relevant fixed and random effects without prior choices made from possibly insufficient theory is important in mixed models. Inference with current boosting techniques suffers from biased estimates of random effects and the inflexibility of random effects selection. This paper proposes a new inference method "BayesBoost" that integrates a Bayesian learner into gradient boosting with simultaneous estimation and selection of fixed and random effects in linear mixed models. The method introduces a novel selection strategy for random effects, which allows for computationally fast selection of random slopes even in high-dimensional data structures. Additionally, the new method not only overcomes the shortcomings of Bayesian inference in giving precise and unambiguous guidelines for the selection of covariates by benefiting from boosting techniques, but also provides Bayesian ways to construct estimators for the precision of parameters such as variance components or credible intervals, which are not available in conventional boosting frameworks. The effectiveness of the new approach can be observed via simulation and in a real-world application.
Collapse
Affiliation(s)
- Boyao Zhang
- Chair of Spatial Data Science and Statistical Learning, Georg-August-Unversität Göttingen, Göttingen, Germany
| | - Colin Griesbach
- Chair of Spatial Data Science and Statistical Learning, Georg-August-Unversität Göttingen, Göttingen, Germany
| | - Elisabeth Bergherr
- Chair of Spatial Data Science and Statistical Learning, Georg-August-Unversität Göttingen, Göttingen, Germany
| |
Collapse
|
2
|
Speller J, Staerk C, Mayr A. Robust statistical boosting with quantile-based adaptive loss functions. Int J Biostat 2022:ijb-2021-0127. [PMID: 35950232 DOI: 10.1515/ijb-2021-0127] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2021] [Accepted: 06/20/2022] [Indexed: 11/15/2022]
Abstract
We combine robust loss functions with statistical boosting algorithms in an adaptive way to perform variable selection and predictive modelling for potentially high-dimensional biomedical data. To achieve robustness against outliers in the outcome variable (vertical outliers), we consider different composite robust loss functions together with base-learners for linear regression. For composite loss functions, such as the Huber loss and the Bisquare loss, a threshold parameter has to be specified that controls the robustness. In the context of boosting algorithms, we propose an approach that adapts the threshold parameter of composite robust losses in each iteration to the current sizes of residuals, based on a fixed quantile level. We compared the performance of our approach to classical M-regression, boosting with standard loss functions or the lasso regarding prediction accuracy and variable selection in different simulated settings: the adaptive Huber and Bisquare losses led to a better performance when the outcome contained outliers or was affected by specific types of corruption. For non-corrupted data, our approach yielded a similar performance to boosting with the efficient L 2 loss or the lasso. Also in the analysis of skewed KRT19 protein expression data based on gene expression measurements from human cancer cell lines (NCI-60 cell line panel), boosting with the new adaptive loss functions performed favourably compared to standard loss functions or competing robust approaches regarding prediction accuracy and resulted in very sparse models.
Collapse
Affiliation(s)
- Jan Speller
- Medical Faculty, Institute of Medical Biometrics, Informatics and Epidemiology (IMBIE), University of Bonn, Bonn, Germany
| | - Christian Staerk
- Medical Faculty, Institute of Medical Biometrics, Informatics and Epidemiology (IMBIE), University of Bonn, Bonn, Germany
| | - Andreas Mayr
- Medical Faculty, Institute of Medical Biometrics, Informatics and Epidemiology (IMBIE), University of Bonn, Bonn, Germany
| |
Collapse
|
3
|
Staerk C, Mayr A. Randomized boosting with multivariable base-learners for high-dimensional variable selection and prediction. BMC Bioinformatics 2021; 22:441. [PMID: 34530737 PMCID: PMC8447543 DOI: 10.1186/s12859-021-04340-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2021] [Accepted: 08/24/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Statistical boosting is a computational approach to select and estimate interpretable prediction models for high-dimensional biomedical data, leading to implicit regularization and variable selection when combined with early stopping. Traditionally, the set of base-learners is fixed for all iterations and consists of simple regression learners including only one predictor variable at a time. Furthermore, the number of iterations is typically tuned by optimizing the predictive performance, leading to models which often include unnecessarily large numbers of noise variables. RESULTS We propose three consecutive extensions of classical component-wise gradient boosting. In the first extension, called Subspace Boosting (SubBoost), base-learners can consist of several variables, allowing for multivariable updates in a single iteration. To compensate for the larger flexibility, the ultimate selection of base-learners is based on information criteria leading to an automatic stopping of the algorithm. As the second extension, Random Subspace Boosting (RSubBoost) additionally includes a random preselection of base-learners in each iteration, enabling the scalability to high-dimensional data. In a third extension, called Adaptive Subspace Boosting (AdaSubBoost), an adaptive random preselection of base-learners is considered, focusing on base-learners which have proven to be predictive in previous iterations. Simulation results show that the multivariable updates in the three subspace algorithms are particularly beneficial in cases of high correlations among signal covariates. In several biomedical applications the proposed algorithms tend to yield sparser models than classical statistical boosting, while showing a very competitive predictive performance also compared to penalized regression approaches like the (relaxed) lasso and the elastic net. CONCLUSIONS The proposed randomized boosting approaches with multivariable base-learners are promising extensions of statistical boosting, particularly suited for highly-correlated and sparse high-dimensional settings. The incorporated selection of base-learners via information criteria induces automatic stopping of the algorithms, promoting sparser and more interpretable prediction models.
Collapse
Affiliation(s)
- Christian Staerk
- Department of Medical Biometry, Informatics and Epidemiology, University Hospital Bonn, Venusberg-Campus 1, 53127, Bonn, Germany.
| | - Andreas Mayr
- Department of Medical Biometry, Informatics and Epidemiology, University Hospital Bonn, Venusberg-Campus 1, 53127, Bonn, Germany
| |
Collapse
|
4
|
Berger M, Schmid M. Flexible modeling of ratio outcomes in clinical and epidemiological research. Stat Methods Med Res 2019; 29:2250-2268. [DOI: 10.1177/0962280219891195] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/02/2023]
Abstract
In medical studies one frequently encounters ratio outcomes. For modeling these right-skewed positive variables, two approaches are in common use. The first one assumes that the outcome follows a normal distribution after transformation (e.g. a log-normal distribution), and the second one assumes gamma distributed outcome values. Classical regression approaches relate the mean ratio to a set of explanatory variables and treat the other parameters of the underlying distribution as nuisance parameters. Here, more flexible extensions for modeling ratio outcomes are proposed that allow to relate all the distribution parameters to explanatory variables. The models are embedded into the framework of generalized additive models for location, scale and shape (GAMLSS), and can be fitted using a component-wise gradient boosting algorithm. The added value of the new modeling approach is demonstrated by the analysis of the LDL/HDL cholesterol ratio, which is a strong predictor of cardiovascular events, using data from the German Chronic Kidney Disease Study. Particularly, our results confirm various important findings on risk factors for cardiovascular events.
Collapse
Affiliation(s)
- Moritz Berger
- Department of Medical Biometry, Informatics and Epidemiology, University of Bonn/University Hospital Bonn, Bonn, Germany
| | - Matthias Schmid
- Department of Medical Biometry, Informatics and Epidemiology, University of Bonn/University Hospital Bonn, Bonn, Germany
| |
Collapse
|
5
|
Mayr A, Weinhold L, Hofner B, Titze S, Gefeller O, Schmid M. The betaboost package-a software tool for modelling bounded outcome variables in potentially high-dimensional epidemiological data. Int J Epidemiol 2019; 47:1383-1388. [PMID: 30380092 DOI: 10.1093/ije/dyy093] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2018] [Accepted: 05/11/2018] [Indexed: 11/12/2022] Open
Abstract
Motivation To provide an integrated software environment for model fitting and variable selection in regression models with a bounded outcome variable. Implementation The proposed modelling framework is implemented in the add-on package betaboost of the statistical software environment R. General features The betaboost methodology is based on beta-regression, which is a state-of-the-art method for modelling bounded outcome variables. By combining traditional model fitting techniques with recent advances in statistical learning and distributional regression, betaboost allows users to carry out data-driven variable and/or confounder selection in potentially high-dimensional epidemiological data. The software package implements a flexible routine to incorporate linear and non-linear predictor effects in both the mean and the precision parameter (relating inversely to the variance) of a beta-regression model. Availability The software is hosted publicly at [http://github.com/boost-R/betaboost] and has been published under General Public License (GPL) version 3 or newer.
Collapse
Affiliation(s)
- Andreas Mayr
- Department of Medical Biometry, Informatics and Epidemiology, University Hospital Bonn, Bonn, Germany
| | - Leonie Weinhold
- Department of Medical Biometry, Informatics and Epidemiology, University Hospital Bonn, Bonn, Germany
| | - Benjamin Hofner
- Section Biostatistics, Paul-Ehrlich-Institut, Langen, Germany
| | - Stephanie Titze
- Department of Nephrology and Hypertension, Friedrich-Alexander-University Erlangen-Nuremberg, Erlangen, Germany
| | - Olaf Gefeller
- Department of Medical Informatics, Biometry and Epidemiology, Friedrich-Alexander-University Erlangen-Nuremberg, Erlangen, Germany
| | - Matthias Schmid
- Department of Medical Biometry, Informatics and Epidemiology, University Hospital Bonn, Bonn, Germany
| |
Collapse
|
6
|
Hepp T, Schmid M, Mayr A. Significance Tests for Boosted Location and Scale Models with Linear Base-Learners. Int J Biostat 2019; 15:/j/ijb.ahead-of-print/ijb-2018-0110/ijb-2018-0110.xml. [PMID: 30990787 DOI: 10.1515/ijb-2018-0110] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2018] [Accepted: 03/21/2019] [Indexed: 11/15/2022]
Abstract
Generalized additive models for location scale and shape (GAMLSS) offer very flexible solutions to a wide range of statistical analysis problems, but can be challenging in terms of proper model specification. This complex task can be simplified using regularization techniques such as gradient boosting algorithms, but the estimates derived from such models are shrunken towards zero and it is consequently not straightforward to calculate proper confidence intervals or test statistics. In this article, we propose two strategies to obtain p-values for linear effect estimates for Gaussian location and scale models based on permutation tests and a parametric bootstrap approach. These procedures can provide a solution for one of the remaining problems in the application of gradient boosting algorithms for distributional regression in biostatistical data analyses. Results from extensive simulations indicate that in low-dimensional data both suggested approaches are able to hold the type-I error threshold and provide reasonable test power comparable to the Wald-type test for maximum likelihood inference. In high-dimensional data, when gradient boosting is the only feasible inference for this model class, the power decreases but the type-I error is still under control. In addition, we demonstrate the application of both tests in an epidemiological study to analyse the impact of physical exercise on both average and the stability of the lung function of elderly people in Germany.
Collapse
Affiliation(s)
- Tobias Hepp
- Institut für medizinische Biometrie, Informatik und Epidemiologie, Medizinische Fakultät, Rheinische Friedrich-Wilhelms-Universität Bonn, Bonn, Germany.,Institut für Medizininformatik, Biometrie und Epidemiologie, Medizinische Fakultät, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
| | - Matthias Schmid
- Institut für medizinische Biometrie, Informatik und Epidemiologie, Medizinische Fakultät, Rheinische Friedrich-Wilhelms-Universität Bonn, Bonn, Germany
| | - Andreas Mayr
- Institut für medizinische Biometrie, Informatik und Epidemiologie, Medizinische Fakultät, Rheinische Friedrich-Wilhelms-Universität Bonn, Bonn, Germany
| |
Collapse
|
7
|
Adib-Hajbaghery M, Nabizadeh-Gharghozar Z, Nasirpour P. Bias in clinical trials into the effects of complementary and alternative medicine therapies on hemodialysis patients. J Family Med Prim Care 2019; 8:2179-2183. [PMID: 31463227 PMCID: PMC6691419 DOI: 10.4103/jfmpc.jfmpc_186_19] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022] Open
Abstract
Background: Chronic renal failure is among the major health challenges in the world. Many clinical trials have been conducted to assess the effects of complementary and alternative therapies on hemodialysis-related outcomes. However, a number of biases may affect the results of these studies. Aims: This study aimed to assess biases in randomized clinical trials into the effects of complementary and alternative therapies on hemodialysis patients. Settings and Design: A critical review on clinical trials into the effects of complementary and alternative therapies therapies on hemodialysis patients. Materials and Methods: This study was conducted on 114 randomized clinical trials which had been published in 2012–2017 into the effects of complementary and alternative therapies on hemodialysis patients. The Cochrane Risk of Bias Tool was employed to assess biases in the included trials. The collected data were presented using the measures of descriptive statistics, namely absolute and relative frequencies. Results: Among 114 included trials, 71.05% (81 trials) had used low bias methods for random sequence generation, while 60.52% (69 trials) had provided no clear information about allocation concealment. Moreover, respecting blinding, 57.89% of trials (66 trials) were low bias. Around 60.52% of trials (69 trials) had no attrition between randomization and final follow-up assessment and 84.21% (96 trials) had apparently reported all intended outcomes. Conclusions: This study shows that 50% of randomized clinical trials into the effects of complementary and alternative therapies on hemodialysis patients have low bias. Yet, quality improvement is still needed to produce more conclusive evidence.
Collapse
Affiliation(s)
- Mohsen Adib-Hajbaghery
- Department of Nursing, Trauma Nursing Research Center, Kashan University of Medical Sciences, Kashan, Iran
| | | | - Parisa Nasirpour
- Department of Nursing and Midwifery, Shiraz University of Medical Sciences, Shiraz, Iran
| |
Collapse
|
8
|
Abstract
Boosting algorithms were originally developed for machine learning but were later adapted to estimate statistical models—offering various practical advantages such as automated variable selection and implicit regularization of effect estimates. The interpretation of the resulting models, however, remains the same as if they had been fitted by classical methods. Boosting, hence, allows to use an advanced machine learning scheme to estimate various types of statistical models. This tutorial aims to highlight how boosting can be used for semi-parametric modelling, what practical implications follow from the design of the algorithm and what kind of drawbacks data analysts have to expect. We illustrate the application of boosting in the analysis of a stunting score from children in India and a high-dimensional dataset of tumour DNA to develop a biomarker for the occurrence of metastases in breast cancer patients.
Collapse
Affiliation(s)
- Andreas Mayr
- Institut für Statistik,
Ludwig-Maxilians-Universität, München, Germany
- Institut für Medizininformatik,
Biometrie und Epidemiologie, Friedrich-Alexander-Universität Erlangen-Nürnberg
(FAU), Erlangen, Germany
| | | |
Collapse
|
9
|
Brockhaus S, Fuest A, Mayr A, Greven S. Signal regression models for location, scale and shape with an application to stock returns. J R Stat Soc Ser C Appl Stat 2017. [DOI: 10.1111/rssc.12252] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
| | | | - Andreas Mayr
- Friedrich‐Alexander‐Universität Erlangen‐Nürnberg Germany
| | | |
Collapse
|
10
|
Mayr A, Hofner B, Waldmann E, Hepp T, Meyer S, Gefeller O. An Update on Statistical Boosting in Biomedicine. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2017; 2017:6083072. [PMID: 28831290 PMCID: PMC5558647 DOI: 10.1155/2017/6083072] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/24/2017] [Accepted: 06/08/2017] [Indexed: 01/16/2023]
Abstract
Statistical boosting algorithms have triggered a lot of research during the last decade. They combine a powerful machine learning approach with classical statistical modelling, offering various practical advantages like automated variable selection and implicit regularization of effect estimates. They are extremely flexible, as the underlying base-learners (regression functions defining the type of effect for the explanatory variables) can be combined with any kind of loss function (target function to be optimized, defining the type of regression setting). In this review article, we highlight the most recent methodological developments on statistical boosting regarding variable selection, functional regression, and advanced time-to-event modelling. Additionally, we provide a short overview on relevant applications of statistical boosting in biomedicine.
Collapse
Affiliation(s)
- Andreas Mayr
- Institut für Medizininformatik, Biometrie und Epidemiologie, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Erlangen, Germany
- Institut für Statistik, Ludwig-Maximilians-Universität München, Munich, Germany
| | | | - Elisabeth Waldmann
- Institut für Medizininformatik, Biometrie und Epidemiologie, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Erlangen, Germany
| | - Tobias Hepp
- Institut für Medizininformatik, Biometrie und Epidemiologie, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Erlangen, Germany
| | - Sebastian Meyer
- Institut für Medizininformatik, Biometrie und Epidemiologie, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Erlangen, Germany
| | - Olaf Gefeller
- Institut für Medizininformatik, Biometrie und Epidemiologie, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Erlangen, Germany
| |
Collapse
|
11
|
Mayr A, Hofner B, Schmid M. Boosting the discriminatory power of sparse survival models via optimization of the concordance index and stability selection. BMC Bioinformatics 2016; 17:288. [PMID: 27444890 PMCID: PMC4957316 DOI: 10.1186/s12859-016-1149-8] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2015] [Accepted: 07/13/2016] [Indexed: 12/15/2022] Open
Abstract
Background When constructing new biomarker or gene signature scores for time-to-event outcomes, the underlying aims are to develop a discrimination model that helps to predict whether patients have a poor or good prognosis and to identify the most influential variables for this task. In practice, this is often done fitting Cox models. Those are, however, not necessarily optimal with respect to the resulting discriminatory power and are based on restrictive assumptions. We present a combined approach to automatically select and fit sparse discrimination models for potentially high-dimensional survival data based on boosting a smooth version of the concordance index (C-index). Due to this objective function, the resulting prediction models are optimal with respect to their ability to discriminate between patients with longer and shorter survival times. The gradient boosting algorithm is combined with the stability selection approach to enhance and control its variable selection properties. Results The resulting algorithm fits prediction models based on the rankings of the survival times and automatically selects only the most stable predictors. The performance of the approach, which works best for small numbers of informative predictors, is demonstrated in a large scale simulation study: C-index boosting in combination with stability selection is able to identify a small subset of informative predictors from a much larger set of non-informative ones while controlling the per-family error rate. In an application to discover biomarkers for breast cancer patients based on gene expression data, stability selection yielded sparser models and the resulting discriminatory power was higher than with lasso penalized Cox regression models. Conclusion The combination of stability selection and C-index boosting can be used to select small numbers of informative biomarkers and to derive new prediction rules that are optimal with respect to their discriminatory power. Stability selection controls the per-family error rate which makes the new approach also appealing from an inferential point of view, as it provides an alternative to classical hypothesis tests for single predictor effects. Due to the shrinkage and variable selection properties of statistical boosting algorithms, the latter tests are typically unfeasible for prediction models fitted by boosting. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1149-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Andreas Mayr
- Institut für Medizininformatik, Biometrie und Epidemiologie, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Waldstr. 6, Erlangen, 91054, Germany. .,Institut für Medizinische Biometrie, Informatik und Epidemiologie, Rheinische Friedrich-Wilhelms-Universität Bonn, Sigmund-Freud-Str. 25, Bonn, 53105, Germany.
| | - Benjamin Hofner
- Institut für Medizininformatik, Biometrie und Epidemiologie, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Waldstr. 6, Erlangen, 91054, Germany
| | - Matthias Schmid
- Institut für Medizinische Biometrie, Informatik und Epidemiologie, Rheinische Friedrich-Wilhelms-Universität Bonn, Sigmund-Freud-Str. 25, Bonn, 53105, Germany
| |
Collapse
|