1
|
Chen T, Zhang H, Mazumder R, Lin X. Fast and scalable ensemble learning method for versatile polygenic risk prediction. Proc Natl Acad Sci U S A 2024; 121:e2403210121. [PMID: 39110727 PMCID: PMC11331062 DOI: 10.1073/pnas.2403210121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2024] [Accepted: 07/11/2024] [Indexed: 08/21/2024] Open
Abstract
Polygenic risk scores (PRS) enhance population risk stratification and advance personalized medicine, but existing methods face several limitations, encompassing issues related to computational burden, predictive accuracy, and adaptability to a wide range of genetic architectures. To address these issues, we propose Aggregated L0Learn using Summary-level data (ALL-Sum), a fast and scalable ensemble learning method for computing PRS using summary statistics from genome-wide association studies (GWAS). ALL-Sum leverages a L0L2 penalized regression and ensemble learning across tuning parameters to flexibly model traits with diverse genetic architectures. In extensive large-scale simulations across a wide range of polygenicity and GWAS sample sizes, ALL-Sum consistently outperformed popular alternative methods in terms of prediction accuracy, runtime, and memory usage by 10%, 20-fold, and threefold, respectively, and demonstrated robustness to diverse genetic architectures. We validated the performance of ALL-Sum in real data analysis of 11 complex traits using GWAS summary statistics from nine data sources, including the Global Lipids Genetics Consortium, Breast Cancer Association Consortium, and FinnGen Biobank, with validation in the UK Biobank. Our results show that on average, ALL-Sum obtained PRS with 25% higher accuracy on average, with 15 times faster computation and half the memory than the current state-of-the-art methods, and had robust performance across a wide range of traits and diseases. Furthermore, our method demonstrates stable prediction when using linkage disequilibrium computed from different data sources. ALL-Sum is available as a user-friendly R software package with publicly available reference data for streamlined analysis.
Collapse
Affiliation(s)
- Tony Chen
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA02215
| | - Haoyu Zhang
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, MD20814
| | - Rahul Mazumder
- Operations Research and Statistics Group, Sloan School of Management, Massachusetts Institute of Technology, Cambridge, MA02139
| | - Xihong Lin
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA02215
- Department of Statistics, Harvard University, Cambridge, MA02138
| |
Collapse
|
2
|
Rauschenberger A, Landoulsi Z, van de Wiel MA, Glaab E. Penalized regression with multiple sources of prior effects. Bioinformatics 2023; 39:btad680. [PMID: 37951587 PMCID: PMC10699841 DOI: 10.1093/bioinformatics/btad680] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2023] [Revised: 10/19/2023] [Accepted: 11/08/2023] [Indexed: 11/14/2023] Open
Abstract
MOTIVATION In many high-dimensional prediction or classification tasks, complementary data on the features are available, e.g. prior biological knowledge on (epi)genetic markers. Here we consider tasks with numerical prior information that provide an insight into the importance (weight) and the direction (sign) of the feature effects, e.g. regression coefficients from previous studies. RESULTS We propose an approach for integrating multiple sources of such prior information into penalized regression. If suitable co-data are available, this improves the predictive performance, as shown by simulation and application. AVAILABILITY AND IMPLEMENTATION The proposed method is implemented in the R package transreg (https://github.com/lcsb-bds/transreg, https://cran.r-project.org/package=transreg).
Collapse
Affiliation(s)
- Armin Rauschenberger
- Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, 4362 Esch-sur-Alzette, Luxembourg
| | - Zied Landoulsi
- Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, 4362 Esch-sur-Alzette, Luxembourg
| | - Mark A van de Wiel
- Department of Epidemiology and Data Science (EDS), Amsterdam University Medical Centers (Amsterdam UMC), 1081 HV Amsterdam, The Netherlands
| | - Enrico Glaab
- Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, 4362 Esch-sur-Alzette, Luxembourg
| |
Collapse
|
3
|
Chen T, Zhang H, Mazumder R, Lin X. Ensembled best subset selection using summary statistics for polygenic risk prediction. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.09.25.559307. [PMID: 37886515 PMCID: PMC10602024 DOI: 10.1101/2023.09.25.559307] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/28/2023]
Abstract
Polygenic risk scores (PRS) enhance population risk stratification and advance personalized medicine, yet existing methods face a tradeoff between predictive power and computational efficiency. We introduce ALL-Sum, a fast and scalable PRS method that combines an efficient summary statistic-based L 0 L 2 penalized regression algorithm with an ensembling step that aggregates estimates from different tuning parameters for improved prediction performance. In extensive large-scale simulations across a wide range of polygenicity and genome-wide association studies (GWAS) sample sizes, ALL-Sum consistently outperforms popular alternative methods in terms of prediction accuracy, runtime, and memory usage. We analyze 27 published GWAS summary statistics for 11 complex traits from 9 reputable data sources, including the Global Lipids Genetics Consortium, Breast Cancer Association Consortium, and FinnGen, evaluated using individual-level UKBB data. ALL-Sum achieves the highest accuracy for most traits, particularly for GWAS with large sample sizes. We provide ALL-Sum as a user-friendly command-line software with pre-computed reference data for streamlined user-end analysis.
Collapse
|
4
|
Rauschenberger A, Glaab E. Predicting dichotomised outcomes from high-dimensional data in biomedicine. J Appl Stat 2023; 51:1756-1771. [PMID: 38933137 PMCID: PMC11198132 DOI: 10.1080/02664763.2023.2233057] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2022] [Accepted: 06/28/2023] [Indexed: 06/28/2024]
Abstract
In many biomedical applications, we are more interested in the predicted probability that a numerical outcome is above a threshold than in the predicted value of the outcome. For example, it might be known that antibody levels above a certain threshold provide immunity against a disease, or a threshold for a disease severity score might reflect conversion from the presymptomatic to the symptomatic disease stage. Accordingly, biomedical researchers often convert numerical to binary outcomes (loss of information) to conduct logistic regression (probabilistic interpretation). We address this bad statistical practice by modelling the binary outcome with logistic regression, modelling the numerical outcome with linear regression, transforming the predicted values from linear regression to predicted probabilities, and combining the predicted probabilities from logistic and linear regression. Analysing high-dimensional simulated and experimental data, namely clinical data for predicting cognitive impairment, we obtain significantly improved predictions of dichotomised outcomes. Thus, the proposed approach effectively combines binary with numerical outcomes to improve binary classification in high-dimensional settings. An implementation is available in the R package cornet on GitHub (https://github.com/rauschenberger/cornet) and CRAN (https://CRAN.R-project.org/package=cornet).
Collapse
Affiliation(s)
- Armin Rauschenberger
- Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Esch-sur-Alzette, Luxembourg
| | - Enrico Glaab
- Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Esch-sur-Alzette, Luxembourg
| |
Collapse
|
5
|
Kober KM, Roy R, Conley Y, Dhruva A, Hammer MJ, Levine J, Olshen A, Miaskowski C. Prediction of morning fatigue severity in outpatients receiving chemotherapy: less may still be more. Support Care Cancer 2023; 31:253. [PMID: 37039882 DOI: 10.1007/s00520-023-07723-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2022] [Accepted: 04/01/2023] [Indexed: 04/12/2023]
Abstract
INTRODUCTION Fatigue is the most common and debilitating symptom experienced by cancer patients undergoing chemotherapy (CTX). Prediction of symptom severity can assist clinicians to identify high-risk patients and provide education to decrease symptom severity. The purpose of this study was to predict the severity of morning fatigue in the week following the administration of CTX. METHODS Outpatients (n = 1217) completed questionnaires 1 week prior to and 1 week following administration of CTX. Morning fatigue was measured using the Lee Fatigue Scale (LFS). Separate prediction models for morning fatigue severity were created using 157 demographic, clinical, symptom, and psychosocial adjustment characteristics and either morning fatigue scores or individual fatigue item scores. Prediction models were created using two regression and five machine learning approaches. RESULTS Elastic net models provided the best fit across all models. For the EN model using individual LFS item scores, two of the 13 individual LFS items (i.e., "worn out," "exhausted") were the strongest predictors. CONCLUSIONS This study is the first to use machine learning techniques to accurately predict the severity of morning fatigue from prior to through the week following the administration of CTX using total and individual item scores from the Lee Fatigue Scale (LFS). Our findings suggest that the language used to assess clinical fatigue in oncology patients is important and that two simple questions may be used to predict morning fatigue severity.
Collapse
Affiliation(s)
- Kord M Kober
- School of Nursing, University of California, San Francisco, CA, USA.
- Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, CA, USA.
- Bakar Computational Health Sciences Institute, University of California, San Francisco, CA, USA.
| | - Ritu Roy
- Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, CA, USA
| | - Yvette Conley
- School of Nursing, University of Pittsburg, Pittsburg, PA, USA
| | - Anand Dhruva
- School of Medicine, University of California, San Francisco, CA, USA
| | | | - Jon Levine
- School of Medicine, University of California, San Francisco, CA, USA
| | - Adam Olshen
- Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, CA, USA
- Bakar Computational Health Sciences Institute, University of California, San Francisco, CA, USA
- School of Medicine, University of California, San Francisco, CA, USA
| | - Christine Miaskowski
- School of Nursing, University of California, San Francisco, CA, USA
- Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, CA, USA
| |
Collapse
|
6
|
van Nee MM, van de Brug T, van de Wiel MA. Fast Marginal Likelihood Estimation of Penalties for Group-Adaptive Elastic Net. J Comput Graph Stat 2022; 32:950-960. [PMID: 38013849 PMCID: PMC10511031 DOI: 10.1080/10618600.2022.2128809] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2021] [Accepted: 09/12/2022] [Indexed: 10/10/2022]
Abstract
Elastic net penalization is widely used in high-dimensional prediction and variable selection settings. Auxiliary information on the variables, for example, groups of variables, is often available. Group-adaptive elastic net penalization exploits this information to potentially improve performance by estimating group penalties, thereby penalizing important groups of variables less than other groups. Estimating these group penalties is, however, hard due to the high dimension of the data. Existing methods are computationally expensive or not generic in the type of response. Here we present a fast method for estimation of group-adaptive elastic net penalties for generalized linear models. We first derive a low-dimensional representation of the Taylor approximation of the marginal likelihood for group-adaptive ridge penalties, to efficiently estimate these penalties. Then we show by using asymptotic normality of the linear predictors that this marginal likelihood approximates that of elastic net models. The ridge group penalties are then transformed to elastic net group penalties by matching the ridge prior variance to the elastic net prior variance as function of the group penalties. The method allows for overlapping groups and unpenalized variables, and is easily extended to other penalties. For a model-based simulation study and two cancer genomics applications we demonstrate a substantially decreased computation time and improved or matching performance compared to other methods. Supplementary materials for this article are available online.
Collapse
Affiliation(s)
- Mirrelijn M. van Nee
- Department of Epidemiology and Data Science, Amsterdam University Medical Centers, Amsterdam, The Netherlands
| | - Tim van de Brug
- Department of Epidemiology and Data Science, Amsterdam University Medical Centers, Amsterdam, The Netherlands
| | - Mark A. van de Wiel
- Department of Epidemiology and Data Science, Amsterdam University Medical Centers, Amsterdam, The Netherlands
| |
Collapse
|
7
|
Diaz-Uriarte R, Gómez de Lope E, Giugno R, Fröhlich H, Nazarov PV, Nepomuceno-Chamorro IA, Rauschenberger A, Glaab E. Ten quick tips for biomarker discovery and validation analyses using machine learning. PLoS Comput Biol 2022; 18:e1010357. [PMID: 35951526 PMCID: PMC9371329 DOI: 10.1371/journal.pcbi.1010357] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Affiliation(s)
- Ramon Diaz-Uriarte
- Department of Biochemistry, School of Medicine, Universidad Autónoma de Madrid, Instituto de Investigaciones Biomédicas ‘Alberto Sols’ (UAM-CSIC), Madrid, Spain
| | - Elisa Gómez de Lope
- Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Luxembourg
| | - Rosalba Giugno
- Department of Computer Science, University of Verona, Verona, Italy
| | - Holger Fröhlich
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Sankt Augustin, Germany
- Bonn-Aachen International Centre for IT (b-it), Rheinische Friedrich-Wilhelms-Universität Bonn, Bonn, Germany
| | - Petr V. Nazarov
- Department of Cancer Research, Luxembourg Institute of Health, Strassen, Luxembourg
| | | | - Armin Rauschenberger
- Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Luxembourg
| | - Enrico Glaab
- Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Luxembourg
- * E-mail:
| |
Collapse
|
8
|
He H, Guo X, Yu J, Ai C, Shi S. Overcoming the inadaptability of sparse group lasso for data with various group structures by stacking. Bioinformatics 2022; 38:1542-1549. [PMID: 34908103 DOI: 10.1093/bioinformatics/btab848] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2021] [Revised: 12/08/2021] [Accepted: 12/13/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Efficiently identifying genes based on gene expression level have been studied to help to classify different cancer types and improve the prediction performance. Logistic regression model based on regularization technique is often one of the effective approaches for simultaneously realizing prediction and feature (gene) selection in genomic data of high dimensionality. However, standard methods ignore biological group structure and generally result in poorer predictive models. RESULTS In this article, we develop a classifier named Stacked SGL that satisfies the criteria of prediction, stability and selection based on sparse group lasso penalty by stacking. Sparse group lasso has a mixing parameter representing the ratio of lasso to group lasso, thus providing a compromise between selecting a subset of sparse feature groups and introducing sparsity within each group. We propose to use stacked generalization to combine different ratios rather than choosing one ratio, which could help to overcome the inadaptability of sparse group lasso for some data. Considering that stacking weakens feature selection, we perform a post hoc feature selection which might slightly reduce predictive performance, but it shows superior in feature selection. Experimental results on simulation demonstrate that our approach enjoys competitive and stable classification performance and lower false discovery rate in feature selection for varying sets of data compared with other regularization methods. In addition, our method presents better accuracy in three public cancer datasets and identifies more powerful discriminatory and potential mutation genes for thyroid carcinoma. AVAILABILITY AND IMPLEMENTATION The real data underlying this article are available from https://github.com/huanheaha/Stacked_SGL; https://zenodo.org/record/5761577#.YbAUyciEwk2. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Huan He
- Department of Mathematics and Numerical Simulation and High-Performance Computing Laboratory, School of Sciences, Nanchang University, Nanchang 330031, China
| | - Xinyun Guo
- Department of Mathematics and Numerical Simulation and High-Performance Computing Laboratory, School of Sciences, Nanchang University, Nanchang 330031, China
| | - Jialin Yu
- Department of Mathematics and Numerical Simulation and High-Performance Computing Laboratory, School of Sciences, Nanchang University, Nanchang 330031, China
| | - Chen Ai
- Department of Mathematics and Numerical Simulation and High-Performance Computing Laboratory, School of Sciences, Nanchang University, Nanchang 330031, China
| | - Shaoping Shi
- Department of Mathematics and Numerical Simulation and High-Performance Computing Laboratory, School of Sciences, Nanchang University, Nanchang 330031, China
| |
Collapse
|
9
|
Quality assessment of web-based information on type 2 diabetes. ONLINE INFORMATION REVIEW 2021. [DOI: 10.1108/oir-02-2021-0089] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
PurposeThis paper proposes a framework that automatically assesses content coverage and information quality of health websites for end-users.Design/methodology/approachThe study investigates the impact of textual and content-based features in predicting the quality of health-related texts. Content-based features were acquired using an evidence-based practice guideline in diabetes. A set of textual features inspired by professional health literacy guidelines and the features commonly used for assessing information quality in other domains were also used. In this study, 60 websites about type 2 diabetes were methodically selected for inclusion. Two general practitioners used DISCERN to assess each website in terms of its content coverage and quality.FindingsThe proposed framework outputs were compared with the experts' evaluation scores. The best accuracy was obtained as 88 and 92% with textual features and content-based features for coverage assessment respectively. When both types of features were used, the proposed framework achieved 90% accuracy. For information quality assessment, the content-based features resulted in a higher accuracy of 92% against 88% obtained using the textual features.Research limitations/implicationsThe experiments were conducted for websites about type 2 diabetes. As the whole process is costly and requires extensive expert human labelling, the study was carried out in a single domain. However, the methodology is generalizable to other health domains for which evidence-based practice guidelines are available.Practical implicationsFinding high-quality online health information is becoming increasingly difficult due to the high volume of information generated by non-experts in the area. The search engines fail to rank objective health websites higher within the search results. The proposed framework can aid search engine and information platform developers to implement better retrieval techniques, in turn, facilitating end-users' access to high-quality health information.Social implicationsErroneous, biased or partial health information is a serious problem for end-users who need access to objective information on their health problems. Such information may cause patients to stop their treatments provided by professionals. It might also have adverse financial implications by causing unnecessary expenditures on ineffective treatments. The ability to access high-quality health information has a positive effect on the health of both individuals and the whole society.Originality/valueThe paper demonstrates that automatic assessment of health websites is a domain-specific problem, which cannot be addressed with the general information quality assessment methodologies in the literature. Content coverage of health websites has also been studied in the health domain for the first time in the literature.
Collapse
|
10
|
Rauschenberger A, Glaab E. Predicting correlated outcomes from molecular data. Bioinformatics 2021; 37:3889-3895. [PMID: 34358294 DOI: 10.1093/bioinformatics/btab576] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2021] [Revised: 07/14/2021] [Accepted: 08/05/2021] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Multivariate (multi-target) regression has the potential to outperform univariate (single-target) regression at predicting correlated outcomes, which frequently occur in biomedical and clinical research. Here we implement multivariate lasso and ridge regression using stacked generalisation. RESULTS Our flexible approach leads to predictive and interpretable models in high-dimensional settings, with a single estimate for each input-output effect. In the simulation, we compare the predictive performance of several state-of-the-art methods for multivariate regression. In the application, we use clinical and genomic data to predict multiple motor and non-motor symptoms in Parkinson's disease patients. We conclude that stacked multivariate regression, with our adaptations, is a competitive method for predicting correlated outcomes. AVAILABILITY AND IMPLEMENTATION The R package joinet is available on GitHub (https://github.com/rauschenberger/joinet) and cran (https://cran.r-project.org/package=joinet). SUPPLEMENTARY INFORMATION Supplementary tables and figures are available at Bioinformatics online.
Collapse
Affiliation(s)
- Armin Rauschenberger
- Luxembourg Centre for Systems Biomedicine (lcsb), University of Luxembourg, Esch-sur-Alzette, 4362, Luxembourg
| | - Enrico Glaab
- Luxembourg Centre for Systems Biomedicine (lcsb), University of Luxembourg, Esch-sur-Alzette, 4362, Luxembourg
| |
Collapse
|