1
|
Mayne GC, Woodman RJ, Watson DI, Bright T, Gan S, Lord RV, Bourke MJ, Levert-Mignon A, Bastian I, Irvine T, Schloithe A, Martin M, Sheehan-Hennessy L, Hussey DJ. A Method for Increasing the Robustness of Stable Feature Selection for Biomarker Discovery in Molecular Medicine Developed Using Serum Small Extracellular Vesicle Associated miRNAs and the Barrett's Oesophagus Disease Spectrum. Int J Mol Sci 2023; 24:ijms24087068. [PMID: 37108236 PMCID: PMC10139127 DOI: 10.3390/ijms24087068] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2023] [Revised: 04/05/2023] [Accepted: 04/09/2023] [Indexed: 04/29/2023] Open
Abstract
The biomarker development field within molecular medicine remains limited by the methods that are available for building predictive models. We developed an efficient method for conservatively estimating confidence intervals for the cross validation-derived prediction errors of biomarker models. This new method was investigated for its ability to improve the capacity of our previously developed method, StaVarSel, for selecting stable biomarkers. Compared with the standard cross validation method, StaVarSel markedly improved the estimated generalisable predictive capacity of serum miRNA biomarkers for the detection of disease states that are at increased risk of progressing to oesophageal adenocarcinoma. The incorporation of our new method for conservatively estimating confidence intervals into StaVarSel resulted in the selection of less complex models with increased stability and improved or similar predictive capacities. The methods developed in this study have the potential to improve progress from biomarker discovery to biomarker driven translational research.
Collapse
Affiliation(s)
- George C Mayne
- Flinders Health and Medical Research Institute-Cancer Program, Flinders University, Bedford Park, SA 5042, Australia
- Department of Surgery, Flinders Medical Centre, Bedford Park, SA 5042, Australia
| | - Richard J Woodman
- Flinders Health and Medical Research Institute-Cancer Program, Flinders University, Bedford Park, SA 5042, Australia
| | - David I Watson
- Flinders Health and Medical Research Institute-Cancer Program, Flinders University, Bedford Park, SA 5042, Australia
- Department of Surgery, Flinders Medical Centre, Bedford Park, SA 5042, Australia
| | - Tim Bright
- Flinders Health and Medical Research Institute-Cancer Program, Flinders University, Bedford Park, SA 5042, Australia
- Department of Surgery, Flinders Medical Centre, Bedford Park, SA 5042, Australia
| | - Susan Gan
- Department of Surgery, Flinders Medical Centre, Bedford Park, SA 5042, Australia
| | - Reginald V Lord
- Gastroesophageal Cancer Research Program, St. Vincent's Centre for Applied Medical Research, Darlinghurst, NSW 2010, Australia
| | - Michael J Bourke
- Faculty of Medicine and Health, The University of Sydney, Westmead, NSW 2145, Australia
| | - Angelique Levert-Mignon
- Gastroesophageal Cancer Research Program, St. Vincent's Centre for Applied Medical Research, Darlinghurst, NSW 2010, Australia
| | - Isabell Bastian
- Flinders Health and Medical Research Institute-Cancer Program, Flinders University, Bedford Park, SA 5042, Australia
- Department of Surgery, Flinders Medical Centre, Bedford Park, SA 5042, Australia
| | - Tanya Irvine
- Department of Surgery, Flinders Medical Centre, Bedford Park, SA 5042, Australia
| | - Ann Schloithe
- Flinders Health and Medical Research Institute-Cancer Program, Flinders University, Bedford Park, SA 5042, Australia
- Department of Surgery, Flinders Medical Centre, Bedford Park, SA 5042, Australia
| | - Marian Martin
- Flinders Health and Medical Research Institute-Cancer Program, Flinders University, Bedford Park, SA 5042, Australia
| | - Lorraine Sheehan-Hennessy
- Flinders Health and Medical Research Institute-Cancer Program, Flinders University, Bedford Park, SA 5042, Australia
| | - Damian J Hussey
- Flinders Health and Medical Research Institute-Cancer Program, Flinders University, Bedford Park, SA 5042, Australia
- Department of Surgery, Flinders Medical Centre, Bedford Park, SA 5042, Australia
| |
Collapse
|
2
|
Waldmann P. On the Use of the Pearson Correlation Coefficient for Model Evaluation in Genome-Wide Prediction. Front Genet 2019; 10:899. [PMID: 31632436 PMCID: PMC6781837 DOI: 10.3389/fgene.2019.00899] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2019] [Accepted: 08/23/2019] [Indexed: 01/24/2023] Open
Abstract
The large number of markers in genome-wide prediction demands the use of methods with regularization and model comparison based on some hold-out test prediction error measure. In quantitative genetics, it is common practice to calculate the Pearson correlation coefficient (r2 ) as a standardized measure of the predictive accuracy of a model. Based on arguments from the bias-variance trade-off theory in statistical learning, we show that shrinkage of the regression coefficients (i.e., QTL effects) reduces the prediction mean squared error (MSE) by introducing model bias compared with the ordinary least squares method. We also show that the LASSO and the adaptive LASSO (ALASSO) can reduce the model bias and prediction MSE by adding model variance. In an application of ridge regression, the LASSO and ALASSO to a simulated example based on results for 9,723 SNPs and 3,226 individuals, the best model selected was with the LASSO when r2 was used as a measure. However, when model selection was based on test MSE and coefficient of determination R2 the ALASSO proved to be the best method. Hence, use of r2 may lead to selection of the wrong model and therefore also nonoptimal ranking of phenotype predictions and genomic breeding values. Instead, we propose use of the test MSE for model selection and R2 as a standardized measure of the accuracy.
Collapse
Affiliation(s)
- Patrik Waldmann
- Department of Animal Breeding and Genetics, The Swedish Universiy of Agricultural Sciences, SLU, Uppsala, Sweden
| |
Collapse
|
3
|
Abstract
Breakthroughs in machine learning are rapidly changing science and society, yet our fundamental understanding of this technology has lagged far behind. Indeed, one of the central tenets of the field, the bias-variance trade-off, appears to be at odds with the observed behavior of methods used in modern machine-learning practice. The bias-variance trade-off implies that a model should balance underfitting and overfitting: Rich enough to express underlying structure in data and simple enough to avoid fitting spurious patterns. However, in modern practice, very rich models such as neural networks are trained to exactly fit (i.e., interpolate) the data. Classically, such models would be considered overfitted, and yet they often obtain high accuracy on test data. This apparent contradiction has raised questions about the mathematical foundations of machine learning and their relevance to practitioners. In this paper, we reconcile the classical understanding and the modern practice within a unified performance curve. This "double-descent" curve subsumes the textbook U-shaped bias-variance trade-off curve by showing how increasing model capacity beyond the point of interpolation results in improved performance. We provide evidence for the existence and ubiquity of double descent for a wide spectrum of models and datasets, and we posit a mechanism for its emergence. This connection between the performance and the structure of machine-learning models delineates the limits of classical analyses and has implications for both the theory and the practice of machine learning.
Collapse
Affiliation(s)
- Mikhail Belkin
- Department of Computer Science and Engineering, The Ohio State University, Columbus, OH 43210;
- Department of Statistics, The Ohio State University, Columbus, OH 43210
| | - Daniel Hsu
- Computer Science Department and Data Science Institute, Columbia University, New York, NY 10027
| | - Siyuan Ma
- Department of Computer Science and Engineering, The Ohio State University, Columbus, OH 43210
| | - Soumik Mandal
- Department of Computer Science and Engineering, The Ohio State University, Columbus, OH 43210
| |
Collapse
|