1
|
Zhu F, Zhang XY, Cheng Z, Liu CL. Revisiting Confidence Estimation: Towards Reliable Failure Prediction. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2024; 46:3370-3387. [PMID: 38090830 DOI: 10.1109/tpami.2023.3342285] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/04/2024]
Abstract
Reliable confidence estimation is a challenging yet fundamental requirement in many risk-sensitive applications. However, modern deep neural networks are often overconfident for their incorrect predictions, i.e., misclassified samples from known classes, and out-of-distribution (OOD) samples from unknown classes. In recent years, many confidence calibration and OOD detection methods have been developed. In this paper, we find a general, widely existing but actually-neglected phenomenon that most confidence estimation methods are harmful for detecting misclassification errors. We investigate this problem and reveal that popular calibration and OOD detection methods often lead to worse confidence separation between correctly classified and misclassified examples, making it difficult to decide whether to trust a prediction or not. Finally, we propose to enlarge the confidence gap by finding flat minima, which yields state-of-the-art failure prediction performance under various settings including balanced, long-tailed, and covariate-shift classification scenarios. Our study not only provides a strong baseline for reliable confidence estimation but also acts as a bridge between understanding calibration, OOD detection, and failure prediction.
Collapse
|
2
|
Li L, Carver R, Lopez-Gomez I, Sha F, Anderson J. Generative emulation of weather forecast ensembles with diffusion models. SCIENCE ADVANCES 2024; 10:eadk4489. [PMID: 38552014 PMCID: PMC10980268 DOI: 10.1126/sciadv.adk4489] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/22/2023] [Accepted: 02/22/2024] [Indexed: 04/01/2024]
Abstract
Uncertainty quantification is crucial to decision-making. A prominent example is probabilistic forecasting in numerical weather prediction. The dominant approach to representing uncertainty in weather forecasting is to generate an ensemble of forecasts by running physics-based simulations under different conditions, which is a computationally costly process. We propose to amortize the computational cost by emulating these forecasts with deep generative diffusion models learned from historical data. The learned models are highly scalable with respect to high-performance computing accelerators and can sample thousands of realistic weather forecasts at low cost. When designed to emulate operational ensemble forecasts, the generated ones are similar to physics-based ensembles in statistical properties and predictive skill. When designed to correct biases present in the operational forecasting system, the generated ensembles show improved probabilistic forecast metrics. They are more reliable and forecast probabilities of extreme weather events more accurately. While we focus on weather forecasting, this methodology may enable creating large climate projection ensembles for climate risk assessment.
Collapse
Affiliation(s)
- Lizao Li
- Google Research, Mountain View, CA, USA
| | | | | | - Fei Sha
- Google Research, Mountain View, CA, USA
| | | |
Collapse
|
3
|
Ojeda FM, Jansen ML, Thiéry A, Blankenberg S, Weimar C, Schmid M, Ziegler A. Calibrating machine learning approaches for probability estimation: A comprehensive comparison. Stat Med 2023; 42:5451-5478. [PMID: 37849356 DOI: 10.1002/sim.9921] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2021] [Revised: 08/30/2023] [Accepted: 09/18/2023] [Indexed: 10/19/2023]
Abstract
Statistical prediction models have gained popularity in applied research. One challenge is the transfer of the prediction model to a different population which may be structurally different from the model for which it has been developed. An adaptation to the new population can be achieved by calibrating the model to the characteristics of the target population, for which numerous calibration techniques exist. In view of this diversity, we performed a systematic evaluation of various popular calibration approaches used by the statistical and the machine learning communities for estimating two-class probabilities. In this work, we first provide a review of the literature and, second, present the results of a comprehensive simulation study. The calibration approaches are compared with respect to their empirical properties and relationships, their ability to generalize precise probability estimates to external populations and their availability in terms of easy-to-use software implementations. Third, we provide code from real data analysis allowing its application by researchers. Logistic calibration and beta calibration, which estimate an intercept plus one and two slope parameters, respectively, consistently showed the best results in the simulation studies. Calibration on logit transformed probability estimates generally outperformed calibration methods on nontransformed estimates. In case of structural differences between training and validation data, re-estimation of the entire prediction model should be outweighted against sample size of the validation data. We recommend regression-based calibration approaches using transformed probability estimates, where at least one slope is estimated in addition to an intercept for updating probability estimates in validation studies.
Collapse
Affiliation(s)
- Francisco M Ojeda
- Department of Cardiology, University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
- Centre for Population Health Innovation (POINT), University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
| | - Max L Jansen
- Cardio-CARE, Medizincampus Davos, Davos, Switzerland
| | | | - Stefan Blankenberg
- Department of Cardiology, University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
- Centre for Population Health Innovation (POINT), University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
- German Center for Cardiovascular Research (DZHK), Partner Site Hamburg/Kiel/Lübeck, Hamburg, Germany
| | - Christian Weimar
- BDH-Klinik Elzach, Baden-Wuerttemberg, Germany
- Institute for Medical Informatics, Biometry and Epidemiology, University of Duisburg-Essen, Essen, North Rhine-Westphalia, Germany
| | - Matthias Schmid
- Institute of Medical Biometry, Informatics and Epidemiology, Faculty of Medicine, University of Bonn, Bonn, North Rhine-Westphalia, Germany
| | - Andreas Ziegler
- Department of Cardiology, University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
- Centre for Population Health Innovation (POINT), University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
- Cardio-CARE, Medizincampus Davos, Davos, Switzerland
- School of Mathematics, Statistics and Computer Science, University of KwaZulu-Natal, Pietermaritzburg, South Africa
- Swiss Institute of Bioinformatics, Lausanne, Waadt, Switzerland
| |
Collapse
|
4
|
Xenopoulos P, Rulff J, Nonato LG, Barr B, Silva C. Calibrate: Interactive Analysis of Probabilistic Model Output. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 2023; 29:853-863. [PMID: 36166523 DOI: 10.1109/tvcg.2022.3209489] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
Analyzing classification model performance is a crucial task for machine learning practitioners. While practitioners often use count-based metrics derived from confusion matrices, like accuracy, many applications, such as weather prediction, sports betting, or patient risk prediction, rely on a classifier's predicted probabilities rather than predicted labels. In these instances, practitioners are concerned with producing a calibrated model, that is, one which outputs probabilities that reflect those of the true distribution. Model calibration is often analyzed visually, through static reliability diagrams, however, the traditional calibration visualization may suffer from a variety of drawbacks due to the strong aggregations it necessitates. Furthermore, count-based approaches are unable to sufficiently analyze model calibration. We present Calibrate, an interactive reliability diagram that addresses the aforementioned issues. Calibrate constructs a reliability diagram that is resistant to drawbacks in traditional approaches, and allows for interactive subgroup analysis and instance-level inspection. We demonstrate the utility of Calibrate through use cases on both real-world and synthetic data. We further validate Calibrate by presenting the results of a think-aloud experiment with data scientists who routinely analyze model calibration.
Collapse
|
5
|
Henzi A, Ziegel JF, Gneiting T. Isotonic distributional regression. J R Stat Soc Series B Stat Methodol 2021. [DOI: 10.1111/rssb.12450] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
Affiliation(s)
| | | | - Tilmann Gneiting
- Heidelberg Institute for Theoretical Studies Heidelberg Germany
- Karlsruhe Institute of Technology Karlsruhe Germany
| |
Collapse
|