1
|
Singh G, Moncrieff G, Venter Z, Cawse-Nicholson K, Slingsby J, Robinson TB. Uncertainty quantification for probabilistic machine learning in earth observation using conformal prediction. Sci Rep 2024; 14:16166. [PMID: 39003341 PMCID: PMC11246475 DOI: 10.1038/s41598-024-65954-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2024] [Accepted: 06/25/2024] [Indexed: 07/15/2024] Open
Abstract
Machine learning is increasingly applied to Earth Observation (EO) data to obtain datasets that contribute towards international accords. However, these datasets contain inherent uncertainty that needs to be quantified reliably to avoid negative consequences. In response to the increased need to report uncertainty, we bring attention to the promise of conformal prediction within the domain of EO. Unlike previous uncertainty quantification methods, conformal prediction offers statistically valid prediction regions while concurrently supporting any machine learning model and data distribution. To support the need for conformal prediction, we reviewed EO datasets and found that only 22.5% of the datasets incorporated a degree of uncertainty information, with unreliable methods prevalent. Current open implementations require moving large amounts of EO data to the algorithms. We introduced Google Earth Engine native modules that bring conformal prediction to the data and compute, facilitating the integration of uncertainty quantification into existing traditional and deep learning modelling workflows. To demonstrate the versatility and scalability of these tools we apply them to valued EO applications spanning local to global extents, regression, and classification tasks. Subsequently, we discuss the opportunities arising from the use of conformal prediction in EO. We anticipate that accessible and easy-to-use tools, such as those provided here, will drive wider adoption of rigorous uncertainty quantification in EO, thereby enhancing the reliability of downstream uses such as operational monitoring and decision-making.
Collapse
Affiliation(s)
- Geethen Singh
- Department of Botany and Zoology, Centre for Invasion Biology, Stellenbosch University, Stellenbosch, South Africa.
| | - Glenn Moncrieff
- Global Science, The Nature Conservancy, Cape Town, 7945, South Africa
- Department of Statistical Sciences, Centre for Statistics in Ecology, Environment and Conservation, University of Cape Town, Private Bag X3, Rondebosch, Cape Town, 7701, South Africa
| | - Zander Venter
- Norwegian Institute for Nature Research-NINA, Sognsveien 68, 0855, Oslo, Norway
| | - Kerry Cawse-Nicholson
- Carbon Cycles and Ecosystems, Jet Propulsion Laboratory, California Institute of Technology, Pasadena, CA, USA
| | - Jasper Slingsby
- Department of Biological Sciences and Centre for Statistics in Ecology, Environment and Conservation, University of Cape Town, Private Bag X3, Rondebosch, Cape Town, 7701, South Africa
- Fynbos Node, South African Environmental Observation Network, Centre for Biodiversity Conservation, Cape Town, South Africa
| | - Tamara B Robinson
- Department of Botany and Zoology, Centre for Invasion Biology, Stellenbosch University, Stellenbosch, South Africa
| |
Collapse
|
2
|
Qiu H, Dobriban E, Tchetgen Tchetgen E. Prediction sets adaptive to unknown covariate shift. J R Stat Soc Series B Stat Methodol 2023; 85:1680-1705. [PMID: 38312527 PMCID: PMC10837005 DOI: 10.1093/jrsssb/qkad069] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2022] [Revised: 06/17/2023] [Accepted: 06/20/2023] [Indexed: 02/06/2024]
Abstract
Predicting sets of outcomes-instead of unique outcomes-is a promising solution to uncertainty quantification in statistical learning. Despite a rich literature on constructing prediction sets with statistical guarantees, adapting to unknown covariate shift-a prevalent issue in practice-poses a serious unsolved challenge. In this article, we show that prediction sets with finite-sample coverage guarantee are uninformative and propose a novel flexible distribution-free method, PredSet-1Step, to efficiently construct prediction sets with an asymptotic coverage guarantee under unknown covariate shift. We formally show that our method is asymptotically probably approximately correct, having well-calibrated coverage error with high confidence for large samples. We illustrate that it achieves nominal coverage in a number of experiments and a data set concerning HIV risk prediction in a South African cohort study. Our theory hinges on a new bound for the convergence rate of the coverage of Wald confidence intervals based on general asymptotically linear estimators.
Collapse
Affiliation(s)
- Hongxiang Qiu
- Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Edgar Dobriban
- Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Eric Tchetgen Tchetgen
- Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA 19104, USA
| |
Collapse
|
3
|
Kuchibhotla AK, Berk RA. Nested conformal prediction sets for classification with applications to probation data. Ann Appl Stat 2023. [DOI: 10.1214/22-aoas1650] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2023]
Affiliation(s)
- Arun K. Kuchibhotla
- Department of Statistics and Data Science, Carnegie Mellon University https://arun-kuchibhotla.github.io
| | - Richard A. Berk
- Department of Criminology, Department of Statistics, University of Pennsylvania https://crim.sas.upenn.edu/people/richard-berk
| |
Collapse
|
4
|
Candès E, Lei L, Ren Z. Conformalized survival analysis. J R Stat Soc Series B Stat Methodol 2023. [DOI: 10.1093/jrsssb/qkac004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023]
Abstract
Abstract
In this paper, we develop an inferential method based on conformal prediction, which can wrap around any survival prediction algorithm to produce calibrated, covariate-dependent lower predictive bounds on survival times. In the Type I right-censoring setting, when the censoring times are completely exogenous, the lower predictive bounds have guaranteed coverage in finite samples without any assumptions other than that of operating on independent and identically distributed data points. Under a more general conditionally independent censoring assumption, the bounds satisfy a doubly robust property which states the following: marginal coverage is approximately guaranteed if either the censoring mechanism or the conditional survival function is estimated well. The validity and efficiency of our procedure are demonstrated on synthetic data and real COVID-19 data from the UK Biobank.
Collapse
Affiliation(s)
- Emmanuel Candès
- Department of Mathematics, Stanford University , Stanford, CA , USA
- Department of Statistics, Stanford University , Stanford, CA , USA
| | - Lihua Lei
- Department of Statistics, Stanford University , Stanford, CA , USA
- Graduate School of Business, Stanford University , Stanford, CA , USA
| | - Zhimei Ren
- Department of Statistics, University of Chicago , Chicago, IL , USA
| |
Collapse
|
5
|
Wang W, Qiao X. Set-Valued Support Vector Machine with Bounded Error Rates. J Am Stat Assoc 2022. [DOI: 10.1080/01621459.2022.2089573] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Affiliation(s)
- Wenbo Wang
- Department of Mathematical Sciences at Binghamton University, State University of New York, Binghamton, New York, 13902
| | - Xingye Qiao
- Department of Mathematical Sciences at Binghamton University, State University of New York, Binghamton, New York, 13902
| |
Collapse
|
6
|
Bergquist S, Brooks GA, Landrum MB, Keating NL, Rose S. Uncertainty in lung cancer stage for survival estimation via set-valued classification. Stat Med 2022; 41:3772-3788. [PMID: 35675972 PMCID: PMC9540678 DOI: 10.1002/sim.9448] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2021] [Revised: 02/16/2022] [Accepted: 05/13/2022] [Indexed: 11/22/2022]
Abstract
The difficulty in identifying cancer stage in health care claims data has limited oncology quality of care and health outcomes research. We fit prediction algorithms for classifying lung cancer stage into three classes (stages I/II, stage III, and stage IV) using claims data, and then demonstrate a method for incorporating the classification uncertainty in survival estimation. Leveraging set‐valued classification and split conformal inference, we show how a fixed algorithm developed in one cohort of data may be deployed in another, while rigorously accounting for uncertainty from the initial classification step. We demonstrate this process using SEER cancer registry data linked with Medicare claims data.
Collapse
Affiliation(s)
- Savannah Bergquist
- Haas School of Business, University of California, Berkeley, Berkeley, California, USA
| | - Gabriel A Brooks
- The Dartmouth Institute for Health Policy and Clinical Practice, Geisel School of Medicine, Hanover, New Hampshire, USA
| | - Mary Beth Landrum
- Department of Health Care Policy, Harvard Medical School, Boston, Massachusetts, USA
| | - Nancy L Keating
- Department of Health Care Policy, Harvard Medical School, Boston, Massachusetts, USA
| | - Sherri Rose
- Department of Health Policy and Center for Health Policy, Stanford University, Stanford, California, USA
| |
Collapse
|
7
|
Dunn R, Wasserman L, Ramdas A. Distribution-Free Prediction Sets for Two-Layer Hierarchical Models. J Am Stat Assoc 2022. [DOI: 10.1080/01621459.2022.2060112] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Affiliation(s)
- Robin Dunn
- Novartis Pharmaceuticals Corporation, Advanced Methodology and Data Science, East Hanover, NJ USA
| | - Larry Wasserman
- Department of Statistics & Data Science, Carnegie Mellon University, Pittsburgh, PA USA
- Machine Learning Department, Carnegie Mellon University, Pittsburgh, PA USA
| | - Aaditya Ramdas
- Department of Statistics & Data Science, Carnegie Mellon University, Pittsburgh, PA USA
- Machine Learning Department, Carnegie Mellon University, Pittsburgh, PA USA
| |
Collapse
|
8
|
Guan L, Tibshirani R. Prediction and outlier detection in classification problems. J R Stat Soc Series B Stat Methodol 2022; 84:524-546. [PMID: 35910400 PMCID: PMC9305480 DOI: 10.1111/rssb.12443] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2019] [Accepted: 12/12/2020] [Indexed: 11/29/2022]
Abstract
We consider the multi-class classification problem when the training data and the out-of-sample test data may have different distributions and propose a method called BCOPS (balanced and conformal optimized prediction sets). BCOPS constructs a prediction set C(x) as a subset of class labels, possibly empty. It tries to optimize the out-of-sample performance, aiming to include the correct class and to detect outliers x as often as possible. BCOPS returns no prediction (corresponding to C(x) equal to the empty set) if it infers x to be an outlier. The proposed method combines supervised learning algorithms with conformal prediction to minimize a misclassification loss averaged over the out-of-sample distribution. The constructed prediction sets have a finite sample coverage guarantee without distributional assumptions. We also propose a method to estimate the outlier detection rate of a given procedure. We prove asymptotic consistency and optimality of our proposals under suitable assumptions and illustrate our methods on real data examples.
Collapse
|
9
|
Feng J, Sondhi A, Perry J, Simon N. Selective prediction-set models with coverage rate guarantees. Biometrics 2021. [PMID: 34854476 DOI: 10.1111/biom.13612] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2020] [Revised: 10/19/2021] [Accepted: 11/10/2021] [Indexed: 11/30/2022]
Abstract
The current approach to using machine learning (ML) algorithms in healthcare is to either require clinician oversight for every use case or use their predictions without any human oversight. We explore a middle ground that lets ML algorithms abstain from making a prediction to simultaneously improve their reliability and reduce the burden placed on human experts. To this end, we present a general penalized loss minimization framework for training selective prediction-set (SPS) models, which choose to either output a prediction set or abstain. The resulting models abstain when the outcome is difficult to predict accurately, such as on subjects who are too different from the training data, and achieve higher accuracy on those they do give predictions for. We then introduce a model-agnostic, statistical inference procedure for the coverage rate of an SPS model that ensembles individual models trained using K-fold cross-validation. We find that SPS ensembles attain prediction-set coverage rates closer to the nominal level and have narrower confidence intervals for its marginal coverage rate. We apply our method to train neural networks that abstain more for out-of-sample images on the MNIST digit prediction task and achieve higher predictive accuracy for ICU patients compared to existing approaches.
Collapse
Affiliation(s)
- Jean Feng
- Department of Epidemiology and Biostatistics, University of California, San Francisco, California, USA
| | | | - Jessica Perry
- Department of Biostatistics, University of Washington, Seattle, Washington, USA
| | - Noah Simon
- Department of Biostatistics, University of Washington, Seattle, Washington, USA
| |
Collapse
|
10
|
Chzhen E, Denis C, Hebiri M. Minimax semi-supervised set-valued approach to multi-class classification. BERNOULLI 2021. [DOI: 10.3150/20-bej1313] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Evgenii Chzhen
- CNRS, Inria, Laboratoire de Mathématiques d’Orsay, Université Paris-Saclay, 91405 Orsay, France
| | - Christophe Denis
- Laboratoire d’Analyse et de Mathématiques Appliquées, Université Gustave Eiffel, 77454 Marne-la-Vallée cedex 2, France
| | - Mohamed Hebiri
- Laboratoire d’Analyse et de Mathématiques Appliquées, Université Gustave Eiffel, 77454 Marne-la-Vallée cedex 2, France
| |
Collapse
|
11
|
Campagner A, Cabitza F, Berjano P, Ciucci D. Three-way decision and conformal prediction: Isomorphisms, differences and theoretical properties of cautious learning approaches. Inf Sci (N Y) 2021. [DOI: 10.1016/j.ins.2021.08.009] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
12
|
Lei L, Candès EJ. Conformal inference of counterfactuals and individual treatment effects. J R Stat Soc Series B Stat Methodol 2021. [DOI: 10.1111/rssb.12445] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
- Lihua Lei
- Department of Statistics Stanford University Stanford California USA
| | - Emmanuel J. Candès
- Department of Statistics Stanford University Stanford California USA
- Department of Mathematics Stanford University Stanford California USA
| |
Collapse
|
13
|
Chzhen E. Optimal Rates for Nonparametric F-Score Binary Classification via Post-Processing. MATHEMATICAL METHODS OF STATISTICS 2021. [DOI: 10.3103/s1066530720020027] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
14
|
|
15
|
|
16
|
Coscrato V, Izbicki R, Stern RB. Agnostic tests can control the type I and type II errors simultaneously. BRAZ J PROBAB STAT 2020. [DOI: 10.1214/19-bjps431] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
17
|
|