1
|
Buick JE, Austin PC, Cheskes S, Ko DT, Atzema CL. Prediction models in prehospital and emergency medicine research: How to derive and internally validate a clinical prediction model. Acad Emerg Med 2023; 30:1150-1160. [PMID: 37266925 DOI: 10.1111/acem.14756] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2023] [Revised: 05/24/2023] [Accepted: 05/29/2023] [Indexed: 06/03/2023]
Abstract
Clinical prediction models are created to help clinicians with medical decision making, aid in risk stratification, and improve diagnosis and/or prognosis. With growing availability of both prehospital and in-hospital observational registries and electronic health records, there is an opportunity to develop, validate, and incorporate prediction models into clinical practice. However, many prediction models have high risk of bias due to poor methodology. Given that there are no methodological standards aimed at developing prediction models specifically in the prehospital setting, the objective of this paper is to describe the appropriate methodology for the derivation and validation of clinical prediction models in this setting. What follows can also be applied to the emergency medicine (EM) setting. There are eight steps that should be followed when developing and internally validating a prediction model: (1) problem definition, (2) coding of predictors, (3) addressing missing data, (4) ensuring adequate sample size, (5) variable selection, (6) evaluating model performance, (7) internal validation, and (8) model presentation. Subsequent steps include external validation, assessment of impact, and cost-effectiveness. By following these steps, researchers can develop a prediction model with the methodological rigor and quality required for prehospital and EM research.
Collapse
Affiliation(s)
- Jason E Buick
- Institute of Health Policy, Management and Evaluation, University of Toronto, Toronto, Ontario, Canada
| | - Peter C Austin
- Institute of Health Policy, Management and Evaluation, University of Toronto, Toronto, Ontario, Canada
- ICES, Toronto, Ontario, Canada
- Sunnybrook Health Sciences Centre, Toronto, Ontario, Canada
| | - Sheldon Cheskes
- Sunnybrook Health Sciences Centre, Toronto, Ontario, Canada
- Division of Emergency Medicine, Department of Family and Community Medicine, University of Toronto, Toronto, Ontario, Canada
| | - Dennis T Ko
- Institute of Health Policy, Management and Evaluation, University of Toronto, Toronto, Ontario, Canada
- ICES, Toronto, Ontario, Canada
- Sunnybrook Health Sciences Centre, Toronto, Ontario, Canada
- Department of Medicine, University of Toronto, Toronto, Ontario, Canada
| | - Clare L Atzema
- Institute of Health Policy, Management and Evaluation, University of Toronto, Toronto, Ontario, Canada
- ICES, Toronto, Ontario, Canada
- Sunnybrook Health Sciences Centre, Toronto, Ontario, Canada
- Division of Emergency Medicine, Department of Medicine, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
2
|
Verbakel JY, Steyerberg EW, Uno H, De Cock B, Wynants L, Collins GS, Van Calster B. ROC curves for clinical prediction models part 1. ROC plots showed no added value above the AUC when evaluating the performance of clinical prediction models. J Clin Epidemiol 2020; 126:207-216. [PMID: 32712176 DOI: 10.1016/j.jclinepi.2020.01.028] [Citation(s) in RCA: 44] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2018] [Revised: 12/06/2019] [Accepted: 01/20/2020] [Indexed: 10/23/2022]
Abstract
OBJECTIVES Receiver operating characteristic (ROC) curves show how well a risk prediction model discriminates between patients with and without a condition. We aim to investigate how ROC curves are presented in the literature and discuss and illustrate their potential limitations. STUDY DESIGN AND SETTING We conducted a pragmatic literature review of contemporary publications that externally validated clinical prediction models. We illustrated limitations of ROC curves using a testicular cancer case study and simulated data. RESULTS Of 86 identified prediction modeling studies, 52 (60%) presented ROC curves without thresholds and one (1%) presented an ROC curve with only a few thresholds. We illustrate that ROC curves in their standard form withhold threshold information have an unstable shape even for the same area under the curve (AUC) and are problematic for comparing model performance conditional on threshold. We compare ROC curves with classification plots, which show sensitivity and specificity conditional on risk thresholds. CONCLUSION ROC curves do not offer more information than the AUC to indicate discriminative ability. To assess the model's performance for decision-making, results should be provided conditional on risk thresholds. Therefore, if discriminatory ability must be visualized, classification plots are attractive.
Collapse
Affiliation(s)
- Jan Y Verbakel
- KU Leuven, Department of Public Health and Primary Care, Leuven, Belgium; Nuffield Department of Primary Care Health Sciences, University of Oxford, Oxford, UK
| | - Ewout W Steyerberg
- Department of Biomedical Data Sciences, Leiden University Medical Centre (LUMC), Leiden, the Netherlands
| | - Hajime Uno
- Division of Population Sciences, Dana-Farber Cancer Institute, Boston, MA, USA
| | - Bavo De Cock
- KU Leuven, Department of Development and Regeneration, Leuven, Belgium
| | - Laure Wynants
- KU Leuven, Department of Development and Regeneration, Leuven, Belgium
| | - Gary S Collins
- Centre for Statistics in Medicine, Nuffield Department of Orthopaedics, Rheumatology and Musculoskeletal Sciences, University of Oxford, Oxford, UK; Oxford University Hospitals NHS Foundation Trust, Oxford, UK
| | - Ben Van Calster
- Department of Biomedical Data Sciences, Leiden University Medical Centre (LUMC), Leiden, the Netherlands; KU Leuven, Department of Development and Regeneration, Leuven, Belgium.
| |
Collapse
|
3
|
Grellety E, Golden MH. Severely malnourished children with a low weight-for-height have a higher mortality than those with a low mid-upper-arm-circumference: I. Empirical data demonstrates Simpson's paradox. Nutr J 2018; 17:79. [PMID: 30217205 PMCID: PMC6138885 DOI: 10.1186/s12937-018-0384-4] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2017] [Accepted: 07/25/2018] [Indexed: 01/01/2023] Open
Abstract
BACKGROUND According to WHO childhood severe acute malnutrition (SAM) is diagnosed when the weight-for-height Z-score (WHZ) is <-3Z of the WHO2006 standards, the mid-upper-arm circumference (MUAC) is < 115 mm, there is nutritional oedema or any combination of these parameters. Recently there has been a move to eliminate WHZ as a diagnostic criterion on the assertion that children meeting the WHZ criterion are healthy, that MUAC is universally a superior prognostic indicator of mortality and that adding WHZ to the assessment does not improve the prediction; these assertions have lead to a controversy concerning the role of WHZ in the diagnosis of SAM. METHODS We examined the mortality experience of 76,887 6-60 month old severely malnourished children admitted for treatment to in-patient, out-patient or supplementary feeding facilities in 18 African countries, of whom 3588 died. They were divided into 7 different diagnostic categories for analysis of mortality rates by comparison of case fatality rates, relative risk of death and meta-analysis of the difference between children admitted using MUAC and WHZ criteria. RESULTS The mortality rate was higher in those children fulfilling the WHO2006 WHZ criterion than the MUAC criterion. This was the case for younger as well as older children and in all regions except for marasmic children in East Africa. Those fulfilling both criteria had a higher mortality. Nutritional oedema increased the risk of death. Having oedema and a low WHZ dramatically increased the mortality rate whereas addition of the MUAC criterion to either oedema-alone or oedema plus a low WHZ did not further increase the mortality rate. The data were subject to extreme confounding giving Simpson's paradox, which reversed the apparent mortality rates when children fulfilling both WHZ and MUAC criteria were included in the estimation of the risk of death of those fulfilling either the WHZ or MUAC criteria alone. CONCLUSIONS Children with a low WHZ, but a MUAC above the SAM cut-off point are at high risk of death. Simpson's paradox due to confounding from oedema and mathematical coupling may make previous statistical analyses which failed to distinguish the diagnostic groups an unreliable guide to policy. WHZ needs to be retained as an independent criterion for diagnosis of SAM and methods found to identify those children with a low WHZ, but not a low MUAC, in the community.
Collapse
Affiliation(s)
- Emmanuel Grellety
- Research Center Health Policy and Systems - International Health, School of Public Health, Université Libre de Bruxelles, Bruxelles, Belgium
| | - Michael H. Golden
- Department of Medicine and Therapeutics, University of Aberdeen, Aberdeen, Scotland
| |
Collapse
|
4
|
Chipman J, Braun D. Simpson's paradox in the integrated discrimination improvement. Stat Med 2016; 36:4468-4481. [PMID: 29160558 DOI: 10.1002/sim.6862] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2013] [Revised: 11/23/2015] [Accepted: 12/07/2015] [Indexed: 11/08/2022]
Abstract
The integrated discrimination improvement (IDI) is commonly used to compare two risk prediction models; it summarizes the extent a new model increases risk in events and decreases risk in non-events. The IDI averages risks across events and non-events and is therefore susceptible to Simpson's paradox. In some settings, adding a predictive covariate to a well calibrated model results in an overall negative (positive) IDI. However, if stratified by that same covariate, the strata-specific IDIs are positive (negative). Meanwhile, the calibration (observed to expected ratio and Hosmer-Lemeshow Goodness of Fit Test), area under the receiver operating characteristic curve, and Brier score improve overall and by stratum. We ran extensive simulations to investigate the impact of an imbalanced covariate upon metrics (IDI, area under the receiver operating characteristic curve, Brier score, and R2), provide an analytic explanation for the paradox in the IDI, and use an investigative metric, a Weighted IDI, to better understand the paradox. In simulations, all instances of the paradox occurred under stratum-specific mis-calibration, yet there were mis-calibrated settings in which the paradox did not occur. The paradox is illustrated on Cancer Genomics Network data by calculating predictions based on two versions of BRCAPRO, a Mendelian risk prediction model for breast and ovarian cancer. In both simulations and the Cancer Genomics Network data, overall model calibration did not guarantee stratum-level calibration. We conclude that the IDI should only assess model performance among a clinically relevant subset when stratum-level calibration is strictly met and recommend calculating additional metrics to confirm the direction and conclusions of the IDI. Copyright © 2016 John Wiley & Sons, Ltd.
Collapse
Affiliation(s)
- J Chipman
- Department of Biostatistics, Vanderbilt School of Medicine, Nashville, TN 37203, U.S.A
| | - D Braun
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02115, U.S.A.,Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA 02215, U.S.A
| |
Collapse
|
5
|
Petretta M, Pellegrino T, Cuocolo A. The "gray zone" for the heart to mediastinum MIBG uptake ratio. J Nucl Cardiol 2014; 21:921-4. [PMID: 24810428 DOI: 10.1007/s12350-014-9894-4] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2014] [Accepted: 03/15/2014] [Indexed: 01/26/2023]
Affiliation(s)
- Mario Petretta
- Department of Translational Medicine, University Federico II, Naples, Italy
| | | | | |
Collapse
|
6
|
Abstract
Decision-analytic measures to assess clinical utility of prediction models and diagnostic tests incorporate the relative clinical consequences of true and false positives without the need for external information such as monetary costs. Net Benefit is a commonly used metric that weights the relative consequences in terms of the risk threshold at which a patient would opt for treatment. Theoretical results demonstrate that clinical utility is affected by a model’;s calibration, the extent to which estimated risks correspond to observed event rates. We analyzed the effects of different types of miscalibration on Net Benefit and investigated whether and under what circumstances miscalibration can make a model clinically harmful. Clinical harm is defined as a lower Net Benefit compared with classifying all patients as positive or negative by default. We used simulated data to investigate the effect of overestimation, underestimation, overfitting (estimated risks too extreme), and underfitting (estimated risks too close to baseline risk) on Net Benefit for different choices of the risk threshold. In accordance with theory, we observed that miscalibration always reduced Net Benefit. Harm was sometimes observed when models underestimated risk at a threshold below the event rate (as in underestimation and overfitting) or overestimated risk at a threshold above event rate (as in overestimation and overfitting). Underfitting never resulted in a harmful model. The impact of miscalibration decreased with increasing discrimination. Net Benefit was less sensitive to miscalibration for risk thresholds close to the event rate than for other thresholds. We illustrate these findings with examples from the literature and with a case study on testicular cancer diagnosis. Our findings strengthen the importance of obtaining calibrated risk models.
Collapse
Affiliation(s)
- Ben Van Calster
- KU Leuven, Department of Development and Regeneration, Leuven, Belgium (BVC)
- Department of Epidemiology and Biostatistics, Memorial Sloan-Kettering Cancer Center, New York, NY (AJV)
| | - Andrew J. Vickers
- KU Leuven, Department of Development and Regeneration, Leuven, Belgium (BVC)
- Department of Epidemiology and Biostatistics, Memorial Sloan-Kettering Cancer Center, New York, NY (AJV)
| |
Collapse
|
7
|
Steyerberg EW, Vedder MM, Leening MJG, Postmus D, D'Agostino RB, Van Calster B, Pencina MJ. Graphical assessment of incremental value of novel markers in prediction models: From statistical to decision analytical perspectives. Biom J 2014; 57:556-70. [PMID: 25042996 DOI: 10.1002/bimj.201300260] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2013] [Revised: 04/24/2014] [Accepted: 05/25/2014] [Indexed: 11/08/2022]
Abstract
New markers may improve prediction of diagnostic and prognostic outcomes. We aimed to review options for graphical display and summary measures to assess the predictive value of markers over standard, readily available predictors. We illustrated various approaches using previously published data on 3264 participants from the Framingham Heart Study, where 183 developed coronary heart disease (10-year risk 5.6%). We considered performance measures for the incremental value of adding HDL cholesterol to a prediction model. An initial assessment may consider statistical significance (HR = 0.65, 95% confidence interval 0.53 to 0.80; likelihood ratio p < 0.001), and distributions of predicted risks (densities or box plots) with various summary measures. A range of decision thresholds is considered in predictiveness and receiver operating characteristic curves, where the area under the curve (AUC) increased from 0.762 to 0.774 by adding HDL. We can furthermore focus on reclassification of participants with and without an event in a reclassification graph, with the continuous net reclassification improvement (NRI) as a summary measure. When we focus on one particular decision threshold, the changes in sensitivity and specificity are central. We propose a net reclassification risk graph, which allows us to focus on the number of reclassified persons and their event rates. Summary measures include the binary AUC, the two-category NRI, and decision analytic variants such as the net benefit (NB). Various graphs and summary measures can be used to assess the incremental predictive value of a marker. Important insights for impact on decision making are provided by a simple graph for the net reclassification risk.
Collapse
Affiliation(s)
- Ewout W Steyerberg
- Department of Public Health, Erasmus MC: University Medical Center Rotterdam, Rotterdam, The Netherlands
| | - Moniek M Vedder
- Department of Public Health, Erasmus MC: University Medical Center Rotterdam, Rotterdam, The Netherlands
| | - Maarten J G Leening
- Department of Epidemiology, Erasmus MC: University Medical Center Rotterdam, Rotterdam, The Netherlands.,Department of Cardiology, Erasmus MC: University Medical Center Rotterdam, Rotterdam, The Netherlands
| | - Douwe Postmus
- Department of Epidemiology, University Medical Center Groningen, University of Groningen, Groningen, The Netherlands
| | | | - Ben Van Calster
- Department of Public Health, Erasmus MC: University Medical Center Rotterdam, Rotterdam, The Netherlands.,Department of Development and Regeneration, KU Leuven, Leuven, Belgium
| | - Michael J Pencina
- Department of Biostatistics and Bioinformatics, Duke Clinical Research Institute, Duke University, Durham, NC, USA
| |
Collapse
|
8
|
McGeechan K, Macaskill P, Irwig L, Bossuyt PMM. An assessment of the relationship between clinical utility and predictive ability measures and the impact of mean risk in the population. BMC Med Res Methodol 2014; 14:86. [PMID: 24989719 PMCID: PMC4105158 DOI: 10.1186/1471-2288-14-86] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2013] [Accepted: 06/26/2014] [Indexed: 12/29/2022] Open
Abstract
BACKGROUND Measures of clinical utility (net benefit and event free life years) have been recommended in the assessment of a new predictor in a risk prediction model. However, it is not clear how they relate to the measures of predictive ability and reclassification, such as the c-statistic and Net Reclassification Improvement (NRI), or how these measures are affected by differences in mean risk between populations when a fixed cutpoint to define high risk is assumed. METHODS We examined the relationship between measures of clinical utility (net benefit, event free life years) and predictive ability (c-statistic, binary c-statistic, continuous NRI(0), NRI with two cutpoints, binary NRI) using simulated data and the Framingham dataset. RESULTS In the analysis of simulated data, the addition of a new predictor tended to result in more people being treated when the mean risk was less than the cutpoint, and fewer people being treated for mean risks beyond the cutpoint. The reclassification and clinical utility measures showed similar relationships with mean risk when the mean risk was less than the cutpoint and the baseline model was not strong. However, when the mean risk was greater than the cutpoint, or the baseline model was strong, the reclassification and clinical utility measures diverged in their relationship with mean risk.Although the risk of CVD was lower for women compared to men in the Framingham dataset, the measures of predictive ability, reclassification and clinical utility were both larger for women. The difference in these results was, in part, due to the larger hazard ratio associated with the additional risk predictor (systolic blood pressure) for women. CONCLUSION Measures such as the c-statistic and the measures of reclassification do not capture the consequences of implementing different prediction models. We do not recommend their use in evaluating which new predictors may be clinically useful in a particular population. We recommend that a measure such as net benefit or EFLY is calculated and, where appropriate, the measure is weighted to account for differences in the distribution of risks between the study population and the population in which the new predictors will be implemented.
Collapse
Affiliation(s)
- Kevin McGeechan
- Sydney School of Public Health, The University of Sydney, Sydney, Australia
| | - Petra Macaskill
- Sydney School of Public Health, The University of Sydney, Sydney, Australia
- The Screening and Test Evaluation Program, The University of Sydney, Sydney, Australia
| | - Les Irwig
- Sydney School of Public Health, The University of Sydney, Sydney, Australia
- The Screening and Test Evaluation Program, The University of Sydney, Sydney, Australia
| | - Patrick MM Bossuyt
- Department of Clinical Epidemiology, Biostatistics and Bioinformatics, Academic Medical Centre (AMC), University of Amsterdam, Amsterdam, The Netherlands
| |
Collapse
|