Original Article
A calibration hierarchy for risk models was defined: from utopia to empirical data

https://doi.org/10.1016/j.jclinepi.2015.12.005Get rights and content

Abstract

Objective

Calibrated risk models are vital for valid decision support. We define four levels of calibration and describe implications for model development and external validation of predictions.

Study Design and Setting

We present results based on simulated data sets.

Results

A common definition of calibration is “having an event rate of R% among patients with a predicted risk of R%,” which we refer to as “moderate calibration.” Weaker forms of calibration only require the average predicted risk (mean calibration) or the average prediction effects (weak calibration) to be correct. “Strong calibration” requires that the event rate equals the predicted risk for every covariate pattern. This implies that the model is fully correct for the validation setting. We argue that this is unrealistic: the model type may be incorrect, the linear predictor is only asymptotically unbiased, and all nonlinear and interaction effects should be correctly modeled. In addition, we prove that moderate calibration guarantees nonharmful decision making. Finally, results indicate that a flexible assessment of calibration in small validation data sets is problematic.

Conclusion

Strong calibration is desirable for individualized decision support but unrealistic and counter productive by stimulating the development of overly complex models. Model development and external validation should focus on moderate calibration.

Introduction

There is increasing attention for the use of risk prediction models to support medical decision making. Discriminatory performance is commonly the main focus in the evaluation of performance, whereas calibration often receives less attention [1]. A prediction model is calibrated in a given population if the predicted risks are reliable, that is, correspond to observed proportions of the event. Commonly, calibration is defined as “for patients with an predicted risk of R%, on average R out of 100 should indeed suffer from the disease or event of interest.” Calibration is a pivotal aspect of model performance [2], [3], [4]: “For informing patients and medical decision making, calibration is the primary requirement” [2], “If the model is not […] well calibrated, it must be regarded as not having been validated […]. To evaluate classification performance […] is inappropriate” [4].

Recently, a stronger definition of calibration has been emphasized in contrast to the definition of calibration given previously [4], [5]. Models are considered strongly calibrated if predicted risks are accurate for each and every covariate pattern. In this paper, we aim to define different levels of calibration and describe implications for model development, external validation of predictions, and clinical decision making. We focus on predicting binary end points (event vs. no event) and assume that a logistic regression model is developed in a derivation sample with performance assessment in a validation sample. We expand on examples used in recent work by Vach [5].

Section snippets

Methods

We assume that the predicted risks are obtained from a previously developed prediction model for outcome Y (1 = event, 0 = nonevent), for example, based on logistic regression analysis. The model provides a constant (model intercept) and a set of effects (model coefficients). The linear combination of the coefficients with the covariate values in a validation set defines the linear predictor L: L=a+b1×x1+b2×x2+…+bi×xi, where a is the model intercept, b1 to bi a set of regression coefficients,

Calibration, decision making, and clinical utility

Strong calibration implies that an accurate risk prediction is obtained for every covariate pattern. Hence, a strongly calibrated model allows the communication of accurate risks to every individual patient. In contrast, a moderately calibrated model allows the communication of a reliable average risk for patients with the same predicted risk: among patients with an predicted risk of 70% on average, 70 of 100 have the event, although there may exist relevant subgroups with different covariate

Strong calibration: realistic or utopic?

In line with Vach's work [5], we find that moderate calibration does not imply that the prediction model is “valid” in a strong sense. In principle, we should aim for strong calibration because this makes predictions accurate at the individual patient's level as well as at the group level, leading to better decisions on average. However, we consider four problems in empirical analyses. First, strong calibration requires that the model form (e.g., a generalized linear model such as logistic

Moderate calibration: a pragmatic guarantee for nonharmful decision making

Focusing on finding at least moderately calibrated models has several advantages. First, it is a realistic goal in epidemiologic research, where empirical data sets are often of relatively limited size, and the signal to noise ratio is unfavorable [31]. Second, moderate calibration guarantees that decision making based on the model is not clinically harmful. Conversely, it is an important observation that calibration in a weak sense may still result in harmful decision making [18]. Third,

A link with model updating

In model updating, we adapt a model that has poor performance at external validation [34]. Basic updating approaches include, in order of complexity, intercept adjustment, recalibration, and refitting [34], [35]. There are parallels between updating methods and levels of calibration. Intercept adjustment updates the linear predictor L to a+L. This will only address calibration-in-the-large but does not guarantee weak calibration. A more complex updating method involves logistic recalibration,

Statistical testing for calibration

We mainly focused on conceptual issues in assessing calibration of predictions from statistical models. We did not consider statistical testing in detail, and in this area, the assessment of statistical power needs further study. In previous simulations, the Hosmer-Lemeshow test showed such poor performance that it may not be recommended for routine use [7], [37]. In practice, indications of uncertainty such as confidence intervals are far more important than a statistical test.

Conclusion and recommendations

We conclude that strong calibration, although desirable for individual risk communication, is unrealistic in empirical medical research. Focusing on obtaining prediction models that are calibrated in the moderate sense is a better attainable goal, in line with the most common definition of the notion of “calibration of predictions.” In support of this view, we proved that moderate calibration guarantees that clinically nonharmful decisions are made based on the model. This guarantee cannot be

Acknowledgments

The authors thank Laure Wynants for proofreading the article.

References (37)

  • F.E. Harrell

    Regression modeling strategies. With applications to linear models, logistic regression, and survival analysis

    (2001)
  • D.W. Hosmer et al.

    A comparison of goodness-of-fit tests for the logistic regression model

    Stat Med

    (1997)
  • D.P. Ankerst et al.

    Evaluating the PCPT risk calculator in ten international biopsy cohorts: results from the Prostate Biopsy Collaborative Group

    World J Urol

    (2012)
  • D.R. Cox

    Two further applications of a model for binary regression

    Biometrika

    (1958)
  • M.E. Miller et al.

    Validation of probabilistic predictions

    Med Decis Making

    (1993)
  • P.C. Austin et al.

    Graphical assessment of internal and external calibration of logistic regression models by using loess smoothers

    Stat Med

    (2014)
  • G.S. Collins et al.

    Sample size considerations for the external validation of a multivariable prognostic model: a resampling study

    Stat Med

    (2016)
  • E.W. Steyerberg et al.

    Assessing the performance of prediction models: a framework for traditional and novel measures

    Epidemiology

    (2010)
  • Cited by (474)

    View all citing articles on Scopus

    Funding: This study was supported in part by the Research Foundation Flanders (FWO) (grants G049312N and G0B4716N) and by Internal Funds KU Leuven (grant C24/15/037).

    Conflict of interest: None.

    View full text