Article Text

Download PDFPDF

Original article
External validation and recalibration of the Brock model to predict probability of cancer in pulmonary nodules using NLST data
Free
  1. Audrey Winter,
  2. Denise R Aberle,
  3. William Hsu
  1. Department of Radiological Sciences, Medical Imaging Informatics, University of California, Los Angeles, California, USA
  1. Correspondence to Dr Audrey Winter, Department of Radiological Sciences, Medical Imaging Informatics, University of California, Los Angeles, CA 90095, USA; audrey.winter89{at}gmail.com

Abstract

Introduction We performed an external validation of the Brock model using the National Lung Screening Trial (NLST) data set, following strict guidelines set forth by the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis statement. We report how external validation results can be interpreted and highlight the role of recalibration and model updating.

Materials and methods We assessed model discrimination and calibration using the NLST data set. Adhering to the inclusion/exclusion criteria reported by McWilliams et al, we identified 7879 non-calcified nodules discovered at the baseline low-dose CT screen with 2 years of follow-up. We characterised differences between Pan-Canadian Early Detection of Lung Cancer Study and NLST cohorts. We calculated the slope on the prognostic index and the intercept coefficient by fitting the original Brock model to NLST. We also assessed the impact of model recalibration and the addition of new covariates such as body mass index, smoking status, pack-years and asbestos.

Results While the area under the curve (AUC) of the model was good, 0.905 (95% CI 0.882 to 0.928), a histogram plot showed that the model poorly differentiated between benign and malignant cases. The calibration plot showed that the model overestimated the probability of cancer. In recalibrating the model, the coefficients for emphysema, spiculation and nodule count were updated. The updated model had an improved calibration and achieved an optimism-corrected AUC of 0.912 (95% CI 0.891 to 0.932). Only pack-year history was found to be significant (p<0.01) among the new covariates evaluated.

Conclusion While the Brock model achieved a high AUC when validated on the NLST data set, the model benefited from updating and recalibration. Nevertheless, covariates used in the model appear to be insufficient to adequately discriminate malignant cases.

  • lung cancer
  • prediction
  • external validation
  • Brock model
  • recalibration

Statistics from Altmetric.com

Key messages

What is the key question?

  • How does following the guidelines set forth in the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) statement help appraise the suitability of using the Brock model by assessing both its discrimination and calibration on a target screening population?

What is the bottom line?

  • While the Brock model achieves high area under the curve (AUC) on an external data set (National Lung Screening Trial), further inspection of the model’s discrimination and calibration reveals opportunities to improve the transportability of the model through recalibration, revision and extension.

Why read on?

  • We perform an external validation of the Brock model, adhering to the guidelines set forth by the TRIPOD statement. While metrics such as the AUC are consistent between the derivation and validation data sets, a lack of calibration was noted. Moreover, the positive predictive value, influenced by the disease prevalence, was low. We demonstrate the impact of model recalibration, revision and/or extension when seeking to adopt the Brock model on a target population that is different from the derivation population.

Introduction

Lung cancer screening plays a critical role along with smoking cessation in reducing the mortality associated with lung cancer. Based on the results of the National Lung Screening Trial (NLST),1 in which participants underwent up to three low-dose CT (LDCT) screening exams, a 20% reduction mortality was observed in comparison with individuals who underwent comparable screening with chest X-ray. However, using a nodule size threshold of a 4 mm diameter to define positive screens, the trial also reported 96.4% of positive screenings were actually false positive results in the LDCT arm. It is well recognised that false positive screens are major contributors to the potential harms of screening due to unnecessary downstream invasive procedures that may be associated with complications. A reliable diagnostic prediction model that estimates the probability of lung cancer in the setting of LDCT-detected indeterminate nodules would better define the level of risk of lung cancer and inform clinical decision-making. To aid radiologists with assessing the clinical significance of nodules identified on an LDCT or chest X-ray exam, several diagnostic prediction models have been proposed to estimate the probability of lung cancer for a patient given the presence of one or more nodules.2–10 However, individual radiologists and institutions must decide if the model is appropriate for their patient population. Determining whether or not a prediction model is capable of being generalised to a new population is called external validation. External validation is strongly recommended for all prediction models, as stated in the ‘Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis’ (TRIPOD) statement.11 TRIPOD provides guidelines for authors on key information needed to clearly describe a diagnostic or prognostic prediction model and for readers on how to interpret the validity and generalisability of these models. When evaluating the performance of a model, two fundamental aspects should be considered: (1) discrimination, which is the ability to discriminate among different outcomes (eg, patients predicted to be at higher risk should exhibit higher event rates than those considered at lower risk); and (2) calibration, which is the agreement between predicted and observed probabilities (eg, a well-calibrated risk score or prediction rule assigns the correct event probability at all levels of predicted risk).12–14 In practice, however, many of these models are not externally validated and even when external validation is purportedly performed, the process and metrics used to perform external validation may not be fully accurate.

While a number of studies have attempted to externally validate prediction models for pulmonary nodules,9 15–27 none of them completely adhere to the methods recommended by TRIPOD.11 Among these models, the Brock model2 has been recommended by organisations such as the British Thoracic Society28 and has the most external validation studies reported.15–24 Table 1 provides a summary of what has been done in those studies.

Table 1

Articles about application of the full Brock model2 on external data sets

Only four of those studies15 17 19 21 present both discrimination and calibration metrics when applying the model to a new population. In addition, only five of these studies reported sufficient sample size,17 18 20 21 23 which recommends ideally 200 examples each of positive and negative events.29 Moreover, five of these studies16 18 21 23 24 applied the Brock model on non-screening patients, which is not consistent with the exclusion/inclusion criteria of the original model.

In this study, we evaluate the external validity of the Brock model2 using the NLST data set. Compared with prior studies that have conducted a similar analysis,15–24 we assess approaches for measuring the discrimination and calibration of the Brock model2 and explore how issues identified by these metrics can be addressed through model recalibration, revision and extension,14 30 following the TRIPOD statement11 (checklist in online supplementary file 1).

Supplemental material

Materials and methods

Study population

The NLST was a randomised controlled trial in which participants underwent three screenings with either LDCT or chest X-ray. Eligible participants were between 55 and 74 years of age at the time of randomisation; had a minimum of 30 pack-years (total years smoked × cigarettes per day/20) smoking; and if a former smoker had quit within the previous 15 years. All screening exams were completed from August 2002 through September 2007. Participants who had previously received a diagnosis of lung cancer, had undergone chest LDCT within 18 months before enrolment or had experienced haemoptysis or unexplained weight loss of more than 6.8 kg (15 lb) in the preceding year were excluded. A total of 53 452 participants were enrolled; 26 722 were randomly assigned to screening with LDCT and 26 730 to screening with chest radiography.1 Participants were followed for lung cancer diagnoses that occurred through 31 December 2009.1

From the collected NLST data set, we identified the relevant data elements related to participant demographics, smoking history and LDCT screening results that were used as covariates in the Brock model (details of how data elements were identified are provided in the online supplementary table S1). As a guiding principle, the model should be applied in exactly the same way in the validation data set (NLST) as in the derivation data set (Pan-Canadian Early Detection of Lung Cancer Study [PanCan]),12 following the same inclusion/exclusion criteria from McWilliams et al 2 and using the same covariates and duration of follow-up. Nodules with baseline LDCT screens that had missing values for covariates were excluded from our analysis. Notably, the NLST data set had sufficient screen-detected lung cancers to be appropriate for validation, ideally 200 (or more) events.29

One challenge was that data from the NLST were captured at the participant level whereas the PanCan study captured data at the nodule level. Thus, the abnormality that was directly related to the lung cancer diagnosis could not be matched with certainty. To determine which nodules were malignant or not within an interval of 2 years from the baseline screening, we created an ‘event’ covariate that denoted whether the abnormality was lung cancer or not based on the reported anatomical location for the abnormality and the diagnosed cancer. Online supplementary table S1 shows the covariates used to create the event covariate. Follow-up times began at the time of the baseline screen and ended with whichever of the following events came first: diagnosis or 2-year follow-up.

Given the variety of data elements collected on NLST participants, we considered four additional covariates beyond what was used in the Brock model as part of model extension: body mass index, the smoking status at randomisation, number of pack-years and occupational exposure to asbestos for 1 or more years. We specifically targeted these covariates based on their association with lung cancer risk in previous prediction models.2–4 Participants with missing values (n=20 patients) for these additional covariates were excluded from analysis; thus, the final data set is shown in online supplementary table S2.

Prognostic index

The main output of a logistic regression model, the type of model underlying the Brock model, is a prognostic index (PI), which is a weighted sum of the covariates Embedded Image in the model, where the weights Embedded Image are the regression coefficients and α is the intercept:

Embedded Image

The validity of a logistic regression model can be evaluated using both qualitative (visual) and quantitative assessments.

Visual assessment of discrimination and calibration

To assess visually the discrimination and calibration of the Brock model when applied to NLST, we used three different visualisations13:

  • Histograms of the PI per outcome (Y=0 or Y=1), where perfect discrimination would appear as non-overlapping histograms.

  • A receiver operating characteristic (ROC) curve, which plots sensitivity (SE) against one minus specificity (SPC). The discrimination performance of an ROC curve is quantified by the area under the curve (AUC) and CI. AUC values range from 0 to 1; the closer the AUC is to 1, the better the discrimination performance.

  • A calibration plot, which reports graphically predicted outcome probabilities (on the x-axis) against observed outcome (on the y-axis).11 A well-calibrated prediction implies that the curve lies on the diagonal, which means perfect agreement between the predicted probabilities and the observed outcomes.

Comparison of the derivation and validation data sets

As a quantitative assessment, we compared the distributions of covariates between PanCan and NLST data sets used in the Brock model using the Student’s t-test, Wilcoxon-Mann-Whitney test, or Embedded Image test, when appropriate. For each test, an effect size (ES) was also reported. For Embedded Image tests, Cramér’s V (Embedded Image ) was calculated (ie, magnitude of the ES; small: Embedded Image , medium: Embedded Image and large: Embedded Image ) and for Student’s and Wilcoxon-Mann-Whitney tests Embedded Image and Cohen’s Embedded Image (ie, magnitude of the ES; small: Embedded Image , medium: Embedded Image and large: Embedded Image ), respectively.31–33 The larger the ES is, the more important the impact of the findings are with all other factors being equal.

Assessment of the model calibration

To perform external validation of the Brock model, we estimated the regression coefficient on the PI and the intercept coefficient by fitting a logistic regression in the validation data set14 30 as follows:

Embedded Image

If perfectly calibrated, Embedded Image and Embedded Image . If Embedded Image and/or Embedded Image are significantly different from 0 and 1 based on a likelihood ratio test, recalibration is needed.14 30

Of note, Embedded Image (or Embedded Image ) means that the predicted probabilities in the validation data set were too high (or too low).14 30 With respect to Embedded Image , if the calibration slope is smaller than 1, then the coefficients derived from the original model were too large, which results in overestimation of the probability of lung cancer for participants at high risk or underestimation of the probability for participants at low risk of lung cancer. Conversely, if the calibration slope is bigger than 1, then the original regression coefficients were too low, and the calculated probabilities of lung cancer will be falsely low.

Model recalibration, revision and extension

When a lack of calibration was observed, we updated the model using several methods described in table 2 based on recommendations outlined.14 30

Table 2

Recalibrating and model revision methods considered for logistic regression models for a prognostic index (PI) with n predictors (ie, methods 1–5) and extension methods considering k new predictors; not included in the original model (ie, methods 6–8)14 30

These recommendations are categorised into four main methods: no model adjustment (method 1); recalibration methods (methods 2 and 3); revision methods (methods 4 and 5); and extension methods, considering new predictors not included in the original model (methods 6–8).

After updating the model using methods 2–8, we calculated three measures of performance: (1) the Brier score as a measure of overall performance; (2) the concordance statistic (c-statistic) as a measure of discrimination; and (3) the Akaike information criterion (AIC) as a measure of relative goodness of fit. The Brier score measures the accuracy (mean squared difference), between predicted probabilities and actual outcomes, where the lower the Brier score, the better the fit; a perfect score is 0. The Brier score provides a measure of the absolute performance of the model. The c-statistic is equivalent to the area under the ROC curve, which varies between 0 and 1. The more closely the c-statistic approximates 1, the better the model distinguishes between those with and without lung cancer. The AIC estimates the relative goodness of fit, where the preferred model has the lowest AIC value; as such it cannot provide information on the absolute goodness of the fit of a given model.

Of note, regression coefficients’ variances estimated through logistic regression models (external validation and recalibration) were adjusted with the use of the Huber-White34 robust variance estimator in order to follow what was done by McWilliams et al.2

Model performance for follow-up intervals of 3 and 4 years

As the follow-up in the derivation cohort used by McWilliams et al to train the Brock model ranged from 2.1 to 4.3 years,2 we performed additional analysis examining discrimination and calibration when predicting malignancy at 3 and 4 years in the NLST population. We created two new subsets of the data, one for each follow-up duration: only patients who were followed for this period were included. Lung cancer diagnoses that occurred outside of this period were not considered.

All analyses were performed using R software, V.3.3.0 (R Development Core Team, A Language and Environment for Statistical Computing, Vienna, Austria, 2016. URL: https://www.R-project.org/).

Results

Data

A flow diagram of the participants selected for the study is presented in figure 1. A detailed flow diagram of how the events were defined is provided in figure 1. The mapping between covariates and the NLST data elements is shown in online supplementary table S1.

Figure 1

(A) Flow diagram and (B) event covariate construction (details of how data elements were identified are provided in the online supplementary table S1, and in NLST article).1 LDCT, low-dose CT.

Patient characteristics

Characteristics of participants and nodules, with a comparison by group of diagnosis (ie, cancer or no cancer) used in this study, are described in table 3 (and online supplementary table S2, for new covariates).

Table 3

Distribution of nodule and participant covariates according to cancer status and comparison in the NLST data set.1 Mean and SD are reported for quantitative covariates; number and percentage are reported for qualitative covariates. Student’s t-test or Wilcoxon-Mann-Whitney test or Embedded Image test used when appropriate

PI for Brock model

We initially calculated the Brock model2 (online supplementary figure S1) as follows:

Embedded Image ,

With: Brock-PI = −6.7892+0.0287*(Age −62)

+0.6011*(1 if female, 0 otherwise)

+0.2961*(1 if family history of lung cancer, 0 otherwise)

+0.2953*(1 if emphysema, 0 otherwise)

+0.377*(1 if nodule type is part-solid, 0 otherwise)

−0.1276*(1 if nodule type is non-solid, 0 otherwise)

+0.6581*(1 if nodule is in upper lung, 0 otherwise)

−0.0824*(Nodule count −4)

+0.7729*(1 if spiculated, 0 otherwise)

−5.3854* Embedded Image .

Brock model achieves good AUC but has low positive predictive value

The ROC curve showed a good discrimination with an AUC=0.905 (95% CI 0.882 to 0.928) (figure 2). The optimal combination of model SE and SPC (online supplementary figure S2) was at a cut-off probability of 6.5%.35 However, histograms in figure 2 show that the distribution of the PI, although shifted slightly to the right in the lung cancer group, is insufficient to provide strong discrimination between cases with and without lung cancer. In our study, only 2.8% of the nodules were malignant, which is not accounted for in the AUC. As such, we also examined metrics such as positive predictive value (PPV) and negative predictive value (NPV), which are influenced by disease prevalence. The original model in the NLST data set with an optimal cut-off of 6.5% achieved an SE of 80.18% (74.94–85.42) and an SPC of 89.63% (88.95–90.31) with an excellent NPV of 99.36% (99.2–99.53) but low PPV of 18.31% (16.92–19.7). Finally, the calibration plot showed significant overestimation of lung cancer with the greatest overestimation occurring at higher levels of risk (figure 2).

Figure 2

Visual assessment of calibration and discrimination of the Brock model in the National Lung Screening Trial (NLST) data set. AUC, area under the curve; ROC, receiver operating characteristic.

Covariates were different between NLST and PanCan

When we compared participants’ and nodules’ characteristics between PanCan and NLST data sets, we found that the distributions of all covariates were significantly different (table 4), reflecting basic differences between cohorts that would compromise the fit of the model to the NLST data set. The biggest differences were observed for emphysema and spiculation (p<0.001) with medium and small ES, respectively. A large ES was noted for nodule count at the baseline covariate.

Table 4

Differences in nodule and participant characteristics between the PanCan data set2 and the NLST data set1 (p values, Student’s t-test or Wilcoxon-Mann-Whitney test or Embedded Image test used when appropriate). For each test, an effect size (ES) was also reported. For Embedded Image tests, Cramér’s V or Embedded Image was calculated (ie, magnitude of the ES; small: Embedded Image , medium: Embedded Image and large: Embedded Image ) and for Student’s t-test and Wilcoxon-Mann-Whitney tests, Embedded Image and Cohen’s Embedded Image (ie, magnitude of the ES; small: Embedded Image , medium: Embedded Image and large: Embedded Image ), respectively

Brock model overestimates the probability of cancer in NLST data set

After fitting the original logistic regression model, we found that a likelihood ratio test showed that a significant difference between the model intercept and slope and Embedded Image and Embedded Image (p<0.01; Embedded Image and Embedded Image ). The slope was very close to 1, but the intercept was too far from 0 (lower than 0, which is consistent with the overestimation observed in the calibration plot), and as a result, model recalibration is needed.

Recalibration and extension of Brock model improves overall performance

Table 5 summarises the results of the various recalibration and extension methods. The c-statistic, Brier score and AIC were very similar across the seven methods.

Table 5

Results of the updating methods14 30 for the Brock model.2 Variances of logistic regression models’ coefficients were adjusted with the use of the Huber-White robust variance estimator.34 The reference groups are the same as in McWilliams et al.2 For each model, the c-statistic, the Brier score and the Akaike information criterion (AIC) were calculated

Based on method 5, we observed that some of the regression coefficients from the NLST database were similar to those of the PanCan database, such as age, family history and nodule location, while others were very different, most notably emphysema, gender and spiculation. Based on the results of method 4, only the coefficients of gender, emphysema and spiculation require revision. This could be explained by differences in participant/nodule characteristics highlighted in table 4. Indeed, significant differences, with small ES, were observed for emphysema (60.4% vs 33.10%, p<0.01, ES=0.238) and spiculation (2.8% vs 13.1%, p<0.01, ES=0.186). Based on methods 6–8, new covariates were added to the model following the methodology described in McWilliams et al.2 Namely, if quantitative covariates had a non-linear relationship with lung cancer, they were transformed using fractional polynomials.36 However, only pack-years contributed to model performance. The results of method 6 achieved the highest c-statistic, 0.914 (95% CI 0.892 to 0.936), and the lowest AIC (1288) and Brier score (0.021). Emphysema, spiculation and nodule count regression coefficients were updated, and pack-year was added to the model. According to this recalibrated model, the probability of having lung cancer 2 years after the LDCT baseline is Embedded Image , with:

PI=−1.88+0.83*(Brock-PI)

  • −0.35*(1 if emphysema, 0 otherwise)

  • −0.10*(Nodule count −4)

  • +0.85*(1 if spiculation, 0 otherwise)

  • +0.38* Embedded Image .

A likelihood ratio test that was performed between the recalibrated and original models (online supplementary table S2; p<0.01) suggests a significant improvement in model performance after recalibration. Notably, the AUC reported for the Brock model2 was 0.942 [95% CI 0.909 to 0.967]; the overlap of CIs between the original and our revised models suggests that they are very close in performance. When we visually compared the calibration and discrimination of the original model with our recalibrated model (figure 3), calibration clearly improved, although discrimination did not change (0.914 [95% CI 0.892 to 0.936] vs 0.905 [95% CI 0.882 to 0.928], p=0.59 based on Delong test37). With an optimal cut-off of 3.4%, the recalibrated model achieved an SE of 82.43% [77.43–87.44] and an SPC of 88.49% [87.78–89.21]. These results are associated with an NPV and a PPV of 99.43% [99.26–99.59] and 17.25% [16.01–18.49], respectively.

Figure 3

Visual discrimination and calibration of the original model (A) compared with the model recalibrated through method 6 (B). AUC, area under the curve; ROC, receiver operating characteristic.

Moreover, as internal validation, we performed 500 bootstrap replications for the model generated following method 6. The optimism-corrected performance was 0.912 (95% CI 0.891 to 0.932).11

Discrimination and calibration performance declines when predicting cases with longer follow-up

For the 3-year follow-up model, we retained 7769 nodules where 255 were lung cancer cases. For the 4-year follow-up, there were 7660 nodules where 260 were lung cancer cases. Figure 4 shows the visual performance assessment. Calibration seemed to improve with the increase of the follow-up (and hence the increase of lung cancer events). However, the greater the years of follow-up, the greater the drop-in discrimination. We also assessed external validation on the data sets of extended follow-ups of 3 and 4 years. Likelihood tests were significant at 5% for both follow-up periods (p<0.01; Embedded Image and Embedded Image , Embedded Image and Embedded Image for 3 and 4 years, respectively).

Figure 4

Visual assessment of the Brock model for 3 years (A) and 4 years (B) of follow-up. AUC, area under the curve; ROC, receiver operating characteristic.

Discussion

The aim of this work was to assess the external validity of a diagnostic prediction model issued from McWilliams et al 2 in the NLST database and to assess how various recalibration methods influenced model performance. We assessed both model discrimination and calibration. Visually, despite a good AUC (0.91, 95% CI 0.89 to 0.93), a lack of calibration and discrimination has been noted. This lack was confirmed by the estimation of the intercept and the slope on the PI in the original data set, which highlighted a need of recalibration due to high predicted probabilities in the original model. Emphysema, nodule count at baseline and nodule spiculation had a different effect in the NLST cohort compared with the PanCan cohort. Among the new covariates, only the pack-year history was added. Finally, by using recalibration methods, we proposed a new prediction score based on recalibration of the Brock model.

Three groups applied the McWilliams model to the NLST data set; while some applied it on external data sets.17 19 22 External validation of a prediction model requires the exploration of both discrimination and calibration. As noted previously, AUC is insufficient to fully assess the discrimination of a prediction model. This measure can only provide a general overview of the apparent discrimination but does not constitute an external validation by itself. Moreover, the assessment of calibration is equally important to model performance when applying an existing prediction model to an external data set. Of note, the Hosmer-Lemeshow test is not recommended to assess calibration because of its limited power and interpretability.38–40 To make an external validation properly, the recommended method is to test the fit of the original model on an independent external data set by estimating the intercept and the slope on the PI.11 14 30 For this purpose, the external validation data set has to be ‘comparable’ to the original data set in terms of exclusion and inclusion criteria and follow-up duration. This last point is challenging to follow because, as noted, the NLST database was patient based, while the PanCan was nodule based. In their article, Nair and colleagues17 considered only the largest nodule observed on baseline screening LDCT to be malignant and excluded all patients with lung cancer diagnoses made after the T1 screen (first incidence screen). Schreuder et al 19 created their own prediction model using the NLST data by aggregating all nodule characteristics at the participant level. Their outcome was lung cancer diagnosis during the year following the T1 screen. White et al 22 excluded from their analysis all participants with multiple nodules only in the cancer group and used all the follow-up data available in the NLST data set. From our perspective, the exclusion of all patients with cancer with multiple observed nodules or all participants diagnosed after the T1 screen, or inclusion of only one occurrence per participant, led to substantial selection bias. Moreover, it is not clear that they considered censorship given the use of a logistic regression model. Furthermore, exclusion and inclusion criteria were not explicitly described, which means that the model may not have been applied exactly in the same way as in the derivation data set. Finally, recalibration was not considered in any of the prior validation studies.

Some limitations of our study should be noted. To apply the Brock score, we assumed that the diagnostic score was derived from patients followed for 2 years (and only 2 years) after their baseline screen. Logistic regression models require that all participants be followed for the entire time period,11 which has to be the same for all participants and thus do not account for censorship. Indeed, survival models are dependent on time (ie, a Cox model, eg, Embedded Image ) and analyse time to event outcomes; in contrast, logistic regression models address only binary responses. Moreover, survival models explain more variability in the data than logistic regression models,41 which is why logistic regression only focuses on short follow-ups while survival models are preferred for longer follow-ups.11 The task of predicting probabilities of cancer at 2, 3 or 4 years is not the same statistically. As such, care should be given when attempting to estimate the probability of cancer beyond a 2-year follow-up period. Moreover, while the probability of malignancy may increase over a longer follow-up, the type of cancer identified over longer periods may be biologically distinct (more indolent).

Moreover, as NLST is a patient-based data set, we were not able to qualify the status of each nodule. We made some assumptions based on nodule locations and ultimate lung cancer locations to construct our ‘event covariate’. Based on those assumptions, some nodules were classified as not malignant, but we cannot access the veracity of those assumptions, and this may have contributed to selection bias. Similarly, 3432 (30%) of LDCT screenings were excluded due to missing data, which may also have contributed to selection bias. In PanCan data, LDCT screenings were classified as positive for nodules ≥ 1 mm which was not the case in NLST ( ≥ 4 mm). This could explain the differences noticed between the derivation and the validation data sets. Differences between NLST and PanCan data sets have been highlighted (table 4); nevertheless, we temper this result, which is partly due to ES. Finally, given that this was an exercise in external validation, we applied the prediction model in the exact same way as in the NLST data set. As such, we did not verify original assumptions made on the covariates included in the original logistic regression model, such as the linearity of a covariate with the outcome, which may have impaired the interpretation of the model.

At the patient level in the target population, the original score avoids 165 unnecessary screenings for one missing cancer diagnosis (eg, true negative/false negative rate). This number reaches a value of 176 in the recalibrated model. In both models, the PPV is low and could lead, in clinical practice, to a lot of false positive cases. Nevertheless, the NPV is very good and limits the number of false negative cases, which could help prevent unnecessary screening. Of note, thresholds are useful to clinical decision-making but should be used with precaution and interpreted in their context according to their own population. Indeed, they are not universal. For example, using the Brock model, van Riel et al,42 in a Danish population, used two thresholds to define three risk groups of increasing risk (<6%: low risk, 6%–30%: intermediate risk and ≥ 30%: high risk), whereas the British Thoracic Society28 recommended a positron emission tomography-CT scan for patients with an estimated probability greater than 10%. Thresholds given in this article are examples but can be increased or decreased, either to limit the number of false negatives or the number of false positives and thus improve PPV or NPV, respectively, according to the needs and context of each. Examples of thresholds and their impact on SE, SPC, NPV and PPV are given in online supplementary table S3.

As mentioned in ref 43, there are two types of models, validated statistically and clinically. Toll et al 44 distinguished three phases in multivariable prediction research: (1) development of a prediction score; (2) external validation of this score; and (3) the study of the clinical impact. In this article, we outlined the importance of the statistical external validation of prediction models. This procedure assesses the validity of an existing prediction score in an independent external data set. Our study focused on a geographical external validation.45 As such, the Brock model was applied in the exact same way in the validation data set as in the derivation data set. More than a statistical exercise, we seek to provide insights on how to interpret the results of external validation, providing readers with practical guidelines on how to evaluate and adopt published prediction models in clinical decision-making. As the number of prediction models increases, the importance of following reporting guidelines such as TRIPOD cannot be understated as the model design and reported results become increasingly more complex and difficult to reproduce and interpret. Of note, our intent was not to promote a new prediction model over the current Brock model, but rather, the effect of following recalibration and updating techniques on model performances in the NLST data set. As with any revisions to a prediction model, further validation is required before use. In conclusion, while the Brock model achieved a high AUC when validated on NLST, overall model performance benefited from both updating and recalibration without difference in the model discrimination, which cannot be altered during the recalibration process.12 Recalibrating an existing model with data from the target population can improve its transportability to other individuals.45 We found that the process of external validation, hence the generalisability of a prediction model, requires thoughtful accounting of both discrimination and calibration and the consideration of recalibration methods.

Acknowledgments

The authors thank the National Cancer Institute (NCI) for access to the NCI’s data collected by the National Lung Screening Trial as part of an existing data transfer agreement (NLST-93). The authors also thank Panayiotis Petousis and Lew Andrada for their comments.

References

Footnotes

  • Correction notice This article has been corrected since it was published. The third sentence in the Introduction section was re-worded.

  • Contributors Substantial contributions to the conception or design of the work, or the acquisition, analysis or interpretation of data: AW, DRA, WH. Drafting the work or revising it critically for important intellectual content: AW, DRA, WH. Final approval of the version published: AW, DRA, WH.

  • Funding Research reported in this publication was supported by the National Cancer Institute of the National Institutes of Health under award number R01CA210360 and by the Department of Radiological Sciences under the Integrated Diagnostics Program.

  • Disclaimer The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

  • Competing interests None declared.

  • Patient consent for publication Not required.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Data sharing statement These data are restricted and cannot be publicly available, but permission access can be requested through this website (https://biometry.nci.nih.gov/cdas/).

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.