Article Text

PDF

Relearning an old lesson: stopping trials early
  1. Najib M Rahman,
  2. Robert J O Davies
  1. UKCRC Oxford Respiratory Trials Unit, University of Oxford and Oxford Centre for Respiratory Medicine, Oxford Radcliffe Hospital, Oxford, UK
  1. Correspondence to Robert J O Davies, Oxford Radcliffe Hospital and Oxford University, Churchill Hospital, Headington, Oxford OX3 7LJ, UK; robert.davies{at}ndm.ox.ac.uk

Statistics from Altmetric.com

A well designed and delivered clinical trial is the main tool to define whether medical interventions ‘work’, and how well. As such, they are potent weapons in the armoury of medical progress—and like all potent weapons need to be used with care.

In this month's Thorax, Koegelenberg et al (see page 857) report the findings of a trial comparing the diagnostic accuracy of closed pleural biopsy (Abrams needle) and cutting needle pleural biopsy after thoracic ultrasound, for the diagnosis of pleural tuberculosis (TB).1 This question is clearly important given the global significance of TB, and the key role of pleural biopsy in the diagnosis and microbiological assessment of its pleural presentations. To date, there are no published studies assessing ultrasound-guided pleural biopsy for the diagnosis of TB-related pleural effusions. This study is a continuation of this group's research programme which has a track record of delivering valuable evidence in the diagnosis of pleural TB, not least their previous study showing that thoracoscopy is superior to closed pleural biopsy in this disease.2

Their studies are conducted in an area with a high prevalence of TB, with all the recruited subjects receiving the compared diagnostic tests, allowing the comparison of diagnostic results. In this new study, all patients underwent both Abrams biopsy and cutting needle biopsy, performed in random order. The design is simple, logical and efficient, with the diagnostic accuracy for TB assessed against accepted ‘gold standards’. Given the biology of pleural TB, which manifests as diffuse pleural involvement, it is reasonable to propose that the two biopsy techniques may be similar. As such, the results of the study should be clinically important, but on this occasion caution needs to be exercised in interpreting these results in view of a flaw in study delivery.

The study was stopped prematurely (after 89 (40%) of a planned 220 subject recruitment). The study was originally, appropriately, planned as a non-inferiority design requiring 220 patients to demonstrate ‘equivalence’ of the two techniques within a 10% threshold. It was halted when a statistically significant difference was identified at a preplanned interim analysis, with closed pleural biopsy appearing more sensitive than cutting needle biopsy (81.8% vs 65.2%, p=0.022). Unfortunately, this early cessation means it is difficult to interpret the result of the study and assess the possible benefit in favour of closed biopsy.

A clinical trial would normally only be stopped early in the face of ‘overwhelming evidence’ of a difference in outcome between the study groups (conventionally a p value of about <0.0001), a criterion generally known as the Peto–Haybittle rule.3 4 The stopping decision would normally be taken by an independent group who are not part of the core investigator team, to avoid bias. This is key to the ability of the trial to fulfil its primary functions of delivering a result that is highly likely to be true, and is useful in quantifying the magnitude of any benefits seen. It is worth revisiting why this is so.

There are several reasons why trials should not stop when a statistically significant difference is first seen. ‘Statistically significant’ differences commonly arise by chance early during trial recruitment, and disappear later. This is the reason for the demanding (p<0.0001) threshold in the Peto–Haybittle rule. In this study, the statistical signal was p=0.022, creating the possibility that the result is a statistical fluke and hence wrong. If the next two cases recruited to the study happened to favour cutting needle biopsy, the statistical significance would have disappeared (p=0.08).

Our Unit's trial of adjuvant intrapleural streptokinase in pleural infection5 is a real-life example of early and misleading statistical significance. During recruitment to this trial the independent Data Monitoring Committee first reviewed the data for safety assessment reasons after ∼40% of the recruitment. At this review there was a ‘statistically significant’ difference between the study groups in the frequency of death and surgery, with a p value of ∼0.01. This was not thought to constitute ‘overwhelming evidence’ of a treatment difference, was not communicated to the trial team and recruitment was allowed to proceed. By the next review this difference had disappeared and the eventual trial result was completely negative. With the benefit of hindsight, if the study had been stopped and published when this first significant p value was identified, we would have reported a result that was untrue.

Secondly, if a statistically significant difference in the trial outcome between study arms is used as a reason to stop, it is tempting to assess the data repeatedly in order to deliver the study result rapidly. Unfortunately, every ‘peek’ at the data increases the likelihood of a false-positive result. The conventional maximum p value for (arbitrary) statistical significance is <0.05, which means that the observed result has a chance of having arisen fortuitously of <1 in 20 (5%). However, if the results are assessed on 20 different occasions, a p value of 0.05 will occur by statistical chance alone on one occasion. The study investigators in this study did preplan their interim analysis, but they did not adjust their p values for this extra ‘peek’ at the data. A correction for one ‘peek’ would imply a p value of ∼0.025—barely achieved when the trial was stopped.

Finally, stopping trials earlier than planned reduces the precision of the estimate of the outcome being assessed. This is most commonly expressed using the 95% CI for the result. In this study, the observed difference in the diagnostic sensitivity between the two techniques is 16.7%; however, the CIs imply that it is somewhere between 1.9% and 31.5%. This means that there is anything from an enormous and vital advantage, through to a trivial and unimportant difference in using Abrams biopsy for the diagnosis of TB. If recruitment had been completed to the original target of 220 and the difference between the groups had remained unaltered, the result would have allowed us to be reasonably confident that the advantage to Abrams biopsy would lie between 10.9% and 27.3%—a more clinically useful estimate.

So how should the clinician respond to these results? Despite the above limitations, these are the first high quality randomised data in this area and the authors should be congratulated on conducting such a difficult and intensive trial. The study also provides interesting data on the high diagnostic sensitivity for pleural malignancy using an ultrasound-guided closed pleural biopsy strategy (100% for combined Abrams and Tru-cut)—this aspect now warrants further specific study. The results of this study suggest that there may be an advantage to Abrams biopsy over cutting needle biopsy for the diagnosis of TB—as this disease is highly prevalent in resource-poor areas, this would be an important finding with implications for practice. However, we cannot be confident that this is true or of the magnitude of any possible advantage. Accordingly, clinical practice should not change on the basis of these results. This study demonstrates that such a trial is deliverable, and that there is a significant possibility that a more simple diagnostic strategy may be advantageous, consistent with previous evidence on the relatively high yield of Abrams biopsies in the diagnosis of pleural TB. Given the lack of data in this field, a fully completed randomised study of similar design would be the appropriate next step to advance diagnostic strategies in this important disease, and this study provides accurate data on the sample size required for such a trial. If such a study again demonstrated high diagnostic yields for both TB and malignancy with an ultrasound-guided blind technique, future assessment of this strategy compared with thoracoscopy or CT-guided biopsy would be warranted.

View Abstract

Footnotes

  • Linked articles 125146.

  • Competing interests None.

  • Provenance and peer review Commissioned; not externally peer reviewed.

Request permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Linked Articles

References