Article Text


HRCT diagnosis of diffuse parenchymal lung disease: inter-observer variation
  1. Z A Aziz1,
  2. A U Wells2,
  3. D M Hansell1,
  4. G A Bain3,
  5. S J Copley4,
  6. S R Desai5,
  7. S M Ellis6,
  8. F V Gleeson7,
  9. S Grubnic8,
  10. A G Nicholson9,
  11. S P G Padley10,
  12. K S Pointon11,
  13. J H Reynolds12,
  14. R J H Robertson13,
  15. M B Rubens1
  1. 1Department of Radiology, Royal Brompton Hospital, London, UK
  2. 2Interstitial Lung Unit, Royal Brompton Hospital, London, UK
  3. 3Department of Radiology, Central Middlesex Hospital, London, UK
  4. 4Department of Radiology, Hammersmith Hospital, London, UK
  5. 5Department of Radiology, King’s College Hospital, London, UK
  6. 6Department of Radiology, London Chest Hospital, London, UK
  7. 7Department of Radiology, Churchill Hospital, Oxford, UK
  8. 8Department of Radiology, St George’s Hospital, London, UK
  9. 9Department of Histopathology, Royal Brompton Hospital, London, UK
  10. 10Department of Radiology, Chelsea and Westminster Hospital, London, UK
  11. 11Department of Radiology, Nottingham City Hospital, Nottingham, UK
  12. 12Department of Radiology, Birmingham Heartlands Hospital, Birmingham, UK
  13. 13Department of Radiology, Leeds General Infirmary, Leeds, UK
  1. Correspondence to:
    Professor D M Hansell
    Department of Radiology, Royal Brompton Hospital, London SW3 6NP, UK;


Background: This study was designed to measure inter-observer variation between thoracic radiologists in the diagnosis of diffuse parenchymal lung disease (DPLD) using high resolution computed tomography (HRCT) and to identify areas of difficulty where expertise, in the form of national panels, would be of particular value.

Methods: HRCT images of 131 patients with DPLD (from a tertiary referral hospital (n = 66) and regional teaching centres (n = 65)) were reviewed by 11 thoracic radiologists. Inter-observer variation for the first choice diagnosis was quantified using the unadjusted kappa coefficient of agreement. Observers stated differential diagnoses and assigned a percentage likelihood to each. A weighted kappa was calculated for the likelihood of each of the six most frequently diagnosed disease entities.

Results: Observer agreement on the first choice diagnosis was moderate for the entire cohort (κ = 0.48) and was higher for cases from regional centres (κ = 0.60) than for cases from the tertiary referral centre (κ = 0.34). 62% of cases from regional teaching centres were diagnosed with high confidence and good observer agreement (κ = 0.77). Non-specific interstitial pneumonia (NSIP) was in the differential diagnosis in most disagreements (55%). Weighted kappa values quantifying the likelihood of specific diseases were moderate to good (mean 0.57, range 0.49–0.70).

Conclusion: There is good agreement between thoracic radiologists for the HRCT diagnosis of DPLD encountered in regional teaching centres. However, cases diagnosed with low confidence, particularly where NSIP is considered as a differential diagnosis, may benefit from the expertise of a reference panel.

  • inter-observer variation
  • interstitial lung disease
  • high resolution computed tomography
  • AIP, acute interstitial pneumonia
  • COP, cryptogenic organising pneumonia
  • DPLD, diffuse parenchymal lung disease
  • EAA, extrinsic allergic alveolitis
  • IPF, idiopathic pulmonary fibrosis
  • LAM, lymphangioleiomyomatosis
  • LCH, Langerhans’ cell histiocytosis
  • NSIP, non-specific interstitial pneumonia
  • SRILD, smoking related interstitial lung disease

Statistics from

The evaluation of high resolution computed tomography (HRCT) as a diagnostic test has centred on its diagnostic accuracy. One aspect of HRCT that has not been comprehensively evaluated is observer variation in the context of diffuse parenchymal lung disease (DPLD), particularly among radiologists at regional teaching centres. Observer variation is an important aspect of the reproducibility of a diagnostic test, and is relevant to both radiological and histological evaluations which rely on subjective interpretation. In both disciplines the essential skill is pattern recognition and the classification of abnormal morphological patterns. Currently, the majority of patients with non-sarcoid interstitial lung disease are managed on the basis of HRCT observations, without histological data, making knowledge of observer variation pivotal to the routine clinical use of HRCT.

The review of HRCT by expert consensus opinion has been advocated in the recent American Thoracic Society/European Respiratory Society statement on the idiopathic interstitial pneumonias.1 British Thoracic Society guidelines for the investigation and management of diffuse parenchymal lung disease have also recommended that national panels should be formed in order to standardise both HRCT and histopathological evaluation.2 Although a national panel of UK histopathologists has existed for several years, a similar initiative has yet to be undertaken by radiologists with expertise in HRCT.

The aim of our study was to determine the level of observer variation for the HRCT diagnosis of diffuse lung disease and to identify areas of difficulty where expertise, in the form of a national panel, would be of particular use. This question was addressed by quantifying the extent to which experienced radiologists agree with each other in the HRCT diagnosis of diffuse lung disease in general, and in separate subgroups of (a) consecutive unselected cases and (b) cases posing greater diagnostic difficulties in which surgical biopsy was undertaken.


Patient population

The patient population (n = 131) consisted of two cohorts: 66 consecutive patients undergoing HRCT at a single tertiary referral hospital between January 1996 and December 1998 in whom a surgical lung biopsy was performed within 1 month and a histological diagnosis of diffuse lung disease was made (group A), and 65 consecutive patients undergoing HRCT at regional teaching centres in whom appearances were considered compatible with diffuse lung disease by the radiologist providing the case (group B). Ten of the 11 participating radiologists were asked to provide either six or seven consecutive cases to match the number of cases that were evaluated in the two groups. Cases of predominantly airway disease (such as bronchiectasis or constrictive bronchiolitis) or infection were excluded.

HRCT scanning protocol

All HRCT scans at the tertiary centre (group A, n = 66) were obtained on a CT scanner (Imatron Inc, San Francisco, CA, USA) with 1.5 mm collimation at full inspiration. Scans were obtained at 10 mm intervals in the supine position and images were reconstructed with a high spatial frequency algorithm and photographed at window settings appropriate for viewing the lung parenchyma (window centre −550 HU; window width 1500 HU). Currently accepted protocols3 for the acquisition of the HRCT scans in group B (n = 65) were used by the participating centres. All the images were evaluated on hard copy.

Observer characteristics and evaluation of images

The participating radiologists, all working at teaching hospitals, had completed their general radiological training 5–26 years previously and all had a declared interest in the HRCT diagnosis of diffuse lung disease. Details of each observer’s experience are summarised in table 1. The HRCT images were reviewed without the provision of any clinical information. Differential diagnoses were specified with percentage likelihoods (censored at 5%, summing to 100% in each case). Observers were free to diagnose any disease entity that they considered classifiable as a diffuse lung disease, the only stipulation being that the recent ATS/ERS classification and terminology for the idiopathic interstitial pneumonias was used when applicable.1 In addition, the likelihood that the disease was reversible as judged by the HRCT appearances4–6 was graded on a scale of 1–5 (<5%, 5–25%, 30–65%, 70–90%, and 95–100%, respectively).

Table 1

Details of the 11 observers

Data analysis

For the purposes of analysis, diagnostic statements were categorised into 17 diagnostic subgroups (box 1). All statistical analyses were performed using STATA data analysis software (Computing Resource Centre, Santa Monica, CA, USA).

Box 1 Categories of disease used for statistical analyses

  • Idiopathic pulmonary fibrosis;

  • Non-specific interstitial pneumonia;

  • Smoking related interstitial lung disease (respiratory-bronchiolitis interstitial lung disease and desquamative interstitial pneumonia);

  • Cryptogenic organising pneumonia;

  • Lymphoid interstitial pneumonia;

  • Acute interstitial pneumonia;

  • Sarcoidosis;

  • Extrinsic allergic alveolitis;

  • Asbestosis;

  • Drug induced lung disease;

  • Langerhans’ cell histiocytosis;

  • Lymphangitis carcinomatosa;

  • Eosinophilic pneumonia;

  • Lymphangioleiomyomatosis;

  • Bronchoalveolar cell carcinoma;

  • Other (amyloidosis, silicosis, follicular bronchiolitis, idiopathic pulmonary haemorrhage, lipoid pneumonia, pulmonary oedema, Churg-Strauss syndrome and Wegener’s granulomatosis);

  • Unclassifiable.

In each case the diagnosis of first choice was assigned a confidence rating of 1 (diagnostic likelihood <70%  =  low confidence), 2 (diagnostic likelihood 70–95%  =  high confidence), or 3 (100%  =  pathognomonic). The categories chosen were based on those used to assess the clinical probability of pulmonary embolism in the PIOPED study.7

Unadjusted kappa coefficients of agreement (κ) were computed in (a) the entire cohort, (b) in separate subgroups with summed confidence scores above and below the median value (that is, cases diagnosed with high and low confidence), and (c) in groups A and B.

The weighted kappa coefficient of agreement (κw) was then used to calculate the observer variation for the estimation of the probability of each of the six most frequently diagnosed conditions. In order to do this, the percentage likelihood given to each diagnosis was assigned a grade of 0–4 representing clinically useful probabilities: grade 0  =  condition not included in the differential diagnosis, grade 1  =  low probability (5–25%), grade 2  =  intermediate probability (30–65%), grade 3  =  high probability (70–95%), and grade 4  =  pathognomonic (100%). Weighted kappa values were calculated for idiopathic pulmonary fibrosis (IPF), non-specific interstitial pneumonia (NSIP), smoking related interstitial lung disease (SRILD), cryptogenic organising pneumonia (COP), sarcoidosis, and extrinsic allergic alveolitis (EAA). Weighted kappa values were calculated between paired observers and hence κw is expressed as median values with ranges for the 55 possible combinations of 11 observers (observer 1 v observer 2, observer 1 v observer 3, etc). κw values for the prediction of reversibility of disease were calculated for the entire cohort.

Data were interrogated to identify the sources of inter-observer variation and cases in which divergent diagnoses were made by at least two observers were tabulated—for example, IPF (n = 6), NSIP (n = 2) and EAA (n = 3) was categorised as diagnostic discordance between IPF and NSIP, IPF and EAA, and EAA and NSIP.

Observer agreement was categorised as poor, fair moderate, good, or excellent according to κ values of <0.20, 0.20–0.39, 0.40–0.59, 0.60–0.79, and >0.80, respectively.8


The prevalence of the conditions in groups A and B, as judged by the diagnosis offered most frequently by the radiologists for each case, is shown in table 2.

Table 2

Prevalence of individual diseases in groups A and B based on the diagnosis offered most frequently by the radiologists for each case

Prevalence of pathognomonic, high confidence, and low confidence observations

The median prevalence of first choice diagnoses thought to be pathognomonic or made with high confidence (>70% likelihood) was 69% (range 41–79%); appearances were considered by the 11 radiologists to be pathognomonic in a median of 25% of cases (range 0–49%). First choice diagnoses were made with low confidence (<70% likelihood) in a median of 31% of cases (range 21–59%). When the cohort was subdivided, first choice diagnoses were made with high confidence in a median of 60% (range 27–80%) and 77% (range 55–82%) for groups A and B, respectively.

Variation in first choice diagnosis

There was moderate agreement (κ = 0.48) on the first choice diagnosis for the entire cohort. Agreement for first choice diagnoses of the six most frequently offered diagnoses (IPF, NSIP, sarcoidosis, EAA, COP, and SRILD) was moderate to good, with agreement greatest for an HRCT diagnosis for sarcoidosis (table 3).

Table 3

Unweighted kappa (κ) coefficients of agreement for the first choice diagnosis for the six most frequently diagnosed conditions

Observer agreement was substantially higher in unselected consecutive cases from regional teaching centres (group B, κ = 0.60) than in biopsied tertiary referral cases (group A, κ = 0.34). Similarly, observer agreement was substantially higher in cases diagnosed with high confidence (κ = 0.68) than in those diagnosed with low confidence (κ = 0.28). For group B cases diagnosed with high confidence (40/66 = 62%), agreement was good to excellent (κ = 0.77).

Variation in diagnostic probabilities and in the probability of reversible disease

There was moderate to good agreement on the probability of the six most prevalent diagnoses as shown in (table 4). Weighted kappa values were moderate for the likelihood of COP, NSIP, and SRILD, and good for EAA, IPF, and sarcoidosis.

Table 4

Weighted kappa coefficients (κw) for the six most frequently diagnosed diffuse lung diseases

Agreement on the likelihood that disease was reversible was good (median κw  =  0.61; 25th to 75th percentile 0.56–0.67).

Sources of inter-observer variation

The sources of inter-observer variation are shown in table 5. 138 individual disagreements were made in the entire cohort with 31 (22%) related to IPF/NSIP discordance. Overall, the diagnosis of NSIP was a frequent source of observer variation and was involved in 76 of the 138 disagreements (55%).

Table 5

Noise analysis where divergent observations were made on four or more occasions


HRCT is the major diagnostic advance of the past two decades in diffuse lung disease,3 yet the inter-observer agreement in HRCT reporting has not been fully evaluated. Quantifying the observer agreement of a diagnostic test should form part of its formal evaluation;9 it is an important insight into a test’s usefulness and may disclose strengths and expose weaknesses of the test that are not readily apparent from more conventional diagnostic accuracy studies. Hence, the aim of this study was to quantify the level of observer agreement among practising thoracic radiologists in the diagnosis of diffuse lung disease in order to determine the need for a reference panel.

For most cases from regional centres the first choice diagnosis was made with high confidence and good observer agreement (κ = 0.77). However, in a minority of regional cases where the diagnosis is made with low confidence (approximately one third), reference panel review is likely to be beneficial, especially if NSIP is suspected.

An important facet of this study is the use of weighted kappa to evaluate agreement in the estimation of diagnostic probabilities. In certain cases the diagnosis may be a close call—for example, observer 1 may state IPF 45%, NSIP 55% and observer 2 IPF 55%, NSIP 45%. The use of the unweighted kappa in this scenario would give the impression of spurious disagreement between the two observers, despite the fact that the percentage probabilities assigned to each condition were very similar. By converting percentage probabilities into five categories and then applying the weighted kappa, we were able to assess agreement across a range of clinically useful probabilities in specific diffuse lung diseases. The good weighted kappa value for IPF (κw = 0.63) therefore reflects agreement among radiologists for the exclusion of the disease. This is of particular relevance in view of the poor prognosis of patients with a typical HRCT appearance of IPF.10 Weighted kappa values were good or moderate for the other five most frequently offered diagnoses (range 0.49–0.70).

We also analysed the sources of inter-observer variation by identifying the frequency with which specific diseases were offered as a differential diagnosis. Our data indicate that 55% of observer noise was related to the diagnosis of NSIP. The greatest area of disagreement was in the distinction between IPF and NSIP (22% of overall noise), but there were also problems distinguishing NSIP from other diseases, particularly EAA, SRILD, COP, and sarcoidosis. The difficulties in making a diagnosis of NSIP most likely stems from differences in the HRCT descriptions of NSIP in the published literature.11–14 In addition, several studies have emphasised the significant overlap between NSIP and IPF, NSIP and EAA, and NSIP and COP.12,15,16

Agreement in identifying reversible disease on HRCT scans was good (κ = 0.61). Studies have established that parenchymal consolidation,17 nodules,18–20 and ground glass opacity not associated with traction bronchiectasis or bronchiolectasis21 are signs of reversible lung disease. The good kappa value indicates that radiologists are aware of, and agreed on, the features of disease reversibility on the HRCT scan. Arguably, a statement on disease reversibility in some cases is as useful as the diagnosis itself.

To the best of our knowledge, there are no similar studies that have evaluated observer agreement for the HRCT diagnosis of consecutive cases of diffuse lung disease. A study by Collins et al assessed observer variation in pattern type and disease extent in fibrosing alveolitis on HRCT scans,22 but agreement on individual CT patterns does not necessarily translate into overall diagnostic agreement. The requisites of an observer agreement study include a large number of observers, cases that are representative of those encountered in everyday clinical practice, and observers who are not all academic radiologists but who provide a substantial proportion of the total number of opinions on cases of DPLD for the patient population.

Some of the early studies that evaluated the diagnostic accuracy of HRCT also included statements on observer variation, but they are limited because they predate the recent classification of the idiopathic interstitial pneumonias, unusual diagnoses were over-represented,23,24 and the number of observers was small.24–26 Kappa values were 0.78 and 0.75 for the studies by Grenier et al23 and Lee et al,25 respectively; apparently much higher than in the present study, but comparison of κ values between studies in which disease prevalence varies considerably is fraught as the κ value is highly dependent on disease prevalence.27 More recently, a study by Johkoh et al achieved a κ value of 0.55 for a correct HRCT diagnosis,16 although in this study all the observers were aware that the differential diagnosis was limited to just the five types of idiopathic interstitial pneumonia which probably increased the value of κ. In a study assessing the need for a lung biopsy in patients with suspected IPF,28 agreement for the presence or absence of IPF was similar to that found in our study (κ = 0.54 and 0.50, respectively).

One of the strengths of this study was the comparison between cases from a tertiary centre and those from secondary practices. Reliance on tertiary referral practice cases alone would have produced biased results; indeed, this has been a criticism of early studies that have quantified observer variation in cases that have not been representative of those encountered in routine clinical practice. The inclusion of cases from both secondary and tertiary practices provides a more representative picture of the observer variation that actually exists. The difference in κ values for the tertiary cases (group A) compared with regional centre cases (group B) was striking (0.34 and 0.60, respectively). A possible explanation is that cases at a referral centre are more likely to represent those at the unusual end of the spectrum and, by virtue of referral patterns, comprise the more challenging cases. Additionally, all these cases (group A) had a surgical biopsy implying that the HRCT appearances were not characteristic, although it is possible that referrals to a tertiary centre trigger a biopsy response more readily than at regional centres. Nevertheless, the difference between the two groups is clear.

Our results also show, not surprisingly, that a greater proportion of group A cases than group B cases were made with low confidence and, in addition, that observer agreement was highest in cases diagnosed with high confidence. The study by Mathieson et al24 in the early 1990s established that, when a confident HRCT diagnosis was made, it was correct in 93% of cases. The link between confidence and accuracy suggests that cases where the diagnosis is made with low confidence may benefit from interpretation by a panel of radiologists with particular expertise in HRCT.

A recent study published in a companion paper has evaluated the observer variation between pathologists in diffuse lung disease.29 The basic design of the studies was similar although, in the study of inter-observer variation between pathologists, observers chose a diagnosis from a specified list of 15 categories. In the present HRCT study observers could state any disease entity that was classified as a diffuse lung disease. This difference in methodology may have artificially increased inter-observer agreement for the pathologists. Nevertheless, κ values for tertiary referral cases were similar for radiologists and pathologists (0.34 and 0.38, respectively). These results highlight the fact that, in difficult cases of DPLD, reliance on either imaging or pathology in isolation is inadvisable. As suggested by the ATS/ERS guidelines,1 a concerted effort should be made to integrate clinical information, HRCT findings, and the pathology (if this is available) before a final diagnosis is formulated.

Several issues surrounding this study require clarification. Firstly, we included cases without a histological diagnosis to allow for a comparison of observer variation between cases from a tertiary centre that come to biopsy with non-biopsied secondary practice cases. In addition, with increasing reliance on HRCT, biopsy cases are no longer representative of the larger population of patients with interstitial lung disease. Indeed, our results show that the quantification of observer variation exclusively in biopsied cases of interstitial lung disease produces biased results. Secondly, all observer variation studies are inherently artificial because the test under evaluation is assessed in isolation, without the assimilation of clinical information that contributes to the diagnostic process. This is necessary as all aspects of a diagnostic test—observer variation being no exception—need to be evaluated without clinical information so that the results reflect the true properties of the test under scrutiny. The integration of clinical information with radiological assessment would be inappropriate in this study where the aim was specifically to quantify the observer noise for HRCT.

Non-thoracic radiologists were not included in the study because we felt it would be more appropriate to assess observer variation among thoracic radiologists who report HRCT scans on a regular basis and provide the opinions on which decisions are made. A reference panel is only warranted when experts disagree, so we specifically included thoracic radiologists with a designated interest in interstitial lung disease. Finally, the diagnoses used throughout the study represent the radiologists’ diagnoses based on HRCT appearances. There was no independent “gold standard” as this concept is irrelevant to a study assessing observer agreement. This study has not attempted to measure the accuracy of HRCT in diffuse lung disease; importantly, a high level of agreement is not equivalent to high accuracy. There are mathematical models that may be used to estimate accuracy from agreement30 but, as a general rule, agreement (as measured by kappa) should not be used as a surrogate for accuracy.

This type of study is subject to differences in behaviour that individuals demonstrated when asked to express confidence using numerical probabilities—for example, in our group of radiologists one observer never used 100% probability. A further related factor is that qualitative expressions of probability have different numerical meanings to different individuals, even those in the medical profession.31 A radiologist interpreting an HRCT scan may come to the conclusion that the appearances are “likely” to represent sarcoidosis. He states the likelihood of this diagnosis to be 80%, whereas another radiologist who also thinks that sarcoidosis is “likely” may record 60%. However, it is hoped that, by placing the percentages into clinically useful categories (for the weighted kappa analyses), some of the noise introduced by the different perceptions of the observers will have been minimised.

In conclusion, we have shown that thoracic radiologists are within the clinically acceptable range of observer variation for cases of diffuse lung disease encountered in regional teaching centres. However, the low agreement observed for cases diagnosed with low confidence justifies a reference panel that would parallel the existing UK pathology group. The purpose of the reference panel would not be to improve the accuracy of HRCT against biopsy, but to provide an opportunity to reach consensus and standardise diagnoses in areas of contention.


View Abstract

Request permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.