Background The grading of radiological severity in clinical trials in tuberculosis (TB) remains unstandardised. The aim of this study was to generate and validate a numerical score for grading chest x-ray (CXR) severity and predicting response to treatment in adults with smear-positive pulmonary TB.
Methods At a TB clinic in Papua, Indonesia, serial CXRs were performed at diagnosis, 2 and 6 months in 115 adults with smear-positive pulmonary TB. Radiographic findings predictive of 2-month sputum microscopy status were used to generate a score. The validity of the score was then assessed in a second data set of 139 comparable adults with TB, recruited 4 years later at the same site. Relationships between the CXR score and other measures of TB severity were examined.
Results The estimated proportion of lung affected and presence of cavitation, but not cavity size or other radiological findings, significantly predicted outcome and were combined to derive a score given by percentage of lung affected plus 40 if cavitation was present. As well as predicting 2-month outcome, scores were significantly associated with sputum smear grade at diagnosis (p<0.001), body mass index, lung function, haemoglobin, exercise tolerance and quality of life (p<0.02 for each). In the validation data set, baseline CXR score predicted 2-month smear status significantly more accurately than did the proportion of lung affected alone. In both data sets, CXR scores decreased over time (p<0.001).
Conclusion This simple, validated method for grading CXR severity in adults with smear-positive pulmonary TB correlates with baseline clinical and microbiological severity and response to treatment, and is suitable for use in clinical trials.
- chest radiograph
Statistics from Altmetric.com
Sputum smear microscopy, and culture where available, are standardised modalities for diagnosing and monitoring treatment response in pulmonary tuberculosis (TB). Chest radiography (CXR) provides useful information regarding disease extent and progress, but there is no agreed-upon, validated system for grading the severity of CXR abnormalities in bacteriologically proven pulmonary TB. Several methods were devised for this purpose at the time of early TB treatment trials, such as those described by the Madras TB Chemotherapy Centre in 1960,1 Simon in 19662 and the National TB and Respiratory Disease Association of the USA in 1969.3 Despite this, no system has been validated in predicting outcome in more than one patient population. Recent randomised controlled trials (RCTs)4–7 and observational studies in adults with TB8–13 illustrate this lack of standardisation in the grading of radiological severity, with each of these studies utilising different non-validated investigator-generated systems to grade CXR severity.
The same problem of non-standardised radiological reporting has been recently articulated by Dawson et al relating to TB screening, who evaluated and recommended the Chest Radiographic Reading and Reporting System14 for TB screening in HIV-positive people.15 However, this and other screening tools16–18 seek to identify the presence of latent TB infection or active disease, and are not useful for researchers wishing to accurately document severity or response to treatment in active TB.
Problems in CXR reporting arise from the heterogeneous CXR manifestations of pulmonary TB (eg, in primary vs postprimary disease, adults vs children, immunocompetent vs immunocompromised)19–21 and to inaccuracies inherent in CXR performance and interpretation,2 including limited interobserver agreement on CXR findings.22 23 Despite these shortcomings, the utility of CXR is well established in TB diagnosis and clinical monitoring.
Associations between radiological extent and other measures such as forced expiratory volume in 1 s (FEV1), age or multidrug-resistant (MDR)-TB have previously been identified,8 24 but a standard, simple, numerical score, validated against TB outcome, in repeated data sets, is lacking. We therefore aimed to devise a simple CXR score for use in adults with smear-positive pulmonary TB, which predicts outcome and correlates with bacteriological and clinical severity markers, for the purpose of grading severity and monitoring treatment response in the context of TB clinical trials. We then determined the utility of the score in a separate, comparable patient population.
The study was conducted at a community-based TB clinic in Timika, Papua Province, Indonesia. Timika has population of ∼200 000 and an estimated TB incidence of 311/100 000.25
Adults (>15 years) diagnosed with sputum smear-positive pulmonary TB who gave written informed consent were eligible for enrolment in the study. Study participants were recruited during two time periods: 2003–2004 (the derivation data set) and 2008–2009 (the validation data set). The demographic, clinical and microbiological findings and outcomes in the first data set have been reported previously.26 27
Standard full-size posteroanterior CXR were performed at the time of TB diagnosis and 2 and 6 months thereafter, with reports provided by a clinician at the field site (first data set, PMK; second data set, APR) and, additionally for the first data set, by one of two radiologists (MJW or GD). During the first data collection period, the presence of small (1–2 mm) or large (>2 mm) nodules, patchy or confluent consolidation, cavitation, bronchial lesions or fibrosis was reported for each of three zones (upper, mid or lower zones) in each lung. The presence of effusion or lymphadenopathy was reported, the total percentage of each lung affected by any pathology was estimated, total cavity size in millimetres was recorded and the effusion volume (percentage of lung field) was estimated. To grade the percentage of affected lung, visual estimation of the extent of opacification, cavitation or other pathology as a percentage of visible lung was made; dense opacification of a zone was graded as 100% of that zone, while patchy opacification within a zone attracted scores <100% depending on the extent of opacification. Other remarks including presence of miliary disease were recorded. During the second data collection period, a simplified CXR report method was used (percentage lung affected, cavitation (0, <4 cm, ≥4 cm), effusion (0, <25%, ≥25% of hemithorax), presence of consolidation, fibrosis, nodules, miliary disease). Reporters were blinded to HIV status, bacteriological and clinical parameters and treatment outcome.
Sputum microscopy and clinical evaluations
Baseline sputum microscopy was performed at the onsite laboratory and repeated at the reference laboratory on samples collected at 0, 2 and 6 months, and the density of acid-fast bacilli (AFB) was graded as 1, 2 or 3+ according to standard protocols.26 27 Baseline and follow-up evaluations included: body mass index (BMI), FEV1 (spirometry performed using ML3535C, MicroLoop, MicroMedical, Chatham, UK), haemoglobin (Hb), measured using point-of-care HemoCue (Ängelholm, Sweden) or iSTAT (Abbott Park, Illinois, USA) tests, 6 min walk test (distance walked in 6 min on a straight walking track), measured according to American Thoracic Society guidelines, and St George's Respiratory Questionnaire (SGRQ) modified to reflect local conditions and translated into Indonesian.27 28 Standard definitions were used for nutritional category (normal, mild malnutrition, moderate malnutrition or severe malnutrition) according to BMI,29 and for TB treatment outcome at 6 months (cured, completed, transferred, defaulted, failed or died).30 Impairment in FEV1 as a percentage of predicted values was calculated using previously established local reference ranges.31
The outcome measure used in this study is 2-month sputum AFB microscopy status. Two-month smear positivity has been previously shown to predict unfavourable outcomes including treatment failure and death,32–34 and determines the need for continued intensive-phase treatment versus switching to continuation-phase therapy.30 Although an imperfect predictor of outcome,35 in the absence of suitable alternatives, it remains a commonly used surrogate end point.
Statistical calculations were performed using Intercooled Stata 10.1 (StataCorp, College Station, Texas, USA); graphs were created in GraphPad Prism 5 (GraphPad, La Jolla, California, USA). Statistical tests were two sided, with a p value of <0.05 indicating statistical significance. Intergroup differences in means or medians were compared using two-sample t tests, Wilcoxon rank sum tests, analysis of variance or Kruskal–Wallis tests as appropriate.
Agreement between reporters in the derivation data set was tested using the concordance coefficients, ρc for continuous variables or the kappa statistic for categorical variables. Prevalence-adjusted, bias-adjusted kappa values were calculated according to the method described by Byrt et al.36 Kappa values were interpreted according to guidelines given by Landis and Koch37 (kappa ≤0.00, poor; 0.00–0.20, slight; 0.21–0.40, fair; 0.41–0.60, moderate; 0.61–0.80, substantial; 0.81–1.00, almost perfect).
The relationships between radiographic findings and clinical outcome were examined by multivariable regression analysis, using a forward stepwise approach in which any radiological variable found to be significant (p<0.05) in univariate analysis was included in the initial model. Goodness of fit of final models was assessed using the Hosmer–Lemeshow test and compared using the likelihood ratio test. The weighting for a numerical radiological score was derived from the regression coefficients. Its ability to predict outcome in the validation data set was determined using receiver operator characteristics (ROC; area under the curve (AUC)). The relationships between this score and demographic, biological and clinical variables were determined in data sets 1 and 2 using regression models using the same principles.
Approval was granted by the ethics committees of the National Institute of Health Research and Development (Jakarta, Indonesia), Menzies School of Health Research (Darwin, Australia) and the Australian National University (Canberra, Australia). Written informed consent was obtained from participants in Indonesian or an appropriate Papuan language.
Characteristics of study participants in the two data collection phases are shown in table 1. All participants had smear-positive pulmonary TB (≥2 AFB smear-positive sputum samples); the result of an additional sample provided for microscopy and culture on the day of treatment commencement is reported here. This was negative in 5.7% and 7.2% of participants in the two data sets, respectively despite their previous samples being positive. Initial smear grade predicted the likelihood of smear conversion by 2 months. In the derivation data set, failure to convert to smear negative by 2 months was observed in 60.9% of patients with a baseline smear grade of 3 and in 38.7% of patients with a baseline smear grade of <3+ (p=0.051). In the validation data set, failure to convert to smear negative by 2 months was observed in 48.4% of patients with a baseline smear grade of 3 and in 11.8% of patients with a baseline smear grade of <3+ (p<0.001).
CXR reports were available at baseline, 2 and 6 months for 112, 76 and 76 study participants in the first data set, and 136, 93 and 76 study participants in the second data set (incomplete in the second data set as 30 of 139 had not yet completed 6 months) (table 2). Reasons for missing CXR included patient failure to attend (died, defaulted or transferred prior to appointment), inability to obtain CXR (eg, electricity failure), CXR date >3 weeks before or after the due follow-up date, or CXR unavailable for reporting.
Agreement on radiological abnormalities
Agreement between reporters on radiological abnormalities in the derivation data set is shown in table 3. Agreement was relatively low overall. More substantial agreement was achieved for some variables after adjusting kappa values for variable prevalence and reporter bias.
Development of score using the training data set
Two-month sputum smear status in the initial data set (n=115) was significantly predicted in univariate logistic regression models by the presence of baseline cavitation (OR 3.26, 95% CI 1.11 to 9.56) and the total percentage of lung affected (OR 1.9, 95% CI 1.3 to 2.7, calculated per each 20% increment of affected lung), but not by cavitation size, presence or number of nodules, fibrosis, effusion or lymph nodes (table 4).
The relationships between those radiological findings which were independently predictive of 2-month outcome (cavitation and percentage of lung affected) and baseline clinical and bacteriological measures were then examined. Cavitary disease on CXR at TB diagnosis was significantly associated with higher baseline AFB density in sputum (ie, smear microscopy grade) (p=0.007, χ2 test for trend), and people with cavitary disease had worse lung function, with a mean percentage predicted FEV1 of 59.0 (95% CI 54.4 to 63.6) in cavitary disease versus 68.7 (95% CI 60.6 to 76.7) in non-cavitary disease (p=0.03, two-sample t test). People with cavitary disease had slightly lower BMI (18.5, 95% CI 18.0 to 19.1) compared with those with non-cavitary disease (19.2 kg/m2, 95% CI 18.5 to 20.0), but this difference was not statistically significantly (mean between-group difference 0.70 kg/m2, 95% CI −0.18 to 1.59). No significant associations were identified between cavitary disease and exercise tolerance (6 min walk distance), quality of life (SGRQ total or individual domain scores) or Hb.
The amount (%) of lung affected significantly predicted all clinical and laboratory variables. Specifically, greater proportions of affected lung were significantly associated with decreasing BMI category (p=0.002, Kruskal–Wallis test), lung function category (p<0.001, Kruskal–Wallis test), 6 min walk distance (0=0.001, linear regression) and Hb in males (p=0.003, linear regression), though not in females (p=0.4). A greater proportion of lung affected on the baseline CXR was also significantly associated with SGRQ total scores (p<0.001, linear regression) and with sputum smear grade at diagnosis (p<0.001, Kruskal–Wallis test).
To create a CXR score, cavitation and percentage of lung affected were included as independent variables in a logistic regression model for 2-month sputum smear status. The model containing both variables was significantly better than the model containing cavitaion alone (likelihood ratio test p<0.001) or proportion of lung affected alone (likelihood ratio test p=0.016) at predicting 2-month sputum smear status. Regression coefficients were 0.03167 for proportion of lung affected and 1.26151 for presence/absence of cavitation, indicating a relative weighting of 40.27 for cavitation (1.26151÷0.03167), thereby generating an equation for the weighted score as follows:
CXR score results
CXR score characteristics are shown in table 2 and figures 1–4. Scores did not significantly differ according to sex, ethnicity or smoking status (p>0.05, two-sample t tests), and were not significantly associated with age in univariate or multivariate analyses. Mean baseline CXR score in people with unfavourable (positive) 2-month outcomes was significantly higher (88.2; 95% CI 76.5 to 99.9) than in those with a favourable outcome (56.8; 95% CI 49.7 to 64.0), but the range of scores in each smear grade was wide (figure 1). Scores were also significantly associated with baseline microscopy grade (figure 1). CXR scores were inversely related to BMI, FEV1, Hb and 6 min walk distance, were directly related to SGRQ total score (higher SGRQ scores indicate worse quality of life) and significantly decreased over time (figures 2–4).
Performance of score using the validation data set
The weighted score calculated for the new data set showed similar characteristics (table 2), including a median baseline score of 69, no significant relationship with demographic factors, significant positive association with baseline smear grade (p=0.009, Kruskal–Wallis test) and the same relationships as were found in the initial data set between CXR score and each of the clinical/laboratory measures (BMI, Hb, FEV1, SGRQ total score and 6 min walk distance; p<0.05 in each case).
Comparing ROC scores to predict outcome, the weighted CXR score (AUC 0.75) was significantly better at predicting 2-month smear status than the percentage lung affected alone (ROC 0.69; p=0.013, χ2 test; figure 5). The optimal cut-off point for weighted CXR score (value furthest from the diagonal) was 71, at which value the sensitivity for predicting a positive sputum smear status at 2 months was 80% (95% CI 61.4 to 92.3) and specificity 67.7% (95% CI 57.3 to 77.1). Comparative sensitivity and specificity values are shown in table 5.
The current need for a universal and standard system for reporting CXR in pulmonary TB is acknowledged.38 In order to grade CXR severity and assess radiological treatment response, we have derived a simple equation from radiographic parameters from adults with smear-positive pulmonary TB that predicts smear positivity at 2 months and provides a single numerical score for each CXR. The score shows good correlation with baseline bacteriological and clinical severity markers, and is sensitive to changes over time. The score performs better than its individual components: it was significantly better at predicting outcome than was the percentage of lung affected alone, and was significantly associated with a broader range of baseline severity measures (BMI, |Hb, exercise tolerance and quality of life) than presence of cavitation alone. Advantages of this method are that CXR assessment does not require aids, grids or rulers, and it is derived by fitting a statistical model to outcome data rather than by assigning points based on assumed relative importance of radiographic pathologies. It has been validated in an independent data set, and offers a single, standardised solution where there are currently multiple unvalidated methods in use.
The proportion of lung affected and/or cavitation feature as the most important measures in many TB CXR grading methods.1–5 7 Cavitation is well recognised to correlate with bacillary load.7 39 We confirmed the association between cavitation and bacteriological measures (baseline and 2-month sputum smear status), and additionally showed cavitary disease to be predictive of worse lung function. The proportion of lung affected was associated with both bacteriological and a range of clinical measures.
This score was derived in adult patients with TB with smear-positive pulmonary disease, in a setting with relatively low rates of HIV–TB co-infection and MDR-TB. The score requires further evaluation in populations with high HIV prevalence, in whom CXR findings characteristic of HIV–TB co-infection (subtle or absent pathology, non-cavitary disease, lower lobe infiltrates, hilar lymphadenopathy and pleural effusion)20 40 may mean that a differently weighted score is needed. Nevertheless, the score remained valid and applicable in the newer data set in which HIV–TB co-infection rates were higher (13%); the rise in HIV prevalence may account for some of the differences observed between the two data sets. The presence of MDR-TB would not be expected to alter radiographic patterns, other than being associated potentially with higher scores and smaller incremental improvements over time.
Potential limitations of the study include the use of 2 month smear status as an outcome measure (rather than a longer term measure such as 6-month outcome or recurrence).
The absence of suitable biomarkers or other surrogate end points in TB research is readily acknowledged, and recent estimates derived from meta-analysis found a sensitivity of only 57% and specificity of 81% for 2-month smear status in predicting treatment failure.35 Nevertheless, until more suitable measures become available, 2-month smear status remains a suitable outcome measure.30 32–34
Another limitation was the inherent problem of limited inter-rater agreement in CXR assessment. The low rates of clinician–radiologist agreement between reporters on CXR findings identified in the derivation data set are not unusual, with only fair or poor agreement between radiologists and clinicians also being reported elsewhere.22 23 This emphasises the importance of using simple rather than complex scores and ensuring individuals allocating CXR scores participate in continuing education to maximise agreement. The score derived from radiologist CXR evaluation in the first data set is simple. Moreover, it was shown to be valid in the second data set when used by an independent TB clinician, rather than a radiologist, confirming its practical utility in a clinical and trial setting. Some systematic differences in CXR results were noted between the two data sets; while this may represent systematic difference in reporting styles, the findings are in keeping with the possibility of less severe disease in the validation data set, as indicated by their lower bacillary burden (with slides read by the same senior laboratory technician during both data collection periods).
In summary, we have derived and validated a simple method for grading CXR severity in adults with smear-positive pulmonary TB that predicts baseline clinical and microbiological severity and response to treatment in two separate patient populations. Although finer discriminatory accuracy might be achieved by collecting more detailed CXR findings (such as cavity size), our data did not indicate this. This method can be used where a numerical score is required for the purpose of comparing radiographic severity between adults with smear-positive pulmonary TB, and to monitor an individual's improvement over time, such as in clinical trials of drug efficacy in TB.
We thank the following for their support and assistance: Dr M Okoseray, Pak Penias and Pak E Meokbun and the Timika District Health Authority; Dr Dina Bisara Lolong and Ibu Meryani Girsang and the National Institute of Health Research and Development, Jakarta; Dr P Penttinen, Dr M Bangs and Dr M Stone, Public Health & Malaria Control (PHMC) and International SOS; Pak Istanto and PHMC laboratory staff; Pak J Lempoy and Timika TB clinic staff; Dr P. Sugiarto and Mimika Community Hospital (RSMM); Natalia Dwi Haryanti, Sri Hasmunik, Sri Rahayu, G Bellatrix and clinical and laboratory staff, NIHRD-MSHR Timika research programme; Mr R Lumb and Dr I Bastian at the Institute of Medical and Vetinerary Science; and Associate Professor R Price, MSHR.
Funding Australian Respiratory Council, the Royal Australasian College of Physicians (Covance award), Australian National Health and Medical Research Council.
Competing interests None.
Ethics approval This study was conducted with the approval of the Human Research Ethics Committees of the NT Department of Health & Families and Menzies School of Health Research, Australia, the Australian National University, and the National Institute for Health Research and Development, Indonesia.
Provenance and peer review Not commissioned; externally peer reviewed.