Current evidence on diagnostic accuracy of commercially based nucleic acid amplification tests for the diagnosis of pulmonary tuberculosis
- 1Dipartimento di Malattie Polmonari, Azienda Ospedaliera San Camillo-Forlanini, Rome, Italy
- 2Dipartimento di Medicina Interna, Università di Roma “Tor Vergata”, Rome, Italy
- 3Dipartimento di Epidemiologia, INMI L Spallanzani-IRCCS, Rome, Italy
- Correspondence to:
Dr S Greco
Dipartimento di Malattie Polmonari, Azienda Ospedaliera San Camillo-Forlanini, 00151 Rome, Italy;
- Received 26 October 2005
- Accepted 18 May 2006
- Published Online First 31 May 2006
Background: Even though commercial nucleic acid amplification tests (NAATs) have become the most frequently used molecular tests for laboratory diagnosis of pulmonary tuberculosis (TB), published studies report variable estimates of their diagnostic accuracy. We analysed the accuracy of commercial NAATs for the diagnosis of pulmonary TB in smear positive and smear negative respiratory samples using culture as a reference standard.
Methods: English language studies reporting data sufficient for calculating sensitivity and specificity of commercial NAATs on smear positive and/or smear negative respiratory samples were included. Meta-regression was used to analyse associations with reference test quality, the prevalence of TB, sample and test type. Predictive values for different levels of pre-test probability were quantified using Bayes’ approach.
Results: Sixty three journal articles published between 1995 and 2004 met the inclusion criteria. Pooled sensitivity and specificity were 0.96 and 0.85 among smear positive samples and 0.66 and 0.98 among smear negative samples. The number of culture media used as reference test, the inclusion of bronchial samples, and the TB prevalence were found to influence the reported accuracy. The test type had no effect on the diagnostic odds ratio but seemed to be correlated with sensitivity or specificity, probably via a threshold effect.
Conclusions: Commercial NAATs can be confidently used to exclude TB in patients with smear positive samples in which environmental mycobacteria infection is suspected and to confirm TB in a proportion of smear negative cases. The methodological characteristics of primary studies have a considerable effect on the reported diagnostic accuracy.
- AFB, acid-fast bacilli
- DOR, diagnostic odds ratio
- MTB, Mycobacterium tuberculosis
- NAAT, nucleic acid amplification test
- TB, tuberculosis
In spite of their theoretical ability to detect even a single mycobacterial cell, nucleic acid amplification tests (NAATs) are not sufficiently reliable to replace conventional diagnostic methods for pulmonary tuberculosis (TB). Both inherent test characteristics and errors in testing procedures may account for their inaccuracy.1 As for microscopy and culture, the key factor in determining NAAT false negatives is the density of mycobacteria in the specimen, since it can result in the absence of organisms in the small volumes sampled for the test. Furthermore, the presence in respiratory secretions of enzymes capable of inhibiting amplification reactions accounts for an additional 3–25% of false negative results.2 On the other hand, false positive results arise most often from contamination of negative samples with either organisms or target DNA from samples containing large numbers of mycobacteria or from amplicons contaminating the laboratory room.2,3
To overcome these problems, automated commercial systems were developed which were made more robust by the use of standardised procedures and reagents for sample processing, amplification, and detection. These procedures, which allow different steps of the process to take place in a single sealed tube, were intended to reduce the risk of contamination. At the same time, the use of larger sample volumes or the introduction of internal amplification controls to detect inhibitors was adopted to cut down the rate of false negative results.
Notwithstanding these precautionary measures, published studies show a considerable heterogeneity in the results obtained with commercial NAATs. The US Centers for Disease Control (CDC) recommend that commercial NAATs be used with microscopy to improve diagnostic certainty (pending culture results and/or patient’s response to treatment) and that clinicians should rely on their clinical judgement in the interpretation of results. According to the CDC, the diagnosis of pulmonary TB can be presumed in smear positive (acid-fast bacilli (AFB)+) patients with a positive NAAT result and in smear negative (AFB−) patients with two subsequent positive NAAT results. Environmental mycobacterial disease can be hypothesised when a negative NAAT result is obtained from an AFB+ and inhibitor-free sample while, as about 20% of TB cases can be attributed to infection by AFB− patients,4 two negative NAAT results from two separate AFB− samples are needed to exclude contagiousness.5
Two previous meta-analyses on the accuracy of NAATs for the diagnosis of pulmonary TB, analysing mostly home grown polymerase chain reaction (PCR) based tests, found a substantial variability in both sensitivity and specificity due to the different threshold set by each investigator and to differences in study design and quality.6,7 To our knowledge, the diagnostic accuracy of commercial NAATs separately on both AFB+ and AFB− respiratory samples has never been systematically reviewed. This study was undertaken to assess the performance of NAATs in the context of a careful analysis of the impact of the type of test as well as of the methodological and clinical characteristics of published studies on the accuracy estimates.
We searched Medline to 1 July 2005 and Embase to 1 March 2005 using a search strategy designed to identify studies evaluating the use of commercial NAATs for the diagnosis of pulmonary TB. The titles and abstracts of the identified citations were screened and the references listed in the retrieved articles were scrutinised, considering any citation that did not obviously fail the inclusion criteria.
After a preliminary analysis of a sample of articles, studies considered eligible for inclusion in our meta-analysis were those that:
examined the diagnostic performance of commercial NAATs on respiratory samples (<5% of non-respiratory samples were tolerated);
used Mycobacterium tuberculosis (MTB) culture of the same sample as the reference standard for the diagnosis of pulmonary TB;
reported primary data sufficient for separately calculating both sensitivity and specificity in AFB+ and/or AFB− specimens; and
were written in English.
Articles were excluded from the meta-analysis for the following reasons:
reporting sensitivity and specificity “revised” by means of discrepant analysis as the only study results; in the case of studies which retested the samples on the basis of discrepant analysis, only the initial “unrevised” results were considered;
possible duplicate publication: when an author or research group published more than one study, the existence of overlapping study populations was ascertained by checking sample recruitment sites and/or periods or, if these were not available, contacting authors for clarification. If this was not provided, only the article reporting the largest number of samples was included;
application of commercial NAATs on gastric aspirates (>5% of total study sample); and
analysis of previous versions of commercial NAATs.
Two investigators independently evaluated studies for inclusion and abstracted relevant data. Disagreements were reconciled by consensus.
Data extraction and quality assessment
Data were abstracted using two separate data sheets, one for AFB+ and one for AFB− samples. Information recorded were descriptive data (author name, journal, publication year), type of respiratory sample, prevalence of MTB culture positive samples, testing procedures for commercial NAATs, culture and AFB staining, and commercial NAAT sensitivity and specificity.
According to the established methodological standards for evaluation of diagnostic tests,8 four aspects of study quality were examined: cohort assembly (population of recruitment, method of sample selection, data collection modality), technical quality of reference test (the use of at least two different culture media was considered a more reliable reference test), blinding, and study population features (clinical/demographic characteristics, pulmonary TB severity, and other diagnoses in subjects without pulmonary TB). The original studies in which data were collected (or primary studies) were classified according to whether each characteristic was present, absent, or unknown. In five multicentre studies the participating laboratories used different AFB staining and/or culture procedures: these items were scored as unknown. Three studies included a separate description of a subgroup of patients on antituberculous therapy: these data were not included in our analysis and the studies were scored as not including patients under treatment.
All statistical analyses were separately performed for AFB+ and AFB− samples. For each study we classified the commercial NAAT results as true positives (TP), false negatives (FN), false positives (FP), and true negatives (TN) as determined by comparison with MTB culture results. We then calculated the true positive rate (TPR = TP/TP+FN = sensitivity), the true negative rate (TNR = TN/FP+TN = specificity), their odds (oddsTPR = TPR/1−TPR and oddsTNR = TNR/1−TNR), and the diagnostic odds ratio (DOR)—that is, the ratio of the odds of a positive NAAT result among MTB culture positive samples compared with MTB culture negative samples (DOR = oddsTPR/oddsFP rate). Thus, DOR values of >1 indicate good test performance while DOR values of <1 indicate a test more frequently positive on control subjects (DOR = 1 means that the test had no discriminating ability).
The potential problems in odds calculations associated with sensitivities and/or specificities of 100% were solved by adding 0.5 to zero values.9 In articles where two or more different commercial NAATs were analysed on the same samples, both extraction of data and calculation of accuracy measures were performed by considering them as separate studies.
To delineate the impact of study characteristics on diagnostic accuracy estimates, we fitted a multivariate random effect regression model using DOR as the dependent variable and study characteristics as explanatory variables (“Metareg”, Stata 8). Since each commercial NAAT fixes a well defined numerical value as the criterion for positivity, we took into account the threshold differences between studies by simply adding the test type as covariate in the regression model.9
However, clinical interpretation of DOR is not easy as the same values can relate to different combinations of sensitivity and specificity.10 The use of fixed thresholds allowed us to explore the impact of the study characteristics (including the different thresholds chosen) on sensitivity and specificity separately. We therefore constructed two further regression models using, as dependent variables, oddsTPR and oddsTNR, respectively. For all the models the dependent variables were included after logarithmic transformation.
As explanatory variables we added the clinical and methodological characteristics of the primary studies to the regression models. Since unreported items can reflect true methodological flaws or poor reporting of a methodologically sound study, we only included variables with a percentage of unreported items of <15%. As it is known that sensitivity and specificity vary with disease prevalence when an imperfect standard is used to evaluate a test, we added to the models the proportion of MTB culture positive samples (among AFB+ or AFB− samples) as a proxy of the true prevalence of pulmonary TB.11,12
The within-study variance was considered by taking weights equal to the inverse of the variance of the appropriate proportions; the between-study variance was estimated using the restricted maximum likelihood estimate (REML).13
We assessed the possibility of publication bias by evaluating a funnel plot for asymmetry, Begg’s adjusted rank correlation test and Egger’s regression asymmetry test (“Metabias”, Stata 8). Finally, we applied Bayes’ theorem to assess the changes in probability of pulmonary TB determined by the use of commercial NAATs.
Study description and synthesis
The study selection process, which is reported in full in the online Appendix available at http://www.thoraxjnl.com/supplemental, led to the inclusion of 63 journal articles published between 1995 and 2004.14–76 Since eight articles analysed two different commercial NAATs, a total of 71 studies were available for analysis. The commercial NAATs evaluated were: Roche Amplicor MTB (n = 25 studies), its entirely automated version, Cobas Amplicor MTB (n = 10), E-MTD (n = 14), BDProbeTecET (n = 12), and LCx (n = 10). Overall, the 63 articles examined 51 160 samples, 5729 MTB culture positive and 45 431 MTB culture negative. The median number of samples per study was 410 (interquartile range (IQR) 247–662), with a median pulmonary TB prevalence of 0.14 (IQR 0.07–0.3).
Fifty six articles analysed both sensitivity and specificity of commercial NAATs on AFB+ samples. They included 3848 MTB culture positive and 1535 MTB culture negative samples, with a median pulmonary TB prevalence of 0.77 (IQR 0.6–0.86). Five articles reviewed two commercial NAATs each, so 61 studies were available for analysis. As shown in fig 1A, sensitivity values were homogeneously higher (0.96, 95% CI 0.956 to 0.968) while specificity was lower and extremely variable (0.85, 95% CI 0.84 to 0.87).
The 60 articles examining the sensitivity and specificity of commercial NAATs on AFB− samples included 1704 MTB culture positive and 43 852 MTB culture negative samples (median pulmonary TB prevalence 0.042, IQR 0.02–0.1). Eight articles reviewed two commercial NAATs each, bringing the total number of studies up to 68. Inspection of the forest plot in fig 1B indicates a high specificity but a clear heterogeneity in sensitivity values. Pooled sensitivity and specificity were 0.66 (95% CI 0.63 to 0.68) and 0.98 (95% CI 0.978 to 0.981), respectively. Pooled values of DOR, sensitivity, and specificity for each test type as well as their respective nucleic acid amplification techniques are reported in table 1.
Analysis of clinical and methodological characteristics of the primary studies (table 2) showed that many studies did not comply with the published guidelines for conducting and reporting diagnostic test evaluation. With regard to MTB culture (most frequently Lowenstein-Jensen (68%) and Bactec 12B (52%)), we found that 11% of primary studies did not provide any description of the reference test used to assess pulmonary TB diagnosis, while approximately one quarter used only one culture medium. Even if more than half of the studies declared the enrollment of patients with suspected pulmonary TB, they often included samples from patients on antituberculous treatment. The clinical spectrum of both pulmonary TB and comparative groups was rarely given and only nine primary studies applied either single or double blinding for test interpretation.
Effect of study characteristics on diagnostic accuracy of commercial NAATs
The characteristics of the studies analysing AFB+ and AFB− samples are shown in the last two columns of table 2. Those included in the meta-regression models as potential sources of heterogeneity were quality of reference test, specimen type, commercial NAAT type, and pulmonary TB prevalence. In table 3 the resulting parameter estimates of these variables are presented as relative odds. Relative odds indicate the diagnostic performance of commercial NAATs in studies with that characteristic, relative to their performance in studies without that characteristic.
For AFB+ samples, studies using at least two MTB culture media and those including bronchial specimens yielded DOR values approximately eight times higher than those using one culture media and six times higher than those analysing sputum specimens only, mainly due to an effect on oddsTNR. OddsTNR values were inversely correlated with the prevalence of pulmonary TB and were significantly lower in studies of the LCx test than in those analysing E-MTD. For AFB− samples, the relative DOR of studies using at least two MTB culture media were more than twice as high as those of studies using only one medium, mainly due to the increase in oddsTNR. The inclusion of bronchial specimens was also associated with increased oddsTNR values. In comparison with studies analysing E-MTD, those using LCx or Roche Amplicor MTB provided lower oddsTPR, while those using Cobas Amplicor MTB yielded higher oddsTNR. An inverse correlation between the prevalence of pulmonary TB and both DOR and oddsTNR values was also found.
Evaluation of publication bias showed that the Egger’s test was significant both for studies on AFB+ samples (regression coefficient 1.14, p = 0.011) and for AFB− samples (regression coefficient 0.97 p = 0.022). Visual inspection of the two funnel plots also showed some asymmetry. Conversely, the Begg’s test was not significant (see Appendix).
Post-test probability of pulmonary TB
Changes in the likelihood of pulmonary TB after performing the commercial NAATs are shown for all pre-test probabilities in fig 2A and B for AFB+ and AFB− samples. The top curves portray the positive predictive values—that is, the probabilities of pulmonary TB after obtaining a positive commercial NAAT result. The bottom curves represent the inverse of the negative predictive values—that is, the probabilities of pulmonary TB after a negative commercial NAAT result. For example, using E-MTD on an AFB− sample from a patient in whom previous diagnostic information (history taking, clinical examination, imaging, etc) indicated a probability of pulmonary TB of about 30%, a negative result would reduce the likelihood of pulmonary TB to about 10% while a positive one would increase it to about 90%.
Since commercial NAATs require fewer technical skills and a shorter assay time than the less expensive home grown tests, they have become the most frequently used molecular tests for laboratory diagnosis of pulmonary TB.77 In this meta-analysis we calculated pooled estimates of their sensitivity and specificity (table 1), showed that their reported accuracy is influenced by primary study characteristics, and analysed to what degree or under what conditions they add information to the diagnostic work-up of pulmonary TB.
The reference test used for diagnosing pulmonary TB was shown to have the largest impact on accuracy, both for AFB+ and AFB− samples. Since the incorporation of one or more additional units of medium is known to reduce the false negative rates of culture,78 it could, as a consequence, have determined an “artificial” improvement in commercial NAAT specificity that is estimated on samples classified by culture as MTB-free.
The small number of studies using liquid media as the only reference test did not allow us to evaluate the independent effect on accuracy of their higher MTB recovery rates compared with solid media.78
The imperfect sensitivity of culture could also explain the variation in the specificity (and the DOR) of commercial NAATs with the prevalence of TB. At a low prevalence, the number of samples containing MTB (but wrongly classified by culture as MTB-free) is in fact likely to be small and commercial NAAT (pseudo) false positives are likely to occur less frequently. However, at a high prevalence, the higher number of (pseudo) false positives reduces the specificity.12
The higher accuracy in studies including bronchial samples, already reported in a previous meta-analysis on PCR-based NAATs,6 was mainly due to an increase in specificity. However, since the reported culture yield in bronchial samples varies from 12% to 87%,79,80 it is difficult to explain these data on the basis of the proportional agreement of positive and negative results between the two tests. Studies focused on diagnostic performance of both culture and commercial NAATs on different bronchial samples may help to clarify this issue in the future.
Although only Amplicor (or Cobas Amplicor) and E-MTD are currently approved by the United States FDA for clinical use, the test type did not seem to explain the heterogeneity of DOR in meta-regression. Interestingly, studies evaluating E-MTD on AFB− samples yielded higher sensitivities and lower specificities than those using Roche Amplicor MTB or Cobas Amplicor MTB (table 3). The higher sensitivity of E-MTD, the only NAAT approved by the FDA for use on AFB− samples, could be due to kit features such as the use of ribosomal RNA as a target sequence (about 2000 copies in each MTB cell),81 but our results suggest that E-MTD could use a lower positivity criterion than other commercial NAATs and the differences observed could be partly due to a “threshold effect”. The accuracy of E-MTD appeared to be higher than that of the recently withdrawn Abbott LCx, while no differences were seen with BDProbeTecET.
With respect to the diagnostic value of commercial NAATs in the evaluation of patients with suspected pulmonary TB, we observed that, because of their very high sensitivity on AFB+ samples, commercial NAATs can be confidently used to “rule out” pulmonary TB in AFB+ patients (fig 2A). Thus, particularly in settings where opportunistic infections are a concern, a negative inhibitor-free commercial NAAT in patients with AFB+ smears and suggestive radiographic abnormalities should direct suspicion towards an environmental mycobacterial pulmonary disease.
The more limited gain in likelihood of pulmonary TB after a positive result on AFB+ samples (particularly for some commercial NAATs, see fig 2) seems to limit their use as confirmatory tests in these cases. The increased false positive rates of a number of studies may be related to the inclusion of samples from patients under treatment. These studies tried to correct the errors deriving from the enrollment of an inadequate study population by applying discrepant analysis, a statistical ploy that attempts to correct sensitivity and specificity of a “new” test (that is supposed to be more accurate than the reference standard it is compared with) by involving an additional more reliable test (clinical diagnosis of pulmonary TB). This procedure, by correcting the errors hidden among conflicting results of the “new” test and the standard and by disregarding concordant errors, leads to an overestimation of the accuracy of the test.82 We therefore decided to include only “uncorrected” results, hence discussing the possible effects on accuracy of the presence of samples from treated patients. The unavailability of treatment data from 27% of primary papers prevented us from drawing conclusions by means of meta-regression. Nevertheless, the pooled specificities calculated on the subgroup of studies clearly stating the exclusion (n = 206 samples) and inclusion (n = 707 samples) of treated patients were 0.97 (95% CI 0.93 to 0.99) and 0.76 (95% CI 0.73 to 0.79), respectively, indicating that inclusion of treated patient samples was the main cause for reduced specificity in AFB+ samples.
In the case of a negative microscopy result, commercial NAATs are not sensitive enough to exclude the diagnosis of pulmonary TB and further diagnostic work-up remains mandatory in these patients. By contrast, their high specificity gives them the ability to “rule in” pulmonary TB in about two thirds of patients who will be recognised as MTB culture positive only 2–8 weeks later (fig 2B). Based on the degree of suspicion, the clinician is allowed to initiate treatment or, having already begun it, is made more comfortable to continue it. Furthermore, and with regard to the risk assessment of infectivity, because commercial NAATs have a higher sensitivity than microscopy, they could guide the decision as to which AFB− patients are to be segregated, especially in facilities where HIV infected or other immunocompromised individuals are managed.83 However, this use is of limited value in patients already started on treatment as the numbers of viable mycobacteria in the sputum are known to dramatically fall in the first few days.84
This meta-analysis has limitations. Firstly, the sensitivity and specificity estimates of commercial NAATs are hindered by the poor quality of primary studies, a common problem in diagnostic meta-analyses. Furthermore, although the Egger’s test may reveal artifactual correlations between DOR and its variance regardless of publication bias,85,86 the possibility that the studies included in our meta-analysis are a biased set cannot be ruled out. In spite of these drawbacks, we think that summary estimates of test performance are a more accurate guide for the physician than the results of any one of the primary studies.
Secondly, our decision to use the specimen as the unit of analysis could have affected accuracy because of the possible inclusion of multiple paucibacillary specimens from AFB− patients. Nevertheless, we speculated that the use of the patient as unit of analysis could have determined even wider variations in accuracy as the number of specimens per patient varied both within and between primary studies and repetitive testing is known to improve sensitivity.
Thirdly, we examined the accuracy of commercial NAATs in comparison with culture, without addressing the issue of microscopy-negative and culture-negative pulmonary TB cases diagnosed on clinical grounds only. During the systematic review of the literature we found only six studies and one FDA premarket approval application document confronting this problem.81,87–92 Out of 69 specimens (52 patients), only seven provided at least one specimen that tested positive with commercial NAATs, corresponding to a pooled sensitivity of 10% (95% CI 4 to 20). Our estimates of sensitivity on AFB− samples are therefore probably inflated compared with what can be seen in the clinical setting.
Based on this systematic review, the clinical use of commercial NAATs should be limited to the exclusion of a diagnosis of TB in AFB+ patients with suspected non-tuberculous mycobacterial infection and to the confirmation of TB in a percentage of those providing AFB− samples. Further studies using rigorous methods—including careful control for treatment and use of single specimen per patient—would be highly desirable to appreciate better the operating characteristics of the commercial NAATs. Their accuracy on different bronchial specimens and on samples from patients with culture negative pulmonary TB are also important issues that remain to be addressed.
Published Online First 31 May 2006
Competing interests: none declared