Background: Poor reproducibility of an outcome measure reduces power and, in an independent variable, biases results. The intraclass correlation coefficient measures loss of power and degree of bias. Information is lacking on the intraclass correlation coefficient for bronchial responsiveness and factors affecting reproducibility.
Methods: Papers containing information on reproducibility of bronchial responsiveness were identified using a Medline search and citations. Within and between person components of variance of PD20 or PC20 were expressed in doubling dose or concentration units, and the intraclass correlation coefficient calculated when not reported.
Results: Results were extracted from 32 papers. Intraclass correlation coefficients were over 0.9 in short term studies of highly selected asthmatic patients, but larger and most long term studies had lower intraclass correlation coefficients, less than 0.5 in some cases, due to greater within person or lower between person variation. Reproducibility of dose or concentration-response slope was generally higher, but still less than that of forced expiratory volume in 1 second.
Conclusions: Information is available to calculate sample size for studies with bronchial responsiveness as the outcome, but results when bronchial responsiveness is an explanatory variable may be misleading.
- bronchial responsiveness
- statistical analysis
- intraclass correlation coefficient
Statistics from Altmetric.com
Clinical assessment of change in a patient should properly be made with reference to variation in change in healthy subjects.1 This variation is a combination of measurement error and true within person fluctuation. The latter component is likely to increase the longer the time interval between assessments. Reviews of short term repeatability of bronchial responsiveness (BHR) reported the width of the 95% range for a single measurement of an individual’s BHR, expressed as the dose (PD20) or concentration (PC20) estimated to produce a 20% fall in forced expiratory volume in 1 second (FEV1), from under 1 doubling dose or concentration to over 2.5.2,3 A 95% range for change is obtained by multiplying the within person standard deviation by √2. Despite the fact that this gives a 95% range that may be over 3.5 doubling doses in width, authors have considered that, in the short term, “test results vary little”.4 Limited information on longer term reproducibility has been reported,2 the width of the 95% range for a single PC20 being up to 4 doubling concentrations.
A clinician needs to know not only how variable a single reading may be for an individual patient, but how the variation compares with that between patients. The greater within patient variation is in relation to between patient variation, the less information a single measurement provides for the individual. In epidemiological studies poor repeatability of a continuous outcome variable reduces power to detect relations with risk factors, as the total variance is increased. The loss of power is a function of the ratio of the total variance including error to the true variance without error, where the term “error” is used to encompass both measurement error and true within person fluctuation over the relevant time period. Hence, for the clinician and the epidemiologist, it is useful to know the ratio of true between person variation to the total variance. This quantity is known as the intraclass correlation coefficient (ICC). It is a dimensionless quantity that takes the value 1.0 when there is no error and 0.0 when there is no true variation. An increase in measurement error decreases its value, but selection of a sample of homogenous individuals reduces true variation and hence also decreases the ICC.
Poor repeatability of an explanatory variable leads to bias in effect estimates. The bias, known formerly as attenuation by statisticians5 but by epidemiologists as regression dilution,6 is also a function of the ICC which is often referred to as “reliability”.6 Error in the independent variable in a regression analysis produces a regression line that is less steep than the true relation. When the relative sizes of the variance due to “error” and the total variation are known, the bias can be determined.7 The true value of the regression coefficient can be estimated as observed value/ICC in the absence of covariates, although bias is difficult to predict in multiple regression.8 Hence, despite the ICC being criticised as a measure of agreement,9 it is clearly of great importance as a measure of repeatability.
Very limited information on ICCs has been given previously.3 This paper reports a review of the literature for estimates of repeatability of BHR, and of ICCs in particular, and assesses the implications of findings for analysis and interpretation of BHR.
A Medline search of the subject heading “bronchial hyperreactivity” or one of its synonyms as a keyword (airway/bronchial (hyper)reactivity/responsiveness), combined with “reproducibility of results”, or ICC, repeatability or reliability as a keyword, was used to obtain abstracts of potential papers. Papers were additionally identified from reference lists. Papers were limited to studies of adults and English language publication. Abstracts were read to exclude studies that did not have repeat measurement of BHR, or where change in BHR with change in treatment or other conditions was the subject of study. Provocation agents were limited to histamine and methacholine—that is, a small number of papers on reproducibility of BHR to carbachol, cold air, exercise or hypertonic saline were excluded. Measurements repeated on the same day were not included, and where methods of administration were compared in the same subjects only the preferred method was included.
Repeatability data were extracted and within and between subject components of variation were expressed in doubling dose or concentration standard deviations. Papers which did not report the within subject standard deviation or the ICC or allow either to be estimated were omitted. Unless otherwise stated, a published 95% range for a single value was assumed to be calculated as ±2 within subject standard deviations. Where limits were stated to be a “confidence interval”, statistics were derived only if it was clear from the text whether the limits were calculated from a standard deviation (that is, a 95% range) or from a standard error (that is, a true confidence interval). When data were presented only graphically they were measured from the graph, taking account of differing scales on the axes where necessary. Raw data were used if given and analysed by one way analysis of variance of dose or concentration in doubling dose units by subject, and components of variance calculated,7 from which the ICC was derived. Based on the distribution of length of follow up in the papers, an arbitrary division into short term and long term follow up was made at a cut off of 4 months.
The Medline search produced 101 abstracts, of which 37 potentially met the inclusion criteria and 23 were found to have useful repeatability data.4,10–31 Of the 14 exclusions, eight were found not to meet the inclusion criteria on reading the full paper, one gave data for a subset of data reported in another paper, and five did not report results in a form that allowed derivation of components of variance or the ICC. A further eight papers were identified from citations32–39 and one study primarily of other measures was included.40 Where only a measure of within person variation, or only the ICC, was stated but data were represented graphically, there was good agreement between the stated estimate and the corresponding value calculated from the measured data except in one case mentioned below.
Short term repeatability
Table 1 gives short term estimates of repeatability of PD20 or PC20 from eight studies published before 1987. These were each carried out on a small number of asthmatic patients. ICCs were above 0.9 when the within person standard deviation was less than 0.5 doubling doses, and 0.97 or more when combined with a between person standard deviation of at least 2.0 doubling doses. In one study the difference between the stated within person variation (1.0) and that derived from the graphical data (0.7) was noteworthy.38
Table 2 shows corresponding estimates from nine studies published from 1987 to 1991. Each of these gave a measure of within person variability, and most the ICC as well. Two of the studies were on population samples,13,14 but the larger study mostly comprised participants who had a measurable PD20 at the first occasion,13 and the smaller recruited participants with wheeze or asthma.14 Three studies achieved low within person variation15,16,40 but most studies had greater within person variation than the earlier studies, and hence lower ICCs. The study of hospital personnel had low between person variation and hence a low ICC. Table 3 shows estimates published from 1993 to 2001. All but one of these studies was carried out in asthmatic patients.
Long term repeatability
Long term repeatability over a period of 4 months or more was estimated in seven studies (table 4). Three of these studies were general population studies which gave lower ICCs than other studies. The largest study found an ICC for PD10 of 0.32 for asymptomatic and 0.42 for symptomatic subjects.4 New results including an extra follow up survey gave an overall ICC of 0.37, with a within person standard deviation of 1.0 doubling concentrations that was comparable to other studies, but lower between person variation. An ICC of 0.45 for PD20 was obtained by Beckett et al,30 with the largest within person standard deviation of any study. The third study, carried out in general practice, found an ICC of 0.48 for 27 subjects with complete data and measurable PD20 on six occasions, but higher ICCs (0.56 and 0.68) when all first year and all second year pairs were analysed.24 The three long term studies on selected asthmatic subjects had lower within person standard deviations and higher ICCs.10,28,37 The study of aluminium smelter workers included data only for 36 people, those with a 20% fall in FEV1 by the maximum dose of 6.14 μmol at each occasion, in the calculation of ICC for PD20.39 New results from the Vlagtwedden/Vlaardingen study showed increasing within person variation, and hence decreasing ICC, with increasing length of follow up (not shown).
Repeatability of dose-response slope
Table 5 shows estimates of short term repeatability of the FEV1-dose response slope from two studies and long term ICC from four. Dose or concentration-response slope was calculated from two data points,41 except for one study which used regression of percentage decline in FEV1 on dose.23 This study reported data from 104 participants which included 90 whose repeatability of PD20 was given in table 2.13 The latter only included people with a measurable PD20 on at least one occasion, while the FEV1-dose response slope was calculated for each participant who received two or more doses of histamine. The ICCs were 0.89 for slope and 0.81 for PD20. The study of aluminium smelter workers, which included data only for persons with two measurable PD20 values in the PD20-ICC, found a much higher ICC for log dose-response slope (0.73 compared with 0.28).39 Trigg et al24 found a higher ICC for the dose-response slope (0.75) than for PD20 (0.48), and Beckett et al30 slightly higher (0.54 compared with 0.45).
Variation in ICC
The early studies on short term repeatability in selected asthmatic patients achieved good repeatability, as indicated by the within person standard deviation of less than 0.5 doubling doses or concentrations and hence a high ICC. Early enthusiastic exponents of bronchial challenge may have taken greater care over procedures or selected highly cooperative patients. Many later studies, particularly the larger population studies, had a within person standard deviation of around 1.0 doubling doses or concentrations. Selection of subjects determines the between subject variation. A population study has a large majority of “non-responsive” individuals whose values are clustered at the maximum dose or concentration; even when this is high, the use of a logarithmic scale reduces the apparent variation at the upper end of the scale.
Variation is expressed on the doubling dose or concentration scale as this is the most appropriate for PD20 or PC20,42 but ICCs are independent of linear transformation—that is, they are the same on any logarithmic scale. Repeatability of histamine and methacholine BHR appears similar on a logarithmic scale. There is no agreement over scale for the dose-response slope (table 5) but correlation with log PD20 has been shown to be high when the dose-response slope is reciprocally transformed23 or log-transformed.24
The Pearson correlation coefficient measures the degree of any linear relation between two variables. It is therefore inappropriate for repeatability studies as, for example, change in mean BHR over time would not affect it but does lower the ICC. The unsuitability of the Pearson correlation coefficient for method comparison and repeatability was made clear in 198643 but, despite this, several later papers reported it, although within person variation was generally also reported.
There were too few long term studies with fixed follow up time to relate within person variation to length of follow up. Unpublished results from the Vlagtwedden/Vlaardingen study suggest an increase in within person variation with length of follow up. It is unclear whether the lower ICCs in table 4 compared with tables 1–3 were primarily due to longer follow up with increased within person standard deviation, or to sampling from a general population and lower between person variation. The one population study of short term repeatability had a within person standard deviation in line with short term studies of asthmatic patients13,23 but also with the long term Vlagtwedden/Vlaardingen study (table 4), while the long term population study of Beckett et al had greater within person variation.30 Hence, the effect of selection of subjects on between person variation and the effect of length of follow up on within person variation both influence the ICC. However, restriction of data to participants with two measurable PD20 values decreases between person variation and hence the ICC.23 The dose-response slope, which can be estimated for people who do not have a measurable PD20, had a greater ICC than PD20 in one study24 and slightly larger in two others.23,30
Many of the estimates of ICC are based on a small sample but, due to the many reasons for variability in the components of variation in BHR and hence in ICC, it would not be appropriate to pool estimates. For this reason, no attempt was made to carry out a fully systematic review. Confidence intervals for the ICC are wide; for example, Seppälä gave a 95% confidence interval of 0.70 to 0.96 for an ICC of 0.89 found for ln(PD20) in 14 responsive healthy subjects.19
BHR as outcome variable in a cross sectional study
In carefully controlled studies with selected participants an ICC of 0.99 can be achieved, as high as that for FEV1.44 However, such a high ICC is unlikely to be achieved in larger studies. In the studies which assessed repeatability of FEV1 and BHR in the same subjects, the ICC for BHR was lower than that for FEV113,18,30,34,40 so that studies of BHR generally require more participants than those on FEV1 to detect an equivalent size of effect. The standard deviation that should be used in a sample size calculation is the total short term variation in a study with similar participants; this can be calculated by adding the squares of the within and between standard deviation and taking the square root of the result.
Change in BHR as outcome in short term follow up studies
The standard deviation of change in any continuous outcome is calculated by multiplying the within person standard deviation by the square root of two. Appropriate within person standard deviations in tables 1–3 can therefore be used to calculate sample size or power. In randomised controlled trials the recommended analysis is of final outcome with the baseline value as a covariate,45 as this increases power and is unbiased as baseline mean values will be equal on average. However, this method is inadvisable in an observational study. The regression coefficient of final on initial value is biased towards zero. It is used to adjust the estimated means at follow up of groups that differ in mean value at baseline and so will affect the comparison of interest and can even reverse the sign of the difference.5 In addition, Schouten and Tager46 have explained why adjusting for baseline may give misleading results. The analysis of final outcome with baseline as covariate has little to recommend it in non-randomised studies and, for an outcome with an ICC that may be as low as 0.5 in some circumstances, it is definitely to be avoided in such studies.
Change in BHR as outcome in long term follow up studies
It is likely that the change in BHR over several months or years will be more variable than in the short term. Although the number of studies is small with most of the information from population based studies, lower ICCs are unlikely to be due wholly to differences in participants. Firstly, the short term population study found a relatively high ICC due to low within person variation13,23 and, secondly, the large long term study found even lower ICCs on adjustment for individual explanatory variables, as between person variation was reduced proportionally more than within person variation.4 The within person standard deviations in table 4 can be used in sample size calculations, although they will be conservative as some of the within person variation will be explained by changes in explanatory variables. On the other hand, the use of standard deviations in tables 1–3 may result in too small a sample size. The recommendation to analyse absolute change, and not final adjusted for initial value, applies even more strongly to long term than to short term observational studies.
BHR as an independent variable
A number of authors have used BHR as an independent variable, particularly as a predictor of decline in FEV1,47–49 dividing participants into “responders” and “non-responders”. Few authors have reported a kappa statistic for repeatability of dichotomised BHR, but it can be expected to be similar in value to the ICC. BHR has a unimodal continuous distribution in the general population50,51 and is not a fixed state, as many authors seem to assume. The problem—whether BHR is dichotomised or not—is the same as that of using baseline BHR as a covariate when final BHR is the outcome in a longitudinal study, that there will be bias in the regression coefficient of outcome on BHR and also of the other regression coefficients in a multiple regression. Correction for bias requires estimates of the ICC for variances and covariances of the explanatory variables6,8 which can only be determined from a repeatability study of all covariates subject to within person variability carried out on all, or a substantial random sample, of the participants unless certain assumptions are met.52
The analysis of BHR as an outcome variable is straightforward and there is considerable information to allow studies to be planned with adequate sample size to take account of the inherent variation. PD20 or PC20 are known only to be above the maximum dose or concentration (that is, “censored”) when a 20% fall in FEV1 has not occurred when the challenge is stopped. This has often led authors to express BHR as “responsive” or “not responsive” and to use logistic regression to analyse the data, but greater power is achieved if regression methods for censored data are used or a dose-response slope or other continuous outcome analysed.53 This is reinforced by ICCs for the dose-response slope being at least as high and probably greater than those for PD20.
In contrast, analysis of BHR as an explanatory variable is liable to give biased and possibly misleading results. This is true of any explanatory variable for which the short term ICC may be as low as 0.5. BHR contrasts with FEV1 as the short term ICC for FEV1 can be presumed to be over 0.913,18,34,40 and has been reported to be 0.89 over 1–3 years.30 Lung function has been shown to be strongly associated with BHR in cross sectional studies, part of which may due to inherent dependence of BHR summary statistics on FEV1. Analyses of BHR as the outcome therefore need to adjust for lung function even if a causal role is not assumed.
Rijcken and Weiss posed the question of whether a lower level of FEV1 is a cause or a result of increased airway responsiveness and stated that longitudinal analyses are necessary to answer the question.54 We can add to this that either multiple measurements of BHR should be made to increase precision4 or the regression coefficients should be adjusted for lack of repeatability. Unfortunately, if ICCs of variances are highly variable, those of covariances may be even harder to estimate and extrapolation from another study is unlikely to be sound. Unless researchers take steps to increase precision, the inclusion of BHR as an explanatory variable may be misleading.
No competing interests declared.
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.