Article Text

Download PDFPDF

Original article
Minimum important difference of the Epworth Sleepiness Scale in obstructive sleep apnoea: estimation from three randomised controlled trials
  1. Sarah Crook1,
  2. Noriane A Sievi2,
  3. Konrad E Bloch2,3,
  4. John R Stradling4,
  5. Anja Frei1,
  6. Milo A Puhan1,
  7. Malcolm Kohler2,3
  1. 1 Epidemiology, Biostatistics and Prevention Institute, University of Zurich, Zurich, Switzerland
  2. 2 Department of Pneumology, University Hospital of Zurich, Zurich, Switzerland
  3. 3 Center of Interdisciplinary Sleep Research, University of Zurich, Zurich, Switzerland
  4. 4 Oxford Centre for Respiratory Medicine and Oxford NIHR Biomedical Research Centre, Churchill Campus, Oxford University, Oxford, UK
  1. Correspondence to Dr Malcolm Kohler, Department of Pulmonology, University Hospital Zurich, Zurich 8091, Switzerland; malcolm.kohler{at}


Background The Epworth Sleepiness Scale (ESS) is a widely used tool for assessing sleepiness in patients with obstructive sleep apnoea (OSA). We aimed to estimate the minimal important difference (MID) in patients with OSA.

Methods We used individual data from three randomised controlled trials (RCTs) in patients with OSA where the preintervention to postintervention change in ESS was used as a primary outcome. We used anchor-based linear regression and responder analysis approaches to estimate the MID. For anchors, we used the change in domains of the Functional Outcomes of Sleep Questionnaire and 36-Item Short Form Health Survey. We also used the distribution-based approaches Cohen’s effect size, SE of measurement and empirical rule effect size to support the anchor-based estimates. The final MID was determined by triangulating all estimates to a single MID.

Findings A total of 639 patients with OSA were included in our analyses across the three RCTs with a median (IQR) baseline ESS score of 10 (6–13). The median (IQR) ESS change score overall was −2 (−5 to 1). The anchor-based estimates of the MID were between −1.74 and −4.21 points and estimates from the responder analysis were between −1 and −3 points. Distribution-based estimates were smaller, ranging from −1.46 to −2.36.

Interpretation We propose an MID for the ESS of 2 points in patients with OSA with a disease severity from mild to severe. This estimate provides the means to plan trials and interpret the clinical relevance of changes in ESS.

Trial registration number Provent, NCT01332175; autoCPAP trial, NCT00280800; MOSAIC,ISRCTN (3416388).

  • sleep apnoea

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Key messages

What is the key question?

  • We aimed to provide an estimate of the Epworth Sleepiness Scale (ESS) that can be used in practice to determine an important change in sleepiness in patients with a wide spectrum of obstructive sleep apnoea (OSA) severity and undergoing different interventions.

What is the bottom line?

  • The presented manuscript, based on individual data from three different randomised controlled trials, suggests the use of a single minimal important difference of 2 points for the ESS in patients with OSA.

Why read on?

  • It facilitates the calculation of required sample sizes in the planning of future trials, as well as providing the means to interpret the clinical relevance of changes in daytime sleepiness following an intervention in both research and clinical practice.


Excessive daytime sleepiness is a key symptom of obstructive sleep apnoea (OSA) and can greatly impact patient’s everyday life and well-being.1 2 The high clinical relevance means that sleepiness is frequently assessed in intervention studies as a primary outcome. One method for assessing sleepiness is the patient-reported Epworth Sleepiness Scale (ESS).3 4 The ESS is a simple, 8-item self-reported questionnaire where patients answer questions based on how likely they are to doze off or fall asleep during sedentary activities. It is an attractive alternative to objective methods for assessing sleepiness, such as the multiple sleep latency test which is conducted in a sleep laboratory and takes a full day to complete,5 since it is more practical and cost-efficient to complete.

Investigations into the measurement properties of the ESS have shown it to have good test–retest reliability,6 internal consistency7 and moderate construct validity.8 9 However, only one study has investigated the minimal important difference (MID) of the ESS in patients with OSA.10 This study estimated the MID as falling between −2 and −3 points in patients with OSA undergoing CPAP for 3 months. While this provides some insight into the ESS MID, a range of estimates can be difficult to use in practice where a single estimate is required. Since there is no evidence in favour of either MID estimates, the final choice of MID to use is unclear. Furthermore, the MIDs were estimated by relating the change in ESS to a global rating of change in sleepiness questionnaire, with the score based on how the respondents feel their sleepiness has improved from baseline. This may be susceptible to recall bias11 and can be influenced when patients are not blinded to the treatment.12 Although specific to sleepiness, it relies on a single question and answer and is not a validated instrument with an existing MID in OSA. When establishing an MID, it is also important to repeat the estimation in different populations of a disease because the MID can vary depending on the setting and population characteristics.13

Although the ESS is commonly used both in clinical practice and in randomised controlled trials (RCTs), the interpretation of changes in sleepiness remains challenging without a well-established MID. We aimed to provide an estimate of the minimum change in sleepiness measured by the ESS that is clinically important and that can be used to guide practice and design of future trials, in patients with a wide spectrum of OSA severity.



We used individual data from three separate RCTs of patients with OSA. RCT 1 (The Multi-centre Obstructive Sleep Apnoea Interventional Coardiovascular trial (MOSAIC)) investigated CPAP therapy in minimally symptomatic patients with OSA in the UK and Canada, with a follow-up of 6 months.14Patients were aged between 45 and 75 years and had an oxygen desaturation index (ODI)>7.5/hour, but did not have daytime symptoms considered by patient and physician to be sufficient to require CPAP therapy. RCT 2 (autoCPAP) was an equivalence trial comparing the efficacy of continuous automatic mask pressure adjustment to conventional fixed mask pressure for reducing sleepiness in moderate to severe patients with OSA in Switzerland, with follow-up assessments at 3, 12 and 24 months.15Patients were aged 18 to 75 years, had an ESS score ≥8 points, an apnoea/hypopnoea index (AHI) ≥10/hour and had completed a 2 to 4 week CPAP adaption period.  RCT 3 (Provent) investigated the efficacy of Provent, an expiratory nasal resistance valve, for preventing the reoccurrence of OSA following withdrawal of CPAP therapy in patients with moderate to severe disease in the UK and Switzerland, with a follow-up of 2 weeks.16Patients were aged 20 to 75 years, had an ODI>10/hour and had received CPAP treatment for at least 12 months prior to the trial with an average compliance of ≥4 hours per night.

Epworth Sleepiness Scale

All three RCTs assessed the change in ESS as a primary outcome. A total score was calculated for the ESS by summing the scores of the eight items, ranging from 1 to 24 points with higher values reflecting a higher level of sleepiness. The change score was calculated as the difference between the total ESS score at baseline and follow-up. For the purpose of our analyses, in the autoCPAP study the primary change in ESS was calculated as the difference between scores at baseline and 3 months as the observed ESS change score was similar between the different time points and so that we could retain the largest sample size possible.

Other outcomes

Other outcomes assessed at the same time points were considered as potential anchor variables if they were already validated for use in patients with OSA and had an MID. Outcomes assessed in autoCPAP and MOSAIC were the EuroQol 5 Dimensions Questionnaire (EQ-5D)17 and 36-Item Short Form Health Survey (SF-36),18 which both assess general health-related quality of life (HRQoL). The five dimensions of the EQ-5D are summarised to a utility index score ranging from −0.208 (worst possible health) to 1.000 (best possible health) and patients also rate their health on a Visual Analogue Scale on a scale of 0 (the worst health you can imagine) to 100 (the best health you can imagine). The SF-36 is composed of eight domains and two summary components on a scale of 0 to 100, with higher scores indicating better health. We used an MID of 5 points for the SF-36, based on previous estimates in COPD and rheumatoid arthritis.19–21 In autoCPAP and Provent, we used the Functional Outcomes of Sleep Questionnaire22 (FOSQ), which is specific to sleep disorders and assesses the impact of sleepiness on activities of everyday life. The FOSQ total score has a scale of 5 to 20 points with an MID of 0.75 and 5 domains measured on a scale of 1–4 with an average MID of 0.3 points, with higher scores indicating a smaller effect of sleepiness.23The intimacy domain was not routinely assessed in the RCTs and therefore was not analysed.

Statistical analysis

We used a combination of anchor-based and distribution-based approaches to calculate MID estimates. Where possible, we pooled data from the RCTs. For the anchor-based analyses, data were pooled where studies assessed the same anchor outcomes resulting in two datasets: (1) autoCPAP +MOSAIC (n=574) and (2) autoCPAP +Provent (n=267)). As the distribution-based analyses use only the ESS score, these were conducted in all three RCTs pooled together as well as in each RCT individually.

The primary anchor-based method used linear regression with the change in ESS score as the dependent variable and change in anchor variable as the independent variable. The resulting coefficient for the anchor was multiplied by the MID of the anchor and added to the intercept coefficient to determine the change in ESS that is mathematically equivalent to an important change in the anchor.24 Domains were used as anchors if they had a meaningful Pearson correlation coefficient (≥0.3)13 with the change score of the ESS. In autoCPAP (treatment and placebo groups), we used the anchor variables most strongly correlated with the ESS to define patients as responders or non-responders based on whether they improved by more than the anchor MID or not. We calculated receiver operator characteristic (ROC) curves from logistic regression models with the response classification as the dependent variable and ESS change score as the independent variable. The optimal cut-point (MID) was determined as the ESS change score with the highest sensitivity (true positives) and specificity (false negatives) for classification of the anchor response, with equal weighting for sensitivity and specificity.25

We used three distribution-based methods to support the anchor-based approaches: Cohen’s effect size (0.5*SDΔ), empirical rule effect size (0.08*6*SDΔ) and SE of measurement (SEM) (SDbaseline*sqrt[1-intraclass correlation coefficient (ICC)]). For the SEM, we calculated the ICC from a random-effects model of two ESS assessments taken at diagnosis and study inclusion in the Provent study, for patients with less than 20 days between the two assessments.

We used recommended methods for triangulating all estimates to a single final MID estimate.26 A consensus was reached between the investigators by judging the importance of each estimate based on several criteria: the quality of the anchor MID, responsiveness of the anchor, the statistical relationship and similarity of content between the anchor and ESS, the size and characteristics of the population we estimated the MID in, and the statistical method we used. Analyses were conducted using Stata for Mac (version 14.1; StataCorp, College Station, Texas, USA), except for the ROC curves analyses, which were conducted using R (version 3.4.1;


Five hundred and seventy-four patients were included in the first pooled RCT combination (MOSAIC (n=372) plus autoCPAP (n=202)) and 267 in the second (Provent (n=65) plus autoCPAP). Baseline characteristics of each RCT combination are shown in table 1. ESS score improved by 2.4 (SD 4.1) points in RCT combination 1 and by 3.9 (SD 4.6) points in RCT combination 2 (table 2). Baseline and follow-up change scores for potential anchor instruments are presented in table 2.

Table 1

Baseline patient characteristics for each RCT combination

Table 2

Summary of ESS scores and potential anchor instruments

Pearson correlation coefficients between the change in ESS score and change in potential anchor instruments are shown in table 3. The SF-36 energy/vitality and physical component domains met the methodological criterion for use as an anchor (correlation strength ≥0.3) in study combination 1. In study combination 2, the FOSQ general productivity, activity level and vigilance domains, and total score were all negatively correlated <−0.5, whereas the social outcome domain was less strongly correlated (r=−0.31). MID estimates based on these anchors were lower when estimated by the SF-36 (−1.74 and −2.66) compared with the FOSQ (−3.03 to −4.21) (table 4).

Table 3

Pearson correlation coefficients between ESS ∆ and potential anchor ∆ scores

Table 4

MID estimates and 95% CIs for the ESS based on anchors with correlations≥0.3

The SF-36 energy/vitality domain, FOSQ total score, activity level domain and vigilance domain were used to define responders/non-responders. Based on the anchor MIDs (SF-36=5,20 21 27 FOSQ total score=0.75, FOSQ domains=0.3,23 402 out of 574, and 156, 139 and 136 out of 267 patients were classified as responders in the SF-36 energy/vitality, FOSQ total, activity level and vigilance, respectively. Results of the ROC curve analyses are displayed in table 4, where the MID estimates were higher based on the FOSQ (≥2 points) than the SF-36 (1 point). Graphical displays of the ROC curves with area under the ROC curve (AUC), sensitivity and specificity for each anchor can be seen in figure 1. MID estimates were between 1 and 3 points. A cut-point of 2 points in the ESS was identified as the ESS change score that best discriminated between responders and non-responders based on the FOSQ total score MID. This estimate had the highest AUC out of the four domains analysed (0.82).

Figure 1

Receiver operating characteristic (ROC) curves identifying the change in ESS that best classifies responders and non-responders based on a change of more than or equal to anchor MID. FOSQ total score and domains are based on data from autoCPAP and Provent, and SF-36 vitality is based on autoCPAP and MOSAIC, using the 3-month follow-up in autoCPAP. Each plot shows the sensitivity and specificity, and the AUC (95% CI). AUC,  area under the ROC curve; ESS, Epworth Sleepiness Scale; FOSQ, Functional Outcomes of Sleep Questionnaire; SF-36, 36-Item Short Form Health Survey.

Only 13 patients from the Provent RCT provided data sufficient for calculating the ICC (≤20 days between tests), giving an ICC of 0.65. Therefore, we identified two previous studies that calculated the ICCs as 0.78 and 0.81 in translated versions of the ESS in OSA28 29 and triangulated these with our ICC to a single ICC of 0.75 to use in our calculation of the SEM. All distribution estimates are shown in table 5. The estimates were mostly consistent across studies with estimates around two points. Overall, the distribution-based estimates were lower than the anchor-based estimates. The estimates calculated for each follow-up time point of the autoCPAP trial show minor increases of between 0.02 and 0.05 as the length of follow-up increases (table 5).

Table 5

Distribution-based estimates of the MID


This is the first study to provide a single estimate of the MID for the ESS in OSA and our results show that the MID of the ESS is 2 points in this patient population. This MID was calculated using a combination of anchor-based and distribution-based approaches in a broad population of patients with OSA from three RCTs. Anchor-based estimates were slightly higher than distribution-based estimates.

Although change score correlations were too low to use the EQ-5D as an anchor, we observed moderate correlations between the ESS and FOSQ (r≥0.5 except for the social outcome domain) and slightly weaker correlations with the SF-36 (≥0.3), where only two domains were considered as anchors. This was expected as the FOSQ assesses the impact of sleepiness specifically on daily activities, whereas the SF-36 and EQ-5D measure general HRQoL. While the anchor-based estimates differed in the two populations, the two domains of the anchors used that best reflect the content of the ESS were the FOSQ energy/vitality and SF-36 activity domains. The estimates using these domains were on the smaller end within each population at −3.45 (95% CI −4.19 to −2.72) and −1.7 (95% CI −2.18 to −1.30), respectively. The distribution-based estimates were on average 1.9 and 1 point lower than the anchor-based estimates, with the estimates from the pooled data of all three RCTs being higher than the single RCTs. This is because a greater variability in change scores is represented, which may better reflect the true MID.13 Examination of the distribution-based estimates revealed no evidence for different MIDs dependent on the follow-up time. Provent (2 weeks) and MOSAIC (6 months) gave almost identical estimates, and in autoCPAP estimates only increased negligibly as the length of follow-up increased (table 5). There was also no clear evidence for different MIDs between studies which included patients with different OSA severities; however, this should be explored in further studies with a combination of anchor and distribution methods used.

The triangulation of all estimates weighted according to our criteria led to a final MID of 2 points. In the anchor-based estimates, we gave less weight to the FOSQ estimates because the MID used is not well-established since it calculated in a single study using only one distribution approach.23 On the other hand, the SF-36 MID we used comes from several studies where several methods have been used to calculate it. Although the MID has not been established in patients with OSA, one study estimated the MID to be between 1 and 5 points in patients with rheumatoid arthritis,21 who often experience symptoms of daytime sleepiness and may have a similar MID to patients with OSA.30 We also gave more weight to study combination 1 (autoCPAP and MOSAIC) as the total sample size was much larger (n=574 vs n=267) and covers a wider range of patient severities. The distribution of change scores was also more evenly spread than study combination 2 (autoCPAP and Provent), where fewer patients deteriorated. In study combination 1, the ESS was generally more responsive relative to the FOSQ, with fewer patients improving more than the FOSQ domain MID compared with the ESS. This could lead to an overestimation of the ESS MID since a larger improvement in ESS will be seen for a 1 MID improvement in the FOSQ domain. This is also true for the SF-36 physical component score and reflects the anchors which gave larger MID estimates. Based on an ESS MID of 2 points, 72% of patients improved in study combination 1 and 55% of patients improved in study combination 2. Anchor-based methods are generally considered to be superior to distribution-based methods as distribution methods are influenced by the data they are estimated in, and can underestimate the MID if based on a single study where strict inclusion criteria limit the population. In our data, the distribution estimates are less likely to be biased by these limitations as we pooled data from three RCTs with very different inclusion criteria. Therefore, we considered these estimates to provide valuable information about the MID and considered them equally in the triangulation. Overall, after consideration of all of these criteria, we came to a consensus that 2 points would be the appropriate estimate as the evidence for a higher MID from the FOSQ was not strong enough to increase the estimate to 3 points.

Our estimate of 2 points is similar to a previous study where an important improvement was estimated to be a change of between −2 and −3 points.10 As this study was unable to determine a single value, interpretation of change remained complicated by an uncertainty around which MID to use. If an MID of 3 points was prespecified as the MID in the design of an RCT, an overall group improvement of 2.5 points would be identified as not being clinically significant, whereas our results indicate that a change of 2 points is sufficient and that 3 points is too large. Using an MID that is larger than the ‘true’ MID in a sample size calculation for an RCT would also result in a smaller sample size requirement that would then lack the power to see a statistically and clinically significant effect. One study that investigated the ESS in narcolepsy suggested a 25% change from baseline to be the threshold for improvement.31 Applied to the three RCTs in our study, this would equate to a change of 2 points in Provent and MOSAIC, which are in agreement with our MID estimate, and 3.3 points in autoCPAP, which is higher than our estimate. This difference may reflect the larger change in ESS score seen in this study of narcolepsy (mean decrease of 8 points) compared with OSA, and 25% is likely to be too high for patients with OSA.

Improving the assessment of sleepiness in patients with OSA was recently identified as a key area that future research should prioritise.32 Since the ESS is an important instrument for assessing sleepiness, it is especially important to determine its MID. The lack of a well-established MID means that, up until now, researchers using the ESS as an outcome had to make an a priori assumption about the change in ESS needed to determine the clinical significance of an intervention or rely on statistical significance. This results in an inconsistency of criteria used across studies, which can influence the final conclusion of the results and limits comparability. For example, out of the three RCTs used in this study, one assumed a change of 2 points to be important,15 and two did not report a prespecified criterion and relied on statistical significance of the treatment effect.14 16 Our robust calculation of the MID means that clinicians and researchers can now confidently use an MID that is based on strong evidence. The MID of 2 points can be used in the preparation of clinical trials to calculate the sample size required and determine the treatment effect needed to see an improvement. Clinicians can also use this MID in the management of patients over time to identify worsening of the disease or to assess response to treatment.

A major strength of this study is that we were able to include a wide spectrum of patients with OSA with different disease severities in several countries. Across the intervention and control groups in the three RCTs, patients underwent several different treatments including standard CPAP, autoadjusting CPAP, expiratory nasal resistance valve therapy after withdrawal from CPAP (actual and placebo) or standard care (without CPAP). This means that our results can be generalised to a wide range of patients and intervention study designs. We were able to pool data from the three RCTs, which increases the variability of between-patient responses. This leads to stronger correlations between the change scores of the ESS and anchors as well as a higher ESS baseline and change score SDs, which in turn increases the validity of our results.33 The pooled data and increased variability also avoids underestimation of the MID in the distribution approaches. Another strength of our study is that we used variety of methodological approaches to produce a large number of estimates (27) on which to base our findings. We followed the US Food and Drug Administration guidance for patient-reported outcome development, where they recommend basing an important difference on several approaches, including anchor-based approaches using a responder definition (ie, anchor MID) and distribution-based approaches to provide supporting information.12 A limitation of our study is that we were not able to pool all three RCTs into one dataset for the anchor-based approaches due to the different instruments assessed in each RCT. Although we were still able to combine two studies for each instrument, we may have seen stronger correlations with three different populations combined. Another limitation is that we relied on measures of HRQoL as anchors and were not able to use an anchor that reflects sleepiness specifically, such as an objective measure of daytime sleepiness, which could only be used as anchors if there was an established MID for these measures. Using anchors that measure the same (or similar) construct increases the validity of the MID estimate and reduces the chance of misleading results. However, HRQoL is known to be associated with the degree of sleepiness in OSA, and in our data the SF-36 and FOSQ met the requirements for being an anchor (having an appreciable association and being interpretable).24 Where possible, future studies should look at measures directly relating to sleepiness as anchors. While the ESS has been found to have good reliability when assessed 7 days apart,6 a recent study identified that patients with OSA had a high within-patient variability in scores when not undergoing an intervention.34 Although all patient reported outcomes will have some measurement error, it is necessary to know whether change is due to measurement error or due to natural fluctuation in disease status or intervention.


In conclusion, we suggest to use an MID of 2 points for the ESS in patients with OSA. This MID is applicable to patients with a broad spectrum of OSA severity and to different study designs. Our estimate will facilitate the planning of clinical trials and provides the means to interpret the clinical relevance of changes in daytime sleepiness following interventions in patients with OSA.



  • Contributors SC, NAS and MK had full access to all data in the study. SC, NAS, KEB, JRS, AF, MAP and MK take responsibility for the integrity of the data and the accuracy of the data analysis. MAP and MK contributed to study design and obtained funding for the study. KEB, JRS and MK contributed to data collection. All authors contributed to analysis and interpretation of data. SC, AF, MAP and MK contributed to writing of the report. NAS, KEB, JRS, AF, MAP and MK contributed to critical revision.

  • Funding This work was supported by the Swiss National Science Foundation (32003B_162534) and an unrestricted grant by Bayer, Germany.

  • Competing interests MK and JRS declare advisory board fees from Bayer. Other authors have no completing interests to declare.

  • Patient consent Not required.

  • Ethics approval Research ethics committees in Zurich and Oxford.

  • Provenance and peer review Not commissioned; externally peer reviewed.