Introduction

The prolonged diminution in muscle strength and function are concerning disabilities following critical illness [1]. Increasingly, interventions aimed at preventing and minimising these impairments are the focus of research studies [2]. Muscle wasting occurs early and rapidly in the intensive care unit (ICU) setting [3]. A conceptual framework, the International Classification of Functioning (ICF), encompasses three core domains: impairment, activity limitations and participation restriction [4, 5], and has been proposed as a model in which measures can be organised. There are three commonly used study endpoints that sit within this model: muscle mass, strength (body systems) and function (activity limitations).

In muscle mass, strength and function are highly interconnected entities. Muscle mass is a passive non-volitional outcome which enables quantification of muscle morphology, and may relate to measurement of muscle strength and the development of intensive care unit-acquired weakness (ICU-AW) [1]. Muscle strength provides greater detail on the patient’s level of impairment, as it is a dynamic measure. At the top of the hierarchy is function, which is the most patient-centred outcome and provides information on activity limitation within the ICF framework. The measurement of function is complex, containing information about task completion (cognition), coordination, processing of visual information and central motor drive, and the activation of signalling pathways from the motor cortex to the muscle [6]. Measurement of strength and function requires patients to be alert and able to cooperate with testing. This is in contrast to measurement of muscle mass, which can be quantified, using non-volitional methods such as ultrasonography.

When selecting the most appropriate measure to evaluate efficacy and change over time, clinicians and researchers need to consider whether the clinimetric properties of the measure of interest have been established. Reliability determines the ability of an instrument to obtain accurate results, which are free from measurement error when the instrument is repeated by multiple assessors (inter-rater reliability) or longitudinally (intra-rater reliability or test–retest reliability) [7, 8]. Validity determines the ability of an instrument to measure what it is intended to measure, i.e. how well an instrument obtains data, as hypothesised, when compared to an instrument measuring a similar construct (construct validity–hypotheses testing); how well an instrument performs in comparison to the “gold standard” measure (criterion–concurrent validity); and how well data from an instrument predicts a future score or outcome (criterion–predictive validity) [7, 8]. Responsiveness refers to the ability of an instrument to detect a true change in the score obtained which is statistically or clinically meaningful over time [8]. There are two main methods used to determine the minimal important difference (MID): a distribution-based method and an anchor-based method [9]. The anchor-based method takes into account the patient perception of change using anchors such as much worse and much better in a scale such as the global rating of change scale [9]. Measures developed for one setting or patient population should only be extrapolated with caution [8]. In ICU, the environment, patient alertness, sedation, delirium and severity of illness, time, resources and expertise are factors which influence the choice of measure [10], as well as the clinimetric properties of the measures [11].

To date, there has been no systematic, comprehensive evaluation and synthesis of measures used to assess muscle mass, strength and function in the critically ill across the continuum of recovery including examination of the clinimetric properties of these measures. There have been two published systematic reviews addressing use of outcome measures in the critically ill [12, 13]. However, these reviews are either focused on one specific aspect of clinimetric evaluation (e.g. only reliability [12]), or one type of outcome of interest (e.g. only physical function [13]).

Therefore, the objectives of this review were to:

  • identify measures which are used to evaluate muscle mass, strength and function in the critically ill population, at any point along the trajectory of critical illness recovery (including in ICU, hospital, and post-hospitalisation settings) and

  • evaluate, synthesise and compare the clinimetric properties of the measures identified.

The consensus-based standards for the selection of health status measurement instruments (COSMIN) and the preferred reporting items for systematic reviews and meta-analyses (PRISMA) guidelines were followed (http://www.prisma-statement.org).

Methods

Protocol

The review was registered on PROSPERO (CRD4201400893). The search for this systematic review was conducted in two parts: part 1 involved the identification of measures, which have been used to evaluate muscle mass, strength and function in the critically ill. This first search allowed a list of measures to be generated. Part 2 involved a second search conducted to identify papers examining the clinimetric properties of measures identified in part 1.

Part 1: identification of measures

Five electronic databases were searched by one reviewer using a systematic, comprehensive and reproducible search strategy [Electronic Supplementary Material (ESM) Table E1]. Electronic databases were accessed via The University of Melbourne, Australia library with the last search run on 17 October 2014. Two independent reviewers determined eligibility against pre-determined criteria (Table 1). A list of measures, was generated from the results of part 1.

Table 1 Study eligibility for inclusion in systematic review (parts 1 and 2)

Part 2: clinimetric properties of measures

Five electronic databases were searched by one reviewer (S.P.) (Fig. 1) with the last search run on 17 October 2014. The search filter adopted (ESM Table E1) was based on guidelines provided by Terwee and colleagues [14] for systematic reviews examining the clinimetric properties of measures. The study selection and data extraction followed the same methodology as described for part 1. Two independent reviewers (S.P., C.G.) used the COSMIN checklist, a validated tool, to evaluate the risk of bias of the included studies from part 2 [15]. Each study was evaluated on the relevant item(s) of the COSMIN checklist (reliability; measurement error; hypotheses testing; criterion validity and responsiveness). An overall quality score for each item was obtained by using the lowest score recorded [15]. The agreement between reviewers was estimated using percentage agreement and the kappa statistic [16].

Fig. 1
figure 1

Flow diagram of clinimetric properties search—part 2. ADL activities of daily living, Ax assessment, CINAHL cumulative index to nursing and allied health literature, CPAx Chelsea critical care physical assessment tool, EMBASE the Excerpta Medica database, FAC functional ambulatory category, FIM Functional Independence Measure, FSS-ICU functional status score for the intensive care, IADL independence in activities of daily living, ICU intensive care unit, ISWT incremental shuttle walk test, n number, PEDRO physiotherapy evidence database, PFIT physical function in intensive care test, RMI Rivermead mobility index, SOMS surgical intensive care unit optimal mobilisation score, TUG timed-up-and-go-test, WT walk test, 6MWT 6-min walk test

Results

Part 1: identification of measures

A list of 33 measures was generated (muscle mass n = 3, strength n = 4 and function n = 26) (ESM Figure E1; Fig. 1). Percentage of agreement for title and abstract was 96 % (κ = 0.90) and for full-text was 93 % (κ = 0.86).

Part 2: clinimetric properties of measures

Study selection and study characteristics

A total of 47 articles were included (Fig. 1). Percentage of agreement for title and abstracts was 91 % (κ = 0.82) and for full-text was 92 % (κ = 0.80). The characteristics of included studies are summarized in Table 2.

Table 2 Study characteristics of studies included in part 2

Outcome measures

Clinimetric properties evaluated by studies were: reliability (studies n = 16), measurement error (n = 4), construct validity (hypothesis testing) (n = 31), criterion-predictive validity (n = 18) and responsiveness (n = 11) (Tables 3, 4, ESM Tables 2–4).

Table 3 Results: synthesis of evidence regarding clinimetric properties (comparison of outcome measures)

Risk of bias

Percentage agreement for risk of bias assessment between reviewers was 97 % (κ = 0.95). Overall, studies scored “fair” or “poor” for the measurement properties evaluated (ESM Table E2). The worst scored area amongst studies included was design requirements (sample size and lack of a priori hypotheses).

Study results

Study results are summarised in Table 3 and described in the following sections. Individual study results are presented in ESM Tables E3–5. Overall ultrasonography, dynamometry, physical function in intensive care test scored (PFIT-s) and the Chelsea critical care physical assessment tool (CPAx) performed the best in terms of clinimetric properties (Table 3).

Muscle mass

In the ICU, muscle mass was evaluated using three different approaches: anthropometry, bioimpedance spectroscopy (BIS) and ultrasonography (Fig. 1; Table 3). The reliability and measurement error of anthropometry has not been examined in individuals with critical illness. Circumference measures of limb size were not sensitive to change over time [22, 27] (Table 3; ESM Table E5). BIS had high intra-session and test–retest reliability when using the SFB7 Bioimped device (ICC > 0.94) [20]. A moderate to excellent relationship was established between BIS and quadriceps thickness (r 2 = 0.61, p ≤ 0.001) [17]. There is conflicting evidence for the predictive ability of both anthropometry and BIS in relation to mortality (Table 3; ESM Table E4).

There was excellent intra-rater reliability for measurement of muscle thickness (ICCs ≥ 0.98) and echogenicity (ICC > 0.90) using ultrasonography [17, 18] (Table 3). There was a fair to moderate correlation between upper limb muscle thickness and strength (r 2 = 0.43–0.52, p < 0.01) and little correlation for quadriceps thickness and strength (r 2 = 0.22, p = 0.07) [17] (Table 3). Muscle thickness was negatively correlated with ICU length of stay (LOS) (p < 0.001) [21] (Table 3; ESM Table E4). The criterion predictive validity of ultrasonography has not been examined but ultrasonography was sensitive to changes in muscle thickness (with a reduction of 1.6–6 % per day in quadriceps thickness) over the ICU admission in two studies [22, 27] (Table 3; ESM Table E5).

Muscle strength

Strength has been evaluated using handgrip dynamometry (assessment of grip strength only), hand-held dynamometry (used to assess all other muscle groups), manual-muscle strength testing (MMT) and the chair-stand test (Fig. 1). The clinimetric properties of only three of these tests (hand-held dynamometry, handgrip dynamometry, MMT) have been evaluated (Table 3; ESM Tables E3–5), with reliability and measurement error the most extensively evaluated clinimetric constructs (Table 3, ESM Table E3).

Manual muscle strength using the Medical Research Council sum-score (MRC-SS) is the most commonly utilised measure for evaluating strength in the ICU setting (Table 2). Whilst there is excellent inter-rater reliability for overall MRC-SS [10, 34, 62], the inter-rater reliability for individual muscle group scores ranges from poor to excellent. Agreement for diagnosis of ICU-AW (< 48 out of 60) is inconsistent ranging from slight to substantial agreement in the ICU [10, 34, 62] and almost perfect in the ward setting [34]. The MRC-SS has a fair correlation with functional outcomes (Barthel index and elderly mobility scale) in terms of criterion validity; and there is conflicting evidence in the relationship of MMT with ICU and hospital LOS (Table 3). There is inconclusive evidence to determine if MMT can predict short- or long-term mortality (Table 3; ESM Table E4).

There is good to excellent intra- and inter-rater reliability for handgrip (ICC range 0.92–0.97) and hand-held dynamometry (ICC range 0.76–0.96). Measurement error for dynamometry was reported to be between 1.9 and 2.8 kg in one study [32]; however, external validation of these findings needs to be undertaken. Construct validity has only been reported for handgrip [30, 31], and is yet to be established for hand-held dynamometry (Table 3; ESM Table E4). Whilst good test performance has been described for the handgrip cut-off values developed for diagnosing ICU-AW [38], no external validation of these values has been undertaken.

Function

Evaluated using 26 different measures (Fig. 1), of which only 12 have been examined in terms of their clinimetric properties (Table 3; ESM Tables E3–E5). Six measures have been specifically developed for use in the ICU setting: CPAx [39, 47], PFIT-s [64], Perme mobility scale [42], ICU mobility scale [44], surgical intensive care unit optimal mobility scale (SOMS) [50] and the functional status score for the intensive care (FSS-ICU) [65]. Excellent reliability has been established for all these measures except the FSS-ICU tool (Table 3; ESM Table E3).

Construct and criterion predictive validity is established for the PFIT-s, CPAx and SOMS (Table 3; ESM Table E3). The PFIT-s tool has been validated in two independent patient settings in two different continents where both patient management and physiotherapy services differ [43, 64]. It had a fair to excellent correlation with the 6-min walk test (6MWT), MRC-SS, and timed-up-and-go test [64]. Additionally, higher awakening PFIT-s were predictive of higher MRC-SS at ICU discharge [43, 64], and discharge to home [64]. The PFIT-s exhibits both floor and ceiling effects of around 20 %, and an MID has been established of 1.5 points out of 10 [2]. The CPAx at ICU discharge was able to discriminate between patient discharge destinations [39]. The SOMS was predictive of in-hospital mortality and had a moderate to good correlation with ICU LOS and handgrip strength [50]. The clinimetric properties of the 6MWT has been examined in one study post-hospital discharge and demonstrated that patients could walk significantly longer on the repeat 6MWT [51]. It is important to note that this testing was performed in a home-based setting. No clinimetric evaluation of the 6MWT in-hospital has been undertaken to date. There was a moderate to good correlation between 6MWT and the Short Form-36 (SF-36) physical function domain [40, 51]. There is an excellent correlation between 6MWT and timed-up-and-go test at 3 months post-ICU discharge [40]. No criterion predictive validity has been examined for the 6MWT.

The Katz activities of daily living (ADL) was the most widely utilized measure assessing function identified in part 1. The Katz ADL has not been examined in terms of reliability and measurement error specifically within the ICU setting. The Katz ADL has construct validity with SF-36 physical and mental function domains at 1-month [52], but no correlation with the Functional Independence Measure score or 12-month SF-36 scores [52, 54]. The Katz ADL is reported to be predictive of short-term mortality [41, 48, 60] but not longer-term mortality (3–6 months) [41, 53, 61].

Discussion

This systematic review focused on three commonly assessed endpoints used in critical illness: representing body system impairments and functional limitations (muscle mass, strength and function). Thirty-three different measures were identified; however, only 20 have published clinimetric properties. Ultrasonography, dynamometry, PFIT-s and the CPAx performed the strongest for the measurement instruments for muscle mass, strength and function, respectively.

Based on this review, whilst anthropometry (circumference) is a simple method and easily obtainable, it should not be considered a primary end-point in clinical and research practice. It is not sensitive to change over time, due to other variables such as adiposity, oedema and hydration status affecting circumference measurement, particularly in the ICU setting [21, 66]. Non-ICU studies have demonstrated anthropometry is unreliable and under-represents muscle wasting [67].

Bioimpedance spectroscopy enables bedside quantification of body water and mass compartments including fat-free and fat mass measurements [20]. Whilst prediction equations and algorithms have been developed for some populations [68], it is recommended that raw data be utilized in ICU as no specific reference equation has been developed. There are also challenges with using BIS which need to be taken into consideration such as cost and factors which can affect impedance measurements such as fluid status and ability to obtain accurate height and weight measurements in ICU [20]. The responsiveness of BIS and what constitutes a clinically meaningful change in scores is unknown. However, because it is non-invasive, quick to use and non-volitional, further research is warranted.

The findings of this review indicate that ultrasonography has high responsiveness and excellent intra-rater reliability for measurement of muscle thickness [17] and echogenicity using the Heckmatt approach to quantify muscle echotexture changes [18]. The association between measures of muscle thickness and strength were only fair to good in one study [17]. This is in contrast to the findings in non-ICU studies where ultrasonography was shown to have strong construct validity with measures of strength [69] and has been correlated with architectural changes which occur at a cellular level (as identified by invasive muscle biopsy) [70]. The cause of muscle wasting in ICU is likely multi-factorial. However, it is generally accepted that immobilization and inflammatory stimuli are important contributing factors to the development of ICU-AW [71]. Muscles are adaptive and respond to changes in loading and inflammation in different ways depending on their composition. The response of a specific skeletal muscle will, among other factors, depend on muscle fibre composition and differing contractile properties, which may contribute to specific task and muscle dysfunction. As an example, a study within individuals with COPD demonstrated significant weakness in the quadriceps musculature and preservation of strength in the adductor pollicis muscle [72]. Therefore, it is important that we examine which muscles may be most sensitive in enabling early diagnosis of future functional impairments. There was limited association between strength and thickness measures on ultrasonography. Muscle thickness does not contain information about the neuromuscular conducting properties or dysfunction of the contractile apparatus, and is a two-dimensional representation of muscle size. It is therefore possible that muscle thickness may under-estimate the loss of strength in patients (in contrast to cross-sectional area which may be more sensitive) and may not enable detection of changes in the quality of the muscle and nerve, which can be affected in ICU-AW.

Ultrasonography is demonstrated to have predictive utility for survival in neuromuscular diseases [73]. Although reliability and validity has been demonstrated regardless of expertise level for image acquisition using ultrasonography in a non-ICU study [74], it is important that assessors follow a standardised methodology. A recent study in the ICU demonstrated excellent reliability regardless of expertise level for the analysis of echogenicity [75].

It is important to consider the timing of measurements/treatments particularly when comparing different regimens of muscle preserving interventions. The rate of muscle loss in ICU patients follows a logarithmic curve; as a consequence, patients will experience a higher absolute rate of muscle atrophy in early compared to later phases of their ICU stay [21]. In accordance with this, a delayed measurement may fail to identify the initial muscle loss. This is termed lead-time bias, which may therefore be an important confounder when examining treatment efficacy. It is important that timing of measurements is reported within future studies. Further research is required to determine if ultrasonography correlates to measures of strength and function and to determine if it has predictive utility in identifying individuals at risk of ICU-AW.

Manual-muscle strength testing is the most commonly utilized measure across the recovery continuum [76]. It has been used both as a diagnostic tool for identifying the presence of ICU-AW and to quantify strength [77]. Whilst excellent inter-rater reliability has been established for overall MRC-SS [10, 34, 35, 62], there is variability in terms of findings for individual muscle groups (poor to excellent reliability) and agreement for the dichotomization of the presence or absence of ICU-AW (slight to substantial in the ICU setting) [10, 34, 62]. Although the majority of studies have demonstrated MRC-SS to be predictive of ICU and hospital mortality [3638, 63], one study found no relationship between MRC-SS and mortality [10]. The inconsistency in reliability and validity findings between studies may relate to variability in the screening methods used to determine the appropriateness and timing of testing. The task results are dependent on the patient’s level of consciousness and mental status. In the ICU setting, many patients are intermittently unable to cooperate because of a reduced level of consciousness/ability to understand due to the critical illness itself or due to the administration of sedative medications. Day-to-day variation may reflect fluctuations in motivation, attention or cognitive dysfunction rather than an increase in muscle dysfunction. This is true of all volitional measurement whether it is in the ICU or in the community. Other reasons for inconsistency in reliability and validity findings include: muscles examined, testing technique (isometric and through range) [77] and the statistical analyses used. To improve measurement accuracy it is important to use standardised phrasing and strong encouragement. Dynamometry has been shown to be a more sensitive method to quantify changes in strength over time particularly once a patient has anti-gravity strength [12]. Normative values have been published for both handgrip and hand-held dynamometry [78, 79]. Handgrip dynamometry is quick, simple and requires minimal training to use, and cut-off values for the diagnosis of ICU-AW have been developed [38]. Further examination of the clinimetric properties are warranted, as well as standardisation of testing methodology including screening to facilitate generalisability across different trials.

Measurement of function is a primary endpoint in many research studies; however, the measures utilised vary. Twenty-six different measures have been used in research trials to date with less than half having one or more established clinimetric properties reported specifically in individuals with critical illness. The PFIT-s and CPAx tools are the most robust function measures with established reliability, validity and responsiveness. However, because of the volitional nature of the tests, there are floor effects in critically ill patients that are greater for the PFIT-s than the CPAx. An MID of 1.5 out of 10 has been established statistically for the PFIT-s tool. An MRC-SS of 41.5 out of 60 had excellent sensitivity and specificity (>80 %) for predicting whether an individual will be able to perform sit-to-stand and marching components of the PFIT-s at ICU discharge [43]. The magnitude of change in muscle performance that represents a clinically meaningful change to the patient has not been calculated in relation to the MID. This is also true for all measures currently published.

The ICU environment is a challenging setting in which to develop a core set of measures to evaluate changes in function. Inflammatory, metabolic and electrolyte changes can all influence muscle function. In critically ill patients, all these parameters are subject to large day-to-day variations. This heterogeneity renders stable study conditions practically impossible and responses unpredictable. This is in conjunction with fluctuations in patient’s ability to follow commands and perform volitional testing. Due to this heterogeneity, it is important that all contributing factors to the development of muscle weakness are documented and reported to ensure accurate interpretation of results and increase comparability between future studies.

Adler and colleagues in their systematic review noted that a key endpoint in studies was the time to achieve milestones and the distance ambulated [80]. These measures are not objective and have no established clinimetric properties or evidence of responsiveness over time. It is important that clinicians and researchers use standardized measures to evaluate functional recovery. Data are most commonly being extrapolated from the gerontology or neurological populations. This is evident, e.g., in the use of the Barthel and Functional Independence Measure outcomes both of which have established clinimetric properties in non-critically ill patient populations [81]. There may be key differences particularly in the early stages of the critical illness including: alertness, delirium and sedation that can affect a patient’s performance. Further, the ceiling effects of these measures have not been documented in the critically ill.

The ICF framework provides a scaffold in which clinicians and researchers can use outcome measures appropriate for different stages of recovery to capture changes in the patient’s level of impairment, activity limitations and participation restrictions. Please refer to Fig. 2 for a potential framework in which measures could be mapped across the continuum from admission to return to home fitting within the ICF framework including suggested measures which warrant investigation in the critically ill such as the de Morton mobility index [82] which has been utilized in geriatric populations. It is important to consider in the longer-term the patients’ ability to achieve a safe community level of ambulation to be able to cross at traffic lights, travel on transportation and to have the physical capacity to perform day-to-day activities such as carrying shopping or walking. This review has focused on the body systems and functional activity limitations outcomes, which can be utilissed to evaluate a patient’s level of recovery across the continuum of care. However, it is also important to consider cognitive, mental and psychological outcomes which can be mapped across the continuum, and which are sensitive to detect changes in the patient’s recovery.

Fig. 2
figure 2

Suggested schematic guide to mapping of outcome measures within the ICF framework. ADL activities of daily living, CPAx Chelsea critical care physical assessment tool, DEMMI De Morton mobility index, EMG electromyography, ESWT endurance shuttle walk test, FSS-ICU functional status score for the intensive care, HGD handgrip dynamometry, HHD hand-held dynamometry, IADL independence in activities of daily living, ICU intensive care unit, IMS ICU mobility scale, MRC Medical Research Council, NCS nerve conduction study, PFIT-s physical function in independence test scored, SPPB short physical performance battery, TUG timed-up-and-go-test, 6MWT 6-min walk test. Asterisk methods for clinically diagnosing the presence of ICU-AW on awakening

Limitations

There is the potential for publication bias due to exclusion of non-English articles. There is also the possibility that studies with negative clinimetric findings may not have been published. For risk of bias, the majority of the included studies scored lowest for “inadequate sample size” although they may have statistically justified a smaller sample size than is considered appropriate based on the COSMIN checklist.

Conclusions

Ultrasonography, dynamometry, PFIT-s and CPAx demonstrated the strongest clinimetric properties. Further research into this area, including identification of a core set of measures which can be utilised across the continuum of recovery fitting within the ICF framework. This will enable greater generalizability of findings between studies to determine efficacy of interventions. Furthermore, using the ICF model will direct measurement of tests with similar constructs to be used within each of the classification categories so that the right test is used for the outcome of interest at the most appropriate time-point [40].