 Research article
 Open Access
 Open Peer Review
 Published:
Interrater reliability of the QuIS as an assessment of the quality of staffinpatient interactions
BMC Medical Research Methodologyvolume 16, Article number: 171 (2016)
Abstract
Background
Recent studies of the quality of inhospital care have used the Quality of Interaction Schedule (QuIS) to rate interactions observed between staff and inpatients in a variety of ward conditions. The QuIS was developed and evaluated in nursing and residential care. We set out to develop methodology for summarising information from interrater reliability studies of the QuIS in the acute hospital setting.
Methods
Staffinpatient interactions were rated by trained staff observing care delivered during twohour observation periods. Anticipating the possibility of the quality of care varying depending on ward conditions, we selected wards and times of day to reflect the variety of daytime care delivered to patients. We estimated interrater reliability using weighted kappa, κ_{ w }, combined over observation periods to produce an overall, summary estimate, \( {\widehat{\upkappa}}_w \). Weighting schemes putting different emphasis on the severity of misclassification between QuIS categories were compared, as were different methods of combining observation period specific estimates.
Results
Estimated \( {\widehat{\upkappa}}_w \) did not vary greatly depending on the weighting scheme employed, but we found simple averaging of estimates across observation periods to produce a higher value of interrater reliability due to overweighting observation periods with fewest interactions.
Conclusions
We recommend that researchers evaluating the interrater reliability of the QuIS by observing staffinpatient interactions during observation periods representing the variety of ward conditions in which care takes place, should summarise interrater reliability by κ_{ w }, weighted according to our scheme A4. Observation period specific estimates should be combined into an overall, single summary statistic \( {\widehat{\upkappa}}_{w\ random} \), using a random effects approach, with \( {\widehat{\upkappa}}_{w\ random} \), to be interpreted as the mean of the distribution of κ_{ w } across the variety of ward conditions. We draw attention to issues in the analysis and interpretation of interrater reliability studies incorporating distinct phases of data collection that may generalise more widely.
Background
The Quality of Interactions Schedule (QuIS) has its origin in observational research undertaken in 1989 by Clark & Bowling [1] in which the social content of interactions between patients and staff in nursing homes and long term stay wards for older people was rated to be positive, negative or neutral. The rating specifically relates to the social or conversational aspects of an interaction, such as the degree to which staff acknowledge the patient as a person, not to the adequacy of any care delivered during the interaction. Dean et al. [2] extended the rating by introducing distinctions within the positive and negative ratings, creating a five category scale as set out in Table 1. QuIS is now generally regarded as an ordinal scale ranging from the highest ranking, positive social interactions to the lowest ranking, negative restrictive interactions [3].
Barker et al. [4] in a feasibility study of an intervention designed to improve the compassionate/social aspects of care experienced by older people in acute hospital wards, proposed the use of the QuIS as a direct assessment of this aspect of the quality of care received. This is a different context to that for which the QuIS was originally developed and extended, and it may well perform differently: wards may be busier and more crowded, beds may be curtained off, raters may have to position themselves more or less favourably in relation to the patients they are observing. A component of the feasibility work evaluated the suitability of the QuIS in the context of acute wards, and in particular its interraterreliability [5]. Because of the lack of alternative assessments of quality of care it is likely that the QuIS will be used more widely, and any such use should be preceded by studies examining its suitability and its interrater reliability.
In this paper we describe the analysis of data from an interrater reliability study of the QuIS reported by McLean et al. [5]. Eighteen pairs of observers rated staffinpatient interactions during two hour long observation periods purposively chosen to reflect the wide variety of conditions in which care is delivered in the hospital setting. The study should thus have captured differences in the quality of care across conditions, for example when staff were more or less busy. It is possible that interrater reliability could also vary depending on the same factors, and thus an overall statement of typical interrater reliability should reflect variability across observation periods in addition to sampling variability. We aim to establish a protocol for summarising data from interrater reliability studies of the QuIS, to facilitate consistency across future evaluations of its measurement properties. We summarise interrater reliability using kappa (κ) which quantifies the extent to which two raters agree in their ratings, over and above the agreement expected through chance alone. This is the most frequently used presentation of interrater reliability in applied health research, and is thus familiar to researchers in the area. When κ is calculated all differences in ratings are treated equally. Varying severity of disagreement between raters depending on the categories concerned can be accommodated in weighted κ, κ_{ w }, however standard weighting schemes give equal weight to disagreements an equal number of categories apart regardless of their position on the scale, and are thus not ideal for the QuIS. For example, a disagreement between the two adjacent positive categories is not equivalent to a disagreement between the adjacent positive care and neutral categories. Thus we aim to establish a set of weights to be used in κ_{ w }, that reflects the severity of misclassification between each pair of QuIS categories. We propose using metaanalytic techniques to combine the estimates of κ_{ w } from the different observation periods to produce a single overall estimate of κ_{ w }.
Methods
QuIS observation
Following the training described by McLean et al. [5], each of 18 pairs of research staff observed, and QuIS rated all interactions involving either of two selected patients, during a twohour long observation period. The 18 observation periods were selected with the intention of capturing a wide variety of conditions in which care is delivered to patients in acute wards, as this was the target of the intervention to be evaluated in a subsequent main trial. Observation was restricted to a single, large teaching hospital on the South Coast of England and took place in three wards, on weekdays, and at varying times of day between 8 am to 6 pm, including some periods when staff were expected to be busy (mornings) and others when staff might be less so.
The analysis of interrater reliability was restricted to staffpatient interactions rated by both raters, indicated by them reporting an interaction starting at the same time: interactions rated by only one rater were excluded. The percentage of interactions missed by either rater is reported, as is the Intraclass Correlation Coefficient (ICC) of total number of interactions reported by each rater in the observation periods.
κ estimates of interrater reliability
Interrater agreement was assessed as Cohen’s κ [6] calculated from the crosstabulation of ratings into the k = 5 QuIS categories of the interactions observed by both raters:
with p_{ o } being the proportion of interactions with identical QuIS ratings and p_{ e } being the proportion of interactions expected to be identical (∑ _{i = 1} ^{k} p_{ i. }p_{.i}) calculated from the marginal proportions p_{ i. } and p_{ .i } of the crosstabulation.
In the above, raters are only deemed to agree in their rating of an interaction if they record an identical QuIS category, and thus any ratings one point apart (for example ratings of + social and + care) are treated as disagreeing to the same extent as ratings a further distance apart (for example ratings of + social and  restrictive). To better reflect the severity of misclassification between pairs of QuIS categories weighted κ_{ w } can be estimated as follows:
where p_{o (w)} is the proportion of participants observed to agree according to a set of weights w_{ ij }
and p_{e (w)} is the proportion of participants expected to agree according to the weights
In (3) p_{ ij }, for i and j = 1 … k, is the proportion of interactions rated as category i by the first rater and category j by the second. A weight w_{ ij } is assigned to each combination restricted to lie in the interval 0 ≤ w_{ ij } ≤ 1. Categories i and j, i ≠ j with w_{ ij } = 1, indicate a pair of ratings deemed to reflect perfect agreement between the two raters. Only if w_{ ij } is set at zero, w_{ ij } = 0, are the ratings deemed to indicate complete disagreement. If 0 < w_{ ij } < 1 for i ≠ j, ratings of i and j indicate ratings deemed to agree to the extent indicated by w_{ ij }. The precision of estimated κ_{ w } from a sample of size n is indicated by the Wald 100(1 α)% confidence interval (CI):
Fleiss et al. ([6], section 13.1) give an estimate of the standard error of \( {\widehat{\upkappa}}_w \) as:
where \( {\overline{w}}_{i.}={\displaystyle {\sum}_{j=1}^k{p}_{.j}{w}_{ij}} \) and \( {\overline{w}}_{.j}={\displaystyle {\sum}_{i=1}^k{p}_{i.}{w}_{ij}} \). Unweighted κ
is a special case.
We examined the sensitivity of \( {\widehat{\upkappa}}_w \) to the choice of weighting scheme. Firstly we considered two standard schemes (linear and quadratic) described by Fleiss et al. [6] and implemented in Stata. Linear weighting deems the severity of disagreement between raters by one point to be the same at each point on the scale, and the weighting for disagreement by more than one point is the weight for a onepoint disagreement multiplied by the number of categories apart. In quadratic weighting, disagreements two or more points apart are not simple multiples of the onepoint weighting, but are still invariant to position on the scale. We believe that the severity of disagreement between two QuIS ratings a given number of categories apart, does depend on their position on the scale. The weighting schemes we devised as better reflections of misclassification between QuIS categories are described in Table 2. In weighting schemes A1 to A6 the severity of disagreements between each positive category and neutral, and each negative category and neutral was weighted to be 0.5; disagreement within the two positive categories was considered to be as severe as that within the two negative categories; and we considered a range of levels of weights (0.5 to 0.9) to reflect this. In schemes B1 to B3 disagreements between each positive category and neutral, and between each negative category and neutral were considered to be equally severe, but were given weight less than 0.5 (0.33, 0.25 and 0.00 respectively); severity of disagreement within the two positive categories was considered to be the same as that within the two negative categories. While in weighting schemes C1C3, disagreement between the two positive categories (+social and + care) was considered to be less severe than that between the two negative categories (−protective and restrictive).
Weighting scheme A4 is proposed as a good representation of the severity of disagreements between raters based on the judgement of the clinical authors (CMcL, PG and JB) for the following reasons:

i)
There is an order between categories + social > +care > neutral > −protective > −restrictive

ii)
Misclassification between any positive and any negative category is absolute and should not be considered to reflect any degree of agreement

iii)
The most important misclassifications are between positive (combined), neutral and negative (combined) categories

iv)
There is a degree of similarity between neutral and the two positive categories, and between neutral and the two negative categories

v)
Misclassification within positive and negative categories do matter, but to a lesser extent
Variation in \( {\widehat{\upkappa}}_w \) over observation periods
We examined Spearman’s correlation between A4 weighted \( {\widehat{\upkappa}}_w \) and time of day, interactions/patient hour, mean length of interactions and percentage of interactions less than one minute. ANOVA and two sample ttests were used to examine differences in A4 weighted \( {\widehat{\upkappa}}_w \) between wards and between mornings and afternoons.
Overall \( {\widehat{\upkappa}}_w \) combined over observation periods
To combine g (≥2) independent estimates of κ_{ w }, we firstly considered the naive approach of collapsing over observation periods to form a single crosstabulation containing all the pairs of QuIS ratings, shown in Table 3a). An estimate, \( {\widehat{\upkappa}}_{w\kern0.5em collapsed} \), and its 95% CI, can be obtained from formulae (2) and (6).
We next considered combining the g observation period specific estimates of κ_{ w } using metaanalytic techniques. Firstly, using a fixed effects approach, the estimate \( {\widehat{\upkappa}}_{wm}={\upkappa}_w+{\varepsilon}_m \) in the m^{th} observation period is modelled as comprising the true underlying value of κ_{ w } plus a component, ε_{ m }, reflecting sampling variability dependent on the number of interactions observed within the m^{th} period: where κ_{ w } is the common overall value, and ε_{ m } is normally distributed with zero mean and variance \( {V}_{wm} = SE{\left({\widehat{\upkappa}}_{wm}\right)}^2 \). The inversevariance estimate of κ_{ w }, based on the fixed effects model, \( {\widehat{\upkappa}}_{w\kern0.5em fixed} \), is a weighted combination of the estimates from each observation period:
with metaanalytic weights, ω_{ m }, given by:
Since study specific variances are not known, estimates \( {\widehat{\omega}}_m \) with variance estimates \( {\widehat{V}}_{wm} = \widehat{SE}{\left({\widehat{\upkappa}}_{wm}\ \right)}^2 \) calculated from formula (6) for each of the m periods are used. The standard error of \( {\widehat{\upkappa}}_{w\ fixed} \) is then:
from which a 100(1 α)% CI for \( {\widehat{\upkappa}}_{w\ fixed} \) can be obtained. \( {\widehat{\upkappa}}_{w\ fixed} \) is the estimate \( {\widehat{\upkappa}}_{w\ overall} \) combined over strata given by Fleiss et al. [6], here combining weighted \( {\widehat{\upkappa}}_{wm} \) rather than unweighted \( {\widehat{\upkappa}}_m \).
Equality of the g underlying, observation period specific values of κ_{ w }, is tested using a χ^{2} test for heterogeneity:
to be referred to χ^{2} tables with g − 1 degrees of freedom. The hypothesis of equality in the g κ_{ wm }s is typically rejected if χ^{2}_{ heterogeneity } lies above the χ _{g − 1} ^{2} (0.95) percentile.
The fixed effects model assumes that all observation periods share a common value, κ_{ w }, with any differences in the observation period specific \( {\widehat{\upkappa}}_{wm} \) being due to sampling error. Because of our expectation that interrater reliability will vary depending on ward characteristics and other aspects of specific periods of observation, our preference is for a more flexible model incorporating underlying variation in true κ_{ wm } over the m periods within a random effects metaanalysis. The random effects model has \( {\widehat{\upkappa}}_{wm}={\upkappa}_w+{\delta}_m+{\varepsilon}_m \), where δ_{ m } is an observation period effect, independent of sampling error (the ε_{ m } terms defined as for the fixed effects model). Variability in observed \( {\widehat{\upkappa}}_{wm} \) about their underlying mean, κ_{ w }, is thus partitioned into a source of variation due to observation period characteristics captured by the δ_{ m } terms, which are assumed to follow a Normal distribution: δ_{ m } ~ N(0, τ^{2}), with τ^{2} the variance in κ_{ wm } across observation periods, and sampling variability. The inversevariance estimate of κ_{ w } for this model is:
with metaanalytic weights, Ω_{ m }, given by:
Observation period specific variance estimates \( {\widehat{V}}_{wm} \) are used, and τ^{2} also has to be estimated. A common choice is the DersimonianLaird estimator [7] defined as:
usually truncated at 0 if the observed χ^{2}_{ heterogeneity } < (g − 1). The estimate \( {\widehat{\upkappa}}_{w\ random} \) is then:
with
and an estimate of the standard error of \( {\widehat{\upkappa}}_{w\ random} \) is:
leading to 100(1 α)% CIs for \( {\widehat{\upkappa}}_{w\ random} \).
The role of τ^{2} is that of a tuning parameter: When τ^{2} = 0 there is no variation in the underlying κ_{ w }, and the fixed effects estimate, \( {\widehat{\upkappa}}_{w\ fixed} \) is obtained. At the other extreme, as τ^{2} becomes larger, the \( {\widehat{\varOmega}}_m \) become close to constant, so that each observation period is equally weighted and \( {\widehat{\upkappa}}_{w\ random} \) becomes the simple average of observation period specific estimates:
\( {\widehat{\upkappa}}_{w\ averaged} \) ignores the impact of number of interactions on the precision of the observation period specific estimates. The standard error for \( {\widehat{\upkappa}}_{w\ averaged} \) is estimated by:
Obtaining estimates of \( {\widehat{\upkappa}}_w \) from Stata
The inversevariance fixed and random effects estimates can be obtained from command metan [8] in Stata by feeding in precalculated effect estimates (variable X1) and their standard errors (variable X2). When X1 contains the g estimates of \( {\widehat{\upkappa}}_{wm} \), X2 their standard errors \( \sqrt{\ {\widehat{V}}_{wm}} \), and variable OPERIOD (labelled “Observation Period”) an indicator of observation periods, inversevariance estimates are obtained from the command:
metan X1 X2, second (random) lcols (OPERIOD) xlab(0, 0.2, 0.4, 0.6, 0.8, 1) effect(X1)
The “second(random)” option requests the \( {\widehat{\upkappa}}_{w\ random} \) estimate in addition to \( {\widehat{\upkappa}}_{w\ fixed} \). The “lcols” and “xlab” options control the appearance of the Forest plot of observation specific estimates, combined estimates, and their 95% CIs.
Results
Across the 18 observation periods 447 interactions were observed, of which 354 (79%) were witnessed by both raters and form the dataset from which interrater reliability was estimated. The ICC for the total number of interactions recorded by each rater for the same observation period was high (ICC = 0.97: 95%CI: 0.92 to 0.99, n = 18). The occasional absence of patients from ward areas for short periods of time resulted in interactions being recorded for 67 patient hours (compared to the planned 72 h). The mean rate of interactions was 6.7 interactions/patient/hour. More detailed results are given by McLean et al. [5].
In Table 3a) the crosstabulation of ratings by the two raters can be seen collapsed over the 18 observation periods. Two specific observation periods are also shown: in 3b) the period demonstrating lowest unweighted \( \widehat{\upkappa} \) (\( \widehat{\upkappa} \) =0.30); and in 3c) the period demonstrating highest unweighted \( \widehat{\upkappa} \) (\( \widehat{\upkappa} \) =0.90). From 3a) it can be seen that the majority of interactions are rated to be positive, between 17% and 20% are rated to be neutral, and 7% as negative (from the margins of the table), and this imbalance in the marginal frequencies would be expected to reduce chance adjusted κ.
Scatterplots of A4 weighted \( {\widehat{\upkappa}}_{wm} \) against observation period characteristics are shown in Fig. 1. One of the characteristics (interactions/patient/hour) was sufficiently associated with A4 weighted \( {\widehat{\upkappa}}_{wm} \) to achieve statistical significance (P = 0.046).
In Table 4 it can be seen that the various combined estimates of κ_{ w } did not vary greatly depending on the method of metaanalysis or on the choice of weighting scheme. However, there was greater variability in χ^{2}_{ heterogeneity }. For all weighting schemes except unweighted, B2, B3, and C1, there was statistically significant heterogeneity by virtue of χ^{2}_{ heterogeneity } exceeding the χ _{17} ^{2} (0.95) cutpoint of 27.59.
Figure 2 shows the Forest plot demonstrating the variability in \( {\widehat{\upkappa}}_{wm} \) over observation periods, \( {\widehat{\upkappa}}_{w\ fixed} \), and \( {\widehat{\upkappa}}_{w\ random} \), for the A4 weighting scheme. Estimate \( {\widehat{\upkappa}}_{w\ fixed} \) and its 95% CI is shown below observation specific estimates to the right of the plot, on the line labelled “IV Overall”. The line below labelled “D+L Overall” presents \( {\widehat{\upkappa}}_{w\ random} \) and its 95% CI. Both estimates are identical to those shown in Table 4. The final column “% Weight (IV)” relates to the metaanalytic weights, \( {\widehat{\omega}}_m \), not the A4 weighting scheme adopted for κ_{ w }.
Discussion
We consider the most appropriate estimate of interrater reliability of the QuIS to be 0.57 (95% CI 0.47 to 0.68) indicative of only moderate interrater reliability. The finding was not unexpected, the QuIS categories can be difficult to distinguish and though positioned as closely together as possible, the two raters had different lines of view, potentially impacting on their QuIS ratings. The estimate of interrater reliability is based on our A4 weighting scheme with observation specific estimates combined using random effects metaanalysis. Combined estimates of κ_{ w } were not overly sensitive to the choice of weighting scheme amongst those we considered as plausible representations of the severity of misclassification between QuIS categories. We recommend a random effects approach to combining observation period specific estimates, \( {\widehat{\upkappa}}_{wm} \), to reflect the inherent variation anticipated over observation periods.
There are undoubtedly other weighting schemes that fulfil all the criteria on which we chose weighting scheme A4, but the evidence from our analyses suggests that it makes relatively little difference to the resultant \( {\widehat{\upkappa}}_{w\ random} \). In the absence of any other basis for determining weights, our scheme A4 has the virtue of simplicity. A key issue is that researchers should not examine the \( {\widehat{\upkappa}}_w \) resulting from a variety of weighting schemes, and then choose the scheme giving highest interrater reliability. The adoption of a standard set of weights also facilitates comparison of interrater reliability across different studies of QuIS.
We compared four approaches to estimating overall κ_{w}. We do not recommend the simplest of these, \( {\widehat{\upkappa}}_{w\ collapsed} \), based on estimating κ_{ w } from the crosstabulation of all ratings collapsed over observation periods: generally collapsing involves a risk of confounding by stratum effects. Comparing the remaining estimates it can be seen that \( {\widehat{\upkappa}}_{w\ random} \) lies between the fixed effects, \( {\widehat{\upkappa}}_{w\ fixed} \), and the averaged estimate, \( {\widehat{\upkappa}}_{w\ averaged} \), for all the weighting schemes we considered. \( {\widehat{\upkappa}}_{w\ averaged} \) gives equal metaanalytic weight to each observation period, and thus upweights periods with highest variance compared to \( {\widehat{\upkappa}}_{w\ fixed} \). The observation periods with highest variance are those with fewest interactions/patient/hour of observation, and it can be seen from Fig. 1 that these periods tend to have highest \( {\widehat{\upkappa}}_{wm} \). A possible explanation being that with fewer interactions it is easier for observers to see and hear the interactions and thus make their QuIS ratings which would be anticipated to result in more accuracy and agreement. Thus \( {\widehat{\upkappa}}_{w\ averaged} \) might be expected to overestimate interrater reliability and should be avoided. We recommend a random, rather than fixed effects approach to combining because variation in κ_{ wm } across observation periods was anticipated. Observation periods were chosen with the intention of representing the broad range of situations in which staffinpatient interactions take place. At different times of day staff will be more or less busy, and this more or less guarantees heterogeneity in observation period specific interrater reliability.
Böhning et al. [9] identified several practical issues relating to inverse variance estimators in metaanalysis. For example and most importantly, that estimation is no longer unbiased when estimated rather than known variances are used in the metaanalytic weights. This bias is less extreme for larger sample sizes in each constituent study. We included 354 interactions across the 18 observation periods, on average about 20 per period, but it is not clear whether this is sufficient for meaningful bias to be eradicated. A further issue relates to possible misunderstanding of the single combined estimate as applying to all observation periods: a correct interpretation being that the single estimate relates to the mean of the distribution of κ_{ wm } over observation periods. An alternative might be to present the range of values that κ_{ w } is anticipated to take over most observation periods. This would be an unfamiliar presentation for most researchers.
Metaanalysis of \( \widehat{\upkappa} \) over studies following a systematic review has been considered by Sun [10] where fixed and random effects approaches are described, but the latter adopting the Hedges [11], rather than the conventional DersimonianLaird estimate of τ^{2}. Alternatives to the DerSimonianLaird estimator are available including the REML estimate, or the HartungKnappSidikJonkman method [12]. Friede et al. [13] examine properties of the DerSimonianLaird estimator when there are only two observation periods and conclude that in such circumstances other estimators are preferable: McLean et al’s study [5] was based on sufficient observation periods to make these problems unlikely. Sun addressed the issue of publication bias amongst interrater reliability studies found by searching the literature. Here we included data from all observation periods, irrespective of the estimate \( {\widehat{\upkappa}}_{wm} \). Sun performed subgroup analyses of studies according to the degree of training of the raters involved, and also drew a distinction between interrater reliability studies where both raters can be considered to be equivalent and a study [14] comparing ratings from hospital nurses with those from an expert which would more appropriately have been analysed using sensitivity, specificity and related techniques. The QuIS observations were carried out by raters who had all received the training developed by McLean et al: though there was variation in experience of QuIS a further source of interrater unreliability relating to the different lines of view from each rater’s position was also considered to be important.
In the interrater study we describe, in some instances the same rater was involved in more than one observation period, and this potentially violates the assumption of independence across observation periods, which would be anticipated to lead to increased variance in an overall estimate, \( {\widehat{\upkappa}}_w \). A random effects approach is more suitable in this regard as it catches some of the additional variance, coping with extradispersion whether it arises from unobserved heterogeneity or from correlation across observation periods.
Though we have considered analysis choices that need to be made when summarising information on the interrater reliability of the QuIS, the issues we address are relevant to interrater reliability studies more generally. Firstly, where weighted κ_{ w } rather than unweighted κ is thought to be a better summary of differing degrees of disagreement between raters, it is important that the weighting scheme be decided in advance. Secondly, where a study comprises distinct subsets of data collection, the method of combining information needs to be considered. It is likely that data in larger interrater reliability studies would need to be collected in distinct phases, but the lack of attention to combining \( {\widehat{\upkappa}}_m \) over subsets within a study suggests that researchers often ignore the issue, adopting the easiest approach of collapsing to obtain a single estimate of κ. We would advise taking account of structure in data collection by either a fixed or random effects metaanalysis approach, the latter being appropriate where variation across subsets is anticipated or plausible. Our example dataset illustrates a potential source of bias in the simple average of subset specific estimates, \( {\widehat{\upkappa}}_m \). Finally, in the context of metaanalysis over studies, Sun considered the issue of bias arising from the selection of studies for publication. In the context of combining over subsets of data collection within a study, it is possible to imagine circumstances where authors might choose to omit selected subsets, but a good reason would have to be given to justify such a step and the omitted data described.
Conclusions
Researchers using the QuIS to evaluate the quality of staff/inpatient interactions should check its suitability in new settings, and (possibly as part of staff training) its interrater reliability. In practice such studies are likely to follow a similar protocol to that adopted by McClean et al.: involving the multiple observers to be employed in a subsequent main study, over a variety of wards similar to those planned for the main study; and preferably taking place at different times of day. We recommend interrater reliability be estimated using our A4 weighting scheme and a random effects metaanalytic approach to combining estimates over observation periods, \( {\widehat{\upkappa}}_{w\ random} \), be adopted. The \( {\widehat{\upkappa}}_{w\ random} \) estimate should be presented with its 95% confidence interval reflecting precision of estimation achieved from the available number and length of observation periods.
Abbreviations
 κ:

Unweighted kappa
 κ_{ w }:

Weighted kappa
 ICC:

Intracluster correlation
 QuIS:

Quality of Interaction Schedule
References
 1.
Clark P, Bowling A. Observational Study of Quality of Life in NHS Nursing Homes and a Longstay Ward for the Elderly. Ageing Soc. 1989;9:123–48.
 2.
Dean R, Proundfoot R, Lindesay J. The quality of interaction schedule (QUIS): development, reliability and use in the evaluation of two domus units. Int J Geriatr Psychiatry. 1993;8(10):819–26.
 3.
Skea D. SPECIAL PAPER. A Proposed Care Training System: Quality of Interaction Training with Staff and Carers. Int J Caring Sci. 2014;7(3):750–6.
 4.
Barker HR, Griffiths P, MesaEguiagaray I, Pickering R, Gould L, Bridges J. Quantity and quality of interaction between staff and older patients in UK hospital wards: A descriptive study. Int J Nurs Stud. 2016;62:100–7. doi:10.1016/j.ijnurstu.2016.07.018.
 5.
McLean C, Griffiths P, MesaEguiagaray I, Pickering RM, Bridges J. Reliability, feasibility, and validity of the quality of interactions schedule (QUIS) in acute hospital care: an observational study. BMC Health Services Research. (Submitted  January 2016).
 6.
Fleiss JL, Levin B, Paik MC. Statistical methods for rates and proportions: third edition. John. Hoboken. New Jersey: Wiley & Sons; 2003.
 7.
Dersimonian R, Laird N. Metaanalysis in clinical trials. Control Clin Trials. 1986;7:177–1.
 8.
Harris RJ, Bradburn MJ, Deeks JJ, Harbord RM, Altman DG, Sterne J. metan: fixed and randomeffects metaanalysis. Stata J. 2008;8(1):3–28.
 9.
Böhning D, Malzahn U, Dietz E, Schlattmann P. Some general points in estimating heterogeneityvariance with the DerSimonianLaird estimator. Biostatistics. 2002;3:445–57.
 10.
Sun S. Metaanalysis of Cohen’s kappa. Health Serv Outcome Res Methodol. 2011;11:145–63.
 11.
Hedges LV. A random effects model for effect sizes. Psychol Bull. 1983;93:388–95.
 12.
IntHout J, Ionnidis JPA, Borm GF. The HartungKnappSidikJonkman method for random effects metaanalysis is straightforward and considerably outperforms the standard DerSimonianLaird method. BMC Med Res Methodol. 2014;14:2–12. http://www.biomedcentral.com/14712288/14/25.
 13.
Friede T, Röver C, Wandel S, Neuenschwander B. Metaanalysis of two studies in the presence of heterogeneity with applications in rare diseases. Biometrical Journal 2016 (in press).
 14.
Hart S, Bergquist S, Gajewski B, Dunton N. Reliability testing of the national database of nursing quality indicators pressure ulcer indicator. J Nurs Care Qual. 2006;21:256–65.
Acknowledgements
The authors would like to thank staff and patients at the participating NHS hospitals (staff being observed as well as staff raters).
Funding
The analysis presented here is based on data collected during research funded by the National Institute for Health Research (NIHR) Collaboration for Leadership in Applied Health Research and Care Wessex, the NHS South Central through funding of a clinical lecturer internship undertaken by CMcL and the National Institute for Health Research Health Services and Delivery Research programme. The views and opinions expressed therein are those of the authors and do not necessarily reflect those of the Health Services and Delivery Research programme, NIHR, NHS or the Department of Health.
Availability of data and materials
The interrater reliability data analysed in this study is shown in the Additional file 1: Table S1.
Author’s contributions
IME carried out statistical analysis. RMP conceived of the paper. DB contributed to the statistical methodology. CMcL, PG and JB carried out the studies from which the dataset was drawn. RMP and IME drafted the manuscript and all authors approved the final version.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Ethical approval for the QuIS interrater reliability study was obtained from Oxford ‘B’ Research Ethics Committee (Reference: 14/SC/1100). Written consent was obtained from patients prior to conducting QuIS observation, and the presence of observers was also explained to nonparticipating patients and visitors in the vicinity. All patient information was anonymised. Staff were made aware of the study through discussion at team meetings, and through the provision of posters and information sheets sent via email as well as being available in hard copy. Staff present at the time of observations were given opportunity to ask questions and/or decline to participate.
Author information
Additional file
Additional file 1: Table S1.
Crossclassification of ratings for each of the 18 observation periods and period specific covariates^{1}. (DOCX 36 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Received
Accepted
Published
DOI
Keywords
 Weighted kappa
 Random effects metaanalysis
 QuIS
 Collapsing
 Averaging