Inter-rater reliability of the QuIS as an assessment of the quality of staff-inpatient interactions

Background Recent studies of the quality of in-hospital care have used the Quality of Interaction Schedule (QuIS) to rate interactions observed between staff and inpatients in a variety of ward conditions. The QuIS was developed and evaluated in nursing and residential care. We set out to develop methodology for summarising information from inter-rater reliability studies of the QuIS in the acute hospital setting. Methods Staff-inpatient interactions were rated by trained staff observing care delivered during two-hour observation periods. Anticipating the possibility of the quality of care varying depending on ward conditions, we selected wards and times of day to reflect the variety of daytime care delivered to patients. We estimated inter-rater reliability using weighted kappa, κw, combined over observation periods to produce an overall, summary estimate, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ {\widehat{\upkappa}}_w $$\end{document}κ^w. Weighting schemes putting different emphasis on the severity of misclassification between QuIS categories were compared, as were different methods of combining observation period specific estimates. Results Estimated \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ {\widehat{\upkappa}}_w $$\end{document}κ^w did not vary greatly depending on the weighting scheme employed, but we found simple averaging of estimates across observation periods to produce a higher value of inter-rater reliability due to over-weighting observation periods with fewest interactions. Conclusions We recommend that researchers evaluating the inter-rater reliability of the QuIS by observing staff-inpatient interactions during observation periods representing the variety of ward conditions in which care takes place, should summarise inter-rater reliability by κw, weighted according to our scheme A4. Observation period specific estimates should be combined into an overall, single summary statistic \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ {\widehat{\upkappa}}_{w\ random} $$\end{document}κ^wrandom, using a random effects approach, with \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ {\widehat{\upkappa}}_{w\ random} $$\end{document}κ^wrandom, to be interpreted as the mean of the distribution of κw across the variety of ward conditions. We draw attention to issues in the analysis and interpretation of inter-rater reliability studies incorporating distinct phases of data collection that may generalise more widely. Electronic supplementary material The online version of this article (doi:10.1186/s12874-016-0266-4) contains supplementary material, which is available to authorized users.


Background
The Quality of Interactions Schedule (QuIS) has its origin in observational research undertaken in 1989 by Clark & Bowling [1] in which the social content of interactions between patients and staff in nursing homes and long term stay wards for older people was rated to be positive, negative or neutral. The rating specifically relates to the social or conversational aspects of an interaction, such as the degree to which staff acknowledge the patient as a person, not to the adequacy of any care delivered during the interaction. Dean et al. [2] extended the rating by introducing distinctions within the positive and negative ratings, creating a five category scale as set out in Table 1. QuIS is now generally regarded as an ordinal scale ranging from the highest ranking, positive social interactions to the lowest ranking, negative restrictive interactions [3].
Barker et al. [4] in a feasibility study of an intervention designed to improve the compassionate/social aspects of care experienced by older people in acute hospital wards, proposed the use of the QuIS as a direct assessment of this aspect of the quality of care received. This is a different context to that for which the QuIS was originally developed and extended, and it may well perform differently: wards may be busier and more crowded, beds may be curtained off, raters may have to position themselves more or less favourably in relation to the patients they are observing. A component of the feasibility work evaluated the suitability of the QuIS in the context of acute wards, and in particular its inter-rater-reliability [5]. Because of the lack of alternative assessments of quality of care it is likely that the QuIS will be used more widely, and any such use should be preceded by studies examining its suitability and its inter-rater reliability.
In this paper we describe the analysis of data from an inter-rater reliability study of the QuIS reported by McLean et al. [5]. Eighteen pairs of observers rated staffinpatient interactions during two hour long observation periods purposively chosen to reflect the wide variety of conditions in which care is delivered in the hospital setting. The study should thus have captured differences in the quality of care across conditions, for example when staff were more or less busy. It is possible that inter-rater reliability could also vary depending on the same factors, and thus an overall statement of typical inter-rater reliability should reflect variability across observation periods in addition to sampling variability. We aim to establish a protocol for summarising data from inter-rater reliability studies of the QuIS, to facilitate consistency across future evaluations of its measurement properties. We summarise inter-rater reliability using kappa (κ) which quantifies the extent to which two raters agree in their ratings, over and above the agreement expected through chance alone. This is the most frequently used presentation of inter-rater reliability in applied health research, and is thus familiar to researchers in the area. When κ is calculated all differences in ratings are treated equally. Varying severity of disagreement between raters depending on the categories concerned can be accommodated in weighted κ, κ w , however standard weighting schemes give equal weight to disagreements an equal number of categories apart regardless of their position on the scale, and are thus not ideal for the QuIS. For example, a disagreement between the two adjacent positive categories is not equivalent to a disagreement between the adjacent positive care and neutral categories. Thus we aim to establish a set of weights to be used in κ w , that reflects the severity of misclassification between each pair of QuIS categories. We propose using meta-analytic techniques to combine the estimates of κ w from the different observation periods to produce a single overall estimate of κ w .

QuIS observation
Following the training described by McLean et al. [5], each of 18 pairs of research staff observed, and QuIS rated all interactions involving either of two selected patients, during a two-hour long observation period. The 18 observation periods were selected with the intention of capturing a wide variety of conditions in which care is delivered to patients in acute wards, as this was the target of the intervention to be evaluated in a subsequent main trial. Observation was restricted to a single, large teaching hospital on the South Coast of England and took place in three wards, on weekdays, and at varying times of day between 8 am to 6 pm, including some periods when staff were expected to be busy (mornings) and others when staff might be less so.
The analysis of inter-rater reliability was restricted to staff-patient interactions rated by both raters, indicated by them reporting an interaction starting at the same time: interactions rated by only one rater were excluded. The percentage of interactions missed by either rater is reported, as is the Intraclass Correlation Coefficient (ICC) of total number of interactions reported by each rater in the observation periods.
κ estimates of inter-rater reliability Inter-rater agreement was assessed as Cohen's κ [6] calculated from the cross-tabulation of ratings into the k = 5 QuIS categories of the interactions observed by both raters: Negative protective (−p) Providing care, keeping safe or removing from danger, but in a restrictive manner, without explanation or reassurance: in a way which disregards dignity or fails to demonstrate respect for the individual.
Negative restrictive (−r) Interactions that oppose or resist peoples' freedom of action without good reason, or which ignore them as a person.
with p o being the proportion of interactions with identical QuIS ratings and p e being the proportion of interactions expected to be identical (∑ i = 1 k p i. p .i ) calculated from the marginal proportions p i. and p .i of the crosstabulation.
In the above, raters are only deemed to agree in their rating of an interaction if they record an identical QuIS category, and thus any ratings one point apart (for example ratings of + social and + care) are treated as disagreeing to the same extent as ratings a further distance apart (for example ratings of + social and -restrictive). To better reflect the severity of misclassification between pairs of QuIS categories weighted κ w can be estimated as follows: where p o (w) is the proportion of participants observed to agree according to a set of weights w ij and p e (w) is the proportion of participants expected to agree according to the weights In (3) p ij , for i and j = 1 … k, is the proportion of interactions rated as category i by the first rater and category j by the second. A weight w ij is assigned to each combination restricted to lie in the interval 0 ≤ w ij ≤ 1. Categories i and j, i ≠ j with w ij = 1, indicate a pair of ratings deemed to reflect perfect agreement between the two raters. Only if w ij is set at zero, w ij = 0, are the ratings deemed to indicate complete disagreement. If 0 < w ij < 1 for i ≠ j, ratings of i and j indicate ratings deemed to agree to the extent indicated by w ij . The precision of estimated κ w from a sample of size n is indicated by the Wald 100(1-α)% confidence interval (CI): Fleiss et al. ( [6], section 13.1) give an estimate of the standard error ofκ w as: We examined the sensitivity ofκ w to the choice of weighting scheme. Firstly we considered two standard schemes (linear and quadratic) described by Fleiss et al. [6] and implemented in Stata. Linear weighting deems the severity of disagreement between raters by one point to be the same at each point on the scale, and the weighting for disagreement by more than one point is the weight for a one-point disagreement multiplied by the number of categories apart. In quadratic weighting, disagreements two or more points apart are not simple multiples of the one-point weighting, but are still invariant to position on the scale. We believe that the severity of disagreement between two QuIS ratings a given number of categories apart, does depend on their position on the scale. The weighting schemes we devised as better reflections of misclassification between QuIS categories are described in Table 2. In weighting schemes A1 to A6 the severity of disagreements between each positive category and neutral, and each negative category and neutral was weighted to be 0.5; disagreement within the two positive categories was considered to be as severe as that within the two negative categories; and we considered a range of levels of weights (0.5 to 0.9) to reflect this. In schemes B1 to B3 disagreements between each positive category and neutral, and between each negative category and neutral were considered to be equally severe, but were given weight less than 0.5 (0.33, 0.25 and 0.00 respectively); severity of disagreement within the two positive categories was considered to be the same as that within the two negative categories. While in weighting schemes C1-C3, disagreement between the two positive categories (+social and + care) was considered to be less severe than that between the two negative categories (−protective and -restrictive).
Weighting scheme A4 is proposed as a good representation of the severity of disagreements between raters based on the judgement of the clinical authors (CMcL, PG and JB) for the following reasons: i) There is an order between categories + social > +care > neutral > −protective > −restrictive ii) Misclassification between any positive and any negative category is absolute and should not be considered to reflect any degree of agreement A: Weights given to neutral compared to a positive or negative = 0.5, assuming that misclassification between the positives is equal to misclassification between the negatives.
Weighted A1 + social 1 All possibilities from weighting misclassification between the two positives and the two negatives as 1 (will be the same as having only three categories, positive neutral and negative) to weighting it as 0.6. Weighting scheme 4 has a weights of 0.75 (half way between .5 and 1) iii) The most important misclassifications are between positive (combined), neutral and negative (combined) categories iv) There is a degree of similarity between neutral and the two positive categories, and between neutral and the two negative categories v) Misclassification within positive and negative categories do matter, but to a lesser extent

Variation inκ w over observation periods
We examined Spearman's correlation between A4 weighted κ w and time of day, interactions/patient hour, mean length Weighting scheme + s + c N -p -r COMMENTS B: Weights using less than 0.5 for neutral compared to a positive or negative and assuming that misclassification between the two positives is equal to misclassification between the two negatives of interactions and percentage of interactions less than one minute. ANOVA and two sample t-tests were used to examine differences in A4 weightedκ w between wards and between mornings and afternoons.
Overallκ w combined over observation periods To combine g (≥2) independent estimates of κ w , we firstly considered the naive approach of collapsing over observation periods to form a single cross-tabulation containing all the pairs of QuIS ratings, shown in Table 3a). An estimate,κ w collapsed , and its 95% CI, can be obtained from formulae (2) and (6). We next considered combining the g observation period specific estimates of κ w using meta-analytic techniques. Firstly, using a fixed effects approach, the estimateκ wm ¼ κ w þ ε m in the m th observation period is modelled as comprising the true underlying value of κ w plus a component, ε m , reflecting sampling variability dependent on the number of interactions observed within the m th period: where κ w is the common overall value, and ε m is normally distributed with zero mean and variance V wm ¼ SEκ wm ð Þ 2 . The inverse-variance estimate of κ w , based on the fixed effects model,κ w fixed , is a weighted combination of the estimates from each observation period: with meta-analytic weights, ω m , given by: Since study specific variances are not known, estimateŝ ω m with variance estimatesV wm ¼ c SEκ wm ð Þ 2 calculated from formula (6) for each of the m periods are used. The standard error ofκ w fixed is then: from which a 100(1-α)% CI forκ w fixed can be obtained. κ w fixed is the estimateκ w overall combined over strata given by Fleiss et al. [6], here combining weightedκ wm rather than unweightedκ m .  Equality of the g underlying, observation period specific values of κ w , is tested using a χ 2 test for heterogeneity: to be referred to χ 2 tables with g − 1 degrees of freedom. The hypothesis of equality in the g κ wm s is typically rejected if χ 2 heterogeneity lies above the χ g − 1 2 (0.95) percentile.
The fixed effects model assumes that all observation periods share a common value, κ w , with any differences in the observation period specificκ wm being due to sampling error. Because of our expectation that inter-rater reliability will vary depending on ward characteristics and other aspects of specific periods of observation, our preference is for a more flexible model incorporating underlying variation in true κ wm over the m periods within a random effects meta-analysis. The random effects model hasκ wm ¼ κ w þ δ m þ ε m , where δ m is an observation period effect, independent of sampling error (the ε m terms defined as for the fixed effects model). Variability in observedκ wm about their underlying mean, κ w , is thus partitioned into a source of variation due to observation period characteristics captured by the δ m terms, which are assumed to follow a Normal distribution: δ m~N (0, τ 2 ), with τ 2 the variance in κ wm across observation periods, and sampling variability. The inversevariance estimate of κ w for this model is: with meta-analytic weights, Ω m , given by: Observation period specific variance estimatesV wm are used, and τ 2 also has to be estimated. A common choice is the Dersimonian-Laird estimator [7] defined as:τ usually truncated at 0 if the observed χ 2 heterogeneity < (g − 1). The estimateκ w random is then: ð14Þ and an estimate of the standard error ofκ w random is: leading to 100(1-α)% CIs forκ w random .
The role of τ 2 is that of a tuning parameter: When τ 2 = 0 there is no variation in the underlying κ w , and the fixed effects estimate,κ w fixed is obtained. At the other extreme, as τ 2 becomes larger, theΩ m become close to constant, so that each observation period is equally weighted and κ w random becomes the simple average of observation period specific estimates: κ w averaged ignores the impact of number of interactions on the precision of the observation period specific estimates. The standard error forκ w averaged is estimated by: Obtaining estimates ofκ w from Stata The inverse-variance fixed and random effects estimates can be obtained from command metan [8] in Stata by feeding in pre-calculated effect estimates (variable X1) and their standard errors (variable X2). When X1 contains the g estimates ofκ wm , X2 their standard errors ffiffiffiffiffiffiffiffiffiffif V wm p , and variable OPERIOD (labelled "Observation Period") an indicator of observation periods, inversevariance estimates are obtained from the command: metan X1 X2, second (random) lcols (OPERIOD) xlab(0, 0.2, 0.4, 0.6, 0.8, 1) effect(X1) The "second(random)" option requests theκ w random estimate in addition toκ w fixed . The "lcols" and "xlab" options control the appearance of the Forest plot of observation specific estimates, combined estimates, and their 95% CIs.

Results
Across the 18 observation periods 447 interactions were observed, of which 354 (79%) were witnessed by both raters and form the dataset from which inter-rater reliability was estimated. The ICC for the total number of interactions recorded by each rater for the same observation period was high (ICC = 0.97: 95%CI: 0.92 to 0.99, n = 18). The occasional absence of patients from ward areas for short periods of time resulted in interactions being recorded for 67 patient hours (compared to the planned 72 h). The mean rate of interactions was 6.7 interactions/patient/hour. More detailed results are given by McLean et al. [5].
In Table 3a) the cross-tabulation of ratings by the two raters can be seen collapsed over the 18 observation periods. Two specific observation periods are also shown: in 3b) the period demonstrating lowest unweightedκ (κ = 0.30); and in 3c) the period demonstrating highest unweightedκ (κ =0.90). From 3a) it can be seen that the majority of interactions are rated to be positive, between 17% and 20% are rated to be neutral, and 7% as negative (from the margins of the table), and this imbalance in the marginal frequencies would be expected to reduce chance adjusted κ.
Scatterplots of A4 weightedκ wm against observation period characteristics are shown in Fig. 1. One of the characteristics (interactions/patient/hour) was sufficiently associated with A4 weightedκ wm to achieve statistical significance (P = 0.046).
In Table 4 it can be seen that the various combined estimates of κ w did not vary greatly depending on the method of meta-analysis or on the choice of weighting scheme. However, there was greater variability in χ 2 heterogeneity . For all weighting schemes except unweighted, B2, B3, and C1, there was statistically significant heterogeneity by virtue of χ 2 heterogeneity exceeding the χ 17 2 (0.95) cut-point of 27.59. Figure 2 shows the Forest plot demonstrating the variability inκ wm over observation periods,κ w fixed , and κ w random , for the A4 weighting scheme. Estimateκ w fixed and its 95% CI is shown below observation specific estimates to the right of the plot, on the line labelled "I-V Overall". The line below labelled "D+L Overall" presentsκ w random and its 95% CI. Both estimates are identical to those shown in Table 4. The final column "% Weight (I-V)" relates to the meta-analytic weights,ω m , not the A4 weighting scheme adopted for κ w .

Discussion
We consider the most appropriate estimate of inter-rater reliability of the QuIS to be 0.57 (95% CI 0.47 to 0.68) indicative of only moderate inter-rater reliability. The finding was not unexpected, the QuIS categories can be difficult to distinguish and though positioned as closely together as possible, the two raters had different lines of view, potentially impacting on their QuIS ratings. The estimate of inter-rater reliability is based on our A4 weighting scheme with observation specific estimates combined using random effects meta-analysis. Combined estimates of κ w were not overly sensitive to the choice of weighting scheme amongst those we considered as plausible representations of the severity of misclassification between QuIS categories. We recommend a random effects approach to combining observation period specific estimates,κ wm , to reflect the inherent variation anticipated over observation periods.
There are undoubtedly other weighting schemes that fulfil all the criteria on which we chose weighting scheme A4, but the evidence from our analyses suggests that it makes relatively little difference to the resultant κ w random . In the absence of any other basis for determining weights, our scheme A4 has the virtue of simplicity.
A key issue is that researchers should not examine thê κ w resulting from a variety of weighting schemes, and then choose the scheme giving highest inter-rater reliability. The adoption of a standard set of weights also facilitates comparison of inter-rater reliability across different studies of QuIS.
We compared four approaches to estimating overall κ w . We do not recommend the simplest of these, κ w collapsed , based on estimating κ w from the crosstabulation of all ratings collapsed over observation periods: generally collapsing involves a risk of confounding by stratum effects. Comparing the remaining estimates it can be seen thatκ w random lies between the fixed effects, κ w fixed , and the averaged estimate,κ w averaged , for all the weighting schemes we considered.κ w averaged gives equal meta-analytic weight to each observation period, and thus up-weights periods with highest variance compared toκ w fixed . The observation periods with highest variance are those with fewest interactions/patient/hour of observation, and it can be seen from Fig. 1 that these periods tend to have highestκ wm . A possible explanation being that with fewer interactions it is easier for observers to see and hear the interactions and thus make their QuIS ratings which would be anticipated to result in more accuracy and agreement. Thusκ w averaged might be expected to over-estimate inter-rater reliability and should be avoided. We recommend a random, rather than fixed effects approach to combining because variation in κ wm across observation periods was anticipated. Observation periods were chosen with the intention of representing the broad range of situations in which staff-inpatient interactions take place. At different times of day staff will be more or less busy, and this more or less guarantees heterogeneity in observation period specific inter-rater reliability.
Böhning et al. [9] identified several practical issues relating to inverse variance estimators in meta-analysis. For example and most importantly, that estimation is no longer unbiased when estimated rather than known variances are used in the meta-analytic weights. This bias is less extreme for larger sample sizes in each constituent study. We included 354 interactions across the 18 observation periods, on average about 20 per period, but it is not clear whether this is sufficient for meaningful bias to be eradicated. A further issue relates to possible misunderstanding of the single combined estimate as applying to all observation periods: a correct interpretation being that the single estimate relates to the mean of the distribution of κ wm over observation periods. An alternative might be to present the range of values that κ w is anticipated to take over most observation periods. This would be an unfamiliar presentation for most researchers.
Meta-analysis ofκ over studies following a systematic review has been considered by Sun [10] where fixed and random effects approaches are described, but the latter adopting the Hedges [11], rather than the conventional Dersimonian-Laird estimate of τ 2 . Alternatives to the DerSimonian-Laird estimator are available including the REML estimate, or the Hartung-Knapp-Sidik-Jonkman method [12]. Friede et al. [13] examine properties of the DerSimonian-Laird estimator when there are only two observation periods and conclude that in such circumstances other estimators are preferable: McLean et al's study [5] was based on sufficient observation periods to make these problems unlikely. Sun addressed the issue of publication bias amongst inter-rater reliability studies found by searching the literature. Here we included data from all observation periods, irrespective of the estimatê κ wm . Sun performed subgroup analyses of studies according to the degree of training of the raters involved, and also drew a distinction between inter-rater reliability studies where both raters can be considered to be equivalent and a study [14] comparing ratings from hospital nurses with those from an expert which would more appropriately have been analysed using sensitivity, specificity and related techniques. The QuIS observations were carried out by raters who had all received the training developed by McLean et al: though there was variation in experience of QuIS a further source of interrater unreliability relating to the different lines of view from each rater's position was also considered to be important.
In the inter-rater study we describe, in some instances the same rater was involved in more than one observation period, and this potentially violates the assumption of independence across observation periods, which would be anticipated to lead to increased variance in an overall estimate,κ w . A random effects approach is more suitable in this regard as it catches some of the additional variance, coping with extra-dispersion whether it arises from unobserved heterogeneity or from correlation across observation periods.