### Data

Patient data from 2000 to 2010 were extracted from the ANZICS APD, [3, 19]. The initial dataset contained 858,758 admissions from 1,354 ICU-year units from 161 ICUs. Exclusions were patients with: unknown hospital outcome (18,244); ICU length-of-stay (LOS) ≤4 hours (9,607); age <16 years (14,752); coronary artery bypass graft (CABG) (81,166); ICU admissions for the same and separate hospital admissions (123,151); and missing Acute Physiology and Chronic Health Evaluation (APACHE) III score, age, ventilation status, diagnostic category, or ICU source (40,507). Patients with CABG and uncomplicated acute myocardial infarction are not considered in the current dataset, and the exclusion of patients with LOS ≤ 4 hours was an original APACHE III requirement [5]. To ensure stability of estimation, ICU-year units with fewer than 150 complete patient records were excluded; this corresponds to three or fewer admissions per week to an ICU in a year. The final dataset consisted of records for 523,462 patients from 984 ICU-year units from 144 ICUs. Access to the data was granted by the ANZICS Database Management Committee in accordance with standing protocols. The research was exempt from formal University of Adelaide Human Research Ethics Committee approval according to the Australian National Statement on Ethical Conduct in Human Research, 2007 and local hospital (The Queen Elizabeth Hospital, Adelaide) Human Research Ethics Committee approval was waived.

### Statistical methods

#### Performance indicator

The performance indicator is the natural logarithm of the standardised mortality ratio (log-SMR), where the SMR is the ratio of the observed to expected numbers of in-hospital deaths. We prefer the log to the raw SMR, as it provides confidence intervals with better coverage probabilities and is approximately normally distributed, [14, 20].

#### Identifying unusual ICUs

#####
**Stage 1: Risk-adjusted model**

The dataset was split randomly into 80% training and 20% test datasets for model building. Owing to the (very long) computing time required to fit three-level models to such a large dataset, the initial model selection was conducted using one-level models and the logistic command in Stata™ (Version 12, 2011, [21]). Continuous (fixed) covariates included APACHE III score, year of admission, and annual ICU volume. The inclusion of Year as a fixed effect enabled adjustment for a (systematic) trend in the response over time. The patient severity score APACHE III is the third revision of the Acute Physiology and Chronic Health Evaluation (APACHE III) score, which is computed using the patient’s worst values during the first 24 hours post-admission to the ICU, [5]. The APACHE III score is the most referenced patient severity of illness score in the clinical literature and is an important predictor of mortality, [22]. It has a range 0-299, and in the original paper, [5], the mean score was 50. A severely-ill patient would have an APACHE III score of between 50-70. It is an important (although not the sole) measure of patient illness, since it combines several physiological and chronic health status variables. Our previous work fitting random intercept and random coefficient logistic regression models to the ANZICS APD demonstrated that APACHE III is the most important predictor of mortality, [14, 19]. In the present study, APACHE III was fitted as a non-linear term (in particular, a degree four polynomial) with the inclusion of a random slope term for APACHE III as this significantly improved the goodness of fit. Additional random terms were not included to avoid increasing the complexity of the model and the associated computing time. Age was fitted as a grouped variable with six bins, which enabled better prediction for very elderly patients. Fitting splines or fractional polynomials, [23], did not improve the fit of age or APACHE III. Descriptors of ICU-admission primary-organ-system dysfunction and patient surgical status (*i.e.*, patient diagnostic category) were generated by consolidating the diagnostic categories of the APACHE III algorithm. ICU-level variables locality and hospital level were also included in the model. The discrete explanatory variables were fitted as indicator variables. Increased mortality during the winter months and at weekends has previously been observed in the ANZICS APD [24, 25]. This was modelled here by sine and cosine trigonometric terms representing yearly and weekly periods and initially included harmonic terms at six months and 3.5 days, calculated from calendar day of admission, [26, 27]. However, the harmonic terms at six months and 3.5 days were not significantly associated with mortality and were dropped from the final model. Interactions between the periodic terms and locality were also included in the model, together with other clinically meaningful two-way interactions. Continuous explanatory variables were centred for model fitting and variables were removed stepwise if the estimated *p*-value was >0.05, excepting the pairs of sine and cosine terms which were always retained together in the model. Annual volume was retained in the final model because it is important, and Year squared was included to allow for a (systematic) nonlinear trend over time. Fitted models were compared using AIC (for nested models only), BIC, area under the ROC curve, and the Hosmer-Lemeshow test statistic (the latter used with caution in this large dataset, [28]). Binned residual plots were used to assess both the covariate modelling and the overall model fit and to help select a final model, [29].

A three-level risk-adjusted hierarchical model based on the best fitting one-level model was fitted to the dataset using Stata™’s xtmelogit command and the Laplace approximation, [21]. Preliminary investigations using two-level hierarchical models and simulations demonstrated that the parameter estimates (and their standard errors) obtained using the Laplace approximation versus seven-point Gaussian quadrature were almost the same. If *Y*
_{
itj
} represents the in-hospital mortality outcome (1 for death, 0 otherwise) for patient *j* in year *t* in ICU *i*, and *P*
_{
itj
} is the binomial probability of in-hospital mortality for this patient, the log odds of death is given by

\text{logit}({P}_{\mathit{\text{itj}}})={\beta}_{0}+\sum _{k=1}^{K}{\beta}_{k}{X}_{\mathit{\text{itjk}}}+{U}_{\mathit{\text{it}}}+{U}_{i}+{U}_{i1}\text{APACHE III},

(1)

where *X*
_{
itjk
} contains the observed (fixed-effects) explanatory variables for patient *j*, *U*
_{
it
} is the random intercept for year *t* in ICU *i*, *U*
_{
i
} is the random intercept for ICU *i*, and *U*
_{
i1} is the random coefficient for APACHE III score. The indices range from *i* = 1,…,144 ICUs, *t* = 1,…,*n*
_{
i
} years within ICU *i*, and *j* = 1,…,*n*
_{
it
} patients within ICU-year *t*. The level-three (ICU-level, or between ICUs) random intercepts are assumed to be normally distributed with zero mean and variance {\sigma}_{I}^{2}; the APACHE III slopes are also assumed normally distributed with variance {\sigma}_{\mathit{\text{AP}}}^{2}, and there is a component of covariance, *σ*
_{
I,A P
}, assumed at level three. The level-two (ICU-year, or between years within ICUs) random effects are assumed to be independently normally distributed with variance {\sigma}_{\mathit{\text{IY}}}^{2}, independently of the level-three random effects. The level-three random intercepts represent (potentially unknown) differences between ICUs and the random slopes for APACHE III allow the dependence of in-hospital mortality on patient severity to vary between hospitals. The component of covariance accommodates potential dependence between the intercept and APACHE III slope terms within ICUs. The normality assumptions for the random effects were assessed using estimated gradient graphs, [30].

Using approximate cross-validation, we assessed how well the observed data from each ICU were fitted by the final model, [14]. This involved estimating an approximate *p*-value for each ICU from the three-level hierarchical model as follows. To begin, random effects were simulated from the fitted random effects distribution (1) 5,000 times. Given the simulated random effects, the probability of death for patient *j* in ICU-year *t* in ICU *i*, {\stackrel{~}{P}}_{\mathit{\text{itj}}}, was calculated using the fitted model. Then {Y}_{\mathit{\text{itj}}}^{k}, the corresponding outcome for iteration *k* = 1,…,5,000, was simulated from a Bernoulli distribution with probability of success {\stackrel{~}{P}}_{\mathit{\text{itj}}}; so {Y}_{\mathit{\text{itj}}}^{k} is equal to 0 or 1. Then the simulated number of deaths for ICU *i* was calculated as {E}_{i}^{k}={\sum}_{t=1}^{{n}_{i}}{\sum}_{j=1}^{{n}_{\mathit{\text{it}}}}{Y}_{\mathit{\text{itj}}}^{k}. For each ICU, the proportion of times the observed number of deaths, *O*
_{
i
}, exceeded the simulated number of deaths was calculated as

{p}_{i}-\text{approx}=\frac{1}{5000}\sum _{k=1}^{5000}{I}_{{E}_{i}^{k}<{O}_{i}},

where *I* is the indicator function. This gave an approximate *p*-value for each ICU under the nominal null hypothesis that the SMR is equal to one. Under this null hypothesis, we would expect the simulated number of deaths to exceed the observed number in approximately half of the simulations, [14]. Thus *p*-approx measures how well the estimated model predicts the observed number of deaths for each ICU. We chose a nominal 20% significance level for this first stage of screening for potential outliers. When *p*-approx < 0.1, an ICU was assessed to be potentially over-performing (*i.e.*, has low mortality), and when *p*-approx > 0.9, a site was potentially under-performing (*i.e.*, has high mortality). It may be helpful to plot a histogram of the *p*-values, or transformed *p*-values, to detect the presence of outliers. If the variability amongst the providers is very small with no obvious outliers, one might decide on a much lower nominal level of significance such as 5% or to proceed with a different analysis for comparison, or no analysis.

#####
**Stage 2: A null model**

The Stage 2 model was estimated by excluding the potentially unusual ICUs identified in Stage 1, then re-fitting the final model. This provided a null ‘reference’ distribution for describing usual ICU performance. Log-SMRs and their variances were again estimated for each ICU, according to the methods described in [14] and Additional file 1. The estimation of the variances of the log-SMRs is somewhat technical, but an outline of the calculations to obtain the approximate variance of the log-SMR for ICU *i* in year *t* is given in Additional file 1. The uncertainty in estimating the expected number of deaths for each ICU is therefore accounted for in our analysis, whereas this is usually treated as given. Treating the estimated expected number of deaths as a constant in the calculations under-estimates the true variance of the log-SMRs, so our analysis offers an advantage over what is usually done. Note that the potentially unusual ICUs were modelled without random effects, so for each unusual ICU, a usual ICU was randomly selected and the random effects predictions from that ICU used to calculate the expected number of deaths for the potentially unusual ICU. Extensive sensitivity analyses demonstrated that randomly selecting the random effects from the ‘usual’ distribution in this way gave the same results as stratifying on ICU level, for example.

#####
**Stage 3: Unusual ICUs**

The funnel plot was constructed as described previously, [14]. ICUs with log-SMRs lying outside the funnel were identified as performing unusually, with either higher or lower mortality than usual. All ICUs have been randomly allocated a random identity number which is shown for those lying outside the thresholds. Confidence intervals controlling the false coverage-statement rate (FCR) at 5% were also constructed for the ICUs identified as unusual, [31]. The FCR is the expected proportion of false discovery rate (FDR) selected [32] confidence intervals which do not cover their true parameter values. FCR is a property of the set of confidence intervals not covering zero and does not involve confidence intervals for the non-selected parameters. However, all confidence intervals may be plotted together by applying visual impact to distinguish the two sets of intervals (selected and non-selected) and we use bold lines to distinguish the FDR-selected intervals. The remaining intervals have FCR coverage of at most 0.05 for all parameters because the FCR offers marginal coverage of at least 0.95. We further evaluated the performance of the outlying ICUs by posing the question: is the worst ICU worse than expected, given it has arisen from the null (usual) predictive distribution, [13]? This question is answered by simulating the distribution of the predicted true worst number of deaths and comparing it to the observed worst number of deaths.