Should policy-makers and managers trust PSI? An empirical validation study of five patient safety indicators in a national health service

Background Patient Safety Indicators (PSI) are being modestly used in Spain, somewhat due to concerns on their empirical properties. This paper provides evidence by answering three questions: a) Are PSI differences across hospitals systematic -rather than random?; b) Do PSI measure differences among hospital-providers -as opposed to differences among patients?; and, c) Are measurements able to detect hospitals with a higher than "expected" number of cases? Methods An empirical validation study on administrative data was carried out. All 2005 and 2006 publicly-funded hospital discharges were used to retrieve eligible cases of five PSI: Death in low-mortality DRGs (MLM); decubitus ulcer (DU); postoperative pulmonary embolism or deep-vein thrombosis (PE-DVT); catheter-related infections (CRI), and postoperative sepsis (PS). Empirical Bayes statistic (EB) was used to estimate whether the variation was systematic; logistic-multilevel modelling determined what proportion of the variation was explained by the hospital; and, shrunken residuals, as provided by multilevel modelling, were plotted to flag hospitals performing worse than expected. Results Variation across hospitals was observed to be systematic in all indicators, with EB values ranging from 0.19 (CI95%:0.12 to 0.28) in PE-DVT to 0.34 (CI95%:0.25 to 0.45) in DU. A significant proportion of the variance was explained by the hospital, once patient case-mix was adjusted: from a 6% in MLM (CI95%:3% to 11%) to a 24% (CI95%:20% to 30%) in CRI. All PSI were able to flag hospitals with rates over the expected, although this capacity decreased when the largest hospitals were analysed. Conclusion Five PSI showed reasonable empirical properties to screen healthcare performance in Spanish hospitals, particularly in the largest ones.


Background
The Spanish National Health Service, like others, has become influenced by the Patient Safety movement. Evidence from two reports on Spanish hospitals, following other international works on adverse events [1][2][3][4][5][6][7], inspired the debate. The first one, showed an in-patient incidence of adverse events ranging from 5.6% to 16.1%, being avoidable between 17% and 41% of them [8]. The second one, found an incidence of adverse events amenable to health care up to 10.1% [9]. As a matter of fact, these findings contributed to steer the inclusion of Patient Safety Indicators (PSI) within the sets of National and Regional Quality Indicators, being modestly used by health care authorities to assess health care performance.
The Spanish National Health Service (NHS) experience is built on the insight from the Healthcare Cost and Utilization Project by the US Agency for Healthcare Research and Quality [10] and the requirements by the OECD [11]. In spite of the efforts made in building a valid tool concerns remain about whether PSI are appropriate to inform hospital performance. Beyond the need of local adaptation [12,13], most of the caveats have pointed out to flaws in their capacity to attribute excess-cases to hospitals by detecting true incident adverse events [14][15][16][17][18][19][20][21][22][23]. Less has been written on their empirical properties, mainly because of their local nature; in particular, to what extent PSI show systematic variation on adjusted-incidence (as opposed to random) and, their ability to provide precise estimates and therefore, being sensitive to detect providers over the expected. In this sense, several works on similar topics, have partially addressed some of these issues [24][25][26].
This paper aims at testing the empirical properties of five PSI as well as their ability to respond relevant questions for concerned users; thus: a) Are differences in PSI rates across hospitals systematic?; b) Do PSI measure differences among hospital-providers as opposed to differences among patients?; and, c) Are measurements precise enough and able to detect providers with a higher (lower) than expected number of cases?

Study design, population and setting
An empirical validation study, based on administrative data, was carried out. All 2005 and 2006 publicly-funded hospital discharges were used to retrieve eligible patients. In order to reduce random noise on estimates, hospitals with less than 30 eligible cases were excluded.
Five PSI were analyzed for the purposes of this study: Death in low-mortality DRGs (MLM); decubitus ulcer (DU); postoperative pulmonary embolism or deep vein thrombosis (PE-DVT); infections due to medical care, including catheter-related infections (CRI) and postoperative sepsis (PS). The number of cases (numerators) and eligible admissions (denominators) are shown in Table 1. The election of these five indicators was based on a previous report on the validity of ARQH PSI indicators for the Spanish case [23].
For the purpose of this study a Spanish version from the AHRQ PSI algorithms was used. PSI definitions by AHRQ -4.1 version-were subject to a local validation process, accounting for differences with respect to the US healthcare system (i.e., ICD 9 th version and DRG version in use, as well as some coding characteristics) with a view of improving face validity for the Spanish context. Although it has been described elsewhere [23,27], it might be useful to highlight that a dedicated consultation group involving clinicians and coders set about to examine and adapt as needed, both numerators and denominators for each indicator. In the particular case of MLM -an empirically built indicator-the list of low mortality DRGs was re-defined for the Spanish case. The overall correlation between original AHRQ and Spanish PSI definitions as to flag events in the hospitals under study was high across the five indicators, ranging from 0.75 in PE-DVT to 0.95 in PS.

Main endpoints
Three main endpoints were studied: a) Systematic variation defined as an Empirical Bayes value different to zero; b) Cluster effect defined as a rho statistic value different to zero; c) Sensitivity as the statistically significant difference between the observed and the expected, as provided by the residual analyses in a multilevel approach.

Analyses
Adjusted-incidence (I) for each PSI -except MLM-and hospital were calculated. Crude incidence was used in the case of MLM due to its quasi-sentinel event nature. Variation in incidence was calculated using the ratio of variation between hospitals in percentile 95 and percentile 5 (RV 95-5 ), and the ratio of variation between hospitals in percentile 75 and percentile 25 (RV 75-25 ). Methods intended to respond the aforementioned questions on PSI empirical properties were carried out, once variation in the incidence of adverse events was calculated. Hereinafter, we describe these methods.
Are differences across hospitals systematic rather than random?
We used an observed to expected approach, being the observed the counts of adverse events in each hospital under study, and the expected the predicted cases from a logistic regression considering as covariates the recorded age, sex and comorbidities for each patient. (An adaptation from the ARQH version [28] was used to retrieve comorbidities) Given both observed and expected counts, the Empirical Bayes statistic (EB) was estimated following a twostep hierarchical model. The first step assumes that, conditional on the risk r i , the number of counts y i follows a Poisson distribution, y i |r i~P oisson (e i r i ), whereas in the second one, heterogeneity in rates is modelled adopting a common distribution π for the risk r i (or for its logarithm), r i~π (r|θ), with θ the vector of parameters of the density function. EB statistic is based on the assumption that the log-relative risks are normally and identically distributed, log (r i )~N(μ, σ2).
In order to assess the alternative hypothesis, confidence intervals for the observed statistics were derived. In order to avoid parametric assumptions on the distribution of observed cases, we used a non-parametric methodology -a sampling with 2,000-time re-sampling method for each one of the simulated samples. Credibility intervals from percentiles 2.5 and 97.5 were obtained [29].
Do PSI measure differences among hospital-providers as opposed to differences among patients?
Classically, risk adjustment has been used to compare providers, assuming that all patients have a homogenous propensity to have the outcome of interest, wherever the place they are treated. We could otherwise hypothesize that this propensity is more similar among patients within a hospital than among patients from different hospitals -this would be the so called cluster effect. If true, classical methods ignore this effect and mislead the true estimates of variation. Alternatively, the multilevel approach considers the cluster effect (heterogeneity across hospitals) in the variance estimation, producing sounder estimates and a better understanding on how context (i.e., hospital of treatment) affects event rates [26].
In our study, to answer the above mentioned question, the existence of cluster effect (hospital effect) was tested by using a 2-level logistic modelling, where patients were nested into hospitals. The outcome variable was the PSI of interest, and the covariate variables were age, sex and the Elixhauser's comorbidities (EC) [28]. A model was tailored for each PSI (except MLM, which is considered a quasi-sentinel event), testing EC as covariates, taking into consideration the clinical reasoning -i. e., not all EC were used in all PSI-, and the magnitude of the association (OR ≥ 2) to avoid spurious findings due to the massive samples used in the study. The multilevel model was an extension of the previously estimated individual logistic models (c statistic was used to assess their goodness of fit) [30].
The degree of similarity of PSI events among providers was tested by using the rho statistic and its confidence intervals (type 1 error of 5%). The unobserved individual error followed a logistic distribution with individual variance equal to π 2 /3 [26].
Finally, the Median Odds Ratio (MOR) statistic (and its confidence intervals), a measure of the variation among clusters (hospitals in our study) was estimated by comparing pairs of patients with the same covariates from two, randomly chosen, different clusters [31]. MOR provides information on how heterogeneity across hospitals increases the individual odds of experiencing the outcome of interest.
Are measurements able to detect hospitals with a higher than expected number of cases?
This is a key question in the study as PSI are infrequent events, and imprecise measures and poor sensitivity are expected.
Given the existence of cluster effect, the natural way to assess the statistically significant difference between each hospital PSI rate and the expected rate, is to compute (and plot) shrunken residuals derived from the multilevel method. Shrunken residuals would disentangle the true hospital variation from that due to random [32]. . For the purposes of this study, the residual in each hospital and its standard error were estimated. The residual (μj) would represent the difference between the observed and the expected rate (μoj), being the expected the estimated average PSI rate for all the hospitals under study. Residual graphs exhibiting each hospital effect (and its confidence interval) around the average value (constant value for all hospitals as the expected one) were plotted. Residuals were assumed to follow a Gaussian distribution, N~(0, 1).

Data sources
The 2005 and 2006 hospital discharges dataset (CMBD) was used to obtain numerators and denominators for each indicator -i.e. PSI inclusion and exclusion criteria. CMBD records the activity performed by all publicly funded hospitals across the country, enforced to provide this information in a yearly-basis. The register records, in a systematic and homogenous way, information from each patient discharge; specifically: age, sex, diagnosis of admission, secondary diagnoses (up to 30), length of stay, nature of the admission, discharge status and, diagnostic and therapeutic procedures performed. The register started off its activity in the mid 90s.

Results
A total of 6.2 million discharges (between 171 and 175 hospitals depending on the indicator) were retrieved, once the new Spanish definitions were implemented. Admissions at risk ranged from 612,590 in post-operative sepsis to 2,954,018 in catheter-related infection. Adjusted-incidence ranged from 0.54 deaths per 1,000 patients admitted in low-mortality DRGs to 17.3 in postoperative sepsis per 1,000 eligible patients. (Table 1) Are differences among hospitals systematic?
In accordance to the Empirical Bayes statistic, variation was observed to be systematic in all indicators, ranging from 0.19 (CI95%: 0.12 to 0.28) in the case of PE-DVT to 0.34 (CI95%: 0.25 to 0.45) in DU (Table 1).

Do they measure differences among hospital-providers?
Multilevel logistic regressions were modelled to determine the effect of the hospital, once patient case-mix was adjusted. Although most of the variance was explained by patient-related factors ranging from 64% in PS to 79% in DU in accordance to the area under the curve, still a significant proportion of the variance was explained by the hospital: from a small rho value of 6% in the case of MLM (CI95%: 3% to 11%) to a high rho value of 24% (CI95%: 20% to 30%) in CRI. (Table 2) In the median case, as expressed by MOR, the variance among hospitals increased the individual risk expressed by ORs: by a 53% (MOR = 1.53 (CI95%:1.35 to 1.81) in the case of MLM, by a 79% in the risk of having DU attributable to the care received, by more than 2.6 times in the risk of experiencing a CRI, a 53% of suffering a PE-DVT after surgery and a 69% of having a PS.
Are measurements precise enough and able to detect hospitals with a higher than expected number of cases?
As observed in Figure 2, after the risk adjustment, a remarkable amount of hospitals were found to be statistically positioned above the expected -average rate of adverse events predicted for the hospitals under study. So, 19 hospitals (11% of the sample) in the case of MLM, 46 hospitals (26%) in DU, 114 hospitals (35%) in CRI, 39 hospitals (22%) in PE-DVT, and 53 hospitals (31%) in PS were flagged as "underperformers".

Discussion
Five PSI have been considered for empirical validation in public acute-care hospitals across Spain. All of them showed systematic variability (variation beyond chance), were proven to have cluster effect, and were able to detect hospitals above the expected. Nevertheless, several questions should be drawn out to provide a nuanced statement on their usefulness.
Is the estimated variation systematic or due to chance?
Except in the case of MLM, since it is considered a quasi-sentinel event, we should know more about the basal distribution of adverse events to properly answer this question; however, we might assume, given the nature and rationale behind the safety indicators, that this distribution is expected to be close to zero.
Our approach was precisely based on testing the alternative hypothesis throughout the estimation of robust Empirical Bayes confidence intervals against zero as the null value. The precision of the estimated intervals together with the distance between the lower limit and the zero value (the closest figure corresponded to 0.12, in PE_DVT) support the hypothesis that the variation observed is systematic, rather than random. Figure 1 Variation in adjusted-incidence by PSI. Each dot represents the adjusted-incidence of adverse events in a specific hospital. Incidence is computed as a mean-centred log-incidence to allow the comparison among events with different basal incidence. Legend: y axis: log-adjusted-incidence.  Is the observed variation due to hospital-providers, rather than to patients?
If this was not the case, PSI would not be useful in describing what they are aimed to, which is to elicit differences attributable to health care. Our approach sought to elicit the hospital effect by estimating the existence of variation beyond the casemix of patients treated -throughout the namely cluster effect. As mentioned in the results, in the studied PSI a noticeable part of variation was attributed to the hospital where the patients were treated. However, it might be argued that in a multilevel approach, this finding is quite dependant on the goodness of the risk adjustment -the worse the adjustment at patient level, the higher the proportion of variance that could be eventually explained by the hospital-level. This is particularly true in the case of studies using administrative data, where the limited information available on specific patient characteristics might reduce the goodness of risk-adjustment methods.
A way to mitigate this limitation is to reduce the extra-variance due to differences in case-mix that the model is unable to capture, by modelling the largest hospitals. These are teaching hospitals with more than 450 beds, able to provide high-tech services, and ultimately, homogeneous with regard to the patient casemix, particularly in studies where sample size is as huge as ours.

Are results dependant on the coding practices affecting Elixhauser comorbidities?
A particular phenomenon that could also affect the cluster estimates, and ultimately the reliance on PSI, is the differential coding intensity across hospitals. In fact, the number of secondary diagnoses has been already proven to influence the international comparisons [21]. In theory, if this variation was closely related to coding intensity in hospitals, the cluster effect would suffer an important reduction when the number of secondary diagnoses was considered as a factor in the multilevel models; otherwise, it would be very much related to the patients, thus affecting the risk adjustment estimates.
For the purpose of this exploration the number of secondary diagnoses was categorized using the median value (4 secondary diagnoses) as a threshold. In general terms, when both models were compared, a clear reduction in the Elixhauser comorbidity β coefficients, together with stable rho-value estimates, were observed. (Additional file 1) Given that the number of secondary diagnoses absorbed part of the variance in the new model and beta coefficients changed, variation is also expected in the random effects estimation for each hospital. However, an excellent correlation (Pearson coefficient values) between the original random effects and the new ones was found: 0.83 in post-operative sepsis, 0.86 in post-operative PE-DVT, 0.94 in decubitus ulcer and 0.96 in Catheter-related infection. On the other hand, except in the case of decubitus ulcer the changes in the statistical nature of the random effect (i.e. hospitals found as statistically different that average turned Estimates for hospital (clustering) and individual effects A Rho statistic value different to 0 represents the existence of cluster effect -the propensity of having an outcome is more similar among the patients within a hospital, that among patients from different hospitals; as for the magnitude of rho, the more the value, the greater the clustering  Are PSI precise enough to detect hospitals with rates above the expected?
Although PSI are quite infrequent events, shrunken residuals from the multilevel analysis have been proven precise enough to detect hospitals above the expected. Figure 2 showed some quite straightforward images on this capacity. Nevertheless, determining in what manner cluster effect might be influenced by either outlier hospitals or the extra-variance attributable to the mix of hospitals within the sample is also needed. With regard to the former, the estimation barely changed once those outlier values -easily identifiable at the two ends of the distribution in Figure 2-were excluded (data not shown). Most important is the latter one. To understand this effect, new residuals were estimated and plotted in those most a priori homogeneous centres, the largest ones as described in previous paragraphs. As observed, except in the case of MLM where heterogeneity across hospitals was the underlying reason for results (just 4 out of 47 hospitals were statistically above the expected in this second analysis), in the remaining PSI, this capacity held noticeably high: 23% of the hospitals were flagged above the expected in decubitus ulcer, up to 36% in catheter-related infection, 25% in the case of postoperative pulmonary embolism or deep vein thrombosis, and up to 28% in the case of postoperative sepsis. (Additional file 2)

Should policy-makers and managers trust PSI?
Our work aimed at shedding light on some empirical properties that PSI are supposed to accomplish, in order to be useful for safety measurement and, ultimately, allow concerned users an informed quality management. Thus, representing systematic variation across providers -ruling out randomness as an alternative explanation of the differences-, and flagging hospitals as potential underperformers regardless the mix of patients they treat. However, a proper use requires debating upon two lessons learnt in this study, and reflecting upon other aspects that were not part of our work.
As for the lessons learnt with the studied PSI, due to the aforementioned flaws in adjusting patient-risks, we need to be aware that hospitals with more complexity might be signalled as false bad performers, particularly if they do not properly report secondary diagnoses. Secondly, the hospital effect (cluster effect) does exist, quite consistently throughout different statistic models; however, its magnitude clearly decreases when studying homogeneous hospital-providers. Although obvious, this message directly points towards comparing comparables, particularly, when risk adjustment is expected to be suboptimal.
As for the reflection on other issues not addressed in this exercise, it is worth pointing out that the study of the empirical properties is just a partial view on PSI's validity. Further debate upon other validity issues ought to be pursued in order to fully trust on PSI usage. As for this purpose we have to be able to answer whether PSI measure what are supposed to measure. In this work, we have assumed construct validity since PSI were carefully developed for safety measurement purposes, [10,11] and face validity has been granted in advance for the Spanish case, by carrying out an ad hoc face-validity project [23]. However, criterion validity -the ability for an indicator to flag true positive cases and true negative cases by comparison with a gold standardhas to be specifically addressed, in context. Fortunately, for the Spanish NHS, a recent piece of research on surgical discharges shed some light on criterion validity [33]. In general terms, the five PSI were proven to have a quite good performance in terms of positive likelihood ratio (+LR). The most conservative estimation yielded a + LR of 26.8 in decubitus ulcer, a + LR of 406.3 in catheter-related infection, a + LR of 149.3 in PE-DVT and a + LR of 25.32 in postoperative sepsis. These figures seemed high enough to adopt the use of these PSI as a screening tool; except in the case of decubitus ulcer, clearly affected by underreporting (false negative cases) and the existence of present-on-admission ulcers (false positive cases).
Some additional effort should be made on evaluating the PSI stability over time (out of the scope of this work), but in the meantime, taking the studied PSI as screening tools, assessing wisely the limits pointed out along this work in specific contexts, might help to identify those centres from which best practice lessons can be drawn out and those where intervention is clearly needed.

Conclusion
Five PSI showed reasonable empirical properties to screen healthcare performance in Spanish hospitals, particularly in the largest ones. However, ability to flag hospitals beyond the expected was limited in Mortality in Low-Mortality DRGs due to its larger standard errors, and risk for hospitals misclassification in decubitus ulcer remained.

Additional material
Additional file 1: The effect of the number of secondary diagnoses. It shows the recalibration of each model using as a factor the number of secondary diagnoses. Tables show both the estimates before and after the adjustment.