We made rankings of 13 IVF clinics in the Netherlands, based on pregnancy rates. We calculated the Expected Rank and the Percentiles based on Expected Ranks, to incorporate both the magnitude and the uncertainty of the differences in pregnancy rates between the clinics.

When we want to measure provider performance with outcome measures, in this case pregnancy rates of IVF clinics, two issues are important: case-mix adjustment and natural variation.[3, 4] Regarding the first issue, the chance of pregnancy is not only determined by the performance of the clinic but also by characteristics of the mother. When different clinics treat different patients this can already cause variation in pregnancy rate that the clinics can not prevent. Therefore adjustment for case-mix is very important, but was not in the scope of this study. It is however technically readily possible to calculated the case-mix adjusted ER. The only difference is that we obtain the pregnancy rate coefficients from a (random effect) logistic regression model that includes the relevant patient characteristics as explaining variables.

The second issue is natural variation that exists just by chance. Random effect models allow imprecisely estimated outcomes from smaller clinics to 'borrow' information from other clinics, causing their estimates to be shrunk toward the overall mean. Each estimate reflects a compromise between the clinic-specific mean and the overall mean based on the relative magnitude of the variance within a clinic to the total variance (between and within the clinics). Random effect estimates can be considered as the 'true' pregnancy rate coefficients, beyond natural variation.[8] In our study sample size was large. Hence there were no large differences between fixed and random effect analyses, and the ranking did not change.

We used R to calculate the random effect estimates. It is however not always straightforward to derive these and to select the correct information from the output of the various statistical programs. SPSS for example does not fit random effect logistic regression models. Other packages such as SAS and Stata do.

Simple rankings, based both on fixed and random estimates of the pregnancy rate coefficients, disregard both the magnitude and the uncertainty of the differences between clinics. With the expected rank (ER) however, both are incorporated in the ranking. We see that the best clinic has an ER of 1.4 instead of 1, the worst clinic has an ER of 11.9 instead of 13. We also see that some clinics perform in a very similar fashion, such as clinic D and E. The magnitude of the difference is disregarded in the simple ranking (clinic D rank 4, clinic E rank 5), but shown in the ER (clinic D ER 5.0, clinic E ER 5.1). We also see that there is more uncertainty about the performance of the smaller clinics like H and J. This uncertainty is disregarded in the simple rankings (clinic H rank 8, clinic J rank 10) but included in the ER (clinic H ER 8.5, clinic J rank 9.4) So the ER is much more subtle than the simple ranking. For ease of interpretation we calculated the percentile based on expected rank (PCER), which is independent from the number of centers in the sample and indicates the probability that a hospital is worse than a randomly selected other hospital.

Approaches similar to the ER have been proposed by others. Lemmers et al. proposed a best and worst case ranking and Spiegelhalter et al proposed a 95% confidence interval around the ranks.[5, 6] Both methods provide 3 numbers: a 'point estimate' and the uncertainty around this estimate. The ER consists of only one number, which is an advantage in our perspective because it is generally easier to process one number than three numbers. On the other hand, the ER does not show the amount of uncertainty, although it is included in the number. The ER also incorporates the magnitude of the difference between the clinics, this is not directly included in the approaches of Lemmers et al. and Spiegelhalter et al.

Another advantage of the ER is the intermediate step of calculation of the probability that a certain clinic performs worse than another one, for all the clinics. These individual comparisons can be very useful for couples when they want to choose an IVF clinic. For example, we saw that the probability that clinic G performs worse than clinic F is 54%. In decision making this probability can be weighed against possible advantages of clinic G, e.g. distance or familiarity with the hospital. The presentation in a table like table 2 however, is only feasible when the number of clinics to compare is small.

The rankability can be used as an indication to see whether it make sense to rank the clinics at all. In this study rankability was 0.9, which is substantial. 90% of the differences between the clinics can be attributed to true differences, and 10% to random variation. It is however a value judgment what can be considered as a high rankability. We would suggest that a rankability above 0.7 is reasonable. The rankability is a function of the heterogeneity τ^{2} and the uncertainty of the individual center effects s_{i}
^{2}. In this case the heterogeneity was considerable (τ^{2} = 0.08) and the uncertainty was small because of the large numbers per clinic (si^{2} = 0.008).

Because of the relatively small uncertainty in this example, the natural variation is limited and there were no large differences between the standard rankings and the Expected Ranks. This is a result of the data, not of the proposed method. Therefore there might have been examples that would have better demonstrated the features of the method. We nevertheless used this dataset since it was used before to present novel ranking methods.

For comparison, in a study on 10 centers in the Netherlands treating stroke patients, the unadjusted τ2 was 0.38, but rankibility was only 0.55, due to small numbers. The ERs ranged from 1.7 to 8.6 instead of the original ranks 1–10 and six of the ten centers had en ER close to the median rank of 5.5. (Lingsma et al, How to compare center based on outcome: results from the Netherlands Stroke Survey, submitted).

Regarding the communication of performance measures, rankings may be not the ideal way to present the data in all its subtleties, they are however attractive to the press and the public. Our proposed measure, the Expected Rank, provides a way to combine the attractiveness of a ranking, a single number and easy interpretation with reliable analyses that does justice to the providers, and also allows individual comparisons.