We have described how the measures of predictive ability, reclassification and clinical utility used to assess a new predictor in a model depend upon the mean risk of the population. We have also demonstrated that the reclassification measures exhibit a different relationship with the mean risk than the clinical utility measures. The continuous NRI increases with increasing mean risk; the NRI categorical with two cutpoints often peaks at two points; whereas the net Benefit and EFLY peak once close to the cutpoint and then generally decrease to zero as the mean risk increases.

In the Framingham Study the mean risk of CVD was higher for men than for women, and also closer to the upper cutpoint of 20%. Based on this, we may have expected the measures of predictive ability, reclassification and clinical utility to be higher among men. However the hazard ratio for systolic blood pressure when it was added to the model was higher for women compared to men, and this compensated for the lower mean risk among women. In a recent review of several new predictors of cardiovascular disease, Paynter and colleagues have also highlighted that results may differ between men and women due to differences in effect sizes of new predictors as well as the strength of the baseline model and the mean risk in the study sample
[31].

In our simulations we observed that as the mean risk increased the NRI(>0), and the change in the c-statistic, also increased. In the paper that introduced the NRI(>0) Pencina suggested that one of the benefits of this measure was that it was not affected by the event rates in the population
[24]. Our simulations, where we assumed a constant hazard rate, indicate that the NRI(>0) increases as the event rate (the mean risk) increases for event rates above the cutpoint. The NRI(>0), as with the change in the c-statistic, is unaffected by event rates only if the odds ratio does not vary. However, as we have demonstrated, if the hazard ratio is assumed to be the same in populations with different event rates (a common assumption in cohort studies of cardiovascular outcomes) then the NRI(>0) will increase with increasing event rate.

In our simulations, when the mean risk in the population was less than the cutpoint the measures of reclassification and clinical utility were generally consistent with each other and increased as the mean risk increased. However, beyond this cutpoint the measures diverged. The reclassification measures continued to increase while the clinical utility measures decreased, although the NRI binary and NRI(with two cutpoints) did eventually decrease. Similar patterns were also observed by Van Calster and others when they varied the cutpoint and assumed a fixed mean risk; as the cutpoint moved away from mean risk the reclassification measures provided a more optimistic view of the new predictor compared to that provided by the difference in net benefit
[16].

The clinical utility measures, difference in EFLY and difference in Net Benefit, achieved a maximum value at approximately the point where the threshold for treatment equaled the mean risk in the population, as expected
[32]. However, we observed a divergence in the clinical utility measures in our simulations as the mean risk increased. This is attributable to differences between the two measures in terms of how benefits and costs are counted and the weights given to benefits and costs in populations with different mean risks.

When a new predictor is added to a model, the difference in EFLY is measured in terms of event free life years. An event free life year gained has the same value whether it occurs in a high risk or low risk population. In contrast, the difference in Net Benefit is measured in units of true positives, adjusted for false positives, with the weighting of false positives relative to true positives determined by the cutpoint defining high risk. However, the actual value of a true positive will differ in populations with different mean risks since the number of event free life years gained will be greater for an individual from a high risk population compared to a low risk population. Also, a false positive will have a greater cost in a low risk population than a high risk population as the survival time, and hence, treatment time, will be greater.

Although there are issues in using the Net Benefit when accounting for costs and benefits over a specific time period, there are also issues in the calculation of costs and benefits for the EFLY. Possible heterogeneity in treatment effects across patient subgroups is not accounted for in the EFLY. Also, the calculation of the EFLY assumes that the chosen cutpoint is the ‘optimal’ cutpoint in that costs equal benefits at this point; the cost of treatment, in terms of event free life years, is then calculated based on this assumption. Rapsomaniki and colleagues acknowledge that many factors, other than the costs and benefits they account for in their EFLY calculations, are considered when a particular cutpoint is chosen
[6]. However, their assumption avoids the problem of an irrational choice of cutpoint resulting in a poorer model being favoured
[6].

In previous papers the relationship between choice of cutpoint and the measures of reclassification and the difference in Net Benefit has been described when the mean risk in the population is fixed
[14, 16, 33]. We observed similar results when the mean risk in the population varies but the cutpoint is fixed. The scenario we have described is the one more commonly encountered in the evaluation of new predictors of cardiovascular events. For example, the Emerging Risk factor Collaboration (EFRC) brings together several cohort studies from the same country which have different mean risks but where the same guidelines and cutpoints for defining high risk would apply. As each of the measures we have examined are in some way affected by the mean risk in the study population this must be taken into account when comparisons are made between different studies whose mean risk varies, or when the mean risk in the study population differs from the population in which a new predictor will ultimately be implemented.

A number of methods have been proposed to allow for these differences. Where the study data arise from a matched case control study Pepe has proposed a method for calculating an adjusted c-statistic that takes into account the greater similarity in risk between cases and controls that arises from matching
[34]. The ERFC applied age-sex specific measures of reclassification observed in their study population to the standard European population to estimate the amount of reclassification that would occur in this standard population
[35, 36]. However, this relies upon having a large enough study population to provide reliable estimates of reclassification in each age-sex stratum. If the data arise from a case control study, Rousson suggests reweighting the proportions of cases and controls to match the proportions found in the parent population
[37].