We first review various approaches to measuring cost-effectiveness, including the ICER, the mean incremental net benefit, and the measure proposed by Willan [1]. We then contrast these measures and argue that Willan's proposal is of only secondary interest to a health care provider. All of the cost-effectiveness measures are in practice unknown parameters that must be estimated from data, and we next consider inference about these measures from both the frequentist and Bayesian approaches. Finally, a fundamental flaw in the estimator proposed by Willan [1] for his probability of cost-effectiveness is exposed.

### Measures of cost-effectiveness

We consider two competing treatments, drugs, or other health technologies, which we refer to as Treatment 1 and Treatment 2. Conventionally, Treatment 1 is often the standard treatment whereas Treatment 2 is a new or comparator treatment. In reality there will usually be far more than two competing treatments for any condition, but for the purpose of this article it is enough to consider, like Willan [1], just two treatments.

A little notation is necessary. Let *C*
_{i} be the cost associated with an individual patient when given Treatment *i*, and let *E*
_{i} be the value of an appropriate effectiveness measure associated with that patient when given Treatment *i.* Now it is important to recognise variation between patients. One patient will incur different costs and experience different effectiveness from another. Therefore, *C*
_{i} and *E*
_{i} are random quantities, which we interpret as the cost and effectiveness under Treatment *i* for an individual patient *randomly* drawn from the population of all patients under consideration. The probability distributions of these random quantities describes how they vary over the population.

In order to compare cost-effectiveness between the two treatments, we require a way to link costs to effectiveness, and this is done through a decision-maker's *willingness to pay* coefficient *K.* Formally, the decision-maker is prepared to pay *K* units of money to obtain one unit of effectiveness. Therefore, the *net benefit* of Treatment *i* for an individual (random) patient is

*B*
_{i}(*K*) = *K E*
_{i} - *C*
_{i}.

This expresses net benefit on the monetary scale by converting the *E*
_{i} units of effectiveness into *K E*
_{i} units of money before subtracting the cost *C*
_{i}. (We could equally express net benefit on the effectiveness scale as *E*
_{i} - *C*
_{i}/*K*, but the two approaches are clearly formally equivalent.) The notation also emphasises the dependence of the net benefit on the decision-maker's willingness to pay coefficient *K.*

Treatment 2 would be clearly more cost-effective than Treatment 1 for an individual (random) patient if *B*
_{2}(*K*) >*B*
_{1}(*K*). This can be expressed simply in terms of the *individual incremental net benefit* (individual INB)

*D*
_{B}(*K*) = *B*
_{2}(*K*) - *B*
_{1}(*K*) = *K D*
_{E} - *D*
_{C},

where *D*
_{E} = *E*
_{2} - *E*
_{1} and *D*
_{C} = *C*
_{2} - *C*
_{1} are the increments in effectiveness and cost, respectively.

If all patients were the same, and experienced the same costs and effectiveness, then the individual INB would be the same for all patients, and could then be called *the* INB. Then the comparison of treatments would become trivial. The INB would quantify the gain (if positive) or loss (if negative) per patient that would result from switching from Treatment 1 to Treatment 2. Treatment 2 would clearly be more cost-effective than Treatment 1 if, and only if, the INB was positive.

However, patients will vary, and the consequence of this is that individual INB will vary between patients, and there is no single value to represent the comparison between the two treatments. Across the population, there is a probability distribution of individual INB.

The measures of cost-effectiveness that are in widespread use in health economics are based on the *mean* of this distribution. We denote the population *mean incremental net benefit* (mean INB) by Δ_{B}(*K*). The standard notation in probability theory for a mean or *expected* value is , so the mean incremental net benefit is

Δ_{B}(*K*) = (*D*
_{B}(*K*)) = *K* Δ_{E} - Δ_{C},

where Δ_{E} = (*D*
_{E}) and Δ_{C} = (*D*
_{C}) are the population mean increments in effectiveness and cost. Then Treatment 2 is defined to be more cost-effective than Treatment 1, in terms of the population mean, if Δ_{B}(*K*) > 0.

The *incremental cost-effectiveness ratio* (ICER) can be expressed as

ρ = Δ_{C}/Δ_{E},

and we can see that Δ_{B}(*K*) > 0, i.e. Treatment 2 is more cost-effective than Treatment 1, if ρ <*K* and Δ_{E} > 0, or if ρ >*K* and Δ_{E} < 0.

The probability of cost-effectiveness as proposed by Willan [1] is the probability that an individual (random) patient will have a positive individual INB. We can denote this by θ(*K*) = *Pr*(*D*
_{B}(*K*) > 0). It can also be seen as the proportion of all patients in the population who have positive individual INBs.

Δ_{B}(*K*) and θ(*K*) are just two summary measures of the distribution of net benefit in the population. If the distribution is symmetric about its mean, as shown for instance in Figure 1, then the two measures will be in agreement, in the sense that Δ_{B}(*K*) will be positive if and only if θ(*K*) is greater than 0.5.

Thus, the distribution represented by the solid curve in Figure 1 has mean Δ_{B}(*K*) = 1.3 and θ(*K*) = 0.903, so that Treatment 2 is more cost-effective in terms of having a higher mean INB and the proportion of patients who will achieve a higher individual INB under Treatment 2 is 90.3%. Conversely, the distribution represented by the dashed curve has Δ_{B}(*K*) = -0.7 and θ(*K*) = 0.242, so the mean INB under Treatment 2 is now less than under Treatment 1, and only 24.2% of patients will obtain a higher individual INB under Treatment 2.

If, however, the distribution is not symmetric, then it is quite possible for the two measures to give apparently contradictory indications of relative cost-effectiveness. Figure 2 shows another two possible distributions. In the distribution shown as a solid line, Δ_{B}(*K*) = 0.2 and θ(*K*) = 0.414, so the mean INB is positive but only 41.4% of patients actually have a higher individual INB under Treatment 2. This is because those 41.4% include an appreciable proportion who obtain large positive individual INBs of 2 or more, whereas although the other 58.6% have negative individual INBs they never experience a value beyond -1. Conversely, in the distribution shown as a dashed line in Figure 2, Δ_{B}(*K*) = -1 and θ(*K*) = 0.682, so that the mean INB is negative but 68.2% of patients have a positive individual INB.

### Which measure is best?

It is well-known in health economics that, from the perspective of a health care provider needing to decide which treatment to apply to the population of patients in their care, it is the mean cost and effectiveness over the whole population that matters [2]. This is because the decision is to apply to the whole population. The health care provider will have to pay a cost equal to the total of all the costs for individual patients under the chosen treatment, and when expressed on a per-patient basis this is the population mean cost. For a similar reason, the per-patient mean effectiveness under the chosen treatment measures the benefit that the health care provider obtains for that cost in terms of improved health for the patients in its care. If the health care provider's willingness to pay coefficient is *K*, then the appropriate measure of relative cost-effectiveness is the mean INB Δ_{B}(*K*), and the correct decision is to fund Treatment 2 if Δ_{B}(*K*) > 0 or Treatment 1 if Δ_{B}(*K*) < 0 [3].

As discussed in the previous section, this can be expressed in terms of comparing the ICER ρ with *K*, but that approach is more complex, since the comparison depends on the sign of Δ_{E}.

From the perspective of a health care provider, then, needing to make a decision between two treatments, the decision rests on mean INB, and in fact only on its *sign.* There is no role for Willan's θ(*K*). As we have seen in Figure 2, the wrong decision could be made if it were based on θ(*K*).

Willan [1] says, "The use of θ(*K*) should be helpful to policy-makers". We agree, in the sense that it does give extra information about the distribution of individual INBs in the population, but as such it is of secondary interest, only. It should not be used as the basis of the actual decision. Nevertheless, we believe that in general an understanding of the distribution of individual INBs in the population is useful ancillary information that may be helpful to a decision-maker in the subsequent implementation of the decision.

The perspective of a health care provider is not necessarily the only one of interest. An individual clinician wishing to decide how to treat an individual patient may be willing to regard that patient as randomly drawn from a large population, and might be interested in θ(*K*). However, the situations shown in Figure 2 argue for caution. Consider for instance the dashed curve. The patient is substantially more likely to have a positive individual INB than a negative one, and this may seem to suggest prescribing Treatment 2. There is, however, a risk of a large negative INB, corresponding to the patient having a very much worse outcome with Treatment 2 than with Treatment 1. In our opinion, the mean INB is as relevant to an individual decision as to the group decision of a health care provider.

### Inference about cost-effectiveness

The measures of cost-effectiveness described in the preceding section are all unknown in practice because they depend on the unknown distribution of individual INBs for patients in the population. From the statistical point of view they are unknown parameters. In order to learn about them, we will need to obtain some relevant evidence. This might, for instance, as supposed in Willan [1], consist of observations of actual costs and effectiveness for a sample of patients in a clinical trial.

We then need to construct appropriate methods of statistical inference for parameters of interest, based on the data. There is a substantial literature on this topic. Based on data from a clinical trial, various authors have presented estimators and confidence intervals for the ICER [4–12], and comparable inferences for the mean INB [3, 13]. All of these references employ the frequentist approach to statistical inference. Analyses under a Bayesian approach have also been given [14–18]. The fact that the ICER is a ratio, together with the way its interpretation changes as the sign of Δ_{E} changes, mean that inference about the mean INB is generally much more straightforward [17, 18].

Inference about the mean INB is generally presented by means of a *Cost-Effectiveness Acceptability Curve* (CEAC) [16–22]. As introduced by van Hout et al [19], the CEAC plots the probability that mean INB is positive against *K.* The value of such a graph lies partly in the difficulty of specifying *K* in practice. Decision-makers are generally reluctant to commit themselves to an explicit willingness to pay, and plotting against *K* allows them to assess the relative cost-effectiveness of the two treatments over a range of values of *K.* Strictly, the probability that Δ_{B}(*K*) is positive can only be a Bayesian inference, since only in the Bayesian approach is it possible to assign probability distributions to unknown parameters. The frequentist analogue is to consider the P-value of a significance test of the null hypothesis that Δ_{B}(*K*) < 0 [17, 23]. It is important to remember, as always, that the interpretation of a P-value for a null hypothesis is much less direct and meaningful than the Bayesian probability that the hypothesis is true.

In its more natural Bayesian form, the CEAC states, for given *K*, the probability, *based on the available evidence*, that the true value of the unknown parameter Δ_{B}(*K*) is positive. It therefore states, for given *K*, the probability, *based on the available evidence*, that Treatment 2 is more cost-effective than Treatment 1, from the perspective of a health care provider needing to make a decision between the two treatments. In presenting the CEAC in practice, authors have tended to assert that the CEAC states, for given *K*, the probability that Treatment 2 is more cost-effective than Treatment 1, omitting to refer to the fact that this probability is based on available evidence, and omitting to state the decision context. Willan [1] objects to this presentation of the CEAC as giving 'the probability of cost-effectiveness'. He writes:

"The interpretation that the acceptability curve is the probability that the intervention is cost-effective is not entirely accurate and could easily be misunderstood by policy makers. Consider the situation in which the observed INB for treatment is very small, but due to a very large sample size the acceptability curve at the value of λ [our *K*] of interest is 0.99. Attaching the label "the probability that the intervention is cost-effective" to this quantity could mislead policy makers into thinking that treatment is highly beneficial compared to the standard. What, in fact, is high is our confidence that the INB, however small, is not zero."

We agree that to refer to the CEAC as simply 'the probability of cost-effectiveness', or 'the probability that Treatment 2 is more cost-effective than Treatment 1', is potentially misleading if its dependence on the available evidence and on the decision context is not clear. We advocate that the phrase 'based on available evidence' should be used to emphasise the first point, or for a technical audience the Bayesian formulation of 'the posterior probability of cost-effectiveness' would be appropriate. It might be helpful also to emphasise that we are judging cost-effectiveness from the perspective of a health care provider needing to decide between two treatments, although this context has been so pervasively adopted in health economics that we believe it can be taken as understood.

Willan proposes that θ(*K*) should more properly be called 'the probability of cost-effectiveness', but to use the phrase for θ(*K*) without further qualification would be *equally* misleading to policy makers. To parallel the above quotation from Willan [1], consider the situation in which the mean INB is positive but small, but due to there being very little between-patient variation we find θ(*K*) = 0.99. Attaching the label 'the probability that the intervention is cost-effective' to this quantity could mislead policy makers into thinking that treatment is highly beneficial compared to the standard. What, in fact, is high is the proportion of patients in the population for whom the individual INB, however small, is positive.

*Neither* measure asserts the degree to which one treatment is 'highly beneficial' compared to the other. *Both* are concerned only with the sign of INB. The CEAC gives the probability, based on available evidence, that the mean INB is positive, while θ(*K*) gives the probability that an individual INB is positive.

Willan [1] further objects to the fact that the CEAC changes as we get more information, because it is a statistical inference. As we get more evidence, our uncertainty about the sign of Δ_{B}(*K*) for a given value of *K* will decrease until we become certain either that Δ_{B}(*K*) is positive (whereupon the CEAC will tend to 1) or that it is negative (in which case the CEAC will tend to 0). This is entirely natural, and we do not understand this objection.

Willan's θ(*K*) is a probability in a different sense, because it is a population parameter, not an inference about a population parameter. Inferences change as we get more data, while the true values of the underlying parameters remain fixed, but unknown. This does not make θ(*K*) in any sense a superior kind of probability. It happens that Willan is interested in inference about a parameter that can itself be considered as a probability (although we believe it would be more helpful to call it a proportion, i.e. the proportion of patients in the population with positive individual INBs). To make inference about it, he provides an estimator (although, as we shall see below, that estimator is logically flawed), but he could have considered calculating a P-value for the null hypothesis that θ(*K*) > 0.5. That would be analogous to the CEAC, and would change with the available data in the same way.

### Willan's estimator

Willan proposes an estimator of θ(*K*) based on data comprising observed costs and effectiveness values for a sample of *n*
_{S} patients given the standard, treatment 1, and another sample of *n*
_{T} patients given the intervention, treatment 2. Now since these data do not include any observations in which the same patient is given both treatments, it is completely impossible to learn the true value of θ(*K*), no matter how large *n*
_{S} and *n*
_{T} might be.

It is easy to demonstrate this impossibility with a simple example. Suppose that we have enormous samples such that we learn the true distribution in the population of costs and effects for treatment 1 and the true distribution of costs and effects for treatment 2. In particular, we will also learn the true distribution of net benefits *B*
_{i}(*K*) for each treatment. Suppose that the value of *K* is given and that the distribution of the net benefit *B*
_{1}(*K*) under treatment 1 is *N*(0,1) (i.e. normal with mean 0 and variance 1), while the distribution of net benefit *B*
_{2}(*K*) under treatment 2 is *N*(1,1). With all this information we know these distributions, and so we know exactly that Δ_{B}(*K*) = 1. We therefore know with certainty that treatment 2 is more cost-effective than treatment 1 for a health care provider with the given value of *K.*

Even with all this information we do not know θ(*K*), because this depends on how correlated *B*
_{1}(*K*) and *B*
_{2}(*K*) are in the population. At one extreme, they might be perfectly positively correlated, such that for every individual in the population it is true that *B*
_{2}(*K*) = *B*
_{1}(*K*) + 1. Then θ(*K*) = 1, because the individual INB is positive for *every* patient. At the other extreme we might have perfect negative correlation, so that for every individual in the population we have *B*
_{2}(*K*) = 1 - *B*
_{1}(*K*). Then treatment 2 is more cost-effective for all those individuals for whom *B*
_{1}(*K*) < 0.5. The proportion of such individuals in the population is 69.15%, and so θ(*K*) = 0.6915.

Willan's estimator effectively assumes that *B*
_{1}(*K*) and *B*
_{2}(*K*) are independent in the population, and for our example this implies that *D*
_{B}(*K*) is distributed as *N*(1,2), with the result that θ(*K*) = 0.7633. The assumption is arbitrary and completely unsupported. Indeed one might imagine that in practice there would be quite strong correlations, on the basis that a patient who responds well to one treatment might respond relatively well to the other, and similarly for costs. But we reiterate that there is absolutely no evidence about this correlation in the data which Willan supposes are available. Indeed for most kinds of intervention it is impractical to test two treatments on the same patient, and even when this is possible we must expect the picture to be complicated by cross-over effects.

What Willan [1] actually estimates is the probability that a randomly chosen patient given treatment 1 will obtain a higher net benefit than *another* randomly chosen patient given treatment 2. This is an entirely different measure from θ(*K*) and we cannot imagine that it is of fundamental interest to any policy maker.