### Plausibility of Figure 1

We showed graphically that perfect correlation does not guarantee correct inference when a potential surrogate endpoint replaces a true endpoint. The underlying reason is that the line predicting true endpoint from potential surrogate endpoint has a sufficiently different slope for each randomization group to make a substantial difference in the conclusion. In one possible scenario the intervention reduces the value of the surrogateendpoint that is observed without affecting the true endpoint, thereby increasing the slope.

With binary outcomes, different slopes can readily arise because of unobserved heterogeneity in the potential surrogate endpoint. As an example consider adenoma (yes or no) as a potential surrogate endpoint for the true outcome of colorectal cancer (yes or no). Two recent randomized trials [4, 5] showed that aspirin versus placebo lowers the risk of adenomas. Can one conclude that aspirin lowers the risk of colorectal cancer? An editorial on these trials [6] states "given the belief that the development of most colorectal cancers follows a sequence leading from adenoma to carcinoma, a clinical trial in which aspirin reduced the rate of recurrence of adenomas might make a compelling case for its effectiveness." However we disagree (and the editorial later comes to a similar conclusion). Under the single pathway hypothesis, if the probability of adenoma is zero, the probability or colorectal cancer is zero regardless of the intervention (as there is no other way to get colorectal cancer). Thus, in terms of our graphic, the single pathway hypothesis implies that the intercepts of the lines for each group are 0, as in Figure 1. However the slopes can differ substantially due to heterogeneity of adenomas, for example in a spectrum of histological types and sizes [7].

To better understand the role of heterogeneity, we follow Schatzkin and Gail [8], and suppose that there are two types of adenomas: "bad" adenomas that have the potential to develop into colorectal cancer and "innocent" adenomas that do not. Let π_{
z
} denote the probability an adenoma in randomization group *z* is "bad." Let φ_{
z
} denote the probability of colorectal cancer arising from a "bad" adenoma in randomization group *z.* Also let ω_{
z
} denote the probability of any adenoma in group *z*. The larger slope in the experimental than the control group in Figure 1 would occur if φ_{
E
}π_{
E
} > φ_{
C
}π_{
C
}. The leftward shift of the vertical line in Figure 1 would occur if ω_{
E
} < ω_{
C
}.

Such a situation is quite plausible, as illustrated in a randomized trial of finasteride versus placebo [9] where the potential surrogate endpoint was the probability of prostatecancer. A true definitive endpoint would be the probability of death from prostate cancer. In this trial, heterogeneity was observed in the form of high-grade prostate cancer versus other histological grades of prostate cancer. Relative to the placebo group, the finasteride group consisted of a greater fraction of men with high-grade prostate cancer (π_{
E
} > π_{
C
}) but a smaller fraction with any prostate cancer (ω_{
E
} < ω_{
C
}). Because individuals with high-grade prostate cancer generally have a greater risk of prostate cancer mortality, we have φ_{
E
} > φ_{
C
}. If the risk of prostate cancer mortality with other histological grades of prostate cancer is minimal, the situation is mathematically similar to the aforementioned hypothetical example with "bad" and "innocent" adenomas, except that that the fraction "bad" is observed. There is a greater slope with the finasteride group (φ_{
E
}π_{
E
} > φ_{
C
}π_{
C
}) and a smaller value fraction with the surrogate with the finasteride group (ω_{
E
} < ω_{
C
}), which corresponds to Figure 1.

### Graphical Representation of the Prentice Criterion

For valid hypothesis testing based on a surrogate endpoint that *replaces* a true endpoint, Prentice developed three criteria [10]. The major one, subsequently called the Prentice Criterion, is that the distribution of true endpoint given the potential surrogate endpoint does not depend on treatment group [10]. Our graphic shows that if the potential surrogate endpoint is a perfect correlate for a true endpoint (even if the slopes and intercepts of the lines were unknown) *and* if the Prentice Criterion holds, one would obtain the correct inference about the true endpoint based on the potential surrogate endpoint. Graphically, the Prentice Criterion implies that the lines for groups E and C coincide, so a decrease in the mean potential surrogate endpoint would always translate into a decrease in the mean true endpoint. Wang and Taylor [11] developed a similar graphic to help explain their proposed statistic, the proportion of treatment effect summarized by the potential surrogate, which indicates the appropriateness of the Prentice Criterion.

### Inference Without the Prentice Criterion

Other approaches to inference with surrogate endpoints involve *predicting* the true endpoint conditional on the surrogate endpoint (and using estimates based on data from previous studies). This use of potential surrogate endpoints to predict true endpoints differs from the use of auxiliary variables to predict true endpoint. An auxiliary variable is a variable that occurs after randomization and before a true endpoint that is missing in *some but not all* subjects. (See [12] and references therein which discuss the role of auxiliary variables in increasing efficiency or reducing bias.) In contrast a potential surrogate endpoint occurs before a true endpoint that is missing in *all* subjects.

If one could accurately predict the slopes and intercepts of lines in Figure 1 based on data from previous studies, one could obtain the correct inference even if the Prentice Criterion did not hold (i.e. even if the lines did not coincide). For example, in Figure 1, if the slopes and intercepts of the lines were accurately predicted, one could correctly predict that the experimental intervention increases the mean value of the true endpoint despite the decrease in the mean value of the potential surrogate endpoint (and in fact obtain estimates and confidence intervals for the predicted increase in the true endpoint). Unfortunately, this situation is infrequent. In practice, sufficiently accurate prediction of the lines based on previous data is difficult because of sampling variability in the estimates of the intercepts and slopes of each line and because each previous study will likely generate a different line, even without sampling variability, due to differences in interventions. (Although in practice, the only relevant part of the lines occurs at the mean values of the surrogate endpoint in the new study).

Another approach for predicting true endpoint from a potential surrogate endpoint is the "meta-analytic" approach [13, 14]. The meta-analytic approach is not reflected in Figure 1 because it does not involve the distribution of the true outcome conditional on the potential surrogate endpoint (i.e. the slanted lines). Instead each previous trial generatestwo regressions: one for the effect of intervention on potential surrogate endpoint and one for the effect of intervention on the true endpoint. The coefficients for these two regressions are treated as random variables with a joint distribution over all trials. The estimated parameters from this joint distribution are used to predict the difference in mean true endpoints in a new trial given the mean values of the potential surrogate endpoints in the new trial. Unfortunately, there are infrequently sufficient data to use this method routinely, and confidence intervals can be very wide due to between-study variation [13].

A third approach for predicting true endpoints from a potential surrogate endpoint in a randomized trial is a counterfactual approach based on the potential surrogates that would occur, if contrary to fact, individuals were randomized to a different group [15]. Estimates come from a previous study but this could be extended to multiple previous studies. Because counterfactual outcomes are not observed, additional assumptions are needed for inference.

In all these approaches there is a fundamental assumption that the relationship between the potential surrogate and true endpoints in previous studies is very similar to the relationship in the new study under investigation. Besides accounting for the variability in this relationship (in addition to sampling variability), one needs to restrict the previous studies to those involving similar interventions although, as discussed below, that is not a guarantee of valid inference.

### Additional Caveats with Potential Surrogate Endpoints

The use of surrogate endpoints is particularly attractive for studies of complex chronic disease since occurrence of the true endpoint may take years. However, it is precisely because of the complexity of the diseases that assessment of potential surrogate endpoints is so difficult. There are likely to be multiple causal pathways to the true disease endpoint. Different interventions may exert their biologic effects on different pathways.

This is why it is particularly hazardous to use even an "established " surrogate endpoint (or a potential surrogate endpoint "validated" via multiple previous studies) for one class of drug to assess another class of drugs. For example, the statin class of drugs lowers serum cholesterol and lowers cardiovascular event rates, including mortality. However HRT with combined estrogen plus progestin lowers serum cholesterol but *increases* cardiovascular event rates. Presumably HRT exerts it harm via another (dominant) causal pathway, such as the induction of a hypercoagulable sate in the coronary arteries. New molecular insights into pathogenesis suggest that cancer pathogenesis is at least as complex as this situation, involving numerous causal pathways.

Another cautionary note is important. If an intervention induces harmful side effects, it is risky to draw conclusions from the potential surrogate endpoint based only inference regarding the true endpoint. There may be harms that occur after the time the potential surrogate endpoint is observed that are not well predicted by the potential surrogate endpoint. Under these circumstances, even if the two lines in Figure 1 were superimposed, acceptance of the potential surrogate endpoint could still lead to harm.