Royston & Sauerbrei D measure
There are various discrimination based measures of prognostic ability available for models of time-to-event data. The measure we have chosen to develop our calculations is Royston and Sauerbrei’s D measure [19], which has been shown to have many good properties which are described below [20]. The most commonly used measure of prognostic ability is probably Harrell’s c index [21], however this measure has some disadvantages: it is affected by censoring [22] and has a scale which can be difficult to interpret. Acknowledging the popularity and prevalence of the c index in the literature, we do consider the relationship between c and D to ensure our methods are more widely usable (see Section Appendix: simulation studies to test sample size calculations).
D measures prognostic ability by quantifying the separation in observed survival curves between subgroups of patients with differing predicted risks. D was developed in the Cox model framework and is based on risk ordering; thus D can be calculated whether the prognostic tool outputs a continuous prognostic index, prognostic groups, or is even a subjective rule. However, it is assumed that the prognostic index resulting from the model is Normally distributed (although this is an approximation in the case of a non-continuous prognostic index). The full derivation of D can be found in Royston and Sauerbrei’s original paper [19], but briefly:
$$D=\kappa \sigma^{\ast }, $$
where σ∗ is an estimate of the standard deviation of the prognostic index values (under the assumption of Normality) and \(\kappa = \sqrt {8/\pi }\simeq 1.60\), a constant used to give a direct interpretation to D, as follows.
D has an intuitively appealing interpretation as the log hazard ratio between two equal-sized prognostic groups formed by dichotomising the prognostic index at its median. D’s interpretation as a log hazard ratio means that it can be translated to a hazard ratio between equally sized prognostic groups; so a D of 1 corresponds to a hazard ratio of e1=2.7 and D=2 to e2=7.4. This allows researchers familiar with hazard ratios of treatment effects (for example) to have some idea of the increase in risk across the prognostic index of the model for a particular value of D. As a log hazard ratio, D can theoretically take any value in the range (−∞,∞), but in real situations it is likely to be much closer to zero. A literature search for published values of D in a wide range of disease areas found that the highest value out of 101 reported was 3.44; the second highest was 2.8 [23]. D=0 implies that the selected model is useless for prediction, and D<0 may arise when a model fitted to one dataset is validated on another, indicating that the original model was flawed in some way. Additionally, D has a functional relationship with a measure of explained variation \({R_{D}^{2}}\) [19]. This relationship is important as most researchers will be more familiar with the 0–100 % range of R2 in linear regression.
As well as its interpretability and applicability to many types of prognostic model, D has many other properties which make it suitable for practical use. These include robustness to outliers, sensitivity to risk ordering, independence from censoring (provided the prognostic model has been correctly specified and the PI is approximately normally distributed), and an easily calculated standard error [19]. Also, since it takes into account the fit of the model to the outcome data, it can be used in a model validation context; a vital part of a good prognostic study. Working with \({R_{D}^{2}}\), Choodari-Oskooei et al. [20] found that it was sensitive to marked non-normality of the prognostic index, but despite this concluded that overall it was one of two best explained variation measures for quantifying predictive ability (along with Kent and O’Quigley’s \(R_{\textit {PM}}^{2}\) [24]). D and \({R_{D}^{2}}\) can be calculated in Stata using the user-written str2d command [25].
Sample size calculations
Introduction
To develop the required calculations we start from the results in Armitage and Berry’s book [26] (p186) for comparison of the means of two independent groups, with equal within-group variance. In this normal errors case, we consider two means \(\overline {x}_{1}\) and \(\overline {x}_{2}\) measured in populations of size n1 and n2 respectively, where s2 is the within-group variance of the response variable in both populations. The standard error of the difference in \(\overline {x}_{1}\) and \(\overline {x}_{2}\) is given by
$$SE(\overline{x}_{1}-\overline{x}_{2})=\sqrt{s^{2}\left(\frac{1}{n_{1}}+\frac{1}{n_{2}}\right) }. $$
From this, various sample sizes can be calculated. If n1, \(\overline {x}_{1}\) and s2 are known, and it is desired that a difference of \(\overline {x}_{1}-\overline {x}_{2}=\delta \) will be just significant at the required two-sided α level with power 1−β, then the sample size required in the second population is
$$\begin{array}{*{20}l} n_{2}=s^{2}\left[\left(\frac{\delta}{z_{1-\alpha/2 }+z_{1-\beta}}\right)^{2} -\frac{s^{2}}{n_{1}}\right]^{-1}, \end{array} $$
((1))
where z
x
is the x-quantile of the standard normal distribution.
We can also calculate sample size in a different way, basing it instead on the confidence interval of the estimated quantity δ. In order that the new estimate of \(\overline {x}_{2}\) will have a 100(1−α) % confidence interval of half width w, the sample size required is
$$\begin{array}{*{20}l} n_{2}=s^{2}\left[\left(\frac{w}{z_{1-\alpha }}\right)^{2} -\frac{s^{2}}{n_{1}}\right]^{-1}. \end{array} $$
((2))
We can work from the same ideas to develop sample size calculations based on D, as this quantity is also normally distributed [23]. Consider the scenario where estimates of D and SE(D) are available from a previous study using the same model, and researchers wish to validate the estimate of D for the model in a new study. Let D1 be the value of D in the first study, \({\sigma _{1}^{2}}\) the variance of D1, and e1 the number of events in the first study. Let D2 be the D value in the (proposed) second study with e2 events, and \({\sigma _{2}^{2}}=var(D_{2})\). The standard error of D1−D2 is thus \(\sqrt {{\sigma _{1}^{2}}+{\sigma _{2}^{2}}}\). As this does not explicitly include e1 and e2 we must make an assumption about the relationship between the variance of D and the number of events in the study in order to obtain sample size calculations.
The quantity λ
To develop the calculations required, we make a proportionality assumption. This is that for a given model with a certain ‘true’ value of D, the ratio of the variances \({\sigma _{1}^{2}}\), \({\sigma _{2}^{2}}\) of D in two datasets with differing numbers e1, e2 of events (but sampled from the same distribution of covariates) equals the reciprocal of the ratio of the corresponding numbers of events:
$$\frac{{\sigma_{1}^{2}}}{{\sigma_{2}^{2}}}=\frac{e_{2}}{e_{1}^{{}}}. $$
This is reasonable, since the variance of a statistic is inversely related to the information in the data, which in a censored time-to-event sample is plausibly represented by the number of events [27]. We have shown through simulation and resampling that this assumption does hold reasonably well; and the larger the dataset, the better it holds (see [23], Tables 4.1 – 4.2).
Under the proportionality assumption we can write \(e_{1}{\sigma _{1}^{2}}=e_{2}{\sigma _{2}^{2}}=\lambda \), where λ is a model- and disease-specific structural constant which is incorporated in our calculations. We can either estimate λ by its value in a previous study (termed λ
s
), or use an approximation incorporating a value of D and the proportion of censoring (cens) in the dataset:
$$ \lambda_{m}=c_{0}+c_{1}D^{1.9}+c_{2}(D\cdot\text{cens})^{1.3}, $$
((3))
where c0=2.66, c1=1.26, and c2=−1.65. This model was developed from simulated data and found to be reasonably accurate (see [23], Section 4.7.5).
Although our findings regarding λ are approximations, this seems a reasonable price to pay when first constructing a new method of planning prognostic studies. Prospective sample size calculations are by definition based on ‘guesstimated’ parameters, and these are not always checked post hoc, so in this respect we feel that the approximations made above are not inappropriate.
A note on the standard error of D
We have found that the default estimate of the standard error of D output by the str2d Stata command tends to underestimate the true value (see [23], Section 3.3 for full details). The negative bias increases the higher D is; for example, when D=0.8 simulation studies using different combinations of dataset size and proportion of censoring showed that the relative bias varied between 0 and –8 %, whereas when D=3.2, it varied from –17 % to –24 %. As an estimate of the standard error of D is required to obtain λ
s
, a downward bias in this quantity could reduce the required sample size and lead to underpowering.
We have found that using bootstrapping with 500 replications to obtain the standard error reduces the bias greatly; we observed a relative bias of –2 % (on average) with the bootstrap estimator when D=3.2 compared to -20 % using the default method [23]. The str2d command has a bootstrap standard error option and we recommend researchers use this method instead of the default estimate when calculating λ
s
, particularly when D≥2.
Obtaining the sample size calculations
By applying this proportionality assumption, we can now write the standard error of D1−D2 as \(\sqrt {{\sigma _{1}^{2}}+\lambda /e_{2}}\) which, using the same rearrangement as above, leads us to the following two calculations. Firstly, to detect a difference in D between the first and second studies of δ with significance level α and power 1−β:
$$\begin{array}{*{20}l} e_{2}& =\lambda\left[ \left(\frac{\delta }{zz}\right)^{2}-{\sigma_{1}^{2}}\right]^{-1}, \end{array} $$
((A))
where zz=z1−α/2+z1−β for a two-sided (superiority) α and zz=z1−α+z1−β for a one-sided (non-inferiority) test. Secondly, in order that the estimate of D1−D2 has a 100(1−α) % confidence interval of half width w
$$\begin{array}{*{20}l} e_{2}& =\lambda\left[ \left(\frac{w }{z_{1-\alpha }}\right)^{2}-{\sigma_{1}^{2}}\right]^{-1}. \end{array} $$
((C))
By comparing (A) and (C) with (1) and (2) we can see there is an analogy between the common within-sample variance s2 and the quantity λ.
Note that unlike in typical sample size calculations, here the value of \({\sigma _{1}^{2}}\) is available from the first study. Since e2 must be positive, this places a lower limit on δ and w for these calculations: δ>σ1zz, and w>σ1z1−α. Having calculated minimum δ for various datasets, we feel that in general (A) and (C) are not very useful in practice and so do not consider them further. Instead we develop slightly different calculations which are described below.
Significance based calculations
Instead of estimating a value of D1 and its standard error from a previous study, we pick a fixed target value of D that we call D∗ and assume this has zero uncertainty; so \({\sigma _{1}^{2}}=0\). Thus (A) becomes
$$ e_{2}=\lambda \left(\frac{\delta }{zz}\right)^{-2} $$
((B))
We further obtain two calculations from (B) which are defined by how λ is estimated. Substituting λ
s
into (B) gives us (B1), while substituting λ
m
gives us (B2):
$$ e_{2}=\lambda_{s}\left(\frac{\delta }{zz}\right)^{-2} $$
((B1))
$$ e_{2}=\lambda_{m}\left(\frac{\delta }{zz}\right)^{-2}. $$
((B2))
For a one-sided test H0:D∗−D2≥δ and H
A
:D∗−D2≤δ. For a two-sided test H0:D∗−D2=δ and H
A
:D∗−D2≠δ. If a previous study does exist, then either (B1) or (B2) can be used. If no previous study exists, then (B1) cannot be used as λ
s
cannot be calculated. When using (B1) and (B2) δ has a lower bound of zero.
One major benefit of using λ
m
is that using this approximation, different values of D and cens can be input which enables calculation of a range of sample sizes. This may be helpful in study planning where the value of D and likely censoring proportion in the new study is uncertain.
Confidence interval based calculations
We can alter calculation (C) under the same assumption of a fixed target D which we call D∗, as for (B1) and (B2). The confidence interval is thus around the quantity δ=D∗−D2. However, as D∗ is assumed to have zero variance, var(D∗−D2)=var(D2); so the width of CI for δ=D∗−D2 is equivalent to the width of CI for D2 only.
Thus to estimate D in a new study with a confidence interval of half width w, we replace \({\sigma _{1}^{2}}\) with 0 in calculation (C), so the number of events required is
$$ e_{2}=\lambda\left(\frac{w}{z_{1-\alpha /2}}\right)^{-2} $$
((D))
Again substituting either λ
s
or λ
m
we get
$$ e_{2}=\lambda_{s}\left(\frac{w}{z_{1-\alpha /2}}\right)^{-2} $$
((D1))
$$ e_{2}=\lambda_{m}\left(\frac{w}{z_{1-\alpha /2}}\right)^{-2} $$
((D2))
The only limit on w when using calculations (D1) and (D2) is that it must be >0.
Note that Eqs. (D) and (B) are equivalent if the power in (B) is 50 % and α is two-sided. So, for example, a study designed to estimate D with a 95 % confidence interval of half width 0.2 requires the same number of patients as a study designed such that a difference in D of δ=0.2 from the target value is significant at the (two-sided) 5 % level with 50 % power.