The proportional hazards measure
Numerous summary measures for a pair of specificity and sensitivity have been suggested: we mention here the Youden index, J
i
=p
i
+q
i
−1 [10], and the squared Euclidean distance to the upper left corner in the SROC diagram, E
i
=(1−p
i
)2+(1−q
i
)2. [A review of summary measures is given in Liu [20].] Using an average over any of these measures might be problematic: not only might sensitivities and specificities be heterogeneous, this might also be true for the associated summary measures such as the Youden index or the Euclidean distance (as demonstrated by Figure 2 using the data of the meta-analysis of BNP and heart failure).
We suggest using the measure , which relates the log-sensitivity to the log-false positive rate; we call it the proportional hazards (PH) measure. In Figure 3 we see that this measure shows a reduced variability for the meta-analysis of BNP and heart failure, making it more suitable as an overall measure in the meta-analysis of diagnostic studies or diagnostic problems. While the measure appears to be like any other summary measure of the pair sensitivity and specificity, it has a specific SROC-modelling background and motivation. We have mentioned previously the cut-off value problem: observed heterogeneity might be induced by cut-off value variation which could lead to different sensitivities and specificities – despite the accuracy of the diagnostic test itself not having changed – and might also lead to an induced heterogeneity in the summary measure. Hence, it is unclear whether the observed heterogeneity is due to heterogeneity in the diagnostic accuracy (authentic heterogeneity) or whether it has occurred due to cut-off value variation (artificial heterogeneity). This second form of heterogeneity can also occur when the background population changes with the study.
One of the features of the SROC approach is that it incorporates the cut-off value variation in a natural way; hence a measure modelling an ROC curve is favorable. We suggest the PH measure based upon the Lehman family in the following way:
This model was suggested by Le [21] for the ROC curve. It is an appropriate model since, for feasible q, (1−q)θ is also feasible as long as θ is positive. Note that (1) is defined for all values of p∈[0,1] and q∈ [ 0,1] whereas is only defined for p∈(0,1) and q∈(0,1). Population values of sensitivity and specificity of 1 are rarely realistic, although observed values of 1 for sensitivity and specificity do occur in samples. This can be coped with by using an appropriate smoothing constant such as estimating specificity as (n
i
−1)/n
i
when x
i
=n
i
and sensitivity as (m
i
−1)/m
i
if y
i
=m
i
.
In Figure 4 we see a number of examples of the proportional hazards family. It becomes clear now why θ is called the proportional hazards measure. By taking logarithms on both sides of (1) we achieve
(2)
meaning if model (1) holds, the ratio of log-sensitivity to log-false positive rate is constant across the range of possible cut-off value choices t. Hence the name proportional hazards model, which was suggested in a paper by Le [21] and used again in Gönen and Heller [22]. The idea of representing an entire ROC curve in a single measure is illustrated in Figure 5. While sensitivity and specificity vary over the entire interval (0,1), the value of θ remains constant. Hence, log-sensitivity is proportional to the log-false positive rate. This assumption is similar to an assumption used for a model in survival analysis, where it is assumed that the hazard rate of interest is proportional to the baseline hazard rate; this might have motivated the choice of name used by Le [21] and Gönen and Heller [22] in this context.
However, it is not our intention to make the assumption that an entire SROC curve can be represented by model (1); the explanations above are instead meant as a motivation that the PH-measure is not just another summary measure, but can be derived from a ROC modelling perspective. We envisage that each study, with associated pair of sensitivity and specificity, can be represented by a specific PH-model, as illustrated in Figure 6.
We see indeed that each pair of sensitivity and specificity can be associated with its own ROC curve provided by
where , so that the curve (3) passes exactly through the point .
Comparison to other approaches. It remains to be seen how appropriate the suggested proportional hazards model is and how it compares to other existing approaches. We emphasize that in our situation we have assumed that there is only one pair of sensitivity and false positive rate per study i. Situations where several pairs per study are observed (such as in Aertgeerts et al.[23]) are rare. Hence, on the log-scale for sensitivity and false-positive rate, we are not able to identify any straight line model within a study with more than one parameter, since this would require at least two pairs of sensitivity and specificity per study; see also Rücker and Schumacher [24, 25]. However, any one-parameter straight line model, such as the proposed proportional hazards model, is estimable within each study, although within-model diagnostics is limited since we are fitting the full within study model. Given that sample sizes within each diagnostic study are typically at least moderately large it seems reasonable to assume a bivariate normal distribution for and with means logp and log(1−q) as well as variances and , respectively, and covariance σ with correlation ρ=σ/(σ
p
σ
q
). This is very similar to the assumptions in the approach taken by Reitsma et al.[17] (see also Harbord et al.[19]), with the difference that we are using the log-transformation whereas in Reitsma et al.[17] logit-transformations are applied. Then, it is a well-known result that the mean of the random variable (having unconditional mean logp) conditional upon the value of the random variable (having unconditional mean log(1−q)) is provided as
(4)
which can be written as where α= log(p)−θ log(1−q) and . This is an important result since it means that, in the log-space, sensitivity and false–positive rate are linearly related. Furthermore, if α is zero, the proportional hazards model arises.
The question then arises why not work with a straight line model
(5)
The answer is that such a model is not identifiable since we have only one pair of sensitivity and specificity observed in each study and it is not possible to uniquely determine a straight line by just one pair of observations since there are infinitely many possible lines passing through a given point in the logp – log(1−q) space. However, the proportional hazards model as a slope-only model is identifiable and it is more plausible than other identifiable models such as the intercept–only model. Clearly, a logistic-transformation would be more consistent with the existing literature [14, 15] than the log-transformation. However, both models would give a perfect fit (within each study) since there are no degrees of freedom left for testing the model fit. The situation changes when there are repeated observations of sensitivity and specificity per study available. However, these meta-analyses with repeated observations of sensitivity and specificity according to cut-off value variation are extremely rare.
A mixed model approach
With the motivation of the previous sections in mind, we assume that k diagnostic studies are available with diagnostic accuracies where
(6)
We assume the following linear mixed model for :
(7)
where x
i
is a known covariate vector in study i, δ
i
is a normally distributed random effect δ
i
∼N(0,τ
2) with τ
2 being an unknown variance parameter, and is a normally distributed random error with variance known from the i−th study.
There are several noteworthy points about the mixed model (7). The response is measured on the log-scale, where the transformation improves the normal approximation and also brings the diagnostic accuracy into a well-known link function family: the complementary log-log function. The difference of the probability for a positive test in the groups with and without the condition is measured on the complementary log-log scale. The fixed effect part involves a covariate vector x which could contain information on study level such as gold standard variation, diagnostic test variation, or sample size information. It should be noted that there are two variance components, τ
2 and . It is important to have information on the second variance component. If the second component is unknown, even under the assumption of homogeneity , the variance component model would not be identifiable. Hence, we need to devote some effort to derive expressions for the within study variances; this can be accomplished using the δ−method as discussed in the next section.
Within study variance. Let us consider (ignoring the study index i for the sake of simplicity)
(8)
and apply the δ−method. Recall that the variance V a r T(X) of a transformed random variable T(X) can be approximated as [ T
′(E(X))]2
V a r(X) assuming that the variance V a r(X) of X is known. Applying this δ−method twice gives
(9)
and
(10)
so that the within study variance for the i-th study is provided as
(11)
We acknowledge that the above are estimates of the variances of the diagnostic accuracy estimates, but are used as if they were the true variances.
Some important cases. If there are no further covariates, two important models are easily identified as special cases of (7). One is the fixed effects model
(12)
and the other is the random effects model
(13)
which have gained some popularity in the meta-analytic literature.