Assessing discriminative ability of risk models in clustered data
 David van Klaveren^{1}Email author,
 Ewout W Steyerberg^{1},
 Pablo Perel^{2} and
 Yvonne Vergouwe^{1}
DOI: 10.1186/14712288145
© van Klaveren et al.; licensee BioMed Central Ltd. 2014
Received: 7 October 2013
Accepted: 8 January 2014
Published: 15 January 2014
Abstract
Background
The discriminative ability of a risk model is often measured by Harrell’s concordanceindex (cindex). The cindex estimates for two randomly chosen subjects the probability that the model predicts a higher risk for the subject with poorer outcome (concordance probability). When data are clustered, as in multicenter data, two types of concordance are distinguished: concordance in subjects from the same cluster (withincluster concordance probability) and concordance in subjects from different clusters (betweencluster concordance probability). We argue that the withincluster concordance probability is most relevant when a risk model supports decisions within clusters (e.g. who should be treated in a particular center). We aimed to explore different approaches to estimate the withincluster concordance probability in clustered data.
Methods
We used data of the CRASH trial (2,081 patients clustered in 35 centers) to develop a risk model for mortality after traumatic brain injury. To assess the discriminative ability of the risk model within centers we first calculated clusterspecific cindexes. We then pooled the clusterspecific cindexes into a summary estimate with different metaanalytical techniques. We considered fixed effect metaanalysis with different weights (equal; inverse variance; number of subjects, events or pairs) and random effects metaanalysis. We reflected on pooling the estimates on the logodds scale rather than the probability scale.
Results
The clusterspecific cindex varied substantially across centers (IQR = 0.700.81; I^{ 2 } = 0.76 with 95% confidence interval 0.66 to 0.82). Summary estimates resulting from fixed effect metaanalysis ranged from 0.75 (equal weights) to 0.84 (inverse variance weights). With random effects metaanalysis – accounting for the observed heterogeneity in cindexes across clusters – we estimated a mean of 0.77, a betweencluster variance of 0.0072 and a 95% prediction interval of 0.60 to 0.95. The normality assumptions for derivation of a prediction interval were better met on the probability than on the logodds scale.
Conclusion
When assessing the discriminative ability of risk models used to support decisions at cluster level we recommend metaanalysis of clusterspecific cindexes. Particularly, random effects metaanalysis should be considered.
Keywords
Clustered data Concordance Discrimination Metaanalysis Prediction Risk modelBackground
Assessing the performance of a risk model is of great practical importance. An essential aspect of model performance is separating subjects with good outcome from subjects with poor outcome (discrimination) [1]. The concordance probability is a commonly used measure of discrimination reflecting the association between model predictions and true outcomes [2, 3]. For binary outcome data it is the probability that a randomly chosen subject from the event group has a higher predicted probability of having an event than a randomly chosen subject from the nonevent group. For timetoevent outcome data it is the probability that, for a randomly chosen pair of subjects, the subject who experiences the event of interest earlier in time has a lower predicted value of the time to the occurrence of the event. For both kinds of outcome data the concordance probability is often estimated with Harrell’s concordance (c)index [2].
In risk modelling, clustered data are frequently used. A typical example is multicenter patient data, i.e. data of patients who are treated in different centers with similar inclusion criteria across the centers. Patients treated in the same center are nevertheless more alike than patients from different centers. A comparable type of clustering may occur in patients treated in different countries or in patients treated by different caregivers in the same center. Similarly, in public health research the study population is often clustered in geographical regions like countries, municipalities or neighbourhoods. It has been suggested that clustering should be taken into account in the development of risk models to obtain unbiased estimates of predictor effects [4]. This can be done by using a multilevel logistic regression model for binary outcomes or a frailty model for timetoevent outcomes [5, 6].
It would be natural to take clustering also into account when measuring the performance of a risk model. For multilevel models, it has been proposed to consider the concordance probability of subjects within the same cluster (withincluster concordance probability) separately from the concordance probability of subjects in different clusters (betweencluster concordance probability) [7, 8]. We propose using the withincluster concordance probability when risk models are used to support decisions within clusters, e.g. in clinical practice where decisions on interventions are commonly taken within centers. A valuable risk model should then be able to separate subjects within the same cluster into those with good outcome and poor outcome. We consider the withincluster concordance probability more relevant in this context than the betweencluster or overall concordance probability.
Here, we aimed to estimate the withincluster concordance probability from clustered data. We explored different metaanalytic methods for pooling clusterspecific concordance probability estimates with an illustration in predicting mortality among patients suffering from traumatic brain injury.
Methods
Mortality in traumatic brain injury patients
We present a case study of predicting mortality after Traumatic Brain Injury (TBI). Risk models using baseline characteristics provide adequate discrimination between patients with good and poor 6month outcomes after TBI [9, 10]. We used patients enrolled in the Medical Research Council Corticosteroid Randomisation after Significant Head Injury [11] trial (registration ISRCTN74459797, http://www.controlledtrials.com/), who were recruited between 1999 and 2004. This was a large international doubleblind, randomized placebocontrolled trial of the effect of early administration of a 48h infusion of methylprednisolone on outcome after head injury. The trial included 10,008 adults clustered in 239 centers with Glasgow Coma Scale (GCS) [12] Total Score ≤ 14, who were enrolled within 8 hours after injury. By design the patient inclusion criteria were equal in all 239 centers.
We considered patients with moderate or severe brain injury (GCS Total Score ≤ 12) and observed 6month Glasgow Outcome Scale (GOS) [13]. Patients who were treated in one of 35 European centers with more than 5 patients experiencing the event (n = 2,081), were used to assess the discriminative ability of a prediction model developed with data from 35 centers. Patients who were treated in one of 21 Asian centers with more than 5 patients experiencing the event (n = 1,421) were used to assess the discriminative ability at external validation.
We used a Cox proportional hazards model with age, GCS Motor Score and pupil reactivity as covariates similar to previously developed risk models [9, 10]. We modelled center with a Gamma frailty (random effect) to account for heterogeneity in mortality among centers. We estimated parameters on the European selection of patients with the R package survival [14, 15]. As center effect estimates are unavailable when using a risk model in new centers, we calculated individual risk predictions applying the Gamma frailty mean of 1 for each patient.
Clusterspecific concordance probabilities
We estimated the concordance probability within each cluster by Harrell’s cindex [2], i.e. the proportion of all usable pairs of subjects in which the predictions are concordant with the outcomes. A pair of subjects is usable if we can determine the ordering of their outcomes. For binary outcomes, pairs of subjects are usable if one of the subjects had an event and the other did not. For timetoevent outcomes, pairs of subjects are usable if their failure times are not equal and at least the smallest failure time is uncensored. For a usable subject pair the predictions are concordant with the outcomes if the ordering of the predictions is equal to the ordering of the outcomes. Values of the cindex close to 0.5 indicate that the model does not perform much better than a coinflip in predicting which subject of a randomly chosen pair will have a better outcome. Values of the cindex near 1 indicate that the model is almost perfectly able to predict which subject of a randomly chosen pair will have a favourable outcome. We estimated the variances of the clusterspecific cindexes with a method proposed by Quade [16]. Formulas are provided in Appendix 1.
Pooling clusterspecific concordance probability estimates
The withincluster concordance probability C_{ w } can be estimated by pooling the clusterspecific concordance probability estimates into a weighted average. Previously, the clusterspecific concordance probability estimates were pooled with the number of usable subject pairs as weights [7, 8]. Here, we define eight different ways for pooling of clusterspecific estimates – both on the probability scale and on the logodds scale – based on fixed effect metaanalysis and random effects metaanalysis.
We consider a dataset with subjects in K clusters. Let m_{ k } be the number of subjects and e_{ k } be the number of events in cluster k. We denote the number of usable subject pairs – pairs of subjects for whom we can determine the ordering of their outcomes – in cluster k by n_{ k }. The clusterspecific concordance probability estimate for cluster k is denoted by ${\widehat{C}}_{k}$ with sampling variance estimate ${\widehat{\sigma}}_{k}^{2}$.
Fixed effect metaanalysis
The simplest approach would be to apply equal weights, w_{ k } = 1/K for each cluster (method 1). This estimator is quite naive when the cluster size varies, because small clusters are given the same weight as large clusters and information about the precision of the clusterspecific estimates is ignored. Heuristic choices of weights taking the cluster size into account are the number of subjects, w_{ k } = m_{ k } (method 2), or the number of events, w_{ k } = e_{ k } (method 3). Analogous to the definition of the cindex a fourth option is the number of usable subject pairs as weights, w_{ k } = n_{ k } (method 4). The pooled estimate is then equal to the proportion of all usable withincluster subject pairs in which the predictions and outcomes are concordant. Another choice of metaanalysis weights are the inverse variances, ${w}_{k}=1/{\widehat{\sigma}}_{k}^{2}$ (method 5). These weights express the precision of the clusterspecific estimates and are commonly used in metaanalysis of studyspecific treatment effects.
Random effects metaanalysis
For estimation of the betweencluster variance τ^{2} we used the DerSimonian and Laird [18] method. Alternative estimators for τ^{2} can be found in DerSimonian and Kacker [19].
Metaanalysis scale
To consider if the normality assumption is valid we used a normal probability plot of z_{ k } and applied the ShapiroWilk test to z_{ k }[22]. In a normal probability plot z_{ k } is plotted against a theoretical normal distribution in such a way that the points should form an approximate straight line. Departures from this straight line indicate departures from normality. The ShapiroWilk test returns the probability of obtaining the teststatistic as least as extreme as the observed one, under the nullhypothesis that z_{ k } are normally distributed (pvalue). When the pvalue is above significance level α, say 5%, the null hypothesis that z_{ k } is normally distributed is not rejected.
Since the concordance probability is restricted to [0, 1] the normality assumption of random effects metaanalysis may be violated. We considered inverse variance weighted metaanalysis on the logodds scale as an alternative approach (methods 7 and 8 for fixed effect and random effects metaanalysis respectively). The resulting estimators for the withincluster concordance probability are defined in Appendix 2. The normality assumption on logodds scale was again assessed by the normal probability plot and the ShapiroWilk test.
Overview of the 8 methods for pooling of clusterspecific concordance probability estimates
Fixed effect metaanalysis  Random effects metaanalysis  

Assuming the same true (logit) concordance probability within each cluster  Assuming variation in true (logit) concordance probabilities across clusters  
Probability scale  
Metaanalysis of clusterspecific estimates of the concordance probability  1. Equal weight for each cluster  6. Inverse of the sum of the clusterspecific sampling variance estimate and the betweencluster variance estimate 
2. Number of subjects in the cluster  
3. Number of subjects in the cluster with an event  
4. Number of usable subject pairs within the cluster  
5. Inverse of the clusterspecific sampling variance estimate  
Logodds scale  
Metaanalysis of clusterspecific estimates of the logit concordance probability  7. Inverse of the clusterspecific sampling variance estimate on logodds scale  8. Inverse of the sum of the clusterspecific sampling variance estimate on logodds scale and the betweencluster variance estimate on logodds scale 
Results
Patient characteristics in selected European and Asian centers
Characteristic  Measure or Category  Europe  Asia  

Age (years)  Median (25–75 percentile)  36  (24–53)  31  (22–43) 
GCS Motor score  No response (1)  445  (21%)  55  (4%) 
Extension (2)  134  (6%)  96  (7%)  
Abnormal flexion (3)  176  (8%)  124  (9%)  
Normal flexion (4)  321  (15%)  261  (18%)  
Localizes/obeys (5/6)  1,005  (48%)  885  (62%)  
Pupil reactivity  No pupil reacted  291  (14%)  129  (9%) 
One pupil reacted  123  (6%)  117  (8%)  
Both pupils reacted  1,667  (80%)  1,175  (83%)  
Sixmonth mortality  Dead  553  (27%)  495  (35%) 
Patients  Total  2,081  1,421  
Centers  Total  35  21  
Patients per center  Median (25–75 percentile)  33  (21–64)  34  (20–66) 
Associations between predictors and 6month mortality in European centers
Characteristic  Level  HR (95 % CI)  

Age (years)  47 versus 23*  2.1  (1.92.4) 
GCS Motor score  No response (1)  3.1  (2.44.0) 
Extension (2)  2.8  (2.03.8)  
Abnormal flexion (3)  2.4  (1.73.2)  
Normal flexion (4)  1.5  (1.12.0)  
Localizes/obeys (5/6)  1.0  (ref)  
Pupil reactivity  No pupil reacted  2.8  (2.33.5) 
One pupil reacted  1.7  (1.22.3)  
Both pupils reacted  1.0  (ref)  
Center random effect  75 versus 25 percentile  1.7 
Discussion
We studied how to assess the discriminative ability of risk models in clustered data. The withincluster concordance probability is an important measure for risk models when these models are used to support decisions on interventions within the clusters. The withincluster concordance probability can be estimated by pooling clusterspecific concordance probability estimates (e.g. cindexes) with a metaanalysis, similar to pooling of studyspecific treatment effect estimates. We considered different pooling strategies (Table 1) and recommend random effects metaanalysis in case of substantial variability – beyond chance – of the concordance probability across clusters [20, 21]. To decide if the metaanalysis should be undertaken on the probability scale or the logodds scale we suggest considering the normality assumptions on both scales by normal probability plots and ShapiroWilk tests of the standardized residuals.
The illustration of predicting 6month mortality after TBI prompted the use of random effects metaanalysis because of the strong difference – beyond chance – in concordance probability among centers. This was clearly visualized by the forest plot in Figure 2. Random effects metaanalysis results can be summarized by the mean concordance probability and a 95% prediction interval for possible values of the concordance probability. By definition, these results give insight into the variation of the discriminative ability among centers as opposed to fixed effect metaanalysis results [20, 21]. By comparing normal probability plots and ShapiroWilk test results based on the standardized residuals we concluded the random effects metaanalysis results on probability scale to be the most appropriate (Figure 4). Although the methodology is illustrated with timetoevent outcomes of traumatic brain injury patients, it is also applicable to binary outcomes.
Even if a risk model contains regression coefficients that are optimal for the data in each cluster, differences in case mix may lead to different concordance probabilities across clusters [24]. Furthermore, predictor effects may vary because of clusterspecific circumstances, also leading to different clusterspecific concordance probabilities. Given the variability beyond chance in our case study, we consider a random effects metaanalysis of the clusterspecific cindexes as most appropriate.
The assumption of random effects metaanalysis is that underlying concordance probabilities among clusters are exchangeable, i.e. clusterspecific concordance probabilities are expected to be nonidentical, yet identically distributed [20]. If part of the variation can be explained by cluster characteristics, a metaregression – assuming partial exchangeability – of the concordance probability estimates with cluster characteristics as covariates is preferable.
We chose to analyse the concordance probability as it is the most commonly used measure of discriminative ability of a risk model. However, the same logic of pooling clusterspecific performance measure estimates can be applied to any other performance measure, like the discrimination slope, the explained variation (R^{ 2 }) or the Brier score [25].
We used Harrell’s cindex to estimate clusterspecific concordance probabilities together with Quade’s formula for the clusterspecific variances of the cindex [2, 16]. The same methodology of pooling clusterspecific performance measure estimates can be applied to other concordance probability estimators and its variances. Other estimators for the concordance probability in timetoevent data can be found in Gönen and Heller [26] and Uno et al [27]. These estimators are especially favourable when censoring varies by cluster as they are shown to be less sensitive to censoring distributions. Other variance estimators are described by Hanley and McNeil [28], and DeLong et al [29] for binary outcome data and by Nam and D'Agostino [30] and Pencina and D'Agostino [3] for timetoevent outcome data. The variance of the concordance probability estimate can also be estimated with a bootstrap procedure [31].
Conclusion
We recommend metaanalysis of clusterspecific cindexes when assessing discriminative ability of risk models used to support decisions at cluster level. Particularly, random effects metaanalysis should be considered as it allows for and provides insight into the variability of the concordance probability among clusters.
Appendix 1
Appendix 2
The resulting pooled estimates together with confidence and prediction intervals are transformed back to probability scale.
Abbreviations
 cindex:

Concordanceindex
 CRASH:

Corticosteroid randomisation after significant head injury
 GCS:

Glasgow coma scale
 GOS:

Glasgow outcome scale
 IQR:

Interquartile range.
Declarations
Funding
This work was supported by the Netherlands Organisation for Scientific Research (grant 917.11.383).
Acknowledgements
The authors express their gratitude to all of the principal investigators of the CRASH trial for providing the data. We thank Prof. Emmanuel Lesaffre (Department of Biostatistics, Erasmus MC, Rotterdam, The Netherlands) for helpful comments.
Authors’ Affiliations
References
 Steyerberg EW, Vickers AJ, Cook NR, Gerds T, Gonen M, Obuchowski N, Pencina MJ, Kattan MW: Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology. 2010, 21 (1): 128138. 10.1097/EDE.0b013e3181c30fb2.View ArticlePubMed
 Harrell FE, Califf RM, Pryor DB, Lee KL, Rosati RA: Evaluating the yield of medical tests. JAMA. 1982, 247 (18): 25432546. 10.1001/jama.1982.03320430047030.View ArticlePubMed
 Pencina MJ, D'Agostino RB: Overall C as a measure of discrimination in survival analysis: model specific population value and confidence interval estimation. Stat Med. 2004, 23 (13): 21092123. 10.1002/sim.1802.View ArticlePubMed
 Bouwmeester W, Twisk JW, Kappen TH, van Klei WA, Moons KG, Vergouwe Y: Prediction models for clustered data: comparison of a random intercept and standard regression model. BMC Med Res Methodol. 2013, 13: 1910.1186/147122881319.View ArticlePubMed
 Gelman A, Hill J: Data analysis using regression and multilevel/hierarchical models. 2007, Cambridge: Cambridge University Press
 Duchateau L, Janssen P: The Frailty Model. 2008, New York: Springer
 Van Oirbeek R, Lesaffre E: An application of Harrell's Cindex to PH frailty models. Stat Med. 2010, 29 (30): 31603171. 10.1002/sim.4058.View ArticlePubMed
 Van Oirbeek R, Lesaffre E: Assessing the predictive ability of a multilevel binary regression model. Comput Stat Data Anal. 2012, 56 (6): 19661980. 10.1016/j.csda.2011.11.023.View Article
 Collaborators MCT, Perel P, Arango M, Clayton T, Edwards P, Komolafe E, Poccock S, Roberts I, Shakur H, Steyerberg E, et al: Predicting outcome after traumatic brain injury: practical prognostic models based on large cohort of international patients. BMJ. 2008, 336 (7641): 425429.View Article
 Steyerberg EW, Mushkudiani N, Perel P, Butcher I, Lu J, McHugh GS, Murray GD, Marmarou A, Roberts I, Habbema JD, et al: Predicting outcome after traumatic brain injury: development and international validation of prognostic scores based on admission characteristics. PLoS Med. 2008, 5 (8): e16510.1371/journal.pmed.0050165.View ArticlePubMed
 Edwards P, Arango M, Balica L, Cottingham R, ElSayed H, Farrell B, Fernandes J, Gogichaisvili T, Golden N, Hartzenberg B, et al: Final results of MRC CRASH, a randomised placebocontrolled trial of intravenous corticosteroid in adults with head injuryoutcomes at 6 months. Lancet. 2005, 365 (9475): 19571959.View ArticlePubMed
 Teasdale G, Jennett B: Assessment of coma and impaired consciousness. A practical scale. Lancet. 1974, 2 (7872): 8184.View ArticlePubMed
 Jennett B, Bond M: Assessment of outcome after severe brain damage. Lancet. 1975, 1 (7905): 480484.View ArticlePubMed
 R Development Core Team: R: A Language and Environment for Statistical Computing. 2011, Vienna, Austria: R Foundation for Statistical Computing, ISBN 3900051070, URL http://www.Rproject.org/
 Therneau T, original Splus>R port by Lumley T: survival: Survival analysis, including penalised likelihood. R package version 2.369. 2011, http://CRAN.Rproject.org/package=survival,
 Quade D: Nonparametric partial correlation. Volume No. 526. 1967, North Carolina: Institute of Statistics Mimeo, Volume 526
 Higgins JP, Thompson SG: Quantifying heterogeneity in a metaanalysis. Stat Med. 2002, 21 (11): 15391558. 10.1002/sim.1186.View ArticlePubMed
 DerSimonian R, Laird N: Metaanalysis in clinical trials. Control Clin Trials. 1986, 7 (3): 177188. 10.1016/01972456(86)900462.View ArticlePubMed
 DerSimonian R, Kacker R: Randomeffects model for metaanalysis of clinical trials: an update. Contemp Clin Trials. 2007, 28 (2): 105114. 10.1016/j.cct.2006.04.004.View ArticlePubMed
 Higgins JPT, Thompson SG, Spiegelhalter DJ: A reevaluation of randomeffects metaanalysis. J R Soc Health Series A. 2009, 172 (1): 137159.View Article
 Riley RD, Higgins JPT, Deeks JJ: Interpretation of random effects metaanalyses. BMJ. 2011, 342: d54910.1136/bmj.d549.View ArticlePubMed
 Hardy RJ, Thompson SG: Detecting and describing heterogeneity in metaanalysis. Stat Med. 1998, 17 (8): 841856. 10.1002/(SICI)10970258(19980430)17:8<841::AIDSIM781>3.0.CO;2D.View ArticlePubMed
 Lumley T: rmeta: Metaanalysis. R package version 2.16. 2009, http://CRAN.Rproject.org/package=rmeta,
 Vergouwe Y, Moons KG, Steyerberg EW: External validity of risk models: use of benchmark values to disentangle a casemix effect from incorrect coefficients. Am J Epidemiol. 2010, 172 (8): 971980. 10.1093/aje/kwq223.View ArticlePubMed
 Steyerberg EW: Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. 2009, New York: SpringerView Article
 Gönen M, Heller G: Concordance probability and discriminatory power in proportional hazards regression. Biometrika. 2005, 92 (4): 965970. 10.1093/biomet/92.4.965.View Article
 Uno H, Cai T, Pencina MJ, D'Agostino RB, Wei LJ: On the Cstatistics for evaluating overall adequacy of risk prediction procedures with censored survival data. Stat Med. 2011, 30 (10): 11051117.PubMed
 Hanley JA, McNeil BJ: The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology. 1982, 143 (1): 2936.View ArticlePubMed
 DeLong ER, DeLong DM, ClarkePearson DL: Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. 1988, 44 (3): 837845. 10.2307/2531595.View ArticlePubMed
 Nam BH, D'Agostino RB: Discrimination Index, the Area under the ROC Curve. GoodnessofFit Tests and Model Validity. 2002, Boston: Birkhauser, 267279.View Article
 Efron B, Tibshirani R: An Introduction to the Bootstrap. 1993, Boca Raton, FL: CRC pressView Article
 The prepublication history for this paper can be accessed here:http://www.biomedcentral.com/14712288/14/5/prepub
Prepublication history
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.