Prognostic models for intracerebral hemorrhage: systematic review and meta-analysis

Background Prognostic tools for intracerebral hemorrhage (ICH) patients are potentially useful for ascertaining prognosis and recommended in guidelines to facilitate streamline assessment and communication between providers. In this systematic review with meta-analysis we identified and characterized all existing prognostic tools for this population, performed a methodological evaluation of the conducting and reporting of such studies and compared different methods of prognostic tool derivation in terms of discrimination for mortality and functional outcome prediction. Methods PubMed, ISI, Scopus and CENTRAL were searched up to 15th September 2016, with additional studies identified using reference check. Two reviewers independently extracted data regarding the population studied, process of tool derivation, included predictors and discrimination (c statistic) using a predesignated spreadsheet based in the CHARMS checklist. Disagreements were solved by consensus. C statistics were pooled using robust variance estimation and meta-regression was applied for group comparisons using random effect models. Results Fifty nine studies were retrieved, including 48,133 patients and reporting on the derivation of 72 prognostic tools. Data on discrimination (c statistic) was available for 53 tools, 38 focusing on mortality and 15 focusing on functional outcome. Discrimination was high for both outcomes, with a pooled c statistic of 0.88 for mortality and 0.87 for functional outcome. Forty three tools were regression based and nine tools were derived using machine learning algorithms, with no differences found between the two methods in terms of discrimination (p = 0.490). Several methodological issues however were identified, relating to handling of missing data, low number of events per variable, insufficient length of follow-up, absence of blinding, infrequent use of internal validation, and underreporting of important model performance measures. Conclusions Prognostic tools for ICH discriminated well for mortality and functional outcome in derivation studies but methodological issues require confirmation of these findings in validation studies. Logistic regression based risk scores are particularly promising given their good performance and ease of application. Electronic supplementary material The online version of this article (10.1186/s12874-018-0613-8) contains supplementary material, which is available to authorized users.


Background
Intracerebral hemorrhage (ICH) is a major cause of death and disability, with an incidence rate of 24.6 per 100,000 person-years and a fatality rate of 40%. After such event, only 12-39% of patients regain independence [1]. Contrary to ischemic stroke, medical care for ICH remains mostly supportive, and few interventions clearly demonstrated benefit in this population [2,3]. Several prognostic tools have been proposed for mortality and functional outcome prediction in ICH. These tools are potentially useful for ascertaining prognosis, facilitating communication between clinicians, characterizing and selecting patients for interventions, and for benchmarking purposes in healthcare delivery [2,4].
The aim of this study was to systematically identify, assess and review the methodological conduct and reporting of studies deriving prognostic tools for the risk of death and/or functional recovery after ICH and to evaluate their overall discrimination according to the method of derivation and type of outcome.

Methods
We have designed, developed and reported our systematic review and meta-analysis in accordance with recommendations from the Cochrane Prognosis Methods Group [5] and the PRISMA [6] and MOOSE [7] guidelines. For this purpose, we searched PubMed, ISI Web of Knowledge, Scopus, and CENTRAL for all studies reporting the derivation of prognostic tools for predicting death and/or functional recovery after non-traumatic ICH, using the broad and sensitive search query reported Additional file 1. The search included articles from database inception to 15th September 2016, with additional articles identified from reference checking. No language restrictions were applied. There is no protocol available.

Study selection and inclusion criteria
Articles were included if they met the following criteria: 1) were human studies; 2) were original articles; 3) were adult studies (≥ 18 years); 4) did not consist of case reports/ case series; 5) enrolled non-traumatic ICH patients; 6) were prognostic studies; 7) described the application of a prognostic tool; and 8) were derivation studies. Studies involving traumatic and/or extra-axial bleedings were excluded. Study selection was performed using a two-step process. In the first step (screening), all abstracts were reviewed by two authors independently applying the inclusion criteria. This process was repeated in the second step again by two authors working independently, applying the same criteria to the full text of remaining studies. Disagreements were resolved by consensus.

Quality assessment, data extraction, analysis and reporting
To inform quality assessment and data extraction from individual studies, two reviewers independently applied a spreadsheet based in the CHARMS checklist [5] to the included studies, gathering information on the following aspects of prognostic tool derivation: 1) population, sampling and source of data; 2) outcome timing and definition; 3) number and type of predictors; 4) number of patients and events 5) handling of missing data; 6) method for tool derivation and 7) prognostic tool performance.
Prognostic tool performance was evaluated by determining its discriminatory capacity, i.e., its ability to determine which patients will suffer the outcome of interest. As a measure of this, we retrieved the c-statistic along with its 95% confidence interval (CI). For studies not reporting any of these parameters, we obtained them by recreating the receiver operating characteristic (ROC) curve from reported probability distributions; for studies reporting the c-statistic but not its confidence interval, we calculated the later using the method reported by Hanley and McNeil [8], where the number of outcomes was available. Standard errors were derived from the respective CIs.
Given the fact that some authors derived more than one tool from the same sample population, we pooled c-statistics using robust variance estimation (RVE) to account for dependent effects, according to Tanner-Smith et al. [9]. Specifically, we assumed correlated effect sizes and used a random effects model with inverse variance weights to estimate the overall mean c-statistic and mean c-statistics for mortality prediction tools, functional outcome prediction tools, logistic regression based tools, and machine learning algorithms. Univariate meta-regression was used to compare these groups and p values < 0.05 were considered significant. Due to the nature of the meta-analytical technique used, heterogeneity statistics such as Q-statistic and I-square are not recommended, according to Tanner-Smith et al. [9]. However, the I2 statistic is reported for illustrative purposes. Statistical analysis was performed using specific macros [9] designed for R and SPSS® statistics v 24.0. Figure 1 depicts the study selection procedure. The search query retrieved 15,613 references: after the screening step, there were 263 references left for full text review. The second step removed an additional 207 references, leaving us with 56 studies reporting the derivation of at least one prognostic tool. Three additional studies were identified through reference check, which led to the final number of 59 studies involving 48,133 patients. Nine studies reported the derivation of more than one prognostic tool, so the total number of prognostic tools analyzed was 72. The summary description of these tools is depicted in Table 1.

Number and type of predictors
The number of predictors for each prognostic tool ranged from two to 20, with the mode being three ( Table 2). The five most frequently included predictors were consciousness (n = 57), hematoma size (n = 43), age (n = 38), intraventricular blood (n = 32), and the presence of comorbidities (n = 16). Figure 2 stratifies the ten most frequently used variables for mortality and functional outcome prediction.

Number of patients and events
The number of included patients varied between 38 [15] and 29,775 [59] and the number of outcomes ranged from 9 [22] to 6765 [59] (Table 2), with four studies not reporting this item [14,15,34,66]. The event per variable (EPV) rate ranged from 1.4 [28] Table 2). Among studies reporting this item, all of them except two used a complete case analysis, with the exceptions using a missing cathegory [37,59]. Two studies failed to report the number of patients lost to follow-up [10,15]: as for the others, the   Values relating to functional outcome majority of them showed a 100% complete follow-up but five studies showed a loss < 5% [30,44,46,53,56], two studies showed a loss of 5-20% [51,57] and two studies showed a loss > 20% [37,43].
Prognostic tool performance C-statistics and respective 95% confidence intervals were retrieved from 38 mortality prediction tools and 15 functional outcome prediction tools (Table 1). Forest plots are depicted in Figs. 3 and 4. The lowest reported value was 0.745 [49] and the highest reported value was 0.984 [28]. Table 3 depicts robust variance estimates of pooled c-statistics for all tools combined and subgroup analysis for mortality prediction tools, functional outcome prediction tools, logistic regression based tools, and machine learning algorithms, along with comparisons using metaregression. All subgroups showed values for pooled c statistics > 0.80. Mortality prediction tools and machine learning algorithms showed higher pooled AUCs but the differences were not statistically significant.

Discussion
Prognostic models for ICH patients have demonstrated good discrimination in derivation studies, regardless of the outcome in question (mortality or functional outcome). These tools have been derived in different ICH populations, ranging from "general" ICH (i.e. primary or spontaneous) to more specific populations (ex. arteriovenous malformation related bleeds, dialysis patients, comatose patients). Cohort studies are the predominant study design: this design is well suited for prognostic tool derivations due to an optimal measurement of predictors and outcome [69]. Other sources of data used included registries, case-control studies, randomized clinical trial data and administrative databases. Of these, the last two raise concerns about representativeness and quality of data: on one side, clinical trials usually have the highest quality of data, but restrictive inclusion and exclusion criteria might hamper generalizability [70]; on the other side, administrative databases might allow for easy access to a large quantity of patient data, but they are prone to errors in codification, data discrepancy, and missing data [71]. A considerable number of studies (n = 11) were multicentric, conceding a theoretical advantage in terms of generalizability. The sampling method was frequently not reported (n = 15) but was consecutive for most studies, again assuring the representativeness of the population and minimizing in a convenient manner the risk of bias due to selective sampling. Most mortality prediction tools focused on death at discharge or 1 month: this timing seems appropriate, since most deaths due to ICH occur early in the disease [1]. However, the same cannot be said for functional outcome prediction: significant changes in functional status have been described in ICH patients up to 1 year [72], rendering outcome predictions at 1 month or discharge less useful. Noticeably, 12 derivation procedures focused on functional outcome at discharge or 1 month. A reasonable compromise would be prediction at three to 6 months, allowing enough time for patient recovery without excessive loss to follow up or occurrence of competing events. Another important issue is that studies with longer follow-ups did not report on outpatient care interventions (ex. rehabilitation), making generalizability of their results less straightforward. Functional outcome prediction was mostly binary and used different scales and cut off values: whereas the optimal method of functional outcome measurement in ICH patients is debatable [73], the usage of different scales and cut offs between tool derivation studies makes comparisons between these instruments more difficult. Only five studies reported blinded outcome assessment: whereas mortality is a rather "hard outcome", functional outcome evaluation is inherently more subjective and thus more prone to evaluation bias. Derivation studies were rather heterogeneous in the number of patients and events analyzed. Interestingly, the four most frequently included variables for mortality prediction were also the four most frequently included variables for functional outcome prediction (Fig. 2). This overlap suggests that mortality prediction tools should, at least to some extent, predict functional outcome and vice-versa. The number of events per variable is a simple rule of thumb to assess the adequacy of sample size: it is suggested that a minimum of ten events per variable are required to prevent overfitting during statistical modelling [73], but a lower rate was found for 21 tools, although admittedly not all of them were regression based.
Missing data, whether pertaining to missing predictors or loss to follow-up, is also a potential source of bias for derivation studies, with the risk of bias relating to the amount of missing data and the extent to which it is missing at random. Handling of missing predictors was frequently not reported (22 studies). Where it was reported, complete case-analysis was the method most frequently used, which potentially creates non-random, non-representative samples of the source population. For this purpose, guidelines for prediction modelling studies have suggested preferential use of other methods such as multiple imputation, noticing however that if the number of missing predictors is extensive this technique will not be sufficient to handle this problem [69]. The same argument regarding risk of bias may be made for loss to follow-up: 4 studies reported a loss to follow-up > 10%.
Discrimination and calibration are important properties for predictive models that should be reported. Discrimination relates to the extent to which a model distinguishes those who will suffer the outcome of interest from those who will not, whereas calibration refers to the agreement between observed and predicted outcome rates [74]. C statistic is the most commonly used performance measure for discrimination [75] but it was retrieved for only 38 derivations focusing on mortality and 15 derivations focusing on functional outcome. Taken together, these studies demonstrated good discriminatory ability for both predictions. The pooled C statistic for mortality prediction was 0.880 and the pooled C statistic for functional outcome prediction was 0.872 but these results must be interpreted with caution, due to the heterogeneity in the included studies in terms of population studied, selected predictors, method of model development and choice of outcome. Other forms of discriminatory ability reported include accuracy, sensitivity/ specificity, and positive/negative predictive values, but the interpretation of these measurements is less straightforward: the first two require the use of cut-off points for predicted probabilities, therefore not allowing the full use of model information, whereas the last depend on the overall probability of the event in the studied sample, hampering extrapolations for other populations with different event rates. Calibration was only reported for 14 tools, either using the Hosmer-Lemeshow test, the Le Cessie and Howelingen test, or a calibration plot.
The most frequently used method for model derivation was logistic regression. There seems to be no consensus about the best method for variable selection during multivariate logistic regression modelling, but most studies used automatic methods. These methods allow for a more efficient use of data but come with an added risk of model overfit and possible exclusion of important predictor variables due to chance, especially when sample sizes are small [76]. Nearly half of the regression based tools were simplified in the form of risk scores, allowing for an easier application. Machine learning algorithms found in our systematic review included decision trees (four), artificial neural networks (four), support vector machines (one), random forests (one) and a hybrid approach (one). These methods are an alternative to logistic regression that requires less formal statistical training and offer more efficient use of data and a higher ability to detect non-linear relations. However, they are prone to overfitting, extremely sensible to small perturbations in data and empirical in the nature of model development [77,78]. Despite being pointed as more statistically efficient, these methods were not superior to logistic regression for discrimination in our review. When models are tested in the same sample on which they were derived, their results tend to be biased due to overfitting: to minimize this problem, internal validation (resampling) techniques can be used. Only 19 derivations used resampling techniques for overfit adjustment. Bootstrapping is recommended as the preferred method of internal validation [74], but was performed for only two.
Other methods encountered included cross-validation and split sample. The later, used in three tools, is regarded as the least effective method since it reduces statistical power for the derivation procedure and does not validate the results in a new population.
In summary, the results from our review suggest that the most promising prognostic tools are i) logistic regression based risk scores, which combine the high discrimination showed by logistic regression with the ease of application typical of prognostic scores; ii) derived from general cohorts (i.e, spontaneous or primary ICH) to maximize generalizability; iii) without significant loss to follow up, to minimize risk of bias; iv) with early outcome measurement for mortality (i.e, discharge or 1 month) and later outcome measurement for functional outcome (i.e, 3 months or more) and v) showing high discrimination with an appropriate EPV rate. Examples of such scores include the scores by Chen [45], Hemphill [48], Ho [64], Romano [56] and Ruiz-Sandoval [58] for mortality, Ji [51] and Rost [57] for functional outcome prediction and Godoy [62] for a combined outcome. Not surprisingly, several validation studies have been published for these tools. Other factors to take into account are internal validation and blinded outcome assessment, the latter being particularly important for functional status.
Our review has limitations. Firstly, there were no clear guidelines on conducting and reporting studies for prognostic tool derivation at the time most of these studies were performed. This lead to frequent underreporting and higher difficulty in retrieving information about important methodological aspects and performance measures, which reflected on the results of our review. As an example, we were only able to retrieve c-statistics for 53 derivations, which means that several tools could not be evaluated for this important discrimination measure. Guidelines have recently been published to give guidance on this issue [69]. Second, studies have demonstrated that healthcare professionals are frequently pessimistic in the face of neurological emergencies [79]. This negative perception can result in a "self-fulfilling prophecy", whereby the physician's perception will lead to early withdrawal of care which, by itself, will facilitate a negative outcome [79]. Most studies assessing the effect of early care limitation in the performance of prognostic models have focused on validation studies [47,80,81]. According to these studies, models underestimate adverse outcomes in patients with early care limitation and overestimate in patients without. However, care limitation has also been demonstrated to be an independent predictor of poor outcome [34,82]. Hence, one should expect that withdrawal of care would affect model performance also in derivation studies, but this factor was not taken in to account in the majority of studies included in this review. A possible solution for this problem is to derive prognostic models from patient populations with maximum level of care. Such approach was more recently used by Sembill and collaborators to derive the max-ICH score [83]. Third, the previously discussed aspects of prognostic tool derivation are useful to assess the risk of bias and external validity of these instruments, but they do not necessarily determine the way these tools will behave in clinical practice. Risk of bias does not necessarily imply existing bias, and the ultimate issue is how they behave in an independent external dataset [84]. At the time of our search we identified external validation studies for only 27 prognostic tools [14, 16, 20, 22, 26, 27, 29-31, 37, 40, 41, 43-46, 48, 54, 56-59, 62, 63]. Nevertheless, derivation studies less prone to bias are more likely to perform well in validation studies. The issues discussed in this systematic review should then be taken as a guidance for future studies seeking to validate existing prognostic tools or to derive new ones in ICH patients as well as in other populations.

Conclusions
Prognostic models showed high discrimination in derivation studies for mortality and functional outcome prediction in ICH patients but numerous methodological and reporting deficiencies were present, namely insufficient length of follow-up for functional outcome, absence of blinding, reporting and handling of missing data, low EPV rate, infrequent use of appropriate internal validation procedures and underreporting of important model performance measures. Machine learning methods have not proven to be superior to regression based models and a significant number of these tools weren't submitted to external validation. Guidelines have been reported to support authors in developing and reporting studies both for prognostic model derivation and validation [69].