We believe this is the first study that has systematically appraised the methodological conduct and reporting of studies evaluating the performance of multivariable prediction models (diagnostic and prognostic). Evaluating the performance of a prediction model in datasets not used in the derivation of the prediction model (external validation) is an invaluable and crucial step in the introduction of a new prediction model before it should be considered for routine clinical practice [12, 13, 26, 35]. External or independent evaluation is predicated on the full reporting of the prediction model in the article describing its development, including reporting eligibility criteria (i.e. ranges of continuous predictors, such as age). A good example of a prediction model that has been inadequately reported, making evaluations by independent investigators impossible [36, 37], yet appears in numerous clinical guidelines [4, 38] is the FRAX model for predicting the risk of osteoporotic fracture .
We assessed the methodological conduct and reporting of studies published in the 119 core clinical journals listed in Abridged Index Medicus. Our review identified that 40% of external validation studies were reported in the same article that described the development of the prediction model. Of the 60% of articles that were solely evaluating the performance of an existing published prediction model, 40% were conducted by authors involved in the development of the model. Whilst evaluating one’s own prediction model is a useful first step, this is less desirable then an independent evaluation conducted by authors not involved in its development. Authors evaluating the performance of their own model are naturally likely to err on being overly optimistic in interpreting results or selective reporting (possibly selectively choosing to publish external validation from datasets with good performance and omitting any poorly performing data).
The quality of reporting in external validation studies included in this review was unsurprisingly, very poor. Important details needed to objectively judge the quality of the study were generally inadequately reported or not reported at all. Little attention was given to sample size. Whilst formal sample size calculations for external validation studies are not necessary, there was little acknowledgement that the number of events is the effective sample size; 46% of datasets had fewer than 100 events, which is indicated, though from a single simulation study, as a minimum effective sample size for external validation . Around half of the studies made no explicit mention of missing data. The majority (64%) of studies were assumed to have conducted complete-case analyses to handle missing values, despite methodological guidance to do the contrary [40–44]. Multiple imputation was conducted and reported in very few studies and the amount and reasons for any missing data were poorly described. The analyses of many of these studies were often confusingly reported and conducted, with numerous unclear and unnecessary analyses done as well as key analyses (e.g. calibration) not carried out. Some aspects identified in this review are not specific to prediction modelling studies (e.g. sample size, study design, dates), it is therefore disappointing that key basic details on study are also often poorly reported.
Key characteristics, such as calibration and discrimination, are widely recommended aspects to evaluate [9, 12–15, 26, 45, 46]. Both components are extremely important and should be reported for all studies evaluating the performance of a prediction model, yet calibration, which assesses how close the prediction for an individual is to their true risk, is inexplicably rarely reported, as observed in this and other reviews [1, 23, 47]. With regards to calibration, preference should be to present a calibration plot, possibly with the calibration slope and intercept in rather than the Hosmer-Lemeshow test, which has a number of known weaknesses related to sample size . For example a model evaluate on a large dataset with good calibration can fail the Hosmer-Lemeshow test, whilst a model validated on a small dataset with poor calibration can pass the Hosmer-Lemeshow test. Arguably, more important than calibration or discrimination, is clinical usefulness. Whilst a formal evaluation of clinical usefulness in terms of improving patients outcomes or changing clinician behavior [26, 49] are not part of external validation, indicating the potential clinical utility can be determined. New methods based on decision curve analysis (net benefit)  and relative utility  have recently been introduced. Only one study in our review attempted to evaluate impact on using a model , which included an author who developed the particular methodology . However, since this review, interest and uptake of these methods have slowly started to increase. In instances where the validation is seeking to evaluate the clinical utility, issues such as calibration (which can often be observed in a decision curve analysis) may not be necessary. However, most studies in our review were attempting to evaluate the statistical properties and thus as a minimum, we expect calibration and discrimination to be reported.
Many of the prediction models were developed and presented as simplified scoring systems, whereby the regression coefficients were rounded to integers and then summed to obtain an overall integer score for a particular individual. These scores are often then used to create risk groups, by partitioning the score into 2 or more groups. However, these groups are often merely labelled low, medium or high risk groups (in the case of 3 groups), with no indication to how low, medium or high was quantified. Occasionally, these risk groups may be described by reporting the observed risk for each group, however, these risk groups should be labelled with the predicted risks, by typically reporting the range or mean predicted risk. Authors of a few of the scoring systems presented lookup tables or plots which directly translated the total integer score to a predicted risk, making the model much more useable.
Terminology surrounding prediction modelling studies is inconsistent and identifying these studies is difficult. Search strings developed to identify prediction modelling studies [53–55] inevitably result in a large number of false-positives, as demonstrated in this review. For example, whilst the term validation may be semantically debatable , it is synonymous in prediction modelling studies as referring to evaluating performance, yet, in the studies included in this review, only 43 papers (55%) included the term in the abstract or title (24% in the title alone). To improve the retrieval of these studies we recommend authors to clearly state in the title if the article describes the development or validation (or both) of a prediction model.
Our study has the limitation that we only examined articles published in the subset of PubMed core clinical journals. We chose to examine this subset of journals as it included the 119 of the most widely read journals published in English, covering all specialties of clinical medicine and public-health sciences, and including all major medical journals. Our review also included studies published in 2010, yet since no initiative to improve the quality of reporting of prediction modelling studies has been put in place, we feel, that whilst methodology may have evolved there is no belief that reporting will have improved.
Systematic reviews of studies developing prediction models have identified numerous models for predicting the same or similar outcome [1, 56–59]. Instead of developing yet another new prediction model for which several already exist, authors should direct their efforts in evaluating and comparing existing models and where necessary update or recalibrate, rather than disregard and ultimately waste information from existing studies. Journal editors and peer reviewers can also play a role by demanding clear rationale and evidence for the need of a new prediction model and place more emphasis on studies evaluating prediction models. Recently, developments have been made that combine existing prediction models, thereby improving the generalisability, but importantly not wasting existing research [60, 61].