Bmc Medical Research Methodology Open Access Empirical Evaluation of Prediction Intervals for Cancer Incidence

Background: Prediction intervals can be calculated for predicting cancer incidence on the basis of a statistical model. These intervals include the uncertainty of the parameter estimates and variations in future rates but do not include the uncertainty of assumptions, such as continuation of current trends. In this study we evaluated whether prediction intervals are useful in practice.


Background
Prediction of cancer incidence is of great interest for both health authorities and the scientific community [1]. Estimates of the future cancer burden should indicate the appropriate amounts of resources that will be needed for diagnosis, treatment and rehabilitation. Since quantification of future cancer incidence is inherently uncertain [2], some measurement of the uncertainty would be useful. It has been suggested that, similar to the confidence inter-vals calculated in standard statistical modeling, prediction intervals should be presented with predictions of cancer incidence [3,4].
When future cancer incidence is predicted in a statistical model, three sources of uncertainty are associated with the predicted numbers. The first is the variance of the parameters estimated in the model; the second is the random variation of the future number of cases; and the third is the adequacy of the model. The last includes the uncertainty in both the mathematical structure of the model and the choice of projected components, such as whether to assume that current trends will continue into the future. The first two sources of variance can be included in a statistical model. Intervals that are based only on the variance of the parameters estimated in the model are called 'confidence intervals', while intervals that also encompass variation in the future number of cases are called 'prediction intervals'.
The third source of uncertainty, variations caused by deviations from the model assumptions, is difficult to formalize. Engeland et al. [5] argued that "The large uncertainties associated with the specification of the models used in the predictions obviate the construction of confidence intervals." One way of investigating the extent of the uncertainty in specification of a model is to calculate prediction intervals from historical data and then calculate the proportion of those intervals that actually cover the observed number of cancer cases. If the proportion is close to the nominal level, e.g. 95 %, the prediction intervals could be taken to give a fair description of the range of likely values to be expected. A low proportion would indicate that there was great uncertainty in the model assumptions, and that prediction intervals should be interpreted with caution. The aim of this study was to evaluate the extent to which prediction intervals derived from incidence rates can be expected to cover the actual numbers observed 10 years later.

Material
The material consisted of new cases of cancer reported to the cancer registries of Denmark, Finland, Iceland, Norway and Sweden between 1958 and 1997, and population figures covering the same calendar period from the central statistical offices in these countries. Sweden has the largest population, with 8.9 million inhabitants, followed by Denmark, Finland, Norway and Iceland with 5.3, 5.2, 4.5 and 0.3 million, respectively. The Nordic cancer registries receive reports from physicians, hospitals, pathological and cytological laboratories and (except in Sweden) death certificates [6]. Compulsory reporting and information from multiple sources ensure almost 100% completeness of all the cancer registries [7]. Table 1 lists the cancer sites included in the study. Detailed descriptions of the types of tumors at each site that are included was given by Engeland et al. [5]. As there were few cases of cancers of the lip and larynx among women and of the breast among men, these cancers were included in 'other sites'.

Statistical model
The age-period-cohort (APC) model [8] has been widely used for predicting cancer incidence and mortality [9][10][11][12][13]. This Poisson regression model is based on tables with five-year age groups and five-year calendar periods. Birth cohorts were constructed synthetically by subtracting age from period. The model can be written as where R ap is the incidence rate in age group a in calendar period p, D is the common drift parameter [14], A a is the age component for age group a, P p is the non-linear period component of period p and C c is the non-linear cohort component of cohort c, c = p -a. We used a slightly modified model [5], substituting a power link for the log link: in order to level off the exponential growth in the multiplicative model. An empirical study of these two models showed that the power model gave predictions that were closer to the observed rates [2]. For some cancer sites for which there were only a few cases in Iceland, a model without the cohort component was used; see Engeland et al. [5] for details and on the lower limits of the age groups included for each site in all the countries.
To ensure a reasonable fit of each data set to the model, the number of five-year periods on which the predictions should be based on was chosen. First, a model including the last six 5-year periods  was fitted. If the model was rejected by a test for goodness-of-fit (5% level), a model including the last five periods was fitted. If this model was also rejected, only the last four periods were used.
Future non-linear effects of cohort and period were assumed to be equal to the last estimated effect in the model, and predictions were made by projecting the drift. Numbers of cancer cases were predicted by multiplying the predicted rates by the person-years at risk in a given age group and time period.

Prediction intervals, coverage level and discrepancy ratio
In an article on the precision of cancer incidence predictions, Hakulinen and Dyba [3] derived prediction intervals for Poisson distributed variables. Following their paper, let , where a is the age group, f is the future calendar period for which predictions are to be made (f = 8 which corresponds to the period 1993-97), R af is the incidence rate in age group a in period f, and k af and n af are the corresponding number of cases and personyears, respectively. Further, let the expected number of cases, E(k af ), be defined as λ af = n af (A a + D·P + P f + C c ) 5  i.e., σ 2 measures the degree of over-dispersion.
An estimator for the variance of the future number of cases can be found by using Taylor series expansion of non-linear functions (see formulas in the appendix). A  where k f was estimated by .
On the basis of up to six 5-year periods between 1958 and 1987, prediction intervals for the numbers of cases in the period 1993-97 were calculated for all 200 combinations of 20 sites for each sex in each of the five countries. The coverage level was defined as the proportion of the prediction intervals that covered the observed number of cases in 1993-97.
The coverage level only indicates whether the observed number of cases was inside the prediction interval or not. Additional information can be gained by looking at how far outside of the interval the observations fall. Discrepancy ratio was defined as the absolute distance between observed and predicted number of cases, measured in half prediction interval widths: where and k f are the predicted and observed number of cases, respectively, and is the distance between predicted number of cases and the limit of the prediction interval. Figure 1 illustrates the discrepancy ratio. When the discrepancy ratio is larger than 1, it measures how much wider the prediction interval had to be to cover the observed number of cases. In Figure 1, the observed number of cases is about twice as far from the predicted number compared to the lower limit of the prediction interval, giving a discrepancy ratio of 2.
Fisher's exact test was used to evaluate differences between sites, countries and quartiles of number of cases. A binomial regression model was used to study the effect of country and frequency simultaneously.

Results
Prediction intervals were calculated from data up to 1987 for the 200 combinations of sex, site and country. After observing the number of cases 10 years later, in the period 1993-97, 104 (52%) of the observed numbers were covered by the prediction intervals. Coverage levels for specific sites varied from 20% to 90%, but only five or ten intervals were calculated for each site and the difference between the sites was not statistically significant ( Table 1).
The coverage levels for the five Nordic countries varied widely. For the country with the smallest population, Iceland, the coverage level was 88%, which is relatively close to the nominal level of 95%. The levels for Denmark, Finland and Norway were around 50 %, while that for Sweden, the most populous country, was only 25%. When the 200 different predictions were ranked according to the annual number of cases in the period 1983-87 (frequency), the cut-offs for the four quartiles were 70, 230 and 555. Subdividing the predictions according to these quartiles, the coverage level decreased markedly with the annual number of cases, being 84% for the first quartile and 52%, 46% and 26% for the next three, respectively (Table 1). Table 2 shows the associations between country and coverage level as odds ratios (ORs), where the odds of covering the observed number of cases with the prediction intervals in each country was calculated relative to the odds in Denmark. The crude numbers reflect the pattern seen in Table 1, the chance of the observed number of cases being within the prediction interval being similar in Denmark, Finland and Norway, higher in Iceland and lower in Sweden. In the binomial regression model the logarithm of the annual number of cases in 1983-87 was used instead of the frequency itself or the quartiles of the numbers. The reason for this was that the fit to the model was worse when the frequency variable was entered on a The discrepancy ratio Figure 1 The discrepancy ratio. Illustration of the components of the discrepancy ratio. The discrepancy ratio compares the distance between predicted and observed number of cases with the distance between predicted number and the limit of the prediction interval.  The distribution of the discrepancy ratio for each country is plotted in Figure 2. For Iceland, most of the discrepancy ratios were below 1, corresponding to a coverage level of 88%. Of the 12 % with a higher discrepancy ratio, none was more than 50 % outside of the prediction interval. In Finland and Norway, observed numbers of cases fell up to tree times further from the predicted numbers compared to the limits of the prediction intervals, while in Denmark and Sweden some predictions were 5-6 times outside of the intervals.

Discussion
The coverage levels in the five Nordic countries differed mainly as a function of the numbers of cases that formed the basis for the predictions. In Iceland, where there were generally few cases, the coverage level calculated from 95% prediction intervals was 88% while that in the other Nordic countries, which had much more cases, was of only 25-53%. The problem associated with interpretation of prediction intervals is illustrated in Figure 3, where the observed age-standardized (World standard [15]) inci-dence rates for cancer of the lung in women in Iceland and Denmark are plotted against the predicted rates, with 95 % prediction intervals. Although the difference between the observed and predicted rates was smaller in Denmark than in Iceland, the wide prediction interval for Iceland meant that the observed rate in 1993-97 was covered by the interval constructed for Iceland, but the narrower interval for Denmark failed to cover the observed rate for that country.
A statistical model is only a simplification of true underlying associations between variables. The finding that the coverage level is inversely proportional to the sample size can be explained by considering the difference between the modeled (simplified) and true (complex) relationship between the cancer rate and the explanatory variables age, period and cohort. When the sample size increases, deviations between the modeled and true relationship will dominate, and non-overage will become a problem. It is useful to distinguish between calculations of prediction intervals for values within the observed range of values of the explanatory variables (interpolation), and outside the range (extrapolation). The non-coverage problem is even larger for extrapolations, because they also relay on the assumption of continuation of current trends. Another difference between interpolation and extrapolation is that when the sample size increases, the possibility to improve the model increases. This would then reduce the non-coverage problem for interpolations, by reducing the distance between the true and the modeled relationship between the variables. Extrapolations, on the other hand, consist of making predictions for values of the covariates outside the range of observed values. Thus, we have to make assumptions that cannot be evaluated from the observed data and the problem with non-coverage for extrapolations are not necessarily reduced by improvements in the model. It could be argued that a low coverage level indicates that the model is not appropriate, rather than that calculation of predication intervals is per se misleading. With the possible exception of tobacco and lung cancer, few associations are strong enough to be modelled directly. Instead, we used calendar period and birth cohort as proxy variables for changes in underlying risk factors in the model. Møller et al. [2] found that the method used in this study performed fairly well in comparison with other methods currently in use for predicting cancer incidence, and that all the methods evaluated missed the observed number of cases by 10-15% on average for 10 year predictions. Prediction methods could therefore be improved to increase the overall coverage level, although it would be unreasonable to expect that the correct model, or close to it, could be specified for all cancer sites. As long as predictions are based on some type of extrapolation from a statistical model, the coverage level will generally decrease as a function of sample size. The problem is that at the time when the prediction intervals are calculated, the appropriateness of the model with respect to extrapolation into the future usually cannot be evaluated. Prediction intervals can thus be misleading, if they are interpreted as the range of likely values for the number of cases to be expected.
The number of cases from which the predictions for the different cancer sites were made varied widely. Cancers at some sites are very common, like those of the prostate, lung and breast, while others occur less frequently. Because the coverage levels vary with frequency, we would also have expected them to vary by site. The differences were not, however, statistically significant, probably because of the small number of prediction intervals calculated for each site.
A relatively large sample size was associated with a narrow prediction interval, as seen for lung cancer among women in Denmark (Figure 3). The prediction intervals are based on asymptotic theory, which can result in underestimates of variance. Bootstrapping is a suitable method for investigating this problem [16]. We constructed a 95% prediction interval by bootstrapping the data for lung cancer among women in Denmark, assuming that each cell followed a binomial distribution in which the incidence rate and the number of person-years at risk were used as probability of success and number of trials, respectively. We resampled the data 1000 times, calculating the predicted world-standardized incidence rate each time. The 95% bootstrap interval, calculated by selecting the 2.5 and 97.5 percentiles of the 1000 predictions, was 32.7 -36.7, which is fairly close to the asymptotic interval of 32.6-36. 8. This indicates that the asymptotic intervals calculated in this study describe the uncertainty in the predicted number of cases well, given a correctly specified model.
Empirical distribution of discrepancy ratio Figure 2 Empirical distribution of discrepancy ratio. Empirical distribution of discrepancy ratio by country. Predictions for the 40 combinations of 20 sites for each sex constitute the distribution in each country.
Population forecasts are needed for predicting the number of cancer cases. In this paper, we assumed that these numbers were known, but in reality they constitute a separate source of uncertainty. Population forecasts are themselves extrapolations, relying on assumptions about future migration patterns, birth rates and death rates. If our projections for 1993-97 had been calculated from population forecasts in 1987, the coverage levels would probably have been even lower.
Most of the differences in coverage level among the five countries disappeared when the number of cases was controlled for. The remaining difference was not significant, but the probability of covering the rates with prediction intervals for Sweden continued to be lower after adjustment of sample size. Møller et al. [2] showed that the predictions for Sweden were more different from the observed number of cases in 1993-97 than those in the other countries, measured as the median of the absolute value of the relative difference between the predicted and observed numbers of cases. This explains the lower coverage level for Sweden, and indicates that the trends current in 1987 continued to a lesser extent in Sweden than in the other countries.
Engeland and co-workers [5] were reluctant to include prediction intervals with their predictions of cancer incidence in the Nordic countries. A similar view was expressed both with regard to an update of predictions for the Nordic countries [17], and to a prediction of cancer incidence in New South Wales, Australia [18]. There are, however, some instances where prediction intervals can be of value. In cancer surveillance, inclusion of prediction intervals in a routine comparison of the latest observed rates with rates predicted from previous trends, can help to identify changes in the rates beyond random variation. The potential reasons for any discrepancy between the observed and predicted rates can then be studied, including changes in risk factors, diagnostic methods or interventions such as screening programs. In Finland, predicted values with prediction intervals for 1980 were calculated based on rates up to 1968 and compared to observed number of cases in 1980 [19]. Of 33 prediction intervals, 22 (67%) covered the observed values, and the authors discussed possible reasons for those cancers where the prediction intervals failed to cover the observed number of cases. Prediction intervals can also be used to identify highly uncertain predictions. For instance, 26 male cancer cases of the lip were predicted in Iceland in the period 1993-97, and the 95 % prediction interval was 6-46 cases.

Conclusion
We do not recommend use of prediction intervals when the predictions are used for administrative purposes, like planning appropriate amounts of resources for diagnosis, Figure 3 Illustration of prediction interval. Age standardized (World population) incidence rates of lung cancer among women in Iceland and Denmark. Predicted rates based on observed rates up to 1987, with corresponding 95% prediction interval for the period 1993-97 for each country.

Illustration of prediction interval
λ λ