An R package for an integrated evaluation of statistical approaches to cancer incidence projection

Knoll, Maximilian; Furkel, Jennifer; Debus, Jürgen; Abdollahi, Amir; Karch, André; Stock, Christian

doi:10.1186/s12874-020-01133-5

Research article
Open access
Published: 15 October 2020

An R package for an integrated evaluation of statistical approaches to cancer incidence projection

Maximilian Knoll ORCID: orcid.org/0000-0002-9037-3980^1,2,3,4,
Jennifer Furkel^1,2,3,4,
Jürgen Debus^1,3,4,
Amir Abdollahi^1,3,4,
André Karch⁵ &
…
Christian Stock^6,7

BMC Medical Research Methodology volume 20, Article number: 257 (2020) Cite this article

4421 Accesses
47 Citations
6 Altmetric
Metrics details

Abstract

Background

Projection of future cancer incidence is an important task in cancer epidemiology. The results are of interest also for biomedical research and public health policy. Age-Period-Cohort (APC) models, usually based on long-term cancer registry data (> 20 yrs), are established for such projections. In many countries (including Germany), however, nationwide long-term data are not yet available. General guidance on statistical approaches for projections using rather short-term data is challenging and software to enable researchers to easily compare approaches is lacking.

Methods

To enable a comparative analysis of the performance of statistical approaches to cancer incidence projection, we developed an R package (incAnalysis), supporting in particular Bayesian models fitted by Integrated Nested Laplace Approximations (INLA). Its use is demonstrated by an extensive empirical evaluation of operating characteristics (bias, coverage and precision) of potentially applicable models differing by complexity. Observed long-term data from three cancer registries (SEER-9, NORDCAN, Saarland) was used for benchmarking.

Results

Overall, coverage was high (mostly > 90%) for Bayesian APC models (BAPC), whereas less complex models showed differences in coverage dependent on projection-period. Intercept-only models yielded values below 20% for coverage. Bias increased and precision decreased for longer projection periods (> 15 years) for all except intercept-only models. Precision was lowest for complex models such as BAPC models, generalized additive models with multivariate smoothers and generalized linear models with age x period interaction effects.

Conclusion

The incAnalysis R package allows a straightforward comparison of cancer incidence rate projection approaches. Further detailed and targeted investigations into model performance in addition to the presented empirical results are recommended to derive guidance on appropriate statistical projection methods in a given setting.

Peer Review reports

Background

Projection of future cancer incidence is an important task in cancer epidemiology. The results are of interest also for biomedical research and public health policy. In particular, cancer prevention and screening programs require reliable estimates of future cancer incidence to allow informed decisions on their design and to facilitate evaluations [1, 2]. Projections are often performed using long-term data (> 20 yrs) from population-based cancer registries [3]. For short-term data, there appears to be a lack of guidance which statistical approach to choose. The need to base projection models on relatively short-term data is relevant e.g. for Germany, where aggregated data of cancer incidence on a national level is only available from 1999 on, as well as for many countries with newly established cancer registries. Even though it might be challenging to give general guidance on which projection approach to choose, software enabling comparison of multiple competing methods for a given research question might prove useful, but flexible, extensive and easy to use tools are missing.

A selection of previously applied projection models is outlined in [4]. Relatively simple approaches assuming constant rates were utilized [5, 6], as well as more complex age-period (AP) models formulated as generalized linear models (GLMs) with or without interaction effects [7,8,9]. Clements et al. use generalized additive models (GAMs) [10]. GAMs can include uni- or multivariate smoothers in their linear predictors. An established model class for incidence projections based on long-term observation data are age-period-cohort (APC) models, which additionally incorporate a cohort effect [11, 12]. Even though projections of APC usually yield robust results, the APC identification problem impairs direct interpretability of single effects [13, 14].

Projection models are often fitted within a classical maximum likelihood (ML) or restricted maximum likelihood (REML) framework [15,16,17]. Alternatively, a Bayesian framework may be used [18, 19]. Bayesian model estimation can be implemented using Markov-Chain Monte Carlo (MCMC) methods, which are computationally intensive. A recently developed computationally far less demanding alternative is Integrated Nested Laplace Approximation (INLA) [20, 21].

GAMs usually incorporate splines to fit univariate trends or tensor product smoothers for multivariate trends (i.e. interactions between function of continuous variables). In the classical frequentist framework, such models can be fit e.g. using the mgcv-package in R [22]. Uni- and multivariate smoothers can directly be incorporated into the model formula, e.g. as splines or tensor product smoothers.

Recently, a highly flexible Bayesian APC (BAPC) model based on the INLA approach has been proposed for future cancer incidence projections which assumes a Poisson distribution of incidence counts [19]. Havulinna et al. demonstrate that interactions between effects can be modeled by specifying appropriate priors [18].

Given the lack of guidance on statistical modeling approaches to cancer incidence projection and the increasing understanding across sciences that neutral comparisons of statistical methods are needed [23,24,25], we developed an R package which allows an integrated comparison of model performance metrics in the above described context. We thereby aim to facilitate an informed choice of statistical models and the development of methodological guidance. Due to the desirable flexibility in modeling options and the probabilistic interpretation of results in a Bayesian framework as well as the computationally efficient implementation, we emphasize the INLA approach. To demonstrate the functionality of the new package we provide an extensive empirical benchmarking analysis of a selection of potentially applicable modeling approaches using observed long-term data from three population-based cancer registries.

Methods

Cancer registry data

Three low incident tumor sites/entities (brain tumors, kidney cancer, melanoma) and four high incident entities (lung, breast, colorectal, prostate) were selected from three population-based cancer registries: SEER-9 [26], NORDCAN [27] and Saarland [28]. Incidence data of patients below the age of 20 yrs. and older than 84 years (available only as aggregated data) were excluded from analysis. Specific selection criteria are shown in Table 1. Data was separately analyzed for males and females with few exceptions (prostate cancer: only male; breast cancer: only female in the Saarland data, males and females in SEER-9 and NORDCAN data). A representative data structure of incidence and population data, as also used in the incAnalysis package, is shown in Suppl.-Tbl. 1. Cancer cases (incidence) for a given year are stored in rows (most recent year in bottom row) and each row is separated by age(−group) in columns in increasing order from left to right.

Table 1 Selection details for analyzed tumor sites/entities for the three cancer registries and selected incidence data. ⁻low, ⁺high incidence. ¹: male/female; age 60 for SEER-9 and age group 60–64 otherwise

Full size table

From the Surveillance, Epidemiology, and End Results (SEER) Program in the United States, SEER-9 cancer incidence data (1973–2014) were accompanied by population data, available in 1 year age groups.

NORDCAN data (1960–2015) comprise cancer incidence data from Denmark, Finland, Iceland, Norway, Sweden, Faroe Islands and Greenland. The data were retrieved from the NORDCAN website on 2018-08-01. Incidence data were available in 5 year age groups. Population matrices were calculated from the person-years at risk information.

Cancer incidence data from Saarland (1970–2014), a German federal state with a long-established cancer registry, were obtained from the Saarland cancer registry website on 2018-08-01 (5 yr age groups). Population data were retrieved from the health report system of the federal government (up to 2012) und from the website of Saarland for the years 2013/14 [29, 30].

Projection models

The incAnalysis R package (see details below in section 3.2) was used to evaluate a number of increasingly complex models (GLMs, GAMs, BAPC) using the INLA framework. To describe the evaluated models, we introduce the following notation: Y denotes observed cancer incidence counts, N denotes population size, AGE and PERIOD are the respective covariates. The notation also corresponds to variable names used in the R package. Age or age-group, respectively, are indexed by i. Selected projections are shown in Suppl.-Figs. 1 and 2.

Generalized linear models (GLMs)

GLMs are formulated using three components: (1) a probability distribution from the exponential family, (2) a linear predictor η = Xβ and (3) a link function g with E(y) = μ = g⁻¹(η). In all, except BAPC models, negative-binomially distributed counts of tumor cases were assumed.

The most simplistically structured GLM includes only an intercept, η = β₀. In R, this intercept-only model was formulated as Y ~ offset (log(N)) (equivalent to: Y ~ 1 + offset (log(N)).

Next, a GLM with age and period as covariates together with their interaction term was assessed: η = β₀ + β₁age + β₂period + β₃age : period, corresponding to the R formula Y ~ offset (log(N)) + AGE*PERIOD.

Generalized additive models (GAMs)

GAMs have a structure similar to GLMs, with the difference that smooth functions f s of covariates can be included in the linear predictor (A: model matrix, θ: parameter vector): g(μ) = A θ + f₁(x₁) + f₂(x₂) + ….

Splines might be used as smooth functions, or in the case of INLA, specific Gaussian Markov Random Fields. In the present analysis, B-splines were used as univariate smoother for the age covariate and bs() from the splines package can directly be included in the model formula: Y ~ offset (log(N)) + PERIOD+bs (AGE). Alternatively, an random walk order 2 (rw2) model might be specified as Y ~ offset (log(N)) + PERIOD+f (AGE, model = ‘rw2’).

To allow evaluation of models with multivariate tensor product smoothers for age and period with INLA, we used an ad-hoc solution applying a z-model (we acknowledge that this is a non-standard appraoch and a more detailed outline than in the scope of this article would be useful before more widespread application). Tensor spline interactions can be specified, e.g. by using the function mgcv::te() for the classical model fitting approach (Y ~ offset (log(N)) + te (AGE,PERIOD)). In R-INLA, te() is not directly usable in model formulas. The z-model we used instead is an implementation of classical random effects part of a mixed model (η = … + Z z). The random effects design matrix is \( \boldsymbol{Z}=\left(\begin{array}{ccc}{\boldsymbol{Z}}_{\mathbf{1}}& \cdots & \mathbf{0}\\ {}\vdots & \ddots & \vdots \\ {}\mathbf{0}& \cdots & {\boldsymbol{Z}}_{\boldsymbol{i}}\end{array}\right) \) for each cluster i which has q ∈ ℕ⁺ random effects. Z was calculated as the tensor product smooth model matrix for marginal bases for age and period model matrices A and P using mgcv::tensor.prod.model.matrix() [31]. The ith row of the resulting tensor product model matrix is calculated as the Kronecker product of the ith rows of A and P. Marginal bases were calculated as M-splines, using splines2::mSpline() [32]. M-splines are non-negative splines, which can be considered as a normalized version of B-splines. A loggamma prior was specified for this model, with parameter values (a = 1, b = 0.005), the same values used as in [33]. The corresponding R code is shown in the package vignette vignette(‘incidence’).

Bayesian age-period-cohort models (BAPC)

APC models estimate the individuals’ age, birth cohort and the period in which the event occurred [19]: η_ij = log(λ_ij) = μ + α_i + β_j + γ_k with intercept μ, and age, period and cohort effects α_i, β_j and γ_k. i (1 ≤ i ≤ I) denotes the age group at time point j (1 ≤ j ≤ j), the cohort index k depends on the age and period index as well as on the age group and period interval width: k = j + M (I − i ). M encodes the width of age groups as compared to period intervals, e.g. for 5 yr age groups and yearly data, M is 5. The model implemented in the BAPC package assumes Poisson distributed data, includes the three random effects age, period, cohort (second-order random walk, rw2) and an additional random effect (independent and identically distributed, iid) to adjust for overdispersion. Separate age, period and cohort effects are not identifiable due to the exact linear dependence of effects [19].

Performance metrics

Model performance was evaluated using three metrics: coverage, bias and precision. Metrics were calculated per age/age-group, sex and entity, and averaged (arithmetic mean), yielding one aggregated value per entity, gender, projection interval and projection models as a summary statistic.

Coverage was calculated as the fraction of projections lying within the 95% (equal tailed) credibility band. Bias was set to 0 if the observed incidence count was equal to the predicted, otherwise the ratio (observed-predicted)/observed was computed. Posterior standard deviations were used as a measure of precision.

Model performance

Evaluation of the predictive performance of models with increasing complexity was performed as follows (see also Fig. 1): the most current observed incidence data was predicted, with the projection period starting n years prior to this timepoint (n ∈ {2, 5, 10,15,20}). The observation period for model training preceded this timepoint. In the presented analysis, 15 yrs. were chosen as observation period. For the evaluation of a 2 yr projection, e.g. in the SEER-9 dataset, data of the year 2014 would be predicted, using data from the 15 yrs. prior to 2012 for model fitting.

Das was available in different aggregation types - as age-groups for NORDCAN and Saarland data and for each age for the SEER-9 data. In the latter case, individual age-years were used, i.e. no further aggregation was applied.

R package incAnalysis

To facilitate further application and reproducibility, the R package ‘incAnalysis’ was developed. It is publicly available on http://github.com/mknoll/incAnalysis. The package mainly builds on methods in the R packages BAPC [19], mgcv [22] and R-INLA [ref: http://www.r-inla.org/]. Representative analyses with stepwise explanations on how to use the package are outlined in the accompanying vignette in more detail: vignette(‘incidence’)in R. An overview of the functionality and structure of the package is given in Fig. 2.

A wide variety of approaches to project future cancer incidence can be comparatively assessed using this package. Constant rates or counts simply projected into the future, as well as GLMs and GAMs (both in the INLA and ML/REML framework, selected via the method parameter) and BAPC models might be specified.

The package provides a class called incClass which is instantiated with population and incidence data (data.frame with years in rows, the earliest available year in the first row and age/age-group as columns with increasing values from left to right) as well as the period used for model training and the fitting period of interest (and additional parameters). Different models are then added to the newly created object with the following functions which usually expect additional parameters, e.g. model formulas and the respective class object: runFwProj() for forward projection of constant rates or constant counts, runGLM() for generalized linear models (using INLA or an ML approach, specified by the method parameter), runGAM() for GAMs, runInla() for any INLA model and runBAPC() to run the BAPC model [19]. evaluate() calculates the performance metrics, which can be extracted as data.frame via metrics(); additionally, projections are plotted. pitHist() plots Probability Integral Transform (PIT) histograms for all INLA fitted models.

Results

Coverage

Coverages for the evaluated models are shown in Fig. 3 for an observation period of 15 yrs. and projection periods of 2, 5, 10, 15 and 20 yrs.

Importantly, most models yielded coverages below 95%, with smallest (< 25%) coverages for intercept only models and highest coverages (> 75%) for BAPC models, irrespective of the projection period. Variability of coverages of BAPC projections is smaller in the SEER-9 dataset as compared to NORDCAN and Saarland data.

Coverage increased for AP models with linear age, period and interaction effect for longer projection intervals in all datasets. Models incorporating a univariate smoother for age showed no clear median increase in coverages for longer periods, variability, however, increased.

Multivariate smoother models showed a decrease of median coverages for longer projection intervals in the SEER-9 data, in increase in the Saarland data and high variability with no clear trend in the NORDCAN data.

Bias

Results of bias analyses are shown in Fig. 4. Negative values correspond to higher predicted than observed incidence counts (overestimation). For visualization purposes, values <− 200 were set to − 200.

Several models show negative values. Absolute bias increases with longer projection intervals for most models in the SEER-9 and Saarland datasets. Intercept-only models show mostly absolute median bias values below − 100, except for 15 and 20 yr projections in the Saarland data. Univariate smoother models show in most cases lower absolute bias as GLMs with linear age, period and interaction effects. Median absolute bias is smallest for the multivariate smoother models in SEER-9 data for longer projection intervals. Differences in median absolute bias between all except intercept-only models are highest in the SEER-9 dataset.

Precision

Precision is depicted in Fig. 5; median model values range mostly between 0.5 and 5 for the SEER-9 data, 2 and 6 for the NORDCAN data and 0 and 4 in the Saarland dataset. Longer projection intervals yield lower precision for all but the intercept only model. Univariate smoother models show higher precision as compared to most additionally evaluated models. Variability in precision increases for longer projection intervals for the BAPC models, and for the SEER-9 data, for univariate smoother GAMs. For the other models, no clear trend can be observed.

Discussion

Population-based cancer registry data are routinely used to monitor cancer incidence at the population-level, to evaluate screening and prevention programs, and to identify areas where intensified medical research is needed [4]. However, no consensus appears to exist on which models to use for projections based on short-term observational rate data in cancer epidemiology. Systematic empirical evaluations of potentially applicable approaches using existing cancer registry data for benchmarking appear sensible to obtain a better understanding of their operating characteristics and to ultimately make informed methodological recommendations. To facilitate this idea, we introduced an R package (incAnalysis) for an integrated evaluation of the adequacy of different statistical approaches in this context. We note that the package could in principle also be used for projections of other types of rates than incidence rates. In an extensive and systematic evaluation we demonstrated its use. While the presented results may already be informative for methodological guidance, we believe that further detailed and targeted applications would be helpful for the derivation of methodological guidance by expert panels. Consensus on desirable (or acceptable) operating characteristics would be sensible prerequisite for the appraisal of individual statistical modeling options.

In the reported empirical analysis only age(−groups) between 20 and 84 were analyzed, as childhood tumors constitute a biologically distinct group, are in general rare and require reliable projections of birth rates. This might impair the ability of models to obtain reliable projections; nevertheless it has been reported [34] that this approach might decrease accuracy. Cancers in the age group > = 85 were excluded to assure comparability between cancer registries (fixed age-group width required for BAPC [gridFactor]).

Model performance was assessed by evaluating coverage, bias and precision of projections. Alternative metrics for model evaluation described are e.g. the Continuous Ranked Probability Score (CPRS) as used e.g. in [19] or the evaluation of PIT histograms. The latter can be easily obtained from INLA fitted objects, and further metrics as the CPRS can be easily calculated using the data provided by the incAnalysis package.

As the least complex model, intercept only models were evaluated. As expected, only small coverages (< 25%) could be expected as cancer occurrence is usually highly dependent on age. An intercept only model does not take the age into account (change in the distribution of age over calendar time), and thus, these models can hardly be recommended for cancer incidence projection, especially over a longer period.

GLMs with linear age, period and their interaction effect were evaluated as next, more complex model types. Performance, however, was generally poor. To achieve a potentially even better fit, a model with a univariate smoother for age was analyzed, as the latter is a biologically highly relevant covariate for cancer incidence. B-splines, created with splines::bs() were incorporated into the model formula. An alternative would be the specification of a Gaussian Markov Random Field structure for smoothing, e.g. a second order random walk.

Next, multivariate smoothers (tensor product smoothers) for age and period were included into the model, using a z-model in INLA. For classical ML/REML models, such effects can easily be included in the models by using the mgcv::te() function. The latter cannot be directly fit with INLA::inla(). Even though the mgcv::ginla() function was made available recently (which allows to obtain posterior distributions of effects directly from GAMs fitted with mgcv), the INLA package is not directly utilized by mgcv, and thus projections are not as straight-forward as with the z-model. Coverage is higher as compared to univariate smoother models, but is less stable for long term projections as compared to BAPC models.

Finally, the BAPC model was evaluated and turned out to be among the best performing for all evaluated parameter combinations. The additional two effects (cohort and overdispersion adjustment effect) seem to be especially important for short-term projections, as differences to most other models except multivariate smoother models decrease for longer intervals.

Conclusions

The incAnalysis R package allows a straightforward comparison of key operating characteristics of statistical approaches to cancer incidence projection. Our empirical analyses of a selection of potentially applicable approaches suggest that (i) projections of rate data using short term data yields robust high coverage at the cost of low precision for BAPC, (ii) age-period GLMs with interaction term mostly yield better results for longer projection intervals (> 10 yrs), (iii) GAMs using tensor product smooth models (age, period) constitute a reasonable alternative to classical GLMs, and (iv) intercept-only models may at best be useful only for short-term projections (< 5 yrs). Further detailed and targeted investigations into model performance seem advisable to make recommendations about appropriate statistical projection methods in a given setting.

Availability of data and materials

The datasets generated and/or analysed during the current study are included in the incAnalysis github package, https://github.com/mknoll/incAnalysis.

Abbreviations

APC:: Age-Period-Cohort
BAPC:: Bayesian APC models
CPRS:: Continuous Ranked Probability Score
GAM:: Generalized Additive Model
GLM:: Generalized Linear Model
INLA:: Integrated Nested Laplace Approximations
MCMC:: Markov-Chain Monte Carlo
ML:: Maximum Likelihood
PIT:: Probability Integral Transform
REML:: Restricted Maximum Likelihood

References

Brown LD, Cai TT, DasGupta A, Agresti A, Coull BA, Casella G, Corcoran C, Mehta C, Ghosh M, Santner TJ, et al. Interval estimation for a binomial proportion - comment - rejoinder. Stat Sci. 2001;16(2):101–33.
Google Scholar
Siegel RL, Miller KD, Jemal A. Cancer statistics, 2019. CA Cancer J Clin. 2019;69(1):7–34.
Article Google Scholar
Moller B, Fekjaer H, Hakulinen T, Sigvaldason H, Storm HH, Talback M, Haldorsen T. Prediction of cancer incidence in the Nordic countries: empirical comparison of different approaches. Stat Med. 2003;22(17):2751–66.
Article PubMed Google Scholar
Bray F, Moller B. Predicting the future burden of cancer. Nat Rev Cancer. 2006;6(1):63–74.
Article CAS PubMed Google Scholar
Moller H, Fairley L, Coupland V, Okello C, Green M, Forman D, Moller B, Bray F. The future burden of cancer in England: incidence and numbers of new patients in 2020. Br J Cancer. 2007;96(9):1484–8.
Article CAS PubMed PubMed Central Google Scholar
Nowatzki J, Moller B, Demers A. Projection of future cancer incidence and new cancer cases in Manitoba, 2006-2025. Chronic Dis Can. 2011;31(2):71–8.
Article CAS PubMed Google Scholar
Dyba T, Hakulinen T, Paivarinta L. A simple non-linear model in incidence prediction. Stat Med. 1997;16(20):2297–309.
Article CAS PubMed Google Scholar
Hakulinen T, Dyba T. Precision of incidence predictions based on Poisson distributed observations. Stat Med. 1994;13(15):1513–23.
Article CAS PubMed Google Scholar
Stock C, Mons U, Brenner H. Projection of cancer incidence rates and case numbers until 2030: A probabilistic approach applied to German cancer registry data (1999-2013). Cancer Epidemiol. 2018;(57):110–9.
Clements MS, Armstrong BK, Moolgavkar SH. Lung cancer rate predictions using generalized additive models. Biostatistics. 2005;6(4):576–89.
Article PubMed Google Scholar
Engeland A, Haldorsen T, Tretli S, Hakulinen T, Horte LG, Luostarinen T, Schou G, Sigvaldason H, Storm HH, Tulinius H, et al. Prediction of cancer mortality in the Nordic countries up to the years 2000 and 2010, on the basis of relative survival analysis. A collaborative study of the five Nordic Cancer registries. APMIS Suppl. 1995;49:1–161.
CAS PubMed Google Scholar
Smith TR, Wakefield J. A review and comparison of age-period-cohort models for Cancer incidence. Stat Sci. 2016;31(4):591–610.
Article Google Scholar
Kupper LL, Janis JM, Salama IA, Yoshizawa CN, Greenberg BG. Age-period-cohort analysis - an illustration of the problems in assessing interaction in one observation per cell Data. Commun Stat-Theor M. 1983;12(23):2779–807.
Article Google Scholar
O’Brien RM. Constrained estimators and age-period-cohort models. Sociol Methods Res. 2011;40(3):419–52.
Article Google Scholar
Mistry M, Parkin DM, Ahmad AS, Sasieni P. Cancer incidence in the United Kingdom: projections to the year 2030. Br J Cancer. 2011;105(11):1795–803.
Article CAS PubMed PubMed Central Google Scholar
Moller B, Fekjaer H, Hakulinen T, Tryggvadottir L, Storm HH, Talback M, Haldorsen T. Prediction of cancer incidence in the Nordic countries up to the year 2020. Eur J Cancer Prev. 2002;11(Suppl 1):S1–96.
PubMed Google Scholar
Whiteman DC, Green AC, Olsen CM. The growing burden of invasive melanoma: projections of incidence rates and numbers of new cases in six susceptible populations through 2031. J Invest Dermatol. 2016;136(6):1161–71.
Article CAS PubMed Google Scholar
Havulinna AS. Bayesian age-period-cohort models with versatile interactions and long-term predictions: mortality and population in Finland 1878-2050. Stat Med. 2014;33(5):845–56.
Article PubMed Google Scholar
Riebler A, Held L. Projecting the future burden of cancer: Bayesian age-period-cohort analysis with integrated nested Laplace approximations. Biom J. 2017;59(3):531–49.
Article PubMed Google Scholar
Rue H, Martino S, Chopin N. Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. JSTOR. 2009;71(2):319–92.
Google Scholar
Rue H, Riebler A, Sorbye SH, Illian JB, Simpson DP, Lindgren FK. Bayesian computing with INLA: a review. Annu Rev Stat Appl. 2017;4:395–421.
Article Google Scholar
Wood SN. Generalized additive models: an introduction with R, second edition edn. Boca Raton: Chapman and Hall/CRC Texts in Statistical Science; 2017.
Book Google Scholar
Boulesteix AL, Binder H, Abrahamowicz M, Sauerbrei W. Simulation panel of the SI: on the necessity and design of studies comparing statistical methods. Biom J. 2018;60(1):216–8.
Article PubMed Google Scholar
Crüwell S, Stefan AM, Evans NJ. Robust standards in cognitive science. Computational Brain & Behavior. 2019;2(3):255–65.
Article Google Scholar
Mangul S, Martin LS, Hill BL, Lam AK, Distler MG, Zelikovsky A, Eskin E, Flint J. Systematic benchmarking of omics computational tools. Nat Commun. 2019;10(1):1393.
Article PubMed PubMed Central Google Scholar
Research Data (1973-2014), National Cancer Institute, DCCPS, Surveillance Research Program, based on the November 2016 submission. [https://seer.cancer.gov].
Engholm G, Ferlay J, Christensen N, Bray F, Gjerstorff M, Klint A, Kotlum J, Olafsdottir E, Pukkala E, Storm H. NORDCAN--a Nordic tool for cancer information, planning, quality control and research. Acta Oncol. 2010;49(5):725–36.
Article PubMed Google Scholar
Krebsregister Saarland [http://www.krebsregister.saarland.de/].
Tabellen und Grafiken aus dem Bereich "Gebiet und Bevölkerung" [https://www.saarland.de/6772.htm].
Bevölkerung im Jahresdurchschnitt 1980–2012 (Grundlage Zensus BRD 1987, DDR 1990) [http://www.gbe-bund.de/gbe10/trecherche.prc_them_rech?tk=700&tk2=906&p_uid=gast&p_aid=66019368&p_sprache=D&cnt_ut=1&ut=906].
Wood SN. Low-rank scale-invariant tensor product smooths for generalized additive mixed models. Biometrics. 2006;62(4):1025–36.
Article PubMed Google Scholar
Ramsay JO. Monotone regression splines in action. Stat Sci. 1988;3(4):425–41.
Article Google Scholar
Bauer C, Wakefield J, Rue H, Self S, Feng ZJ, Wang Y. Bayesian penalized spline models for the analysis of spatio-temporal count data. Stat Med. 2016;35(11):1848–65.
Article PubMed Google Scholar
Baker A, Bray I. Bayesian projections: what are the effects of excluding data from younger age groups? Am J Epidemiol. 2005;162(8):798–805.
Article CAS PubMed Google Scholar

Download references

Acknowledgements

MK and JF are members of the MD/PhD program at Heidelberg University and are funded by Heidelberg Medical Faculty.

The authors would like to thank two anonymous reviewers and the handling editor for their helpful comments and suggestions on the initial submission. In the authors view, this helped to significantly improve the quality and the clarity of the manuscript.

Funding

National Center for Tumor Diseases Heidelberg (NCT PRO-2015.21), German Research Foundation (DFG UNITE SFB-1389s), German Cancer Research Center (iMed). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Open Access funding enabled and organized by Projekt DEAL.

Author information

Authors and Affiliations

Department of Radiation Oncology, Heidelberg University Hospital, Im Neuenheimer Feld 400, 69120, Heidelberg, Germany
Maximilian Knoll, Jennifer Furkel, Jürgen Debus & Amir Abdollahi
Faculty of Biosciences, Heidelberg University, Heidelberg, Germany
Maximilian Knoll & Jennifer Furkel
Clinical Cooperation Unit Radiation Oncology, German Cancer Research Center (DKFZ), Im Neuenheimer Feld 280, 69120, Heidelberg, Germany
Maximilian Knoll, Jennifer Furkel, Jürgen Debus & Amir Abdollahi
German Cancer Consortium (DKTK) Core Center Heidelberg, Heidelberg, Germany
Maximilian Knoll, Jennifer Furkel, Jürgen Debus & Amir Abdollahi
Institute of Epidemiology and Social Medicine, University of Muenster, Albert-Schweitzer-Campus 1, 48149, Muenster, Germany
André Karch
Institute of Medical Biometry and Informatics (IMBI), University of Heidelberg, Im Neuenheimer Feld 130.3, 69120, Heidelberg, Germany
Christian Stock
Division of Clinical Epidemiology and Aging Research, German Cancer Research Center (DKFZ), Heidelberg, Germany
Christian Stock

Authors

Maximilian Knoll
View author publications
You can also search for this author in PubMed Google Scholar
Jennifer Furkel
View author publications
You can also search for this author in PubMed Google Scholar
Jürgen Debus
View author publications
You can also search for this author in PubMed Google Scholar
Amir Abdollahi
View author publications
You can also search for this author in PubMed Google Scholar
André Karch
View author publications
You can also search for this author in PubMed Google Scholar
Christian Stock
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

CS designed the study, MK designed and created the R package. MK and CD wrote the manuscript with input from JF, AA, JD and AK. All authors provided critical feedback and helped shape the research, analysis and manuscript. The author(s) read and approved the final manuscript.

Corresponding author

Correspondence to Maximilian Knoll.

Ethics declarations

Ethics approval and consent to participate

Access to NORDAN and SEER data was provided upon a request at the respective offices, Saarland data was publicly available, no additional approval was required.

Consent for publication

Not applicable.

Competing interests

CS is now full-time employee of Boehringer Ingelheim Pharma GmbH & Co. KG, Ingelheim, Germany. The company had no role in design, analysis or interpretation of the presented work.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Additional file 1.

Additional file 2.

Additional file 3.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Knoll, M., Furkel, J., Debus, J. et al. An R package for an integrated evaluation of statistical approaches to cancer incidence projection. BMC Med Res Methodol 20, 257 (2020). https://doi.org/10.1186/s12874-020-01133-5

Download citation

Received: 07 June 2020
Accepted: 24 September 2020
Published: 15 October 2020
DOI: https://doi.org/10.1186/s12874-020-01133-5

An R package for an integrated evaluation of statistical approaches to cancer incidence projection

Abstract

Background

Methods

Results

Conclusion

Background

Methods

Cancer registry data

Projection models

Generalized linear models (GLMs)

Generalized additive models (GAMs)

Bayesian age-period-cohort models (BAPC)

Performance metrics

Model performance

R package incAnalysis

Results

Coverage

Bias

Precision

Discussion

Conclusions

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s Note

Supplementary information

Additional file 1.

Additional file 2.

Additional file 3.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Medical Research Methodology

Contact us