 Research article
 Open Access
 Published:
Bayesian estimation of a cancer population by capturerecapture with individual capture heterogeneity and small sample
BMC Medical Research Methodology volume 15, Article number: 39 (2015)
Abstract
Background
Cancer incidence and prevalence estimates are necessary to inform health policy, to predict public health impact and to identify etiological factors. Registers have been used to estimate the number of cancer cases. To be reliable and useful, cancer registry data should be complete. Capturerecapture is a method for estimating the number of cases missed, originally developed in ecology to estimate the size of animal populations. Capture recapture methods in cancer epidemiology involve modelling the overlap between lists of individuals using loglinear models. These models rely on assumption of independence of sources and equal catchability between individuals, unlikely to be satisfied in cancer population as severe cases are more likely to be captured than simple cases.
Methods
To estimate cancer population and completeness of cancer registry, we applied M_{th} models that rely on parameters that influence capture as time of capture (t) and individual heterogeneity (h) and compared results to the ones obtained with classical loglinear models and sample coverage approach. For three sources collecting breast and colorectal cancer cases (Histopathological cancer registry, hospital Multidisciplinary Team Meetings, and cancer screening programmes), individual heterogeneity is suspected in cancer population due to age, gender, screening history or presence of metastases. Individual heterogeneity is hardly analysed as classical loglinear models usually pool it with between“list” dependence. We applied Bayesian Model Averaging which can be applied with small sample without asymptotic assumption, contrary to the maximum likelihood estimate procedure.
Results
Cancer population estimates were based on the results of the M_{h} model, with an averaged estimate of 803 cases of breast cancer and 521 cases of colorectal cancer. In the loglinear model, estimates were of 791 cases of breast cancer and 527 cases of colorectal cancer according to the retained models (729 and 481 histological cases, respectively).
Conclusions
We applied M_{th} models and Bayesian population estimation to small sample of a cancer population. Advantage of M_{th} models applied to cancer datasets, is the ability to explore individual factors associated with capture heterogeneity, as equal capture probability assumption is unlikely. M_{th} models and Bayesian population estimation are wellsuited for capturerecapture in a heterogeneous cancer population.
Background
Cancer is the leading cause of death in Western countries and particularly in France [1]. In view of this situation, cancer control programmes have been implemented. Evaluating the effectiveness of these policies, aiming for improved prevention and management, is essential. However to conduct such an evaluation, a baseline reference requiring ongoing, reliable and complete data collection, such as a cancer registry, is necessary. Besides providing descriptive epidemiology, cancer registries are currently used for epidemiological research, assessment of screening programmes and treatment innovations [2].
In order to verify the completeness of cancer cases recorded within a specific geographic area, capturerecapture method is usually applied. The capturerecapture method is a sampling technique originally devised by ecologists to study fauna [3,4], and subsequently adapted to epidemiological studies [57] Since the early nineties, this method has been extended to many demographic and epidemiologic studies [8,9]. It has thus been used to confirm the completeness of the data recorded in cancer registries [1012].
The capturerecapture procedure in epidemiology consists in confronting data from at least two independent sources, collecting cases in the same area in order to estimate the total number of cases, and assessing the completeness of each data source [13]. In brief, this method involves modelling the overlap between two or more lists of individuals (data “sources”) from the target population, and using this model to predict how many additional individuals were unseen, and hence the total population size. To avoid bias in the estimate, the sources of data collection must be independent and homogeneity of capture must be ensured [14,15] (i.e. the probability of capture does not depend on case characteristics). Capture heterogeneity can result in positive dependence (underestimation of the population) or negative dependence (overestimation of the population) between sources.
The standard approach to capturerecapture in epidemiology is to fit loglinear models [16,17], in which the inclusion of source by source interaction terms may account for dependencies between the data sources. These parameters are subsequently estimated by the maximum likelihood estimate procedure. With a sufficiently large population, this procedure is acceptable under the asymptotic assumption. However, the asymptotic assumption cannot be verified on few cases, a frequent situation when capturerecapture concerns a specific target population, as a cancer population within a small area.
Many authors have presented capturerecapture methods in epidemiology to take into account dependence and individual heterogeneity, including source by source interaction terms in the loglinear model, loglinear models stratified on several covariates, or including sources by covariate interactions in a single loglinear model [6,9,18].
In the loglinear method, capture heterogeneity is pooled with between“list” dependence within the betweensource interaction terms [19]. To test capture heterogeneity stratification of cases on potential variables, related to capture heterogeneity, is applied. For example, this method has been applied by King et al.[20] to estimate current injectors in Scotland and drugrelated death rate by sex, region and agegroup. This consists in constructing a single contingency table in which cells correspond to the numbers of individuals belonging to each distinct combination of covariates and sources.
Conversely, incorporating one or more potential variables is more complex when numbers of cases are limited. Stratification of cases, whether common or not to both sources, on several covariates leads to a contingency table containing several missing cells or too few cases within certain cells to provide robust results.
Moreover, many authors, e.g. Schmidtmann [21] as Silcoks and Robinson [22], compared several methods to estimate the completeness of cancer registration, among which loglinear models: according to both these authors, loglinear modelling does not always yield the best estimation. Confirmation of results obtained via the classical loglinear model therefore appeared to us as essential. Capture heterogeneity, which had not been previously tested with loglinear models, needed to be taken into account. During the past years, much theoretical research has been conducted to develop capture recapture methods, such as that by Chao, Pan and Chiang [23] who propose a LincolnPetersen estimator including dependence effects resulting from local lists and heterogeneous capture probabilities. Other authors have proposed mixed models: Mao [24] focused on a nonparametric maximum likelihood estimate for two classes of mixed models with a binomial and geometric distribution. The classical modelling approach consists in estimating the parameters of a model which are then considered as fixed quantities. To confirm our results with a totally different approach, we therefore wished to implement a Bayesian procedure. Several authors have focused on this method, among them ManriqueVallier and Fienberg [25] who postulated the existence of a homogeneous population class to overcome the problems related to heterogeneity of closed populations in capture recapture.
However, capturerecapture was first developed in ecology for estimating the size of animal populations: as a result, methods in ecology are somewhat more developed and there is probably much for epidemiologists to learn from the ecology literature.
In this paper, we borrow tools from the ecological capturerecapture literature: M_{tbh} models [26], which simultaneously allow for the effects of time, behaviour and individual heterogeneity in capture probabilities. King and Brooks [27], proposed a Bayesian estimate for the size of a closed population in the context of heterogeneity and model uncertainty, using Bayesian Model Averaging (BMA). This approach overcomes the difficulties related to capture heterogeneity and model selection, providing the ecological models may be adapted to capture recapture procedures in epidemiology.
We applied these tools to a capture recapture study concerning a histopathological cancer registry [28]. This study confronted newly diagnosed cases of breast and colorectal cancer, in the AlpesMaritimes (Southeastern France), among patients aged 50 to 75 years, recorded in the Histopathological Registry (HR), those discussed in hospital Multidisciplinary Team Meetings (MTM) and those diagnosed through the coordinated Cancer Screening Programmes (CSP). We compared the results to those obtained with log linear models and sample coverage approach [19] for the same data [28]. We have then applied ecological models and BMA method to wellknown examples of data set in capturerecapture, as an outbreak of the hepatitis A virus in a college in northern Taiwan [29], a data set on diabetes in a community in Italy based on four records [30] and finally to five lists of infants born with a specific congenital anomaly in Massachusetts [31,32].
Methods
Capturerecapture ecological models
When estimating a population size using the capture recapture method in an ecological study, the underlying assumptions concerning case capture have a direct impact. The selected model to estimate the total number of cases rests on these assumptions and on its adjustment on the observed data. Otis et al. [26] defined three effects influencing capture: time, behaviour and individual heterogeneity. All the interactions may be possible between these three effects. In other words, models differ according to whether the capture probability changes only with time of capture (M_{t}), or changes between individuals according to their behaviour (M_{b}) or their characteristics (M_{h}).
The use of behavioral models (M_{b} and more complex models including the behavioral effect) do not appear appropriate in epidemiology as capture probability should not change after a previous capture. Indeed, they are based on the assumption of a natural sequence of capture episodes, whereas there is no time sequence in our sources so that these models do not appear useful. This has also been pointed out by Chao et al. [19] who went as far as stating that only M_{t}, M_{h}, and M_{th} are potentially useful for applications in epidemiology. For our study, we will therefore focus on these three models applied to our data, implying that capture probability changes only with time of capture (M_{t}), or according to individuals’ characteristics (M_{h}) or both (M_{th}).
Let p_{iτ} denote the capture probability for individual i = 1, 2, 3, …N at time τ = 1, 2, 3, … T. F(i) represents the first time that individual i is observed. Therefore p_{iτ} = P which is the initial capture probability of i for times τ =1,…, F(i), and the recapture probabilities for τ = F(i) + 1, …T assuming F(i) < T. Thus individual i is captured at time τ =1, not captured at time τ = F(i) – 1 and captured at time τ = F(i), and can be recaptured between times τ = F(i) + 1 and τ, with total capture times = T.
The saturated M_{ th } model integrating time (t), and heterogeneity (h) has capture probabilities of the form:
where γ_{ i } denotes independent and identically distributed individual random effects (i.i.d) ~ N (0, σ^{2}_{γ}).
In this model the estimated parameters are μ (mean capture rate expressed as logit), α_{τ} (year effect for capture), and σ^{2}_{γ} (variance of individual random effects). Submodels of the saturated model are obtained by setting certain parameter values equal to zero.
It is assumed that capture probabilities are independent, given the parameter values. Let θ = {μ, α, σ^{2}_{γ}, N} with α = {α_{1},…,α_{T}}, and γ = {γ_{1},…, γ_{N}}. The vector x, which describes the capture history of all individuals, is given by:
where n (n < N) denotes the total number of individuals captured in the study, i.e. n unique individuals captured initially. The vector x_{it} = 0 if individual i is not captured at time t and x_{it} = 1 if the individual is captured at time t, i.e. x = (x_{11}, …, x_{1T}, … x_{21}, …, x_{2T}, x_{N1}, …, x_{NT}) where x_{it} = 0 or x_{it} =1.
For models with no heterogeneity, i.e. M_{t}, the individual random effect γ_{i} = 0 so f(x  θ, γ) = f(x  θ) and θ can be obtained using the Maximum Likelihood Estimate (MLE) of the parameters. In the presence of heterogeneity (models M_{h}, M_{th}), calculating the MLE and selecting the model is more complex [33] because of the individual random effects.
Bayesian population estimation
In the Bayesian approach, individual heterogeneity is estimated from Monte Carlo Markov Chain algorithms. All the parameters can thus be estimated for all possible models, with or without individual heterogeneity. In the Bayesian approach, the model parameters are considered as a random sample. The distribution of the samples therefore provides information on the parameters. Before collecting the data, the parameter distribution is a prior distribution. After data collection, the parameters have a posterior distribution.
In this analysis, the model itself is considered as an unknown parameter to be estimated. According to Bayes’ theorem applied to continuous distributions, the joint posterior distribution over both parameter and model space is obtained by multiplying the likelihood by the prior distribution, with m denoting the model and θ_{ m } the parameters in model m:
To introduce individual heterogeneity, random effects are included as expressed by the variables γ = {γ_{1},…, γ_{N}} in equation Eq. 1.
The term h(γθ_{m}) corresponds to the model assumption of the random effect γ_{ i } ~ N (0, σ^{2}_{γ}).
Finally, the posterior distribution of the parameters and the model is given by:
Models are compared via posterior model probabilities and an estimate of the total population, based on all plausible models, may be obtained using the posterior distribution. In other words, each estimate is an average of the posterior distributions under each of the models considered, weighted by their posterior model probability. This procedure, detailed in Additional file 1: Appendix A, is called Bayesian Model Averaging. Usually, a single model is selected, as this model best fits the observed data. However these data come from random sample. As Hoeting pointed out [34] this approach ignores the uncertainty in model selection, leading to overconfident inferences and decisions that are more risky than one thinks they are. Bayesian model averaging (BMA) provides a coherent mechanism for accounting for model uncertainty.
Bayes’ theorem is used to estimate the joint posterior distribution of all the parameters included in the model. Ultimately, the posterior marginal distribution of the parameters of interest is estimated, requiring integration of the posterior joint distribution, which is not always possible. In the modern Bayesian approach, this distribution of the posterior parameter vector is not integrated, and specific simulations are performed to obtain posterior distribution samples and thus simulated values for the posterior marginal distributions of the parameters of interest. As the posterior distribution is multidimensional, Monte Carlo Markov Chain algorithm is applied to obtain a random sample of the posterior distribution. Let’s firstly consider the two components of this method: Monte Carlo integration and Markov chains. Monte Carlo integration is a simulation technique which allows obtaining an estimate of a given integral. This technique is based upon drawing observations from the distribution of the variable of interest and then calculating the sample mean [35]. To obtain a potentially large sample from the posterior distribution we use a Markov chain, which is a stochastic sequence of numbers where each value in the sequence depends only upon the last. Under conditions as chain is aperiodic and irreducible, distribution will converge to a stationary distribution. Monte Carlo Markov Chain methods perform Markov chains with Monte Carlo integration to generate observations and to construct a sequence of values whose distribution converges to the posterior distribution of interest. Once the chain has converged we can use sequence of values to obtain estimates of any posterior summaries of interest (Monte Carlo). To be sure that Markov chain has reached the stationary distribution before Monte Carlo estimates, we discard observations from the start of the chain, which is called the burnin.
Two Monte Carlo Markov Chain algorithms are used according to the parameters and models. The MetropolisHastings algorithm is used when the model does not involve changing the dimension and the reversible jump algorithm when calculations involve a change in dimension (due to the model and the population size in the presence of individual effects). The reversible jump algorithm is detailed in Additional file 1: Appendix B.
Lastly, prior probabilities must be defined for the models and parameters. In the absence of prior assumptions concerning their influence on the estimate of the total population, we chose a noninformative model. The priors for each parameter are detailed in Additional file 1: Appendix C. The prior probability of selecting a model follows a uniform distribution p(m) = 1/k where k denotes the number of models. For each potential effect (time, heterogeneity) the prior probability is 0.5. The parameters are assumed to be independent and to follow the same prior distribution in each model (if present).
Software for Bayesian analysis
We applied the above Bayesian methods using the WinBUGS [36] software package, which performs complex Bayesian analyses. The codes used for the M_{th} models were drawn from the models developed by Link and Barker [37] in WinBUGS for Bayesian inference applied to ecological surveys, namely capture recapture. Briefly, two vectors are used: the number of captures of each individual and the fact that an individual is considered as caught, not caught or unknown during a specific episode. To provide an inference for the total population we consider that the individual capture probabilities of the subjects that were not caught are linked to the capture probabilities of those that were captured. The total number of subjects is estimated by the method of data augmentation developed by Link and Barker. The WinBUGS codes are detailed in Additional file 1: Appendix D.
Individual covariates
The three sources studied are the Histopathological Registry, the hospital Multidisciplinary Team Meetings (MTM) and the Cancer Screening Programmes. Firstly, histopathology laboratories have been transmitting all the invasive cancers with a histopathology diagnosis to the Nice University Hospital Public Health Department, which coordinates the Histopathological registry since 2005. Cancer cases from hospital Multidisciplinary Team Meetings came from the regional cancer network which has been systematically collecting data from hospital since 2007. The third source is the local coordinating centre for cancer screening which collects data concerning patients aged 50 to 75 years with a positive result following screening for breast or colorectal cancer since 2002 and 2005 respectively.
An estimate was first obtained from the three available sources, each of them considered as a capture episode. Secondly, an estimate of the total population was obtained by considering each covariate as a capture episode. The selected parameters considered as potentially accounting for different capture probabilities included age and presence of metastases at the time of diagnosis (TNM staging), according to the recommendations concerning capture recapture applied to cancer registries [38,39]. We also introduced past history of screening mammography [21] and gender for cases of breast and colorectal cancer, respectively, as potential capture heterogeneity parameters.
Result
Estimate of the number of incident cases of breast cancer
Capturerecapture M_{th} models were initially applied to the three sources, each considered as a capture episode, i.e. Histopathological cancer Registry (HR), hospital Multidisciplinary Team Meetings (MTM) and cancer screened cases (CSP). In these 3 sources, 787 cases aged 50 to 75 years were notified in 2008 as presented in Figure 1, of which 729 by the Histopathological cancer registry, 470 were identified through at least 2 sources and 108 were common to all three sources. After averaging over all the models according to their posterior probability, the estimated mean number of breast cancer cases was 790.6 (median = 790.2, 95%HPDI: 790.2792.2), as presented in Table 1. The average estimate was obtained over all models following 25 000 iterations after discarding the initial 5 000 iterations. Since first iterations a convergence of Markov chains was obtained, as a stationary distribution was observed after only 500 iterations.
With a total number of 791 breast cancer cases, the completeness of the Histological Register was estimated to be 92.2% (92.1%; 92.3%). The posterior probability for the M_{t} model was 100%, corresponding to the model for which capture probability changes for each source. As there is no time sequence in our sources, we altered it and compared the estimates obtained for each possible time sequence. Estimates were exactly the same for all possible time sequences. Models for which the capture probability differs for each individual (M_{th} M_{th}) provide a different estimate: 814.1 (median = 814.2, 95%HPDI: 805.2824.2) and 806.3 (median = 806.2, 95%HPDI: 798.2815.2) cases, respectively.
Averaging over the M_{th} models, applied to the three sources, provides an estimate in line with the result obtained by the classical method [28]. With the selected loglinear model, the estimate was of 791 breast cancer cases corresponding to the model including interaction between the Multidisciplinary Team Meeting source and the two other sources, as presented in Table 2. The choice of the most appropriate loglinear model is based on the likelihood ratio statistics. The selected model is the one with the fewest interaction terms and the best fit with the observed data, i.e. a nonsignificant value for the likelihood ratio statistics. The model was selected using a stepwise descending procedure starting from the saturated model, taking all interactions into account, until the most parsimonious model with the best fit was retained. The total number of cases was estimated to be 791 cases (95% CI: 784797), i.e. the completeness of the histological register was estimated to be 92.2% (95% CI: 91.5%; 93.0%). Taking into account coefficient of covariation as developed by Chao et al. [19], the sample coverage approach gives an estimate of 794 cases CI 95% [788824] with dependent sources and an estimate of 802 cases CI 95% [795815] with three independent sources, close to the estimates obtained previously.
Dependence between sources
For breast cancer cases, the LincolnPetersen estimate [28] obtained via twobytwo record linkage for the MTM and CSP sources (N_{MTMCSP} = 958) differed from the estimates obtained from the other sources (N_{HRMTM} = 814 and N_{HRCSP} =766). Interdependency between these two sources was suspected and confirmed by a test for independence [6] on the basis of whether cases were recorded or not in the third source (OR_{MTMCSP} = 0.52; 95% CI:0.370.73). If interdependence is shown between two of at least three data sources, these must be pooled in order to apply the capture recapture procedure to two independent sources. The resulting LincolnPetersen estimate, by crosslinkage of Histopathological Registry cases and cases discussed during MTM pooled with screened cases, was N = 803.2 (95% CI: 793.8812.5).
Capture heterogeneity
To investigate capture heterogeneity between individuals with breast cancer, we created 21 capture episodes, based on potential heterogeneity covariate: age, expressed as 5year intervals, i.e. five capture episodes for each of the three sources, presence of metastases at the time of diagnosis and, finally, history of screening mammography by sources, i.e 6 capture episodes. The estimate averaged over all the models was of 803 cases (median = 802.6, 95%HPDI: 798.6809.6), based on the results of the M_{h} and M_{t} models with 80% and 20% of posterior probability, respectively, as presented in Table 3. The average estimate was obtained over all models following 25 000 iterations after discarding the initial 10 000 iterations. Convergence of Markov chain to a stationary distribution was observed after 10 000 iterations.
This result is in line with the LincolnPetersen estimate for pooled MTM and cancer screening sources. Therefore, the estimated completeness of the Histopathological Registry for breast cancers would be of 90.8% (90.0%91.3%).
Estimate of the number of incident cases of colorectal cancer
Results from the three sources showed 512 cases of colorectal cancer in 2008 (HR, MTM, CSP), of which 481 were recorded in the Histopathological Registry, 337 were identified through at least 2 sources and 41 were common to all three sources as shown in Figure 2.
Using the BMA method, ecological M_{th} models were first applied to the three sources collecting incident colorectal cancer cases, considered each as a capture episode. The estimate averaged over all the models yielded 513 cases (median = 513.0; 95% HPDI: 512.515.0) based on the results of the M_{t} model, as presented in Table 4. Estimates were exactly the same for all possible time sequences. The average estimate was obtained over all models following 25 000 iterations after discarding the initial 5 000 iterations. As for breast cancer, convergence of Markov chains to a stationary distribution was observed from the beginning, meaning before 1000 iterations.
With a total number of 513 colorectal cancer cases, the completeness of the Histological Registry was estimated to be 93.8% (93.4%; 94.0%). With the selected loglinear model [28], the estimate was of 527 colorectal cancer cases corresponding to the model without interaction, as presented in Table 5. With sample coverage approach, estimates are 527 cases CI 95% [519542] with three independent sources, as selected loglinear model without interaction, and 529 cases CI 95% [519557] with dependent sources.
Dependence between sources
As for breast cancer cases, twobytwo record linkage using the LincolnPetersen estimator for MTM and CSP sources provided a different result from the two other estimates (N_{MTMCSP} = 618 versus N_{HRMTM} = 526 and N_{HRCSP} =513). However, testing for independence gave a statistically nonsignificant result (OR = 0.66 [0.401.08]). The LincolnPetersen estimator [28], for the Histopathological Registry and pooled MTM and CSP sources, yielded an estimate of 525 cases (95% CI: 516.5534.5).
Capture heterogeneity
Capture heterogeneity among the colorectal cancer cases was also investigated. Retained covariates potentially responsible for heterogeneity included age in 5 year intervals (15 capture episodes), gender (6 capture episodes) and metastases present at the time of diagnosis, i.e. 15, 6 and 3 capture episodes respectively and 24 for all three sources. Contrary to the results obtained with 3 sources, the estimate, averaged over all the models, was of 521 cases (median = 520.6; 95%HPDI: 517.6526.6), resulting from the M_{h} model including individual capture heterogeneity, with a posterior probability of 99%, as presented in Table 6. The average estimate was obtained over all models following 25 000 iterations after discarding the initial 5 000 iterations. Contrary to breast cancer with potential heterogeneity covariate, for colorectal cancer convergence of Markov chains to a stationary distribution was observed rapidly after 1 000 iterations. The estimated completeness of the Histopathological Registry would be of 92.3% (91.3%92.9%).
Application to data set from other fields
Then we apply our method to an outbreak of the hepatitis A virus in a college in northern Taiwan [29] with 271 cases ascertained by three sources, to a data set on diabetes in a community in Italy based on four records [30] with 2069 cases identified and finally to fives lists of 537 infants born with a specific congenital anomaly in Massachusetts [31,32]. For the HAV data, our method gives an estimate of 515 cases [465.5567.5] whereas onestep estimator by sample coverage approach gives 508 cases [442600], Petersen estimator 336 cases and loglinear models 1300 cases. The number of HAV infected students was finally known with a screen serum test for HAV antibody for all students and was about 545.
For the data set on diabetes, author found that the selected loglinear model that fits data gave an estimate of 2 771 cases but they further stratified for heterogeneity terms and an estimate of 2 586 cases [23412830] was obtained. With sample coverage approach Chao et al. estimate was 2 559 cases [24722792] and with our method estimate is 2 589 cases [25342645].
For the data set on infants’ congenital anomaly, Wittes and Fienberg [31,32] obtained a close estimate respectively 638 cases under independent assumption and 634 cases for the loglinear model approach. For the sample coverage approach, the retained estimator with dependencies was 659 cases [607750]. With our method estimate is close to previous ones as it is 654 cases [631680].
Discussion
Loglinear models provide a useful method to estimate population size but some authors have pointed out [19,38], the need to pursue additional methods because of assumptions, as independence of sources and equality of capture probability, which are rarely satisfied. Capturerecapture M_{th} models are interesting for cancer population as individual capture heterogeneity is taken into account. Bayesian population estimation allows small sample as it does not rely on the asymptotic assumption.
Firstly, we applied capturerecapture M_{th} models to epidemiological data. Secondly, we applied a Bayesian Model Averaging (BMA) method to present a result averaged over all the models. The BMA method was of interest to us because it takes into account all possible models instead of selecting only the result of the best model. However, we chose to apply capturerecapture M_{th} models specifically for this study because individual heterogeneity was suspected between severe cancer cases collected via Multidisciplinary Team Meetings (MTM), and simple cancer cases screened in a Cancer Screening Program (CSP). These methods can be used separately and this is what we have done in our study. The results for each model were analysed and the BMA method was then applied to obtain a result weighted for all models according to their posterior probability.
For both types of capturerecapture M_{th} models, the samples are independent only for the M_{t} model, while heterogeneity arises for the M_{h} model. The Raschlike model is the M_{th} model which extends the M_{h} model by allowing time effects. Heterogeneity between individuals means that even if two lists are independent within individuals, the two sources may become dependent if the capture probabilities are heterogeneous among individuals. Model M_{h} assumes that each individual has its own unique probability that remains constant over samples. Capturerecapture M_{th} models have already been used in the context of lists. Chao [19] for example has shown that results of models M_{h} and M_{th} were very close to those obtained with loglinear models by Wittes [31] for 5 lists of “Infants’ congenital anomaly data”. Chao’s conclusion was that although heterogeneous models did not consider possible local dependence, the estimates were close to the proposed estimate that does. We came to the same conclusion in our study comparing estimates yielded by capturerecapture M_{th} models with results obtained by the “source pooling” method. When two of at least three data sources are dependent, these must be pooled in order to apply the capture recapture procedure to two independent sources. “Source pooling” is a method proposed by Wittes [6] and adopted by many authors thereafter. It was applied in this study for comparison with previously published results on these data [28]. We presented this method here to emphasize that the results are in line with the classical methods (loglinear model on three sources or pooled if dependent) and capturerecapture M_{th} models. The interest of these models is the ability to decide to “capture’ subjects aged 60 to 65 years, or diagnosed with metastases, or any others covariates suspected for heterogeneity, whereas with the loglinear approach, the number of potentially adequate models increases and model selection is more difficult.
Finally, capturerecapture M_{th} models allowed us to compare and to confirm results obtained with loglinear models and then to make them more reliable. This last point was particularly important as heterogeneity in threelist data could involve that our estimates were not reliable [14]. Moreover the results retained with the loglinear approach were the estimates of the selected model whereas other fitted models could have yielded a quite different estimate [40].
The objective of this article was to propose an alternative to the method most often used, i.e a loglinear model stratified on several covariates. For example, Tilling [18] did not propose a loglinear model but a logit model which has the advantage over the loglinear model stratified on several covariates to limit the number of parameters studied, to incorporate continuous covariates and above all of being applicable to two sources. However, the adjustment used the maximum likelihood ratio based on the asymptotic assumption, which cannot be verified in our case due to the small number of cases, which is frequent in epidemiology.
Considering each of the three sources (HR, MTM and CSP) as a capture episode, the estimated mean number of breast cancer cases was 790 and the number of colorectal cancer cases 513, according to the M_{t} model. When considering each covariate as a capture episode, the model retained in BMA corresponds to the model with heterogeneity. The estimated total number of cases, for breast cancer, was of 803 cases according to the M_{h} model against 791 cases for the loglinear model [28]. The resulting LincolnPetersen estimate from the sourcepooling method was of 803 cases too and sample coverage approach gives estimates of 794 and 802 cases, respectively with dependent and independent sources. From our point of view, the estimate from the M_{h} model could be considered as more representative than the results of all the loglinear models considered for breast cancer. Finally, the discrepancy, without considering heterogeneity, between estimates may seem not apparent. However, we have shown with some covariates, corresponding to our heterogeneous population, that heterogeneity exists and has an impact. Indeed, as there were very few cases missing, the difference is equal to two points for histological cancer registry completeness.
The only difference between the M_{t} and M_{h} models lies in the selected covariates considered each as a capture episode which could influence the probability of capture for each individual case. Size effects due to smaller samples cannot be held responsible for a higher posterior probability for model M_{h} because the total number of cases and the gap between samples are the same with 3 episodes of capture. Furthermore we modified the time sequence of our sources, as there is no sequential time order in lists of individuals, and estimates were the same. For colorectal cancer, the estimate of 521 cases for the capturerecapture M_{h} model was in line with the estimated total number of 527 cases retained by the selected loglinear model and by sample coverage approach. The LincolnPetersen estimator, for HR and pooled MTM and CSP sources, yielded an estimate of 525 cases.
Application of capturerecapture M_{th} models have confirmed estimates obtained via the loglinear models retained according to the traditional procedure. The traditional model selection procedure and the use of the capturerecapture M_{th} models thus yield concordant results.
For our study, a major advantage of this Bayesian population estimation was the possibility of easily taking into account several covariates potentially responsible for capture heterogeneity, even with few cancer cases collected by some sources. Considering some potential heterogeneity covariates (i.e. age, presence of metastases, screening history or gender) as a capture episode has shown that capture probability was not homogenous between individuals. This can be easily understood for our heterogeneous cancer population as a case with metastases at the time of diagnosis won’t have the same capture probability as other cases, since multidisciplinary team meetings will be more concerned with such situations of advanced disease, whereas there are fewer cases with metastases in a cancer screening program.
However, loglinear method, sample coverage approach and capturerecapture M_{th} models have their advantages and their limitations, which is in favour of applying them both to make estimates more reliable. The main advantage of the loglinear method is that it is particularly well suited to the socalled « list » method. All models have the same framework, the selected model can be tested easily, betweensource dependencies are included in interaction terms and inference is easily available in statistical software.
Applying the Bayesian procedure to the M_{th} capture recapture models has the advantage of taking caselinked capture heterogeneity into account and providing a result that incorporates all the possible models. Furthermore, with the Bayesian method, considering a potential heterogeneity covariate as a capture episode may be easily applied to small samples, which can be particularly useful in cancer epidemiological studies. On the other hand, covariate selection may seem arbitrary, which is in favour of selecting variables that have already been shown to have an impact on capture probability [18,39,21]. Bayesian inference is nowadays easily available with the WinBUGS software package [36]. Moreover, codes for M_{th} models and Bayesian Model Averaging have been developed [37] by ecological researchers and can be adapted to epidemiological data.
Conclusions
Our analysis shows that capturerecapture M_{th} models can be applied to data usually available as « lists » in epidemiology. The advantage of these models resides in their capacity to independently assess heterogeneity of individual capture probability which is useful for a heterogeneous cancer population. Moreover, Bayesian population estimation allows including several covariates potentially accounting for heterogeneity even with small sample. Thus, capturerecapture M_{th} models and Bayesian population estimation should be considered additionally to the classical methods usually implemented in cancer epidemiology, to confirm results and enhance the reliability of estimates.
Availability and requirements
Winbugs software is available through: http://www.mrcbsu.cam.ac.uk/software/bugs/thebugsprojectwinbugs.
Abbreviations
 HR:

Histopathological Registry
 MT:

Multidisciplinary team meetings
 CSP:

Cancer screening programmes
References
Belot A, Grosclaude P, Bossard N, et al. Cancer incidence and mortality in France over the period 19802005. Rev Epidemiol Sante Publique. 2008;56(3):159–75.
Bray F, Parkin DM. Evaluation of data quality in the cancer registry: Principles and methods. Part I: Comparability, validity and timeliness. Eur J Cancer. 2009;45:747–55.
Chapman DG. The estimation of biological populations. Ann Math Stat. 1954;25:1–15.
Cormack RM. The statistics of capturerecapture methods. Oceanogr Mar Biol Ann Rev. 1968;6:455–506.
Wittes JT, Sidel VW. A generalization of the simple capture recapture model with applications to epidemiological research. J Chronic Dis. 1968;21:287–301.
Wittes JT. Applications of a multinomial capturerecapture model to epidemiological data. J Am Stat. 1974;69:93–7.
Sekar CC, Deming WE. On a method of estimating birth and death rates and the extent of registration. American Stat Assoc J. 1949;44:101–15.
Himes CL, Clogg CC. An overview of demographic analysis as a method for evaluating census coverage in the US Population. Index. 1992;58:587–607.
Hook EB, Regal RR. Internal validity analysis: a method for adjusting capturerecapture estimates of prevalence. Am J Epidemiol. 1995;142(9):48–52.
Crocetti E, Miccinesi G, Paci E, Zappa M. An application of the twosource capturerecapture method to estimate the completeness of the Tuscany Cancer Registry. Italy Eur J Cancer Prev. 2001;10(5):417–23.
Ballivet S, Rachid Salmi L, Dubourdieu D. Capturerecapture method to determine the best design of a surveillance system. Application to a thyroid cancer registry. Eur J Epidemiol. 2000;16:147–53.
Seddon DJ, Williams EM. Data quality in populationbased cancer registration: an assessment of the Merseyside and Cheshire Cancer Registry. Br J Cancer. 1997;76(5):667–74.
International Working Group for Disease Monitoring and Forecasting. Capturerecapture and multiplerecord systems estimation I: history and development. Am J Epidemiol. 1995;142(10):1047–58.
International Working Group for Disease Monitoring and Forecasting. Capturerecapture and multiplerecord systems estimation II: applications in human diseases. Am J Epidemiol. 1995;142(10):1059–68.
Ledberg A, Wennberg A. Estimating the size of hidden populations from register data. BMC Med Res Methodol. 2014;14:58.
Goodman LA. A general model for the analysis of surveys. American J Socio. 1972;77(6):1035–86.
Bishop YMM, Fienberg SE, Holland PW. Discrete multivariate Analysis: Theory and practice. Cambridge. MIT press, 1975, chapter 56, ISBN 9780387728056 © 2007 Springer Science+Business Media, LLC
Tilling K, Sterne JAC. Capturerecapture models including covariate effects. Am J Epidemiol. 1999;149(4):392–400.
Chao A, Tsay PK, Lin SH, Shau WY, Chao DY. The applications of capturerecapture models to epidemiological data. Stat Med. 2001;20:3123–57.
King R, Bird SM, Hay G, Hutchinson SJ. Estimating current injectors in Scotland and their drugrelated death rate by sex, region and agegroup via Bayesian capturerecapture methods. Stat Methods Med Res. 2009;18(4):341–59.
Schmidtmann I. Estimating completeness in cancer registries comparing capturerecapture methods in a simulation study. Biom J. 2008;6(50):1077–92.
Silcocks PB, Robinson D. Completeness of ascertainment by cancer registries: putting bounds on the number of missing cases. J Public Health (Oxf). 2004;26(2):161–7.
Chao A, Pan HY, Chiang SC. The Petersen–Lincoln Estimator and its extension to estimate the size of a shared population. Biom J. 2008;6(50):957–70.
Mao CX. Computing an NPMLE for a mixing distribution in two closed heterogeneous population size models. Biom J. 2008;6(50):983–92.
ManriqueVallier D, Fienberg SE. Population size estimation using individual level mixture models. Biom J. 2008;6(50):1051–63.
Otis DL, Burnham KP, White GC, Anderson DR. Statistical inference from capture data on closed animal populations. Wildlife Monographs. 1978;62:1–135.
King R, Brooks SP. On the Bayesian estimation of a closed population size in the presence of heterogeneity and model uncertainty. Biometrics. 2008;64(3):816–24.
Bailly L, Daurès JP, Pradier C. Investigating the completeness of a histopathological cancer registry: estimation by capturerecapture analysis in a French geographical unit AlpesMaritimes, 2008. Cancer Epidemiol. 2011;35(6):62–8.
Chao DY, Shau WY, Lu CWK, Chen KT, Chu CL, Shu HM, et al. A large outbreak of hepatitis A in a college school in Taiwan: associated with contaminated food and water dissemination. Taiwan Government: Epidemiology Bulletin, Department of Health, Executive Yuan; 1997.
Bruno GB, Biggeri A, LaPorte RE, McCarty D, Merletti F, Pagono G. Application of capturerecapture to count diabetes. Diabetes Care. 1994;17:548–56.
Wittes JT, Colton T, Sidel VW. Capturerecapture methods for assessing the completeness of cases ascertainment when using multiple information sources. J Chronic Dis. 1974;27:25–36.
Fienberg SE. The multiple recapture census for closed populations and incomplete 2 k contingency tables. Biometrika. 1972;59:591–603.
Pledger S. Unified maximum likelihood estimates for closed capturerecapture models using mixtures. Biometrics. 2000;56(2):434–42.
Hoeting JA, Madigan D, Raftery AE, Kronmal RA. Bayesian model averaging: a tutorial. Stat Sci. 1999;14(4):382–417.
Gelfand AE, Smith AFM. Samplingbased approaches to calculating marginal densities. J Am Stat Assoc. 1990;85(410):398–409.
Lunn DJ, Thomas A, Best N, Spiegelhalter D. WinBUGS – a Bayesian modelling framework: concepts, structure, and extensibility. Stat Com. 2000;10:325–37.
Link WA, Barker RJ. Bayesian Inference with ecological applications. Elsevier, London: Academic; 2010. p. 201–24.
Tilling K. Capture–recapture methods–useful or misleading ? Int J Epidemiol. 2001;30(1):12–4.
Brenner H, Stegmaier C, Ziegler H. Estimating completeness of cancer registration: an empirical evaluation of the two source capturerecapture approach in Germany. J Epidemiol Community Health. 1995;49(4):426–30.
Coull BA, Agresti A. The use of mixed logit models to reflect heterogeneity in capturerecapture studies. Biometrics. 1999;55:294–301.
Acknowledgements
We thank reviewers for their careful reading and constructive advices. We thank Dr. Eugènia Mariné Barjoan, Dr. Damien Ambrosetti, Dr. Bernard Giusiano, and Miss Agnès Viot who were involved by making substantial contributions to conception, design and acquisition of data. This study is supported by Institut National du Cancer, Paris, France. Authors want to thanks Dr Béatrice Jacqueme, ARS PACA and Conseil Général des AlpesMaritimes for their support. Special thanks to the Department of Public Health’s team to all CRISAP PACA’s pathologists, to APREMAS’s and OncoPACA’s team.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
LB participated in the design of the study, performed the statistical analysis and wrote the paper. JPD conceived of the study and implemented the Bayesian estimation of a population. CP and BD participated in its design and coordination and helped to draft the manuscript. All authors read and approved the final manuscript.
Additional file
Additional file 1:
Appendices. A) Bayesian Model Averaging. B) Reversible jump MCMC model. C) Prior for the total population. D) WinBUGS Codes.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Bailly, L., Daurès, J.P., Dunais, B. et al. Bayesian estimation of a cancer population by capturerecapture with individual capture heterogeneity and small sample. BMC Med Res Methodol 15, 39 (2015). https://doi.org/10.1186/s1287401500297
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1287401500297
Keywords
 Capturerecapture
 Cancer population
 Capturerecapture models
 Bayesian model averaging
 Capture heterogeneity
 Completeness of cancer registries