Correspondence: Some general points regarding Ledberg and Wennberg, BMC Medical Research Methodology 2014 April 27;14:58
- Dankmar Böhning^{1}Email author and
- Peter G.M. van der Heijden^{2, 3}
https://doi.org/10.1186/s12874-015-0043-9
© Böhning and van der Heijden. 2015
Received: 16 August 2014
Accepted: 16 June 2015
Published: 7 July 2015
The Erratum to this article has been published in BMC Medical Research Methodology 2015 15:76
Abstract
The purpose of this note is to contribute some general points on a recent paper by Ledberg and Wennberg (BMC Med Res Meth 14:58, 2014) which need to be rectified. They advocate the capture-removal estimator. First, we will discuss drawbacks of this estimator in comparison to the Lincoln-Petersen estimator. Second, we show that their evaluation of the Chao estimator is flawed. We conclude that some statements in Ledberg and Wennberg with respect to Chao’s estimator and removal estimation need to be taken with great caution.
Main text
In a recent paper, Ledberg and Wennberg [1] propose to use the capture-removal estimator (Otis et al. [2]; Seber [3]; Borchers et al. [4]; ch. 5) for estimating the size of a hidden population from register data. It is assumed that a register has registrations from M occasions with M>1. These occasions refer to different points in time so that they are chronologically ordered. The approach, at any occasion, consists of considering only new registrations and ignore those that have been identified before. Under the assumption that registration is independent at occasions and probability of registration is homogeneous a likelihood function can be determined and maximized in the two parameters involved, the probability of registration and the size N of the population. Below we will first show, for two occasions, that the capture-removal estimator can have drawbacks in comparison with the Lincoln-Petersen estimator. Then we will show that the evaluation of the Chao estimator, given by Ledberg and Wennberg, is flawed.
M=2 occasions
We consider the case of two occasions, M=2. This is the simplest possible case and also allows an easy comparison with the Lincoln-Petersen estimator and the bias-corrected Chapman estimator (Borchers et al. [4]). Let, as in Ledberg and Wennberg [1], denote with n_{1} all registrations at occasion 1 (here every occasion is a new registration) and with n_{2} all registrations at occasion 2 that were not yet registered at occasion 1. For the setting of M=2 occasions it is possible to derive the maximum likelihood estimate of N in a closed form expression: \( \hat N_{R} = \frac {{n_{1}^{2}}}{n_{1}-n_{2}}\), assuming that n_{1}>n_{2} which may or may not be met in practice. We denote this estimator as \(\hat N_{R}\), index R for removal. For comparison, we consider the Lincoln-Petersen estimator given as \( \hat N_{\textit {LP}} = \frac {n_{1}(m+n_{2})}{m}\) and the Chapman estimator given as \( \hat N_{\textit {Ch}} = \frac {(n_{1}+1)(m+n_{2}+1)}{(m+1)}\).
We will show that the latter two estimators are the better choice in the following two situations: first, when the assumption of constant and occasion-independent inclusion probabilities of the capture-removal estimator are met, the Lincoln-Petersen and the Chapman estimators are generally more efficient. Second, when the assumption of homogeneous inclusion probabilities that underlies the capture-removal estimator is not met, the capture-removal estimator is biased whereas the Lincoln-Peterson and the Chapman estimators are not. However, when there is behavioral response, i.e. after an inclusion the probability of the next inclusion increases, the Lincoln-Peterson and the Chapman estimators are biased downwards whereas the capture-removal estimator might be less biased depending on the constellation of marginal distributions and occasion dependency. We note that in the biological literature the first condition is known as M_{0} for the inclusion probability being constant over time (under which the removal estimator is derived) and the second condition is known as M_{ t } for the inclusion probability varying with occasions (under which the Lincoln-Petersen and Chapman estimators are derived), whereas behavioral response is M_{ b }.
Simulation results for registration system with two occasions. p_{11} is the probability for capture at occasion 1 and occasion 2, p_{10} is the probability for capture at occasion 1 but not at occasion 2, and so forth. The marginal probabilities for capture at occasion 1 and 2 are p_{1}=p_{11}+p_{10} and p_{2}=p_{11}+p_{01}, respectively. In settings 1 to 6 inclusion on occasion 1 is independent of inclusion on occasion 2. In settings 7 and 8, occasions become dependent (odds ratio larger than 1) but the conditional probability for capture at occasion 2 given no capture at occasion 1 is identical to the unconditional probability for capture at occasion 1, the capture-removal estimator works fine. In settings 9 and 10, those conditional and unconditional probabilities are different and the capture-removal estimator breaks down
LP | Chapman | Removal | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Setting | p _{1} | p _{2} | p _{11} | p _{10} | p _{01} | p _{00} | \(\bar N_{\textit {LP}}\) | SD | \(\bar N_{\textit {Ch}}\) | SD | \(\bar N_{R}\) | SD |
1 | 0.5 | 0.5 | 0.25 | 0.25 | 0.25 | 0.25 | 1001 | 31 | 1000 | 31 | 1007 | 59 |
2 | 0.3 | 0.3 | 0.09 | 0.21 | 0.21 | 0.49 | 1006 | 77 | 1000 | 76 | 1064 | 262 |
3 | 0.5 | 0.6 | 0.30 | 0.20 | 0.30 | 0.20 | 1002 | 27 | 1001 | 27 | 1271 | 113 |
4 | 0.3 | 0.35 | 0.105 | 0.195 | 0.245 | 0.455 | 1004 | 66 | 1000 | 65 | 1825 | 6475 |
5 | 0.5 | 0.3 | 0.15 | 0.35 | 0.15 | 0.35 | 1000 | 48 | 998 | 47 | 714 | 21 |
6 | 0.3 | 0.1 | 0.03 | 0.27 | 0.07 | 0.63 | 1021 | 155 | 999 | 146 | 392 | 17 |
7 | 0.5 | 0.55 | 0.30 | 0.20 | 0.25 | 0.25 | 955 | 28 | 955 | 28 | 1003 | 57 |
8 | 0.5 | 0.625 | 0.375 | 0.125 | 0.25 | 0.25 | 834 | 18 | 834 | 18 | 1009 | 56 |
9 | 0.3 | 0.1 | 0.065 | 0.235 | 0.035 | 0.665 | 464 | 34 | 462 | 34 | 340 | 15 |
10 | 0.5 | 0.5 | 0.4 | 0.1 | 0.1 | 0.4 | 625 | 17 | 626 | 17 | 625 | 17 |
Multiple occasions
Assume that registrations are followed over a period of time. Since estimates obtained by Chao’s estimator should not strongly depend on the duration of the time period used, similar estimates should be obtained if the first half of the time period is used compared to if the whole time period is used.
Chao’s estimator for registration system with M occasions and true N=200
M=5 | M=10 | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
Setting | p _{1} | p _{2} | \(\bar s\) | \(\bar f_{0}\) | \(\bar N\) | SD(\(\hat f_{0}\)) | \(\bar s\) | \(\bar f_{0}\) | \(\bar N\) | SD(\(\hat f_{0}\)) |
1 | 0.1 | 0.1 | 82 | 120 | 202 | 52 | 130 | 70 | 200 | 20 |
2 | 0.3 | 0.05 | 106 | 40 | 156 | 13 | 137 | 29 | 166 | 10 |
Registration system with two occasions
Occasion 2 | |||||
---|---|---|---|---|---|
1 | 0 | ||||
1 | m | n_{1}−m | n _{1} | ||
Occasion 1 | |||||
0 | n _{2} | x | |||
m+n_{2} | N |
It has been seen that some statements in Ledberg and Wennberg on Chao’s estimator, in particular on its independence of the number of occasions, need to be revised, especially, if there is population heterogeneity. Even if there is homogeneity the variation for the entire period will be considerably smaller than for the first half-period. It might be better to compare the Chao estimator for different periods of equal size.
From our perspective, Chao’s estimator remains as one of the most useful estimators in the area. We recently proposed a generalization of Chao’s estimator that can take covariates into account (Böhning et al. [7]). Thus observed population heterogeneity can be modelled and the lower bound provided by this covariate adjusted estimator will be closer to the true population size than the unadjusted estimator.
Response
by Anders Ledberg and Peter WennbergCorresponding author: Anders LedbergEmail: anders.ledberg@sorad.su.seAddress: Centre for Social Research on Alcohol and Drugs, SoRAD Stockholm University, SE-10691 Stockholm, Sweden
Introduction
We are happy about the attention our publication “Estimating the size of hidden populations from register data” [1] has received and would like to use this opportunity to clarify what our paper is about and what it is not about.
What our paper is about
In our paper we are considering the problem of estimating the size of an incompletely sampled population. The particular case we have in mind is that when a given individual in the population has constant probability, per unit time, of being first registered, but once registered the probability of future registrations might change, perhaps radically. (We use ’registered’ in a general sense here; the analogous concept in the ecological literature would be ’captured’, or ’trapped’). This case is of interest to us since we believe that it could serve as an approximate model for epidemiological data. As an example, consider the “population” of heavy drug users. Assume that there is a constant probability that heavy drug use leads to contact with the health care system for the first time (and a registration). One possible outcome of such a contact is that the client enters a treatment program that implies regular contacts with the health care system (for example methadone maintenance treatment). Consequently, the probability that this particular individual is registered again is very high (close to one). Indeed, that the probability of registration is history dependent seems to us a generic feature of this type of data. In the literature on population estimation in ecology this history dependence is often called behavioral response [e.g. [2]]. In keeping with this terminology (of [2]) we call this scenario Model M_{ b }. In other words, our paper suggests modeling (some types of) epidemiological data using Model M_{ b }, and to use the maximum likelihood estimator derived under this model [3].
In our paper we evaluate the performance of this maximum likelihood estimator under the scenario we consider, and show when it is applicable, and when it is not (Figure 2 in [1]). In particular, we show that for the estimator to be useful a certain fraction of the population should be sampled, and this fraction depends on the total size of the population (Figure 2 in [1]). An important result is that the estimator is robust under moderate heterogeneity with respect to the probabilities of first registration of different individuals, i.e. they need not be identical for the estimator to be useful (see Figure 3 in [1]). Another contribution is that we show that some other estimators, that have been used on data that could be reasonably modeled using Model M_{ b }, can have a substantial bias when applied to data from Model M_{ b }. In particular, we show that an estimator that can be derived assuming that the data follow a truncated Poisson distribution, can have a substantial bias, and that this bias can be positive, i.e. it might lead to an overestimation of the population size (see Figure 6 in [1]).
What our paper is not about
Estimating the size of hidden populations is a problem that has been treated by many authors and there are many different methods in use. The basic idea in deriving a measure (an estimator) is to start with a particular scenario (model) for the registrations, and from this model derive an estimator. Thus, key aspects of a real situation (e.g. drug users interacting with the health care system) are captured in an idealized model (Model M_{ b } in our case), and given this model an estimator is derived (maximum likelihood estimator in our case). The estimator is then strictly valid only under the model considered. We certainly do not suggest that the maximum likelihood estimator should be used if the data at hand are better described by other models (such as Models M_{0} or M_{ t }, for example). Indeed, that an estimator derived under model A does not perform well when applied to data generated under model B is neither surprising nor informative for its performance under model A.
Our paper does not provide an evaluation of other estimators, and our evaluation of the maximum likelihood estimator is done only under some particular scenarios. We have no particular attachment to the estimator we propose but for the type of data we are interested in it still seem a most reasonable choice (given, of course, that a sufficient fraction of the population is sampled). Böhning and van der Heijden do not suggest another estimator that works better in this case, something we interpret as them being in tacit agreement with us. Perhaps contrary to these workers, we do not believe in a “universal estimator” that should always be used. Rather, as we suggest in our paper, application of several estimators, relying on different assumptions, might provide complementary information about the data at hand and might help in getting more reliable estimates.
Competing interests
The authors declare that they have no competing interests.
Acknowledgements
This research has been financed by the Swedish Council for Working Life and Social Research (FAS 2006-1523).
References
1. Ledberg A, Wennberg, P. Estimating the size of hidden populations from register data. BMC Med Res Methodol 2014;14(58):58.2. Otis D, Burnham K, White G, Anderson D. Statistical-Inference From Capture Data On Closed Animal Populations. Wildlife Monogr 1978;(62):7–135.3. Moran P. A Mathematical Theory Of Animal Trapping. Biometrika 1951;38(3-4):307–311.
Notes
Declarations
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
- Ledberg A, Wennberg P. Estimating the size of hidden populations from register data. BMC Med Res Methodol. 2014; 14:58.View ArticlePubMedPubMed CentralGoogle Scholar
- Otis DL, Burnham KP, White GC, Anderson DR. Statistical inference from capture data on closed animal populations. New York: Wiley; 1978.Google Scholar
- Seber GAF. The estimation of animal abundance, 2nd Ed. London: Charles Griffin; 1982.Google Scholar
- Borchers DL, Buckland ST, Zucchini W. Estimating animal abundance: closed populations. London: Springer; 2002.View ArticleGoogle Scholar
- Chao A. Estimating the population size for capture-recapture data with unequal catchability. Biometrics. 1987; 43:783–91.View ArticlePubMedGoogle Scholar
- Chao A. Estimating population size for sparse data in capture-recapture experiments. Biometrics. 1989; 45:427–38.View ArticleGoogle Scholar
- Böhning D, Lerdsuwansri R, Vidal-Diez A, Viwatwongkasem C, Arnold M. A generalization of Chao’s estimator for covariate information. Biometrics. 2013; 69:1033–42.View ArticlePubMedGoogle Scholar