### Outline and notation

We consider the following situation (see Figure 1 for illustration): The target population (consisting of *N* individuals) is followed over time during a time interval of total duration *T*. At each time-point a given individual may or may not be registered. We use ‘registered’ in a general sense here, it can for example refer to an instance of care at a medical facility; an arrest by the police; the purchase of a medically prescribed drug, etc. We assume that each individual in the target population, if not previously registered, has a fixed probability *π* of being registered at each time-point. We will refer to *π* as registration probability in the following. A registration (independent of the nature of the registry), must contain an identifier of the person (PIN) and the time-point of registration. We note that there might be a delay between the occurrence of the event we would like to track, and the time point of registration of this event. Consider for example an attempt to estimate the prevalence of a certain disease in the population. If health care records are used to estimate this, there might be a delay between the time of acquiring the disease and the first contact with the health care system. We will assume that such delays, if they occur, do not introduce systematic errors in the analysis and hence can be ignored.

Next we will detail how such register data can be used to estimate the total size *N* of the population. First we will describe the novel ML approach and then an approach that has been previously used with data of thistype.

### Maximum likelihood estimation of *N*

Here we describe how the data in the register can be used to estimate *N* using ML (see Figure 1 for illustration). First the observation interval is divided into *M* non-overlapping epochs with equal duration *d*. That is, the *j*-th epoch, *I*
_{
j
} is given by *I*
_{
j
}=[*t*
_{0}+*d*(*j*−1),*t*
_{0}+*d* *j*) where *t*
_{0} denotes the time point when the observations start and *j*=1,2,…,*M*. We will assume that time is discrete and that there are *d* opportunities to be registered in an interval of duration *d*. For each of the *M* epochs we count the number of new persons being registered in that epoch. That is, we count the persons registered during epoch *I*
_{
j
} that were not registered in any of the previous epochs. These numbers will be called *n*
_{1},*n*
_{2},…,*n*
_{
M
} and are used in the following to represent both random variables and samples of these random variables. It is *n*
_{1},*n*
_{2},…,*n*
_{
M
} that constitute the data we will use to estimate the total number of persons in the population (i.e. *N*). The number of intervals *M*>1 is a parameter in the suggested method and we will study how the estimate of *N* depend on *M* below.

We proceed by deriving the ML estimator under rather idealized conditions (a closed and homogeneous population) but will later study numerically how the estimator performs when we make the population heterogeneous. Assuming that the target population does not change in the time window during which we make the observations, and that each member of the population has the same registration probability, *π*, it is straight-forward to write down the probability distribution of the sample *n*
_{1},*n*
_{2},…,*n*
_{
M
}. Indeed, the number of persons registered in the first epoch, i.e. *n*
_{1}, is distributed as a Binomial random variable with parameters *N* and *p*=def1−(1−*π*)^{d}. Note that *p* depends on the duration of the epoch (through *d*). In the next epoch there are *N*−*n*
_{1} unregistered individuals in the population and conditionally on *n*
_{1}, the random variable *n*
_{2} is distributed as a Binomial variable with parameters *N*−*n*
_{1} and *p*. Continuing this argument it follows that

{n}_{j}|{n}_{1},{n}_{2},\dots {n}_{j-1}\sim \text{Bin}\left(N-\sum _{i=1}^{j-1}{n}_{i},p\right).

(1)

The log-likelihood function of *N* and *p* given the data *n*
_{1},*n*
_{2},…,*n*
_{
M
} is given by (e.g. [15])

\begin{array}{ll}{L}_{M}=& log\left(p\right){s}_{M}+Nlog\left(N\right)\\ +\sum _{i=1}^{M}\left(N-\sum _{j=1}^{i}{n}_{j}\right)log(1-p)\\ -(N-{s}_{M})log(N-{s}_{M})+C\end{array}

(2)

Here *s*
_{
M
} denotes the sum of the observations:

{s}_{M}\stackrel{\text{def}}{=}\sum _{i=1}^{M}{n}_{j},

and *C* is represent terms that do not depend on *N* or *p*. To derive the likelihood equation (Eq. 2) the following standard approximation, valid for large *n*, was used:

n!\simeq nlog\left(n\right)-\mathrm{n.}

Given the log-likelihood function (Eq. 2) we take as our estimate of population size the value of *N* that maximizes that expression (henceforth denoted *N*
_{
M
L
}). Note that we simultaneously estimate the registration probability *p*, but since this is not the parameter of main interest, we will focus on the estimator of *N*.

For the case *M*=2 the maximum of the likelihood function is given by {N}_{\mathit{\text{ML}}2}={n}_{1}^{2}/({n}_{1}-{n}_{2}). However, in the general case, the maximum likelihood estimates are easiest found by maximizing Eq. 2 numerically, for example using Newton’s method. It should be noted that the ML estimator is not applicable for all samples. For the case *M*=2, for example, *n*
_{1} must be greater than *n*
_{2}. General conditions for when the ML estimator is applicable have been derived before [18]. As we will show later, in cases where the ML estimator give good estimates of *N*, it is very unlikely to get a sample for which the ML estimator does not exist.

### Truncated poisson estimate of *N*

Given a single source of data with possible multiple registrations per person there are alternative estimates of populations size that can be (and have been) used. The simplest of these estimates can be derived if one assumes that the sampled data follow a zero-truncated Poisson distribution (this is just a re-weighted standard Poisson distribution in which the zero bin is not observed). We note that if the probability *π* of being registered at each time point is independent of previous registrations, and there are *T* time points in total, then the number of registrations for each subject is distributed as a binomial variable with parameters *T* and *π*. Given reasonable values of these parameters, the distribution of the number of registrations will closely follow a Poisson distribution with parameter *λ*=def*T* *π* (e.g. [19]). Consider for example that observations are made daily for a year, then *T*=365, and if the probability of being registered at least once in a year is 0.3 then *π*≃0.001 which gives *λ*≃0.36. These are values for which the Poisson approximation of the Binomial distribution is very good and, consequently, a zero truncated Poisson distribution is a natural choice to model the distribution of the registrations.

To derive estimators from this distribution we may proceed as follows. With *λ* being the parameter in the Poisson distribution (the rate), the probability mass function of the untruncated distribution has the following form

f(k;\lambda )=\text{Pr}(X=k)\stackrel{\text{def}}{=}\frac{{\lambda}^{k}{e}^{-\lambda}}{k!}.

Here *X* denotes the random variable standing for the number of registrations made (for one person). Now, draw *N* independent samples, *X*
_{
i
}, *i*=1,2,…*N* following this distribution (here *N* is of course the (unknown) size of the population) and let *h*
_{
j
} denote the count of cases where *X*
_{
i
}=*j* for *j*=1,2,…*T*. That is, *h*
_{1} is the number of observed cases that was registered exactly once, *h*
_{2} the number of cases observed exactly two times, etc. Note that in applications *h*
_{0} is not observed; it represents the “hidden” part of the population. The *h*
_{
j
}s are random variables with the following expected values

\begin{array}{l}\mathrm{E}\left({h}_{0}\right)=N{e}^{-\lambda},\phantom{\rule{1em}{0ex}}\mathrm{E}\left({h}_{1}\right)=\mathrm{N\lambda}{e}^{-\lambda},\\ \mathrm{E}\left({h}_{2}\right)=\frac{N{\lambda}^{2}{e}^{-\lambda}}{2}.\end{array}

From this it follows that

\mathrm{E}\left({h}_{0}\right)=\frac{{\left(\mathrm{E}\right({h}_{1}\left)\right)}^{2}}{2\mathrm{E}\left({h}_{2}\right)}.

(3)

If we replace the expected values with the sample values we get the following simple estimator of *N*:

{N}_{C}=\sum _{i=1}^{T}{h}_{i}+\frac{{h}_{1}^{2}}{2{h}_{2}}.

(4)

By the law of large numbers we know that *N*
_{
C
} will be close to *N* when *N* is large. We will investigate the convergence further below. This estimator (*N*
_{
C
}) was first derived in [17], using a more general formalism and we will refer to *N*
_{
C
} also as Chao’s estimator.

An alternative estimator can be obtained by first estimating *λ*, and then using that

N(1-{e}^{-\lambda})=\mathrm{E}\left(\sum _{i=1}^{T}{h}_{i}\right).

The parameter *λ* can, for example, be estimated as

{\lambda}_{e}=\frac{2{h}_{2}}{{h}_{1}},

which leads to the following estimator of *N*

{N}_{Z}=\frac{\sum _{i=1}^{T}{h}_{i}}{1-{e}^{-2\frac{{h}_{2}}{{h}_{1}}}}.

This estimate was derived in [10] using a different approach.

### Monte Carlo simulations

To evaluate the performance of the estimators we performed Monte Carlo simulations. In these simulations we generated data for 512 ‘time steps’ (e.g. days) where each of *N* individuals has a probability *p* of being registered. Simulations where run for different values of *N* and different values of *p* (described in the Results). All simulation results were based on 5000 realizations. For the ML estimator the maximum of the likelihood (Eq. 2) was found numerically (for *M*>2) using Newtons method. In some cases the method did not converge, something that often was due to a failure of automatically specifying initial conditions within the region of convergence. In such cases a small change to the initial conditions is often sufficient for convergence. Since there is an explicit expression for the case *M*=2, the corresponding estimate is a good starting point for the algorithm. Although it is possible (in particular for small samples) to get data for which the ML estimator does not exists [18], for our purposes problems of convergence were very rare in all cases where the estimator will be useful in practice (see Results). An indication that a sample is not suited for the ML estimator is if *n*
_{1}≤*n*
_{2} (for the case of *M*=2), and if so other methods of estimation must be used.

The ML estimator was derived assuming that all the individuals in the population have the same registration probability, something that translates to *p* in the simulations is the same for individuals. To model the effect of heterogeneity in the target population we run simulations where we allowed *p* to vary from individual to individual. In particular, the *p*s were randomly sampled from a normal distribution with a fixed mean value. The parameter used to control population heterogeneity was the standard deviation of the normal distribution. This way the effect of increasing population heterogeneity can be studied systematically from a constant *p* (zero standard deviation) to *p*s that vary a lot between individuals (large standard deviation). We also run simulations where we sampled *p*s from uniform distributions but the results were similar to those obtained using a normal distribution and are therefore not shown.

### Empirical data

Apart from the evaluation done on simulated data, we also wanted to use the suggested method on ‘real’ data. Here it was deemed important to use a data set where *N* could be assumed known, since then we can evaluate how close to the true value the different estimates are. We approached this by using as our target ‘population’ the accidentally deceased opiate misusers (fatal overdoses) in Sweden during 2006–2011. Most such cases occur outside of hospitals and the causes of death are therefore investigated in detailed forensic autopsies (including a toxicological analysis). We can therefore use the central Swedish Causes of Death registry held by The National Board of Health and Welfare (Socialstyrelsen) to count these cases and get a reasonably precise value of *N*.

To identify the deceased persons making up the target population we proceeded as follows. We selected all persons between 23 and 60 years of age at the time of death who were listed in the causes of death registry with heroin or methadone poisoning (ICD-10 codes T40.1 and T40.3) as contributing cause of death. This choice of substances should be obvious. However, there are also a substantial number of cases with morphine poisoning (T40.2) but these were not included as they were judged to constitute a mixed group of individuals, where only some are opiate misusers in the sense relevant here. The upper age limit was to minimize the number of cases where methadone had been medically prescribed and used in pain management therapy. The lower age limit was to enable a search of five year medical history for all individuals (we are tacitly assuming that most younger persons in this population had started their opiate misuse by the time they were 18 years old). We furthermore excluded all cases that were judged to be suicides. After applying these criteria there were 486 persons remaining who can be assumed to have misused opiates on a regular basis and these 486 will constitute our target population. To evaluate the population size estimators we will now see how well we can estimate the size of this population from another set of data. For this we will use health care data as described next.

For each of the *N*=486 individuals in the population we extracted information on previous contacts with the Swedish health care system by using the National Patient Registry (held by The National Board of Health and Welfare). This registry contains all instances of inpatient care in Sweden with good coverage and contains the main diagnosis (classified according to the WHO ICD-10 classification system) and time points for care. For the target population we extracted for all occurrences of inpatient care under diagnoses indicative of substance misuse. In particular, we used the following ICD-10 diagnoses: F11-F16, F18, and F19 and restricted the search to five years immediately preceding the time of death. During these time intervals 262 of the 485 persons (54%) had been receiving care under these diagnoses. We note that this might be taken as support for the assumption that the 485 persons constitute a relatively homogeneous population with respect to substance misuse. There were a total of 1165 hospital records from these 262 persons and most individuals had more than one care occasion. These health-care records will be used below to estimate *N*.

Since the individuals were all deceased at the time of project initiation, ethics committee approval was not necessary according to the regional research ethics board (Etikprövningsnä mnden i Stockholm).