Research article  Open  Open Peer Review  Published:
Estimating the standardized incidence ratio (SIR) with incomplete followup data
BMC Medical Research Methodologyvolume 17, Article number: 55 (2017)
Abstract
Background
A standard parameter to compare the disease incidence of a cohort relative to the population is the standardized incidence ratio (SIR). For statistical inference is commonly assumed that the denominator, the expected number of cases, is fixed. If a disease registry is available, incident cases can sometimes be identified by linkage with the registry, however, registries may not contain information on migration or death from other causes. A complete followup with a population registry may not be possible. In that case, endoffollowup date and therefore, exact personyears of observation are unknown.
Methods
We have developed a method to estimate the observation times and to derive the expected number of cases using population data on mortality and migration rates. We investigate the impact of the underlying assumptions with a sensitivity analysis.
Results
The method provides a useful estimate of the SIR. We illustrate the method with a numerical example, a simulation study and with a study on standardized cancer incidence ratios in a cohort of migrants relative to the German population. We show that the additional variance induced by the estimation method is small, so that standard methods for inference can be applied.
Conclusions
Estimation of the observation time is possible for cohort studies with incomplete followup.
Background
The analysis of epidemiologic cohort studies usually requires personyears (py) calculation to measure the exact time at risk. Personyears are not only used to compute different rates within a cohort resulting in e.g. rate ratios (RR), but also to calculate the commonly applied standardized incidence or mortality ratio (SIR, SMR) to compare the observed number of certain events with the expected number given the rates in the underlying population [1]. A SIR or a SMR estimates the occurrence of an event in a population relative to what might be expected if the population had the same experience as some larger comparison population designated as ‘normal’ or average or reference.
To be able to calculate the exact accumulated number of personyears within a cohort, the dates (i) begin of followup and (ii) end of followup must be known for each individual. For estimating the SIR, the date of disease onset and reference incidence rates for comparison (to calculate the expected number of cases) must also be available. The latter may come from disease registries, e.g. a cancer registry. The end of the observation period is defined as the minimum of date of death, date of disease onset and endoffollowup/losttofollowup date.
Statistical inference for the SIR and the SMR is usually based on the assumption that the denominator is fixed and that the numerator is Poisson distributed. Both approximate and exact solutions have been developed [1]. The analysis is implemented in common software programs like SAS, R or STATA. Few authors, however, have considered the case when the denominator must be considered as random. Silcocks showed that the beta distribution can be used for statistical inference when the expected number of events is subject to random variation [2]. More recently, Beyene and Moineddin compare methods for measuring the “location quotient”, which is the relative contribution of a rate in a subset (size n_{i}) to the whole population (size n) [3]. In both of these cases the variance of the denominator cannot be neglected because the underlying population is large enough relative to the study population.
In the present case, a different situation is considered which has not been paid much attention. While the linkage of a cohort with a disease registry may be straight forward, the linkage with the population registry may be hampered by bureaucracy, data protection issues, costs, or other reason. In that case the exact personyears are unknown. This was the case for a cohort study conducted by us and motivated the present work.
In this paper we present a method to estimate the SIR and to perform statistical inference under the following situation: (i) The cohort is fully defined and information on basic covariates (date of birth, date of entry into cohort, sex, covariates where appropriate) is complete (ii) incident cases of the disease(s) of interest have been identified through linkage with a disease registry, with date of disease onset known. This registry may cover the geographical area from which the cohort originated, but not necessarily the whole country (iii) a mortality followup is not available, i.e. it is unknown if a person died of a disease other than the disease of interest, and it is unknown whether and when a person moved out of the catchment area of the disease registry.
This situation is not uncommon. In the second section, we provide the estimate, a procedure to implement this in SAS, an estimate for the variance, and give recommendations for a sensitivity analysis. In the third section, we illustrate the procedure with a simple artificial example.
In the fourth section we investigate properties of the estimation procedure with a simulation study. In the fifth section, we present data from a cohort study in which this situation occurred [4]. It is a study on immigrants from the Former Soviet Union in a German state (Saarland) where we calculated the SIR for cancer, based on linkage with the Saarland cancer registry. This study was performed with 18 621 migrants from the Former Soviet Union who immigrated between 1990 and 2005 to the German federal state of Saarland. For every individual, name, sex, date of birth, date of immigration (start of the observation period), and first city of residence was available. A complete incidence followup was done by linking the cohort with the Saarland cancer registry which is a high quality register for this federal state [5]. 470 incident cancer (except nonmelanoma skin cancer) cases were identified. Mortality followup and tracing of individuals who relocated from the state, however, was not performed. We present a sensitivity analysis to assess the effect of the assumptions made in the method. Finally, we discuss the relevance of our findings.
Methods
Outline and notations
We consider data of a cohort with N individuals numbered as n = 1, …, N where date of birth, date of beginning of followup, sex and possibly other covariates are known. We assume further that a disease registry exists which covers the geographical area from which the cohort was selected. Linkage with this disease registry yields the date of diagnosis for those diseased before a fixed followup date. Then, for each individual the date of diagnosis for those diagnosed within the catchment area of the registry (in the following: study area) or no information on the end of followup is given. Thus, four events are possible for each individual as given in Fig. 1.
Event 1) person was identified as incident case (person D)
Event 2) person died from a cause other than the disease of interest at date d_{i} (person B)
Event 3) person moved out of the study area at date d_{i} (person C)
Event 4) person is alive at the end of followup date and residing in the study area (person A)
Events of person B and C are not observed. An assumption that all these individuals contribute personyears until the endoffollowup date is incorrect and would overestimate the personyears and hence underestimate the SIR. The SIR for a disease D is estimated by the ratio of observed O and expected cases E. E is calculated as
where py _{ ij } are the observed personyears in the full cohort for age group i and calendarperiod j, and λ_{ij} the corresponding incidence rates for disease D in the reference population.
For the estimation according to the later proposed method we need the mortality rates for all other diseases except D, and the outmigration rates. Both may be age and perioddependent, and we denote these as μ _{ ij } and ν _{ ij } , respectively.
To introduce the method for personyears estimation, we also need the estimated personyear contribution of an individual n for year 1, 2 … . We denote this as \( {\hat{py}}_{i1 n} \), \( {\hat{py}}_{i2 n} \) etc. Year 1 denotes the first calendar year in which cohort members contribute observation time.
Personyears estimation with incomplete followup for a single individual
In order to estimate py_{ij} from the available data, we assume that mortality rates for causes other than the disease of interest μ_{ij} which can be applied to the cohort (by age and calendar period) (event 2) and the migration rates ν_{ij} (event 3) are available. Note that the following applies to the cohort members who did not become incident cases during the followup. For these incident cases, persontimes are exactly known and can be assigned to age groups and calendar periods as usual.
For sake of simpler presentation we assume for a moment that followup start and birthday is January 1^{st}. We assume further for a moment that age groups and calendar periods of one year are considered, that the expected personyear contribution for an individual n for which one of the events occurred in a year is 0.5 years, and that the events 1,2,3 are independent. Then, the estimated personyear contribution for individual n which is not an incident case belonging to age i at calendar year 1 after entry into the study, \( {\hat{py}}_{i1 n} \), is
With
We can assume small rates, so γ _{ i1} can be approximated by γ _{ i1} = μ _{ i1} + ν _{ i1}. So if, for example, μ _{ i1} = 0.002 and ν _{ i1}= 0.013, then γ _{ i1} = 0.002 + 0.013 − 0.002 × 0.013 = 0.015 − 0.000026 = 0.014974 ≈ 0.015, and an estimate for the personyear contribution in the first year is (1–0.015) + 0.5 × 0.015 = 0.9925 personyears instead of a full personyear. In other words, from 1000 individuals in that age group who did not get the disease of interest, an expected number of 15 individuals either die from another disease or migrate within a year. Each of these 15 individuals contribute an expected half personyear (since a uniform distribution of the time of occurrence of the events within a year can be assumed), so the expected total number of personyears for these 1000 individuals is 985 + 15x0.5 = 992.5 personyears, i.e. 0.9925 per person. For year 2 calculation is similar: From 1000 individuals in that age group who did not get the disease of interest, an expected number of 15 either died from another disease or migrated in the first year and therefore do not contribute persontime for the second year. From the remaining 985 individuals, 985 × 0.015 = 14.775 are expected to die or migrate in the second year, i.e. 985–14.775 = 970.225 contribute a full year of observation time in the second year. So the expected total number of personyears for the initial 1000 individuals for the second year is (970.225 + 14.775 × 0.5) = 977.612 personyears, i.e. 0.977612 per person. The general equation for the second year is thus
This procedure continues for an arbitrary year k until the endoffollowupdate as
This can be rewritten as
Since \( \frac{\left(10.5{\gamma}_{i k}\right)\left(1{\gamma}_{i, k1}\right)}{\left(10.5{\gamma}_{i, k1}\right)} \) is numerically close to (1 − γ _{ i,k }) unless the rates change strongly from one year to the next, we use the simpler equation
A full numerical example is given in Table 1. In a cohort individual entry dates and birth dates are arbitrary,, and the calendar period changes over time and so do the corresponding rates μ_{ij} and ν_{ij}.. We now use the common notation for personyears which are given according to age groups. In praxis, 5year age groups i = 1, .., I are commonly used. We use this grouping in the following. The expected persontime that an individual n contributes to the age class i × calendar year j, j = 1, …, J , \( {\hat{py}}_{ijn} \) is then given as the probability that the class is entered multiplied by the expected observation time within that class. The probability that the class is entered is the product of the failure probabilities in all previous age × calendar year classes (i’,j’), i.e.
\( {\displaystyle \prod_{\begin{array}{l}\kern1.5em i\hbox{'}, j\hbox{'},\\ {}\kern1em i\hbox{'}< i, j\hbox{'}< j\\ {}\left( i\hbox{'}, j\hbox{'}\right)\ne \left( i, j\right)\\ {}\end{array}}\left(1{\gamma}_{i\hbox{'} j\hbox{'}} p{y}_{i\hbox{'} j\hbox{'} n}^{*}\right)} \), where γ _{ i ' j '}is the sum of the mortality and migration rate in class (i,j) and py _{ i ' j ' n } ^{*} is the maximal persontime (raw personyears) in that class (which is 1 year under this categorization), and thus we get
Equation (6) is the general form of Eq. (3) which allows arbitrary entry times, birth dates, and class length. Summing up the expected personyears for each cohort member over all agegroups and calendar years then gives an estimate for the expected personyears in all categories, \( {\hat{py}}_{ij}={\displaystyle \sum_{n=1}^N{\hat{py}}_{ij n}} \). The personyears from the cases py _{ ij,cases } have to be added. An estimate for the SIR is then given by
Computational aspects
In principle, the personyears contributions for each individual for each calendar year and age class can be calculated separately, and thus the estimate in Eq. (6) can be used directly. Computationally, it is simpler to first calculate the personyears by age group and calendar period under the assumption μ _{ ij } = ν _{ ij } = 0, i.e. for all individuals not diseased the study endpoint is used as end of followup date for all individuals. Common software to calculate personyears is applied here and gives the ‘raw’ personyears which we have denoted as py _{ ij } ^{*} . This also is the upper bound for the total personyears and can be used to calculate the minimal SIR. We apply the above procedure iteratively from the first up to the last year of followup as
similar as in Eq. (5). Here i is the index for age group and 1 the first calendar year to which individual(s) contribute persontimes.
If the population did not undergo aging, the adjusted personyears py in the second calendar year would be
since to the raw py in year 2 contribute individuals who enter the cohort in year 2 and individuals who entered in year 1. For the latter group, the expected loss of py in the first year, \( p{y}_{i1}^{*}{\widehat{py}}_{i1} \), must be taken into account and subtracted from the raw py in year 2. On this adjusted term, \( \left( p{y}_{i2}^{*}\left( p{y}_{i1}^{*}{\widehat{py}}_{i1}\right)\right) \), the rates for the year 2 (1 − γ _{ i2}) must be applied. This procedure continues until the last calendar year is reached for which observation times are recorded.
In the iterative process we must take aging into account, i.e. people move through the underlying age groups. Since mortality rates are commonly provided in five year age groups we use this categorization as well. For a given year, the observation times of a fiveyear age group undergo aging such in expectation that one fifth will move to the next oldest fiveyear age group in the next year. The estimated personyears for the second calendar year j2 of age group i, \( {\widehat{py}}_{i2} \) is therefore the raw personyears, py _{ i2} ^{*} , minus 4/5 of the expected loss in the same age group, minus 1/5 of the expected loss in the next youngest age group, yielding
Finally, for an arbitrary year j we get
We have written an SAS Macro to perform the required calculations (available as Additional file 1). As mentioned before, the exact solution requires the procedure for the nondiseased only and adding the (known) personyears from the diseased (the cases).
Variance estimation and confidence intervals
Standard procedures for SIR confidence intervals consider the observed number of cases O as random variable which is Poisson distributed and the denominator as fixed. Our procedure to estimate the accumulated personyears adds a variance component to the expected number of deaths E, and therefore an additional variance component must be taken into account. In the following we show that this component is so small compared to the component in the numerator that it can be neglected.
The deltamethod asymptotically gives \( V a r\left(\frac{O}{E}\right)=\frac{1}{E^2} V a r\kern0.5em O+\frac{O^2}{E^4} V a r\kern0.5em E \). Since O is Poisson distributed, its variance is equal to the expectation, i.e. Var(O) = E(O). For Var (E) it holds:
The evaluation of \( V a r\Big(\widehat{p{ y}_{ij}\Big)} \) is done under the same assumptions on entry times and ages as before. For the first year (1) and for a single individual n, the variance of its personyear contribution is \( V a r{\overset{\wedge }{\Big( py}}_{i1 n}\Big)=\left(\frac{1}{3}\frac{\gamma_{i1}}{4}\right){\gamma}_{i1} \). This is shown as follows: Let Y denote the random variable denoting the (unknown) persontime for an individual in the first year of observation. Its density function is \( {f}_Y(y)=\left\{\begin{array}{c}\hfill 1{\gamma}_{i1}\hfill \\ {}\hfill {\gamma}_{i1}\hfill \end{array}\right.\kern1em if\kern1em \begin{array}{c}\hfill Y=1\hfill \\ {}\hfill Y\in \left(0,1\right)\hfill \end{array} \) , assuming a uniform distribution given an event (migration or death from other cause) occurred. Then we get expectation E(Y) = 1 × (1 − γ _{ i1}) + 0.5 × γ _{ i1} = 1 − 0.5γ _{ i1}. We are interested in the variance of Y. We use the relation Var(Y) = E(Y ^{2}) − [E(Y)]^{2}. For f _{ Z }(y), Z := Y ^{2} we get \( {f}_Z(z)=\left\{\begin{array}{c}\hfill 1{\gamma}_{i1}\hfill \\ {}\hfill 1/\left(2\sqrt{z}\right)\hfill \end{array}\right.\kern1em if\kern1em \begin{array}{c}\hfill Z=1\hfill \\ {}\hfill Z\in \left(0,1\right)\hfill \end{array} \) and then after basic probability theory E(Z) = (1 − γ _{ i1}) + γ _{ i1}/3 and thus Var(Y) = (1 − γ _{ i1}) + γ _{ i1}/3 − (1 − γ _{ i1}/2)^{2} = γ _{ i1}/3 − γ _{ i1} ^{2}/4 = γ _{ i1}(1/3 − γ _{ i1}/4).
For small rates, this is smaller than the binomial variance γ _{ i1}(1 − γ _{ i1})by a factor of about 3. This follows intuitively since individuals with an event contribute between zero and one personyears in the year of the event, not zero. Then, for all individuals in this year and age group i, we have \( V a r{\overset{\wedge }{\Big( py}}_{i1}\left)={\overset{\wedge }{\Big( py}}_{i1}\right)\left(\frac{1}{3}\frac{\gamma_{i1}}{4}\right){\gamma}_{i1} \) . For the second year, we have for a single individual \( V a r{\overset{\wedge }{\Big( py}}_{i2 n}\Big)=\left(1{\gamma}_{i2}\right){\gamma}_{i2}+\left(\frac{1}{3}\frac{\gamma_{i2}}{4}\right){\gamma}_{i2} \) . In order to show that the variance component from the estimated personyears is negligible it is sufficient to consider an upper bound of the total variance of the expected number of cases, E. and for all individuals in the second year and age group i, we have as an upper bound \( V a r{\overset{\wedge }{\Big( py}}_{i2}\left)={\overset{\wedge }{\Big( py}}_{i2}\right)\left(1+{0.5}^2\right)\left(1{\gamma}_{i2}\right){\gamma}_{i2} \)
Summing up, we have as an upper bound
since \( \overset{\wedge }{\lambda_{ij} p{y}_{ij}}={E}_{ij} \),the expected number of cases in age group × calendar period (i, j), we write
Therefore under the rare disease assumption λ _{ ij } and γ _{ ij }small, the variance of the denominator is several magnitudes smaller than the variance of the numerator. Additional aspects on variance estimation can be found in Additional file 2. The appropriateness of this method has additionally been checked by simulation using the numerical example below. The evaluation shows that the value resulting from Eq. (13) is very low, and that the variance is very small compared to the variance of the numerator. Therefore, it can be neglected and standard methods to calculate the confidence interval can be used.
Sensitivity analysis
The method developed provides an estimate for the personyears and hence for the SIR if the mortality rates and the migration rates are applicable to the cohort. Since the question may arise whether the mortality rates from the underlying population are applicable to the cohort and whether the migration rates are reliably estimated, the question of bias may be more relevant than the additional variance induced by the personyears estimation. If γ _{ ij } is underestimated, the personyears are overestimated, up to the maximum when γ _{ ij } is zero. Conversely, if γ _{ ij } is overestimated, the personyears are underestimated. We recommend to perform a detailed sensitivity analysis by estimating the SIR with reasonably modified migration and mortality rates (see example below). As will be seen in the study example (section 4), the SIR estimate is relatively robust against violations of the assumptions. If migration rates are reliably estimated, however subject to random variation, an additional variance component can easily be incorporated (see numerical example below).
Results
A numerical example
A numerical example is given in Table 1 to show the effect of the estimation procedure on the personyears allocation in the (calendar year) × (age) crossclassification. We consider one single individual n which entered the cohort on January 1st, 1990. This person did not become an incident case during the followup period. Therefore, death from another cause or outmigration out of the catchment area of the registry could have occurred at any date during the followup period with death rates μ _{ ij } and migration rates ν _{ ij }, depending on calendar year and age. We assume different migration and mortality rates in different age groups. According to this example, this person with an unknown followup status contributes an expected value of py _{ n1} =0.992 personyears to the first year of followup, py _{ n2} =0.978 personyears to the second, and py _{ n15} =0.816 personyears to the last year of followup instead of a maximal value of one personyears per year. The expected total number of personyears for this person is 13.487 instead of a maximal possible value of 15 years. These are realistic rates and yield a reduction of personyears of 10.1%. If γ _{ ij } are assumed fixed, the Variance of py _{(n)(1)} is approximately 0.5^{2} γ _{ ij }(1 − γ _{ ij })= 0.004, the variance of py _{ n2} and py _{ n3} is approximately (1 + 0.5^{2})γ _{ ij }=0.019, and (2 + 0.5^{2})γ _{ ij }=0.035, and so on. The variance of the total personyears estimate for this individual is 1.335. If γ _{ ij } are random with variance var(γ _{ ij }) = σ _{ γ } ^{2} , we get var(py _{ n1})= γ _{ ij }(1 − γ _{ ij }) + σ _{ γ } ^{2} . The variance component from the rates is here assumed as σ _{ γ } ^{2} = 0.05^{2}. Then the variance of the total personyears estimate for this individual is 1.667, an increase of 22%, however, still so small that it can be neglected.
It would be incorrect to assign to this individual 13.429 personyears from the begin of followup since this would wrongly yield to a fixed losstofollowup date in the year 2003, and hence full years of observation for all previous years, and zero personyears for the year 2004.
In Additional file 3 we expand this numerical example by assuming that we have a cohort of 1000 individuals with the same characteristics as the individual above. It shows numerically that the variance component induced by the personyears estimate in the denominator of the SIR is small compared to the variance component in the numerator so that it can be neglected. The sensitivity analysis for this dataset used rates with factors 1.5 and 0.75 yielded a bias in the SIR of only 7%. It shows that the procedure is relatively robust against violations in the rate estimates.
Simulation study
A simulation study has been performed to investigate the performance of the estimation procedure. The setup of the simulation is as follows: A cohort of size N = 1000 was simulated with parameters as in the preceding numerical example. For each dataset, the number of observable events (incident cases) and unobservable losses (deaths from other causes or outmigration) and the corresponding failure time of the events and time of loss was simulated. The exact number of personyears was recorded and compared with the estimated number of personyears according to the above procedure. The main parameter for comparison was the relative difference of observed and exact personyears, i.e. \( {\varDelta}_{p y\_ rel}=\frac{p{y}_{exact} p{y}_{estimated}}{p{y}_{exact}} \).
According to the chosen parameters we observed between 3 and 23 events (median 11, mean 10.75) and between 145 and 218 losses from deaths or outmigration within the 15 year observation period (median 179, mean 179.7). The simulation results regarding the observed and estimated py as well as the absolute and relative difference and number of events are given in Table 2. We observe a slight underestimation of the true personyears of 0.4% with an empirical 90% confidence interval as (−1.0% – 1.9%). We conclude that the procedure is sufficiently accurate. In comparison we would find a mean overestimation of the personyears of 10.7% if all individuals are wrongly assumed to remain under risk.
The simulation program was written in SAS and is listed in the supplemental material.
Application
A cohort study was performed with 18621 migrants from the Former Soviet Union who immigrated between 1990 and 2005 to the German federal state of Saarland. For every individual name, sex, date of birth, date of immigration (start of the observation period), and first city of residence was available. A complete incidence followup was done by linking the cohort with the Saarland cancer registry which is a high quality register for this federal state [5]. 470 incident cancer cases (except nonmelanoma skin cancer) were identified. Mortality followup and tracing of individuals who relocated from the state was not performed.
Age, sex and calendaryear specific “raw” personyears were calculated at the end of the observation period (31^{st} December 2005) by calendar year and 5year age groups. For individuals with cancer diagnosis, the date of diagnoses was set as the endpoint of their observation time. A summarized result is given in Table 3.
Adjustment of the personyears was done according to Eqs. (1) and (2). The German mortality rates for all causes of death except cancer within the cohort (μ_{ij}) were used. These were obtained from WHO mortality database [6]. The migration rates (ν_{ij}) were taken from a partial followup and from a similar study on migrants in another federal state [7]. This is considered the most appropriate model. The rates are given in Table 4 (model 1). The cancer incidence rates λ_{ij} were taken from the Saarland cancer registry.
Table 5 gives the “raw” and adjusted personyears for the 5year age groups and calendar periods for the cohort. The overall cancer SIR is 1.11 in males and 1.01 in females. Using the “raw” personyears for calculating E, the SIR would be 0.90 and 0.86. The difference between “raw” and estimated total personyears was 18959.7 personyears, or 11.3%.
The sensitivity analysis is done by estimating the number of personyears with different assumptions on migration rates. Model 2 uses halved and model 3 doubled rates (see Table 4). Results of the estimation procedure in terms of personyears and expected number of cancer cases are shown in Table 5 for selected cancer sites based on Saarland incidence rates [8]. Relatively little variation between the different models is observed. The biggest difference to model 1 for the expected number of cases is seen for model 2 in all female cancers with 22.3 cases. For other cancer sites the maximum differences (found between models 1 and 2) in incident cancer cases are 4.3 and 6.8, respectively. Sensitivity analysis demonstrates that estimated expected number of cancer cases and therefore corresponding SIR are quite robust with regard to the underlying assumptions. SIR for different cancer sites according to different models are also shown in Table 5 with the respective standard 95% confidence intervals. Model 4 gives the lower limit of the SIR estimates using the raw personyears for calculating the expected number of cases. On average, the expected number of cases are about 25% higher in this model.
Discussion
We have presented a method to estimate the standardized incidence ratio in a cohort study for which a linkage with a disease registry has been performed, however, for which a mortality followup including tracing of individuals which moved out was not directly possible. This situation is not a rare one: cancer registries exist in many parts of the world, and disease registries for other diseases, such as diabetes, stroke, and others have been developed, e.g. [9, 10]. A straightforward mortality followup is possible only in countries with a death register or comparable data resources, such as for example the US, Great Britain and the Scandinavian countries. For many other countries, such as Germany, Switzerland, France, Italy, Spain and others the method presented here is an alternative to the sometimes heavily bureaucratic mortality followup procedures. Epidemiologists use these registries to assess disease incidence, and a linkage of cohorts with these registries is technically straightforward. We showed that under quite robust assumptions it is possible to estimate the accumulated number of personyears, and provided a method to calculate the SIR and its 95% confidence intervals taking into account the estimation procedure.
In our example, we present the analysis of cancer incidence in a cohort by making assumptions on mortality from other causes and on migration. First, the same mortality as the German population within the cohort was taken. Results from a similar cohort show that this assumption may be reasonable [4]. A slight underestimate of the number of personyears is possible, though. The second assumption refers to migration. There are good estimates on migration by age group from other cohort studies in Germany, and also on migrants from the former Soviet Union who migrated to another state in Germany. We think these rates can be applied here, however, some uncertainty will remain. A small subset of the cohort (about 20%) has been followedup by now, and results support our assumption.
Here, we have considered cancer as the disease of interest which can be regarded as a rare disease in terms of the yearly incidence rates. If the disease is common, then the time from begin of observation until disease onset is shorter, and then for more individuals the raw persontime is the correct persontime. Consequently, the bias when using the raw personyears becomes smaller.
Our sensitivity analyses were reassuring in that respect. SIR estimates showed relatively little variation with modified rate assumptions. Still, a bias in the estimated expected value of cases is possible, and this appears to play a bigger role than the increase of variance by the estimation procedure. We showed that the additional variance induced by variation in the denominator is small relative to the variation in the number of observed cases. This allows using standard method of inference, in particular the application of exact confidence intervals if the observed number of cases is small.
The completeness of the registry is another relevant issue. In the case of the cancer registry in the Saarland, the percentage of DCO (death certificate only) cases in the registry is very low, and all those cases within the cohort are likely to have been identified with the linkage procedure.
Conclusions
We showed that valid estimation of the observation time is possible for cohort studies with incomplete followup data. This is especially relevant for cohort studies where a complete assessment of the individual vital status is not easily possible but which can be linked to disease registries in order to assess the number of incident cases within the cohort.
Abbreviations
 py:

personyears
 RR:

Rate ratio
 SIR:

Standardized incidence ratio
 SMR:

Standardized mortality ratio
References
 1.
International Agency for Research on Cancer. Statistical methods in cancer research Vol II. The Design and Analysis of Cohort Studies. Lyon: International Agency for Research on Cancer; 1980.
 2.
Silcocks P. Estimating confidence limits on a standardised mortality ratio when the expected number is not error free. J Epidemiol Community Health. 1994;48(3):313–7.
 3.
Beyene J, Moineddin R. Methods for confidence interval estimation of a ratio parameter with application to location quotients. BMC Med Res Methodol. 2005;5:32.
 4.
Winkler V, Ott JJ, Holleczek B, Stegmaier C, Becher H. Cancer profile of migrants from the Former Soviet Union in Germany: incidence and mortality. Cancer Causes Control. 2009;20(10):1873–9.
 5.
Ziegler H, Strabenow R, Holleczek B, Stegmaier C. Krebs im Saarland. Saarbrücken: Saarland cancer registry; 2009.
 6.
WHO: WHO Mortality Database. http://www.who.int/healthinfo/mortality_data/en/. Accessed 4 Mar 2017
 7.
Becher H, Razum O, Kyobutungi C, Laki J, Ott J, Ronellenfitsch U, Winkler V. Mortalität von Aussiedlern aus der ehemaligen Sowjetunion  Ergebnisse einer Kohortenstudie. Dtsch Arztebl. 2007;23(104):1655–62.
 8.
Cancer Registry Saarland (Krebsregister Saarland) http://www.krebsregister.saarland.de/datenbank/datenbank.html. Accessed 4 Mar 2017.
 9.
Benatru I, Rouaud O, Durier J, Contegal F, Couvreur G, Bejot Y, Osseby GV, Ben Salem D, Ricolfi F, Moreau T, et al. Stable stroke incidence rates but improved casefatality in Dijon, France, from 1985 to 2004. Stroke. 2006;37(7):1674–9.
 10.
Udayaraj U, Tomson CR, Gilg J, Ansell D, Fogarty D. UK Renal Registry 11th Annual Report (December 2008): Chapter 6 Comorbidities and current smoking status amongst patients starting renal replacement therapy in England, Wales and Northern Ireland: national and centrespecific analyses. Nephron Clin Pract. 2009;111(Suppl 1):c97–111.
Acknowledgements
We acknowledge financial support by Deutsche Forschungsgemeinschaft and RuprechtKarlsUniversität Heidelberg within the funding programme Open Access Publishing.
Funding
The study was funded by a grant of the German Research Foundation, Graduate College 793: “Epidemiology of communicable and chronic noncommunicable diseases and their interrelationships”.
Availability of data and materials
The authors are owners of the data of the cohort study. The data is available upon request from the corresponding author of this publication.
Authors’ contributions
HB conceived the study, developed the method, performed the simulation study and drafted the manuscript. VW performed some of the calculations, developed the figure and contributed to writing. Both authors read and approved the final manuscript.
Authors’ information
HB is the head of Institute for Medical Biometry and Epidemiology at the University Medical Center HamburgEppendorf. VW is the deputy head of the Unit Epidemiology and Biostatistics at the Institute of Public Health of the University of Heidelberg.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
All analyses were entirely based on secondary data only.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Author information
Additional files
Additional file 1:
py estimation macro for SAS. (DOCX 17 kb)
Additional file 2:
Addition aspects on variance estimation. (DOCX 21 kb)
Additional file 3:
Numerical example showing py and variance estimation. (DOCX 37 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Received
Accepted
Published
DOI
Keywords
 Epidemiological methods
 Cohort studies
 Incidence
 Missing data
 Standardized incidence ratio