Adaptive list sequential sampling method for population-based observational studies
© Hof et al.; licensee BioMed Central Ltd. 2014
Received: 14 March 2014
Accepted: 12 June 2014
Published: 25 June 2014
In population-based observational studies, non-participation and delayed response to the invitation to participate are complications that often arise during the recruitment of a sample. When both are not properly dealt with, the composition of the sample can be different from the desired composition. Inviting too many individuals or too few individuals from a particular subgroup could lead to unnecessary costs or decreased precision. Another problem is that there is frequently no or only partial information available about the willingness to participate. In this situation, we cannot adjust the recruitment procedure for non-participation before the recruitment period starts.
We have developed an adaptive list sequential sampling method that can deal with unknown participation probabilities and delayed responses to the invitation to participate in the study. In a sequential way, we evaluate whether we should invite a person from the population or not. During this evaluation, we correct for the fact that this person could decline to participate using an estimated participation probability. We use the information from all previously invited persons to estimate the participation probabilities for the non-evaluated individuals.
The simulations showed that the adaptive list sequential sampling method can be used to estimate the participation probability during the recruitment period, and that it can successfully recruit a sample with a specific composition.
The adaptive list sequential sampling method can successfully recruit a sample with a specific desired composition when we have partial or no information about the willingness to participate before we start the recruitment period and when individuals may have a delayed response to the invitation.
KeywordsList sequential sampling Sample representativeness πps sample Population-based observational studies
Population-based observational studies are frequently used to measure the prevalence of characteristics such as diseases by means of a sample from a population . Two important problems that arise when a sample is recruited is that (i) not everyone in the population has the same willingness to participate in the study [2–6], and (ii) after inviting an individual, it might take some time before we receive a response.
Variation in the willingness to participate may bias the results of the study. To deal with this problem, we could invite more individuals from groups related to a low willingness to participate . However, this approach requires that the participation probability per person or group is known before the sampling procedure starts. Unfortunately, this detailed knowledge on the willingness to participate among sub-groups in the population is often not available. If the willingness to participate is less than assumed we will invite too few individuals, which leads to a too small sample and a decreased precision. On the other hand, inviting too many individuals will lead to extra costs. Generally, we invite too many individuals when we underestimate the willingness to participate and there is delayed response to the invitation. In general, not accounting for delayed response will lead an unexpected number of extra individuals in the sample at the end of the recruitment period.
An example of a complex sampling problem is observed in the HELIUS study . One objective of the HELIUS study is to measure ethnic inequalities in the incidence and prognosis of major diseases in the population of Amsterdam. The desired sample should have approximately 5000 individuals in each ethnic group, and should be representative for the population of Amsterdam. This is achieved by stratifying on the auxiliary variables: place of residence (spatial), age, (continuous), gender (categorical), and social economic status (categorical) available from municipal registries.
Unfortunately, it is not straightforward to implement stratification when we have a large number of auxiliary variables of mixed types . In this case, too small or even empty strata might be obtained when we cross all strata from all variables. An alternative variance reduction technique, proposed by Grafström et al., is to obtain a well spread set of participants [10, 11]. Basically, a set of participants is well spread when the number of participants is close to what is expected on average, for every set of auxiliary variables. Grafström et al. showed that the variance of commonly used estimators is usually low with a well spread set of participants.
In this paper, we use the list sequential method, developed by Bondesson and Thorburn  to obtain a well spread set of participants without replacement from a finite population. Instead of trying to cross all strata from all auxiliary variables, our approach is based on a distance function between individuals. Similar or almost similar individuals should seldom be invited both to participate in the study. In its current form, the list sequential sampling method cannot be used to recruit sets of participants for population-based observational studies because the list sequential sampling method assumes that (i) everyone participates in the study and that (ii) there is no delayed response to the invitation.
We developed approaches to correct for non-participation and delayed response to the invitation when we use a list sequential sampling method. The list sequential sampling method evaluates individuals from the population in a sequential order, and uses a random process to decide whether or not an individual should be invited to participate in the study. In this decision we have to correct for any non-participation. An approach is to weigh the probability of being invited with the (estimated) participation probability. When there is no or partial a-priori knowledge on the participation probability, we can estimate this probability during the recruitment period using the information from already invited individuals. To combine both prior information and information that is generated during the recruitment period, we developed a Bayesian approach to estimate the participation probabilities. Moreover, to deal with the delayed response to the invitation, we use the expected response of an individual when we have no answer yet.
We performed a simulation study to illustrate the performance of the adapted list sequential sampling method, when we have unknown heterogeneous participation probabilities and delayed response to the invitation.
We consider a finite population D containing n individuals, where each individual i is described by a vector x i of auxiliary variables. The auxiliary variables x i are known for each individual before the recruitment period starts. Usually x i is available from municipal or national person-registries. Examples of these variables are gender, age, place of residence, and social economic status. In addition to x i , each individual i has an unobserved outcome of interest y i . The goal of this paper is to obtain a sample of size m (m<n) from D, in which we can observe y i .
A sample is described by the vector s=(s 1,…,s n ), where s i takes the value 1 if individual i is in the sample and 0 otherwise [13, 14]. With this representation there are 2 n possible samples. Before the recruitment period starts we need to determine π i , which is the probability that individual i is included in s (i.e. p(s i =1)=π i ). We want to recruit a sample of m individuals and therefore , where m is a positive integer.
Different choices can be made for the inclusion probabilities π i . For instance, we can assign equal inclusion probabilities to all individuals, i.e. π i = m/n. In this case, the sample s is expected to be a ‘miniature’ version of the population D, because we expect s to have approximately the same composition of auxiliary characteristics as D. In this case, the sample is referred to as a representative sample . However, π i is frequently chosen to be proportional to x i . For example, by oversampling a rare subgroup we could increase the precision of the result for that particular subgroup .
List sequential sampling method
To obtain the sample we use the list sequential method based on sampling without replacement developed by Bondesson and Thorburn . To illustrate the list sequential method, we first consider the situation in which all invited individuals will participate in the study.
Within these bounds, we can impose different restrictions on , resulting in samples with certain characteristics. Generally, when we have corr(s i =1,s j =1)<0 (i.e. a negative correlation between the sampling indicators of individuals i and j), whereas with , we have corr(s i =1,s j =1)>0. For more detail about the list sequential method, we refer the reader to respectively theorem 1 and remark 1 from Bondesson and Thorburn .
Well spread samples
We are interested in recruiting a well spread sample with the list sequential sampling method. Usually, a well spread sample leads to parameter estimates with low variances. Before we can introduce the definition of a well spread sample, we require the concept of coherent subsets. Let d(i,k) be the distance between individuals i and k. A subset D ′ from the population D is coherent if the following holds. First, let some individual i∈D ′. Individual k is included in D ′ if and only if d(i,k)≤r, where r≥0. Consequently, D ′ can be constructed by including all individuals within a ball of radius r around individual i.
A smaller distance to individual i increases the probability of being included in the coherent subset D ′. To satisfy (3), it is clear that the inclusion probability of individual i should be more influenced by the sampling indicators s of individuals with a smaller distance. We propose to measure distance between individuals with the auxiliary variables x, where d(x i ,x k ) is the distance between individual i and k. Based on the types of auxiliary variables, we can choose, for instance, the Mahalanobis or the Manhattan distance.
To obtain a well spread sample with the list sequential sampling method, we will use preliminary weights which are specified before the recruitment period starts. The preliminary weight reflects the effect of s k from individual k on the inclusion probability of individual i. The weights are referred to as preliminary because the upper bound from (2) has an effect on the conditional inclusion probabilities.
where the weights μ and λ≤0 are arbitrarily chosen weights. The sampling indicator s k of individual k has a larger effect on individuals at smaller distance, whereas it has less effect on individuals at further distance. To recruit a set of approximately m individuals, we restrict the weights to satisfy .
Heterogeneous participation probabilities
A problem of sampling from population D is that individuals that are invited to participate in the study can decline the invitation. Let b=(b 1,…,b n ) be the vector that indicates whether an individual i is invited to participate (b i =1) or not (b i =0). When individual i refuses to participate in the study, we have s i =0 and we do not observe y i . Let ϕ=(ϕ 1,…,ϕ n ) be the vector that contains the participation probability per person in the population, where ϕ i =p(s i =1|b i =1). Note that when every invitee participates (i.e. ϕ i =1, for i=1,…,n), we have s=b.
Let be the inclusion probability corrected for non-participation, i.e. the probability of being invited to participate in the study for individual i from D. When ϕ i is known before the recruitment period starts, non-participation can be dealt with by using as probability to invite individual i. Moreover, we can use the updating rule from (1) to update the inclusion probabilities of the non-evaluated individuals , j>i, after individual i responded to the invitation. This will give us a sample that approximately satisfies the inclusion probabilities π.
The following small sampling problem illustrates this modification. Consider that, for the first individual, we have and ϕ 1=0.5. The probability to invite this individual is therefore . Using this strategy there might be some individuals i with . This means that the participation probability of individual i is too low with respect to ; the desired probability to be included in s for individual i cannot be reached. For instance, this would happen in the example above for individual 1 when ϕ 1=0.1 and consequently . This means that we have to invite individual 1 two and a half times to satisfy . Because we can only invite an individual once, we restrict all values to be one or lower.
Adaptive list sequential sampling method
Usually, ϕ i is not known before the recruitment period starts. In this section we suggest how ϕ i can be estimated adaptively during the recruitment period. In addition, we consider delayed response to the invitation.
For each individual, we have some knowledge about the willingness to participate before the recruitment period starts. For example, we might have participation estimates from a small pilot study or from previously performed studies. In addition, information from the invited individuals becomes available during the recruitment period. Therefore, we propose to use a Bayesian method to estimate the participation probability of individual i during the recruitment period, in which we use both the available prior knowledge and the information that becomes available during the recruitment period.
where α is the intercept term, and f() is a function of the observed characteristics z i and the regression weights β. Because more information becomes available during the recruitment period, the participation probability estimates become more accurate. The vector of estimated participation probabilities of all n individuals after the evaluation of individual i is denoted as . We then adapt the inclusion probabilities as .
After an invitation has been send to an individual, it might take some time to get a response. Let be the indicator whether individual j has responded to the invitation before individual i is evaluated, where when we observe s j and when we do not observe the participation indicator s j during the evaluation of individual i. Note that when individual j has not been invited (i.e. b j =0), s j =0 since individual j is not included in the set of participants. A problem of delayed response is that we cannot use the update rule from (1) to determine , when the participation indicator of the previous individual is not observed. Consequently, we cannot update which means that our sampling method is less successful in recruiting a well spread sample. As a solution, we propose to use the data from all previously invited individuals, and replace the non-observed participation indicators with their estimated expected value. We use this approach in step 1 of the adaptive list sequential sampling method listed below.
To estimate , we can use quadrature or MCMC methods. The values of θ depend on the amount of prior knowledge that is available before the recruitment period starts. For instance, we can assume that (α,β) is sampled from some flat distribution with large variance when no prior knowledge is available.
We illustrated the performance of the adaptive list sequential sampling method with two simulations. In these two simulations, we created populations with unknown heterogeneous willingness to participate and delayed response to the invitation. The first simulation was focused on recruiting a well spread, representative set of participants. In the second simulation, we investigated stratified sampling from a population in which some subgroups were over-represented.
Consider a population D of size n=4000 from which we drew a random sample without replacement of size m=400 with the adaptive list sequential sampling method. To recruit a representative sample from the population, we assigned equal inclusion probabilities to all individuals from the population; i.e. for i=1,…,n. When the sample is well spread, the distribution of the auxiliary characteristics x should be approximately similar in the population and the sample.
The data was generated as follows. The vector z i was drawn from a multivariate normal distribution with means zero, and covariances zero. The probability of positively responding to the invitation was p(s i =1|b i =1,z i )=invlogit[ α+z i β], where invlogit denotes the inverse logit transformation, α=1, and β=(0.3,−0.7,0.1,0.4). The response was drawn from a Bernoulli distribution with p(s i =1|b i =1,z i ). In addition, for individual i, delayed response to the invitation was simulated by drawing time t i from a Poisson distribution with expectation 15. Individual i responded to the invitation after the evaluation of individual i+t i . Thus if t i =0, individual i responded immediately to the invitation.
We used an estimated participation probability to deal with non-participation. Two different approaches to estimate the participation probability were evaluated. The first approach was to use all available data to estimate the participation probability, i.e. . With the second approach, we assumed that z i had no impact on the participation probability, i.e. . The second approach was used to investigate whether the impact of miss-specifying had a large impact on how well the sample was spread.
We assumed that we had no prior knowledge about the participation probability before the recruitment period started. Therefore flat, non-informative priors were used for α and all regression weights β by assuming they followed normal distributions with means zero and variance 100. Because we assumed zero means, the initial estimated participation probabilities were 50%, i.e. for i=1,…,n.
We quantified how well a sample was spread with the following measure based on Voronoi polytopes, suggested by Grafström and Lundström. Let individual i∈s, i.e. individual i is included in the set of participants s. The Voronoi polytope v i consists of all individuals j from the population D for which d(x i ,x j )≤d(x k ,x j ), for all other individuals k∈s. Note that when d(x i ,x j )=d(x k ,x j ), individual j is included in both polytopes v i and v k , but weighted with 1/2.
where a low R corresponds to well spread sample. To investigate how well the adaptive list sequential sampling methods performed in recruiting a well spread sample, the simulation was performed 1000 times. We calculated the mean and variance of R, and the average sum of recruited participants. Note that the best adaptive list sequential sampling method should give us a set of approximately 400 participants with a low R in every simulation.
In simulation 2, we considered a population D of size n=5000, in which each individual was described by a categorical auxiliary variable x i and a unobserved binary outcome of interest y i . The auxiliary variable x i had five possible values g. The main goal of this simulation was to estimate the sum of the outcome y in the population, denoted as , with a set of participants in which we can measure y. Moreover, we had resources to measure y in a set of participants of size m=500. The set of participants was obtained with an adaptive list sequential sampling method where we dealt with non-participating during the recruitment period.
where p(s i =1|b i =1,x i =g) was the participation probability of individual i given x i =g, i.e. for individual i the probability of participating depended on x i . The response to an invitation was drawn from a Bernoulli distribution with probability p(s i =1|b i =1,x i =g). Moreover, .
Note that we could also use stratified sampling to get our desired set of participants because we only have five disjoint groups. However when we have a large number of groups, stratification becomes impracticable. A large number of groups is no problem for the (adaptive) list sequential sampling design, if it is possible to specify a distance measure between individuals (see (3)). With π (0), we expected to have an equal number of individuals for each subgroup g in the set of participants.
where n g is the number of individuals in group g.
where β g is the regression weight for group g. Because we assumed we had no a-priori information about the participation probabilities, we used non-informative priors for β by sampling all five parameters β g from a normal distribution with mean zero and variance 100. For individual i, delayed response to the invitation was simulated by drawing time t i from a Poisson distribution with expectation 15. Individual i responded to the invitation after the evaluation of individual i+t i .
The simulations were performed 1000 times and we calculated the bias, MSE, and coverage of for both adaptive list sequential methods.
95% Confidence interval of R and the number of participants in simulation 1
Estimated participation probability:
Number of participants
Simple random sampling
Adjusted sampling 1
Adjusted sampling 2
Estimated participation probability:
Number of participants
Simple random sampling
Adjusted sampling 1
Adjusted sampling 2
Using all the auxiliary characteristics z i to estimate the participation probability of individual i, the simple random sampling approach resulted in a median R of 0.238 (95% confidence interval: 0.192–0.304). The mean number of participants with the simple random sampling approach was about 401 (95% confidence interval: 365 – 436). For the adjusted sampling 1 approach, approximately similar results were found for R, i.e. on average, the set of participants obtained with the simple random sampling approach and the adjusted sampling 1 approach were comparable in how well they were spread. With the adjusted sampling 1 approach, the average size of the set of participants was 397 (95% confidence interval: 376 – 418). However, compared to the simple random sampling approach, the variation in the size of the set of participants was considerably lower with the adjusted sampling 1 approach (respectively standard deviations of 18 and 11).
On average, a set of participants recruited with the adjusted sampling 2 approach was better spread than with the other two approaches. Not only was the median R 0.189, the spread around the median was also smaller than with the other two approaches (95% confidence interval: 0.157–0.225). The mean size of the set of participants with the adjusted sampling 2 approach was 397 (95% confidence interval: 376 – 418), which was comparable to the adjusted sampling 1 approach.
Interestingly, the performances of all three approaches remained similar when we ignored the auxiliary characteristics z i in the estimation of the participation probability of individual i. Since fitting a model with just an intercept gave comparable results to the more complicated model where we also included z i , the results suggested that the adaptive list sequential sampling method was robust to miss-specification of the participation probability model.
In this paper, we developed an adaptive list sequential sampling method when a random sample from the population is required and the willingness to participate varies between individuals and is not known beforehand. Our adaptive list sequential sampling method requires that the characteristics that are related to the participation probability are known of all individuals. With simulations, we showed that the adaptive list sequential sampling method could successfully deal with unknown heterogeneous participation probabilities.
In our adaptive list sequential sampling method, we evaluate each individual from the population only once. Therefore we only have one opportunity to decide whether to invite an individual or not. When we overestimate the participation probability for all individuals from the population, we end up with a too small set of participants. A simple solution for this problem would be to re-evaluate non-invited individuals until the desired size of participants in the study has been reached.
The simulations suggested that the adaptive list sequential sampling method is robust to miss-specification of the participation probability model. Just using an intercept term to describe the participation probability seems to work quite well. However, to what extent the adaptive list sequential sampling method can deal with wrong participation probability estimates was not investigated in this paper. In addition, extreme delayed response to the invitation has influence on the performance of the list sequential sampling method. Further research is necessary to determine in which situations the adaptive list sequential sampling method succeeds and fails to recruit a well spread set of participants.
A problem that was not considered here was the use of multiple invitation techniques in sampling designs. For instance, there could be individuals in the population that have a low willingness to participate when they are invited by a letter, but a much larger willingness when invited by telephone. Our method can be adopted by introducing multiple participation probabilities by extending step 3 of our algorithm and estimate multiple logistic regression participation probabilities.
We showed that correcting for heterogeneity in the participation probability during the recruitment period is an effective approach when we have no or partial knowledge on the willingness to participate in population studies. By inviting individuals from the population in stages, the participation probability can be estimated and used in the sampling procedure.
- Rothman KJ: Epidemiology: An Introduction. 2002, Oxford: Oxford University PressGoogle Scholar
- Kruskal W, Mosteller F: Representative sampling, ii: Scientific literature, excluding statistics. Int Stat Rev Rev Int Stat. 1979, 47 (2): 111-127.View ArticleGoogle Scholar
- Kruskal W, Mosteller F: Representative sampling, i: Non-scientific literature. Int Stat Rev Rev Int Stat. 1979, 47 (1): 13-24.View ArticleGoogle Scholar
- Kruskal W, Mosteller F: Representative sampling, iii: The current statistical literature. Int Stat Rev Rev Int Stat. 1979, 47 (3): 245-265.View ArticleGoogle Scholar
- Kruskal W, Mosteller F: Representative sampling, iv: The history of the concept in statistics, 1895-1939. Int Stat Rev Rev Int Stat. 1980, 48 (2): 169-195.View ArticleGoogle Scholar
- Galea S, Tracy M: Participation rates in epidemiologic studies. Ann Epidemiol. 2007, 17 (9): 643-653.View ArticlePubMedGoogle Scholar
- Pickery J, Carton A: Oversampling in relation to differential regional response rates. Surv Res Methods. 2008, 2 (2): 83-92.Google Scholar
- Stronks K, Snijder MB, Peters RJG, Prins M, Schene AH, Zwinderman AH: Unravelling the impact of ethnicity on health in Europe: the HELIUS study. BMC Public Health. 2013, 13: 402-doi:10.1186/1471-2458-13-402View ArticlePubMedPubMed CentralGoogle Scholar
- Szklo M: Population-based Cohort Studies. Epidemiol Rev. 1998, 20 (1): 81-90.View ArticlePubMedGoogle Scholar
- Grafström A, Lundström NLP: Why well spread probability samples are balanced. Open J Stat. 2013, 3: 36-41.View ArticleGoogle Scholar
- Grafström A, Schelin L: How to Select Representative Samples. Scand J Stat. 2014, 41 (2): 277-290. doi:10.1111/sjos.12016View ArticleGoogle Scholar
- Bondesson L, Thorburn D: A list sequential sampling method suitable for real-time sampling. Scand J Stat. 2008, 35 (3): 466-483.View ArticleGoogle Scholar
- Deville J, Tillé Y: Efficient balanced sampling : The cube method. Biometrika. 2004, 91 (4): 893-912.View ArticleGoogle Scholar
- Horvitz DG, Thompson DJ: A generalization of sampling without replacement from a finite universe. J Am Stat Assoc. 1952, 47 (260): 663-685.View ArticleGoogle Scholar
- Tillé Y: Sampling Algorithms. Springer Series in Statistics. 2006, New York: SpringerGoogle Scholar
- Narain RD: On sampling without replacement with varying probabilities. J Indian Soc Agric Stat. 1951, 3: 169-175.Google Scholar
- Hájek J: Asymptotic theory of rejective sampling with varying probabilities from a finite population. Ann Math Stat. 1964, 35 (4): 1491-1523.View ArticleGoogle Scholar
- Berger YG: Asymptotic consistency under large entropy sampling designs with unequal probabilities. Pak J Stat. 2011, 27 (4): 407-426.Google Scholar
- The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1471-2288/14/81/prepub
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.