 TECHNICAL ADVANCE
 Open Access
 Open Peer Review
 Published:
Crude incidence in twophase designs in the presence of competing risks
BMC Medical Research Methodologyvolume 16, Article number: 5 (2016)
Abstract
Background
In many studies, some information might not be available for the whole cohort, some covariates, or even the outcome, might be ascertained in selected subsamples. These studies are part of a broad category termed twophase studies. Common examples include the nested casecontrol and the casecohort designs. For twophase studies, appropriate weighted survival estimates have been derived; however, no estimator of cumulative incidence accounting for competing events has been proposed. This is relevant in the presence of multiple types of events, where estimation of event type specific quantities are needed for evaluating outcome.
Methods
We develop a non parametric estimator of the cumulative incidence function of events accounting for possible competing events. It handles a general sampling design by weights derived from the sampling probabilities. The variance is derived from the influence function of the subdistribution hazard.
Results
The proposed method shows good performance in simulations. It is applied to estimate the crude incidence of relapse in childhood acute lymphoblastic leukemia in groups defined by a genotype not available for everyone in a cohort of nearly 2000 patients, where death due to toxicity acted as a competing event. In a second example the aim was to estimate engagement in care of a cohort of HIV patients in resource limited setting, where for some patients the outcome itself was missing due to lost to followup. A sampling based approach was used to identify outcome in a subsample of lost patients and to obtain a valid estimate of connection to care.
Conclusions
A valid estimator for cumulative incidence of events accounting for competing risks under a general sampling design from an infinite target population is derived.
Background
In many longitudinal studies, some information might not be measured/available for the whole cohort, in fact biomarkers/additional covariates, or even outcome, might be ascertained only in selected subsamples. These studies are part of a broad category termed twophase studies [1], in fact they imply two sampling phases: the first one being usually a random sample from the target population, ending up in the entire cohort (phase I sample), and the second one applying some kind of sampling (e.g. efficient or of convenience) to collect additional information or the selection of subjects with no missing data (phase II sample). Common examples of efficient second phase sampling include the nested casecontrol and the casecohort designs [2–5]. In other situations the outcome itself is collected only for a subsample [6]. Twophase sampling, or more generally, multiphase sampling, is a general design that includes any valid probability sample of the data, in which each subsampling can depend on all the currently observed data at each step [7]. The actual sampling probabilities will depend on the specific design. The acknowledgement of these sampling phases, even in the commonly applied designs, can be very useful to improve efficiency and to allow flexibility in the analysis (e.g. different timescales or different models can be applied) by using information available for the whole cohort [8].
Efficient designs are particularly useful to identify new biomarkers when the combination between large cohorts and expensive new technologies make it infeasible to measure the biomarkers on the entire cohort. The Women’s Health Initiative program, for example, stored serum and plasma from participants and used them for specialized studies [9]. Also the Cardiovascular Health Study collected DNA from most participants to study different genetic factors underlying cardiovascular or other diseases and only subsets of the cohort have been genotyped in different projects [10].
In our first motivating clinical example, the aim was to evaluate the role of different genetic polymorphisms on treatment failure due to relapse in childhood acute lymphoblastic leukemia (ALL) using clinical information and biological samples available from a clinical trial that enrolled nearly 2000 patients. In this situation, a parsimonious use of these specimens motivated the choice of an efficient/optimal twophase sampling design [11]. We present also a further application where the aim was to estimate engagement to care of HIV patients in resource limited settings. Here the outcome itself was missing in a group of patients due to possibly informative loss to followup. The outcome was tracked in a random sample of those lost to followup to obtain a valid estimate of engagement to care [12].
For twophase studies, appropriate weighted survival estimates have been derived, both in the presence of additional covariates measured in the second phase [7, 13, 14], as well as in cohorts where the outcome/followup is not available for everyone [15]. A Cox model adapted for twophase designs has also been derived [14]. However no estimator of cumulative incidence accounting for competing events in the general framework of twophase designs has been proposed, while it has been developed for specific designs, such as nested casecontrol studies [16, 17]. This is relevant in the presence of multiple types of events, such as relapse and (toxic) death in cancer patients, as in the motivating examples presented here, where estimation of event type specific quantities are needed for evaluating outcome.
The aim of this paper is to develop a non parametric estimator of the crude incidence of events accounting for possible competing events in the general framework of twophase designs, where subgroups of analysis might be defined according to explanatory variables ascertained in the phase II sample, or the outcome itself assessed only in the second phase sample.
In the Methods section we propose a weighted crude incidence estimator for application in twophase designs. The theoretical properties of the proposed method are derived in appendix and investigated through simulations under different scenarios, which results are reported in Results section. In this section we also report the examples on childhood ALL and HIV patients. Conclusions is dedicated to the discussion.
Methods
Notation and basics
Let T be the failure time variable and suppose there are K possible causes of failure denoted by ε=1,2,…K. Let the causespecific hazard function of the k ^{th} event be:
and \(\Lambda _{k}(t)={\int _{0}^{t}} \lambda _{k}(s)ds\). Define
as the probability that a failure due to cause k occurs by time t, that is the quantity that we aim to estimate. Define also \(S(t)=P(T>t)=1\sum _{k} F_{k}(t)\) as the probability of surviving from any cause of failure.
A convenient representation of the crude incidence function (1) as product limit estimator naturally arises starting from the subdistribution hazard introduced by Gray [18] and defined as:
This hazard has been shown to be very useful to compare the crude cumulative hazard functions in different groups, since it restores a onetoone relationship between the hazard and the cumulative probability of a particular failure type: , with \(\Lambda _{k}^{*}(t)={\int _{0}^{t}} \lambda _{k}^{*}(s)ds\) and where the product integral notation is used to suggest a limit of finite products \(\prod \) [18, 19]. Of note, the onetoone relationship between the hazard and the cumulative probability is not satisfied from the causespecific hazard in the presence of competing events [20]. The subdistribution hazard can be thought as the hazard of an artificial variable \(T^{*}_{k}=T\cdot I\left \{ \varepsilon =k\right \} +\infty \cdot I\left \{ \varepsilon \neq k \right \}\) that extends to infinity the time to event k when another competing event is observed. In fact, for any finite t, \(T^{*}_{k}\leq t\) is equivalent to T≤t and ε=k; thus, given definition (1), \(P\left (T^{*}_{k}\leq t\right) = F_{k}(t)\). The definition of \(T^{*}_{k}\) is consistent with the argument that when an event other than k occurs as first, the latter will never be observed as first and thus the corresponding time is infinity.
Let (T _{ i },ε _{ i },C _{ i },Z _{ i }), with i=1…N, be N independent replicates of (T,ε,C,Z), where C is the censoring time and Z a vector of covariates. We will refer to these N subjects as the phase I sample. Define X= min(T,C) and Δ=I(T≤C). We will assume that failure and censoring times are conditionally independent, T⊥CZ. Let Y _{ i }(t)=I(X _{ i }≥t), N _{ ik }(t)=I(X _{ i }≤t,Δ _{ i } ε _{ i }=k) and \(N_{i \cdot }(t)=\sum _{k=1}^{K} N_{\textit {ik}}(t)\), where I(·) is the indicator function. Define G(t)=P(C>t) as the probability to remaining uncensored up to t.
Suppose that complete information on (X _{ i },Δ _{ i } ε _{ i },Z _{ i }) is available only for a subset n<N of subjects drawn based on a possibly complex sampling design and let ξ _{ i } indicate whether subject i is selected into this sample. We will refer to the \(n=\sum _{i}\xi _{i}\) subjects as the phase II sample, even if multiple phases of sampling could actually be involved to obtain the final complete sample [7]. Let π _{ i }=P(ξ _{ i }=1X _{ i },Δ _{ i } ε _{ i },Z _{ i }) being the inclusion probability of subject i for the phase II sample, conditional on being selected at the first phase. In a random sample this probability is equal for every subject. However sampling is often stratified on some variables to increase efficiency; in this case, the probability to be selected for the phase II sample is common for all subjects in the same stratum and differs between strata. In particular, it is usually higher for the more informative strata (e.g. strata including subjects with the event of interest as in casecontrol studies). For nested casecontrol designs the sampling probability of cases will be 1, while the one of controls might be derived as the probability that individual i is ever selected as control, following Samuelsen [4]. We denote the pairwise sampling probability for any two subjects (i,j, with i≠j) by π _{ ij }=P(ξ _{ i }=1,ξ _{ j }=1X _{ i },Δ _{ i } ε _{ i },Z _{ i },X _{ j },Δ _{ j } ε _{ j },Z _{ j }). As commonly assumed in survey theory, the sampling method should have the following properties: the sampling probabilities π _{ i } and π _{ ij } must be non zero for all i,j in the population and must be known for each i,j in the sample [7].
Incidence estimation in the presence of competing risks
Overall survival/incidence estimate
Under a twophase design it is common to be interested in estimating survival in subgroups related to variables ascertained only in phase II sample (i.e. biomarkers). Another possible situation is that, instead of covariates, the outcome itself is not available for the whole cohort. Thus, in both cases an estimate of the incidence of event using only the phase II sample is very useful. The total number of events of type k up to t and the total number of persons at risk at time t for the entire phase I sample can be estimated from the phase II sample (accounting for the sample design) by \(\hat {N}_{\cdot k}(t)=\sum _{i=1}^{N} [\!\xi _{i} N_{\textit {ik}}(t)/\pi _{i}]\) and \(\hat {Y}_{\cdot }(t)=\sum _{i=1}^{N} [\!\xi _{i} Y_{i}(t)/\pi _{i}]\), respectively. Note that these estimates are valid under general sampling designs, where π _{ i } and π _{ ij }, the socalled ‘design weights’, are known for the observations actually sampled [21].
The estimate of the overall survival has been shown by several authors in different contexts of complex sampling [13–15]:
where the overall hazard can be obtained by \(\hat {\Lambda }(t)=\sum _{k=1}^{K} \hat {\Lambda }_{k}(t)\) and \(\hat {\Lambda }_{k}(t)= {\int _{0}^{t}} \hat {N}_{\cdot k}(ds)/\hat {Y}_{\cdot }(s)\) [22]. It has been shown that \(\sqrt {N}[\!\hat {\Lambda }(t)\Lambda (t)]\) converges weakly to a zeromean Gaussian process [14, 15].
Competing risk
The goal is to estimate the crude incidence of a given cause k, , using the phase II sample, which is also called subdistribution function and is the probability that a failure due to cause k occurs within t [23, 24]. The estimate of \({\Lambda }_{k}^{*}(t)\) is based on the count of events due to cause k and the count of subjects at risk for \(T^{*}_{k}\), denoted by \(\hat {Y}_{\cdot k}^{*}(s)\) (see Appendix A.1):
The estimate of the cumulative subdistribution hazard in (2) can now be estimated, using only the phase II sample, by:
Note the complement of F _{ k }(t) can be thought as the survival probability of \(T^{*}_{k}\) [18, 20, 25], thus a product limit type estimator can be directly derived as:
Interestingly, this estimator is algebraically equivalent to the AalenJohansen type estimator, shown by [18] for random sampling, and in the Appendix A.2 for general sampling:
It is easy to see that in the absence of competing events, \(\hat {Y}^{*}_{\cdot k}(s)\) in (4) degenerates to the usual risk set \(\hat {Y}_{\cdot }(s)\), thus \(\hat {\Lambda }_{k}^{*}(t)=\hat {\Lambda }(t)\) and \(\hat {F}_{k}(t)\) equals the complement of 1 of the weighted KaplanMeier estimator for twophase studies [13]. Under no censoring, the weight \(\hat {G}(s^{}X_{i})\) becomes 1 and the risk set \(Y^{*}_{\cdot k}\) is eroded in time only by events of type k, therefore \(\hat {F}_{k}(t)\) degenerates into the proportion of events of type k estimated by the phase II sample (weighted number of events of type k out of the estimated total size of the cohort, phase I). If every subject in phase I is sampled (ξ _{ i }=1 ∀i), then (5) becomes the standard subdistribution cumulative hazard [19, 20] and (6) the standard estimator of the crude incidence.
For simplicity of notation, in (6) we estimated the overall incidence regardless of covariates, but the estimator can also be applied on subgroups defined by Z. The censoring probability G(t) should also be estimated in subgroups defined by Z. The overall estimator is reasonable when we make the more restrictive assumption T⊥C, otherwise separate estimators conditional on Z would be more appropriate (and eventually an average, weighted on the frequencies of Z, between the conditional estimates).
Variance and confidence intervals
Following Breslow and Wellner [26], we can express \(\sqrt {N}\left [\hat {\Lambda }^{*}_{k}(t)\Lambda ^{*}_{k}(t)\right ]=\sqrt {N}\left [\tilde {\Lambda }^{*}_{k}(t)\Lambda ^{*}_{k}(t)\right ]+ \sqrt {N}\left [\hat {\Lambda }^{*}_{k}(t)\tilde {\Lambda }^{*}_{k}(t)\right ]\) where \(\tilde {\Lambda }^{*}_{k}(t)\) represents the crude cumulative incidence estimator that we would have obtained if complete information (X _{ i },Δ _{ i } ε _{ i },Z _{ i }) was known for all the subjects in phase I sample (i=1…N) [18]. The two terms are asymptotically independent [14, 26]. The first term converges weakly to a zeromean Gaussian process [19] with covariance that we denote as \(\sigma _{k\medskip I}^{2}(t)\). By the arguments in Appendix A.3, the second term converges weakly to a zeromean Gaussian process with covariance \(\sigma _{k\medskip II}^{2}(t)\). Hence, \(\sqrt {N}\left [\hat {\Lambda }^{*}_{k}(t)\Lambda ^{*}_{k}(t)\right ]\) converges weakly to a zeromean Gaussian process with covariance being the sum of the contribution of each sampling phase: \({\sigma _{k}^{2}}(t)=\sigma _{k\medskip I}^{2}(t)+\sigma _{k\medskip II}^{2}(t)\). The first one represents the irreducible minimum uncertainty that would remain if everyone in phase I would be sampled and the second one accounts for the fact that complete information is available only in the phase II sample [13, 14, 26].
Each contribution to the variance can be estimated by the influence function approach [27]. The influence function of an estimator describes how the estimator changes when single observations are added or removed from the data and has the property that the difference between the estimate and the population quantity can be expressed as the sum of influence functions over all the subjects in the sample. By denoting with \(z_{\textit {ik}}^{*}(t)\) the influence function of subject i on \(\hat {\Lambda }^{*}_{k}(t)\) we have that [28]:
The influence function of subject i on \(\hat {\Lambda }^{*}_{k}(t)\) has been derived in Appendix A.4.
By using the HorvitzThomposon variance [29] on the weighted influence function, the contribution of the variance of phase II will be:
For phase I, the variance \(\hat {\sigma }_{k\medskip I}^{2}(t) \) can also be estimated using (9) by setting sampling probabilities to 1 [13].
Given the onetoone relationship between F _{ k }(t) and \(\Lambda _{k}^{*}(t)\), the variance of the crude cumulative incidence (6) can now be estimated as:
In analogy with the survival estimate for twophase designs, we derived confidence intervals for (6) on the logarithm scale by:
where q _{ α/2} denotes the α/2 quantile of the standard Gaussian distribution.
Software
By using suitable weights for both study design and censoring, any software allowing for time dependent weights can be used to derive the modified risk set and to estimate the crude cumulative incidence function (6). These weights have been implemented in R in function crprep in the mstate package [30] and in STATA in the stcrprep function. However, any software can be used to derive the ingredients for the modified risk set in (4) and these can be used to estimate (6) and its variance by the HorvitzThompson approach.
The complete code to compute this estimate has been developed in R software [31] using the survey package [32] and is available at [33]. An example of the application of this function is given in the subsection Genotype ascertained on a subset of a clinical trial cohort.
Results
Simulations
Simulations protocol
We considered two competing events with independent latent times T _{1} and T _{2} and constant marginal hazard of 0.1, the crude incidences are then \(F_{1}(t)=F_{2}(t)=\frac {1}{2}(1e^{0.1t})\). We focused on the crude incidence of event 1 up to t=2 units of time (i.e. years). This implies a fraction of 82 % with no events at t=2 (administrative censoring) and a crude incidence of about 9 %. The independence between the latent times T _{1} and T _{2} is not restrictive given the non identifiability issue [34]. The censoring time followed an uniform distribution on ranges (0.5,30.5) and (0.5,10.5), leading to around 5 % and 15 % censored before t=2, respectively.
We drew B=1000 random firstphase samples of size N=1000, from which we sampled a phase II sample according to different study designs:

1.
random sample, with n=50,100 units;

2.
casecontrol sampling: we randomly sampled n/2 individuals among those who experienced event 1 (cases) up to time 2 and n/2 individuals among the others (controls), with n=50,100 units;

3.
stratified sampling: we considered the phase I sample divided into 4 strata defined by the variable Z={0,1} (with frequencies 70 % and 30 % for Z=0 and Z=1, respectively) and the occurrence or not of event 1 up to time 2. The hazard rates were assumed to be 0.08 and 0.2 with Z=0 and with Z=1, respectively. An equal number of subjects (n/4) were sampled for each strata (balanced sampling), with n=50,100 units.

4.
nested casecontrol design: we selected all cases and m controls for each case with no events at the time of event of the case, fixing m=1,2. Under this design we cannot fix a total sample size a priori, but we expect around 90 events and 90·m controls. Sampling probabilities for each included subject were derived according to Samuelsen [4].
B = 1000 was chosen in order to get a ±5 % level of accuracy in the estimate of the crude incidence (F _{1}(t),t>0.3) in about 95 % of the samples. For each sample, \(\hat {F}_{1}(t)\) has been computed by (6), with \(\hat {\Lambda }_{1}^{*}(dt)\) estimated by (5), and it has been compared with F _{1}(t) in order to assess bias in each sample: \(\hat {F}_{1b}(t)F_{1}(t)\), b=1…B. Bias has been computed and reported for 20 different time points t=0.1,0.2,…,2. For each simulation, we also computed standard error of \(\hat {F}_{1}(t)\) according to (10) and the 95 % confidence interval (CI) of \(\hat {F}_{1}(t)\) on the logarithm scale (11) to evaluate coverage and length.
Simulations results
Figure 1 compares the average of the estimated standard error of \(\hat {F}_{1}(t)\) in each simulation with the empirical standard error at the 20 different times of observation in the four different scenarios with random censoring of 15 %. They were found to be very close in all scenarios, as expected.
Figure 2 reports the distribution of bias in each one of the 1000 simulated samples under random (panel a), casecontrol (panel b), stratified (panel c) and nested casecontrol sampling (m=1, panel d). Bias fluctuates around 0 in each scenario, but it has more variability in the random sampling compared to other scenarios, resulting also in a higher mean value of absolute bias over the B simulations (still always lower than 0.2 %). The lower performance of the estimator in the first scenario is due to the fact that random sampling is not a convenient design in the simulated setting. In fact, in phase I cohort we expect around 90 events of type 1 (incidence of 9 % and sample size of 1000), thus if we randomly sample 100 subjects from the phase I cohort, we expect to observe only 9 events (in phase II sample). With such a small number of events, unbiasedness is in fact not sufficient to ensure a reasonable behaviour and to get enough information on event incidence. To address this issue, the other study designs (scenarios 2, 3, and 4) are indeed thought to guarantee to sample more events of type 1. Thus we recommend adopting efficient designs accounting for the event of interest. Relative and standardized biases were always lower than 6 % (data not shown). The mean square error, not shown, slightly increases with time, in fact variability is increasing in time (as confirmed by the empirical standard error of \(\hat {F}_{1}(t)\) and by Fig. 2). The average length of the confidence interval was consistently increasing with time, ranging between 7 % and 12 % in the random sampling and between 2 % and 4 % in the casecontrol, stratified and nested casecontrol sampling (data not shown). This comparison underscores the advantages of a careful selection of the subsample.
Figure 3 reports results on coverage for the random, casecontrol, stratified and nested casecontrol sampling. The coverage was very close to the nominal value of 95 %, ranging mostly within a minimum of 94 % and a maximum of 97 %, except for very early times in the random setting.
In the same setting, we also considered a longer followup time, t = 50, with around 500 events of type 1 expected in phase I sample (under no censoring), and confirmed the performance of our estimator in a scenario with higher variability, with similar results for the different sampling schemes (data not shown).
Motivating examples
Genotype ascertained on a subset of a clinical trial cohort
A study on childhood ALL evaluated the role of a genetic polymorphism (glutathione Stransferase θ, GSTT1) on treatment failure due to relapse (in different sites), in the presence of a competing event (toxic death). GSTT1 is a common genetic polymorphism in Caucasians, with 13–26 % of individuals displaying a homozygous deletion of the gene (null genotype). Subjects carrying the null variants fail to express the GSTT1 enzyme, that is involved in drug metabolism. Clinical information were available for a cohort of 1999 consecutive patients (mainly European Caucasians, aged between 1 and 17 years, median age: 5 years) newly diagnosed with ALL in the Italian Associazione Italiana di Ematologia Pediatrica centers between September 2000 and July 2006. Biological samples stored at diagnosis were available, but genotype was ascertained only in a subgroup (phase II sample) for an efficient use of specimens [11, 13]. The interest was to evaluate incidence at different relapse sites by GSTT1, that can only be estimated using phase II data. In order to select the subgroup to be genotyped we adopted an optimal strategy that is carefully described in [11, 13]. Briefly, sampling was done after classifying patients into 6 strata according to the event of interest (relapse/no relapse) and to 3 groups, defined by prognostic features in the treatment protocol, that modulate the intensity of treatment, we will call them treatment protocols (Table 1). Strata were not defined based on the competing event death due to toxicity 58 events  for efficiency reasons given that the event of interest was relapse. Patients were sampled at random without replacement from the 6 strata, with the sampling from each stratum conducted independently (stratified sampling) and with higher probability in the more informative strata according to an optimal design [13]. The full cohort of 1999 patients represents the phase I sample, for which clinical information are available, while genotype is ascertained in the phase II sample only (n=601).
Relapses were classified according to the site, in particular we distinguished relapses involving bonemarrow (BM) from the others (extramedullary). We estimated the crude incidence of BM relapse by GSTT1 deletion using (6) and (10) and found higher relapse incidence for patients with GSTT1 deletion, with 5year crude incidence of 19.3 % (95 % CI: 13.4−27.7 %) versus 12.4 % (95 % CI: 10.7−14.4 % for non deleted patients, Fig. 4 panel a). This was derived accounting for the competing risk of other sites of relapse as well as for death due to toxicity.
We report here the R code used to compute these estimates:library(survey) d.std<twophase(id=list(~upn,~upn), subset=~!is.na(GST_T), strata=list(NULL, ~interaction(rel,elfin)),data=dat) GSTse<svycr(Surv(time,event>0)~GST_T, etype=“BMrelapse”,d.std,se=TRUE)
The twophase function in the survey package describes the design and produces a survey object [32]. The svycr function, available at [33], performs the estimate of crude incidence by the influence approach and uses 3 variables: time is the time of event, event the censoring indicator (1 if an event of any type is observed and 0 otherwise) and BMrelapse indicates whether a BM relapse is observed or not. Details on the survey package can be found in [7, 32].
The right panel of Fig. 4 represents the incidence of extramedullary relapses by GSTT1 showing that the difference in relapse incidence between GSTT1 deleted and other patients is mainly due to relapse involving the BM, that represents the most relevant type of relapse in childhood ALL. A Cox model adapted for twophase design [14], when applied to the cause specific hazard of BM relapse, gives an hazard ratio (HR) of 1.53 (95 % CI 0.98–2.37) for GSTT1 deleted patients versus non deleted; after adjusting for relevant factors (treatment protocol, gender, age), the HR dropped to 1.38 (95 % CI 0.90–2.13). For extramedullary relapses the HR was 1.22 (95 % CI 0.60–2.49). Of note, in order to compare patients with and without deletion of the GSTT1 gene, we used a causespecific model, thus we actually compared the causespecific hazard of relapse. In fact, a subdistribution model accounting for the twophase design is not available. This would be useful to compare the actual incidences of relapse in the two groups, however the causespecific model is still very useful to address the impact of the genotype on relapse by an aetiological point of view.
Outcome ascertained on a subset of patients lost to followup
In the evaluation of the effectiveness of the global effort to provide antiretroviral therapy (ART) for HIVinfected patients in resource limited settings, the estimate of the number of patients who continue to access care after starting ART is essential. This estimate is hampered, however, by the fact that some patients die shortly after their last visit to clinic  a group of individuals who cannot be considered as “stopping care” nor censored for the event of stopping care. In addition, the number of patients who are starting care is large and a high fraction have unknown outcomes (i.e., are lost to followup), generating informative censoring. Given that lost patients could reasonably be not in care, but they could also have changed clinic or be dead, one approach to obtain outcomes estimates has been to identify a numerically small, but random, sample of those who are lost [15], intensively seeking their outcomes, and using them to correct outcomes among the lost.
To illustrate, a cohort of 13,321 HIVinfected adult patients, who initiated ART treatment, were followed from ART initiation to either death, disengagement or administrative database closure (see Fig. 5). Among them, 2451 patients were lost to followup [35], defined as not being seen at the clinic for at least 90 days (after the last return visit). A tracker went into the community to determine the outcome of a random subsample of 428 among the 2451 lost patients and got information on 306 patients (110 patient were found to be in care in other clinics, 80 died while in care and 116 were found to be not treated/disengaged) [12]. The 10,870 patients no lost to followup and the 306 tracked patients can been considered as the second phase sample of the whole cohort, stratified on lost to followup.
We used the methods developed in the Methods section to estimate crude cumulative incidence, where the 306 tracked patients represented the 2451 lost patients by the sampling probability 306/2451, while the other 10,870 had sampling weight one. The crude incidence estimate of disengagement is reported in Fig. 6, the curve starts to rise after 90 days from ART start, that is the earliest possible time of disengagement, by definition. At 1 year, disengagement resulted 6.8 % (CI 95 % 5.7–8.2 %). This was subject to a strong influence of the competing event death in care that resulted 7.7 % (CI 95 % 6.8–8.9 %) at 1 year. A naïve (but less expensive) approach to deal with informative censoring would be to treat all lost patients as events or, contrarily, as censored observations. We plotted the two corresponding curves in Fig. 6, obtaining estimates of crude incidence at 1 year since ART treatment of 18.5 % and 0.2 %, respectively. We can consider that the true incidence will lie between these two estimates (that are however quite far in this context), as in fact it does the estimate we got by tracking a random sample of lost patients and using the proposed estimator.
Availability of supporting data
The R code to compute the proposed estimate of crude cumulative incidence is available at [33]. The results of simulations presented in the Simulations section and the related code are also available at [33].
Conclusions
We have derived an estimator for cumulative incidence of events based on the subdistribution hazard accounting for competing risks under a general sampling design from an infinite target population. The estimator shows good performance in simulations under different scenarios and the variance, derived by the influence function of the subdistribution hazard and the HorvitzThompson theory, was very close to the empirical variance, therefore we expect it to be very close to the one obtainable by replicate weights (e.g. bootstrap) [7]. Confidence intervals, derived on the log scale, provided good coverage in simulations, but alternative confidence intervals might also be considered such as the complementary loglog transformation [36]. The proposed estimator was used to estimate incidence of relapse by genotype in a cohort of childhood ALL patients, where the genotype was ascertained only on a subsample of the cohort chosen by an optimal sampling approach based on relapse as the event of interest. Interestingly, we can also analyze the incidence of the competing event (toxic death) or of the combined endpoint (relapse or toxic death), but the efficiency could be lower unless the subsampling is adapted to this new endpoint by including a further strata on toxic death in the sampling process. This is particularly important since toxic death is a rare event in this context. We should also remember to avoid random sampling when the event of interest is rare, as discussed in the Simulations section.
In the second case, we dealt with a missing data problem, in which the outcome itself was not available for everybody, since some patients were lost to followup. A subsample of lost patients had been tracked to ascertain the outcome, but if this tracking was not possible, a more basic/naïve approach to deal with this informative censoring could have been to identify the variables affecting missingness, poststratify the sample in homogeneous strata and use the missing probabilities for each strata as sort of sampling weights to adjust the incidence estimate. This approach would make an important assumption of missing at random that might be not appropriate and cannot be tested [7, 15, 21, 37]. However this underlines how the proposed estimator could be applied also in the presence of missing data.
The code to compute this estimate has been developed in R software [31] under the survey package [32] and is available at [33]. The survey package is a flexible package for complex surveys including also twophase studies. It provides flexible functions to describe the design of the study and to derive sampling fraction accordingly. The package includes functions to estimate survival and to perform a weighted Cox model with standard error properly adjusted for the design and with the possibility to use general weights, as calibrated weights. Our function takes advantage of the facilities of the package (see an example of use in Genotype ascertained on a subset of a clinical trial cohort).
In order to recover the representativeness of the subcohort (phase II) for the entire cohort, we used weights related to the inverse of the probability to be sampled, similarly to the weights of Barlow for casecohort studies [38]. More general weights can be used, such as calibration weights [7, 39, 40]. The use of calibration weights is advantageous when there is availability of phase I variables that are strongly related to the additional variables ascertained in phase II. This would provide results more representative of phase I data and increase precision. When phase II variables are common genetic polymorphisms, as in our first example, it is unlikely to find any strong relation between phase I and II variables, therefore no big advantage would be expected by calibration.
The estimator can also be extended to a situation where an individual may move among a finite number of states to estimate the AalenJohansen probabilities of transition among each state in a multistate framework [41] in the presence of general sampling design.
In order to derive a modelbased estimate of incidence (adjusted for possible covariates) two main approaches have been followed in the context of competing risk, the first one based on the causespecific hazard inspired by Benichou and Gail [16, 42, 43] and the other one based on the subdistribution hazard [19, 44]. The crude cumulative estimator developed by Kovalchik and Pfeiffer [45] for twophase studies for finite population follows the first approach, and the Cox model for twophase designs [14] could be used to extend it for infinite population. Under this model we can estimate the effect of a covariate on the causespecific hazard to address its impact on the event by an aetiological point of view. However it is well known that this does not reflect the impact of the variable on the crude cumulative incidence. The latter effect, even if affected by the incidence of the other competing events, could still be of interest for a public health prospective. Future work will concern the development of a regression model to assess the effect of a covariate on the crude cumulative incidence. The Fine and Gray regression model [19] could be extended to complex sampling by weighting the estimating function of the parameter of interest and working out their influence function.
A Appendix
A.1 Derivation of the risk set for \(T^{*}_{k}\)
The risk set for the usual survival time T at s is commonly obtained in standard analysis by counting the observed times greater then s. It can be also written as:
where \(\hat {G}(s)\) is the probability to be free of censoring up to s and is estimated considering censored observations as events and viceversa according to (3). This can proved also in the case of a twophase design by the following:
where \(\hat {N}_{\cdot }^{c}(t)=\sum _{i=1}^{N} \left [\xi _{i} I(X_{i}\le t,\Delta _{i}=0)/\pi _{i}\right ]\) denotes the number of censoring up to time t and \(\hat {N}_{\cdot \cdot }(t)=\sum _{i=1}^{n} \hat {N}_{\textit {i}\cdot }(t)\) the total count of events observed up to time t.
The derivation of the risk set for \(T^{*}_{k}\) at s can be obtained by:
By writing \(\hat {F}_{l}(s^{})\) as empirical cumulative distribution function \(\hat {F}_{l}(s^{})=\frac {1}{\hat {Y}_{\cdot }(0)}\sum _{i=1}^{N} \frac {\xi _{i}}{\pi _{i}}\frac {I(X_{i}\le s^{};\varepsilon _{i}=l)}{\hat {G}(X_{i}^{}\wedge s^{})} \) [25, 46], the risk set becomes:
that can be simplified to
that is equivalent to (4).
The first summation estimates the usual total number of subjects at risk at s, where the condition X _{ i }= min(T _{ i },C _{ i })≥s is satisfied. This in fact implies \(\min (T_{\textit {ki}}^{*},C_{i})\geq s\), i.e. being at risk for T implies being also at risk for \(T^{*}_{k}\). The second summation estimates the number of subjects who had other events before s, satisfying the condition X _{ i }= min(T _{ i },C _{ i })<s, Δ _{ i }=1 and ε _{ i }≠k which implies \(T_{\textit {ki}}^{*}=\infty > s\) and completes the number at risk at s for \(T^{*}_{k}\). While the first part is exposed to censoring, the contribute of each subject observed to fail of cause l≠k, \(\sum _{l\neq k} N_{\textit {il}}(s^{})\), would remain equal to 1 up to ∞, thus ignoring possible censoring, given that C _{ i } is (usually) not observable if T _{ i }<C _{ i }. A possible way to deal with this inconsistency is to mimic the presence of random censoring acting on the infinite times, by weighting the unitary contributions \(\sum _{l\neq k}N_{\textit {il}}(s^{})\) by the estimate \(\hat {G}(s^{}X_{i}^{})\) of \(P(C>s^{}C>X_{i}^{})=G(s^{}X_{i})\), where G(t)=P(C>t) is estimated by \(\label {eq:G}\hat {G}(t)=\prod _{s\leq t} \left [1 \frac {\sum _{i=1}^{N}\xi _{i} {N_{i}^{c}}(ds)/\pi _{i}}{\sum _{i=1}^{N}\xi _{i} Y_{i}(s)/\pi _{i}}\right ]\), with \({N_{i}^{c}}(s)=I(X_{i}\leq s,\Delta _{i}=0)\). This weight assumes value 1 before X _{ i } and decreases afterword according to the censoring distribution.
Of note, an alternative expression for \(\hat {Y}^{*}_{\cdot k}(s)\) derives substituting \(\hat {Y}_{\cdot }(0)\hat {G}(s^{})=\frac {\hat {Y}_{\cdot }(s)}{\hat {S}(s^{})}\) from (12) in (14):
This shows as \(\hat {Y}_{\cdot }(s)\) is upweighted by a multiplier that gets greater as the action of competing events gets larger, accounting for the fact that subjects that experienced events of type l≠k will never experience event k as first.
A.2 Proof of equivalence (7)
The equality between the cumulative incidence of the artificial variable T ^{∗} in (6) and the AalenJohansen type estimator (that for the purpose of this proof will be denoted as \(\hat {F}^{AJ}_{k}(t)\)) holds true if and only if, ∀t:
which can be proved by induction. Both quantities are step functions, changing value at each occurrence of type k events. At the time t where the first event of type k is observed, \(\hat {\Lambda }_{k}(dt)=\hat {\Lambda }^{*}_{k}(dt)\) being from (4) \(\hat {Y}_{\cdot }(t)=\hat {Y}_{\cdot k}^{*}(t)\). If this was the first event overall, then \(\hat {S}(0)=1\) and Eq. 18 is satisfied, otherwise \(\hat {F}^{AJ}_{k}(t)=[\!11/\hat {Y}_{\cdot }(0)]\cdot 1/[\!\hat {Y}_{\cdot }(0)1]=1/\hat {Y}_{\cdot }(0)=1[\!11/\hat {Y}_{\cdot }(0)]=\hat {F}_{k}(t)\).
Now, assuming that (18) holds true for a given t ^{−}, this implies \(\hat {F}^{AJ}_{k}(t)=\hat {F}_{k}(t)\) if and only if, from (18):
and using the equality at t ^{−}:
That is proved by (17).
A.3 Weak convergence of \(\hat {\Lambda }_{k}^{*}(t)\)
Lin showed that the normalised HorvitzThompson estimators of the number of events \(\sqrt {N}\left [\hat {N}_{\cdot }(t)N_{\cdot }(t)\right ]\), number at risk \(\sqrt {N}\left [\hat {Y}_{\cdot }(t)Y_{\cdot }(t)\right ]\), cumulative hazard \(\sqrt {N}[\hat {\Lambda }(t)\Lambda (t)]\) and survival function \(\sqrt {N}\left [\hat {S}(t)S(t)\right ]\) (and analogously \(\sqrt {N}\left [\hat {G}(t)G(t)\right ]\)) are asymptotically multivariate zeromean normal [14]. Firstly, we concentrate on the normalised HorvitzThompson estimators of the modified at risk process: \(\sqrt {N} \left [\hat {Y}^{*}_{\cdot k}(t)Y^{*}_{\cdot k}(t)\right ]=\sqrt {N}\sum _{i=1}^{N}\frac {\xi _{i}\pi _{i}}{\pi _{i}}Y^{*}_{\cdot k}(t) =\sqrt {N}\sum _{i=1}^{N} \frac {\xi _{i}\pi _{i}}{\pi _{i}}Y_{\cdot k}(t)+\sqrt {N}\sum _{i=1}^{N}\frac {\xi _{i}\pi _{i}}{\pi _{i}}\sum _{l\neq k}N_{i l}(t^{}) \hat {G}(t^{}X_{i})\). The first term represents the estimator of the number at risk, that Lin showed to be asymptotically multivariate zeromean normal [14]. We concentrate now on the second term:
that using Lemma 1 in [14] also converges to a zeromean Gaussian process.
We want to prove that also \(\sqrt {N}\left [\hat {\Lambda }^{*}_{k}(t)\Lambda ^{*}_{k}(t)\right ]=\sqrt {N}\left [\tilde {\Lambda }^{*}_{k}(t)\Lambda ^{*}_{k}(t)\right ] +\sqrt {N}\left [\hat {\Lambda }^{*}_{k}(t)\tilde {\Lambda }^{*}_{k}(t)\right ]\) converges to a zeromean normal, where \(\tilde {\Lambda }^{*}_{k}(t)\) represents the crude cumulative incidence estimator that we would have obtained if complete information (X _{ i },Δ _{ i } ε _{ i },Z _{ i }) was known for all the subjects in phase I sample (i=1…N) [18]. Fine and Gray proved that the first term converges weakly to a zeromean Gaussian process [19].
The second term results:
It then follows that also \(\sqrt {N}\left [\hat {\Lambda }^{*}_{k}(t)\tilde {\Lambda }^{*}_{k}(t)\right ]\) converges weakly to a zeromean Gaussian process.
A.4 Influence function for \(\hat {\Lambda }_{k}^{*}(t)\)
The estimator \(\hat {\Lambda }_{k}^{*}(t)\) can be expressed as a differentiable function g of the estimated total number of events of type k and total number at risk for T ^{∗} up to t:
where w _{ i }=1/π _{ i } and ξ _{ i } indicates whether subject i is withdrawn in the phase II sample.
The difference between the true and estimated cumulative hazard can be expressed as a sum of influence functions: \(\hat {\Lambda }_{k}^{*}(t)\Lambda ^{*}_{k}(t)=\sum _{i=1}^{N} z_{\textit {ik}}^{*}+o(1/\sqrt {N})\), where \(z_{\textit {ik}}^{*}\) is the influence function of the i ^{th} subject. Demnati and Rao [27] proved that we can express the influence function of subject i as \(z_{\textit {ik}}^{*}(t)=\frac {\partial g \left (\hat {N}_{\cdot k}(dt),\hat {Y}^{*}_{\cdot k}(t)\right)}{\partial w_{i}}\). The influence function of \(\hat {\Lambda }^{*}_{k}(t)\) of the i ^{th} subject can thus be derived as:
Being \(\hat {Y}^{*}_{\cdot k}(s)=\sum _{i=1}^{N} \xi _{i} w_{i}\cdot \left [ Y_{i}(s)+\frac {I(X_{i}\le s^{};\varepsilon _{i}\neq k)\hat {G}(s^{})}{\hat {G}(X_{i}^{}\wedge s^{})} \right ]\) and being , the derivative of \(\hat {Y}^{*}_{\cdot k}(s)\) with respect to w _{ i } results:
Please note that the last addendum accounts for the fact that \(\hat {G}(t)\) is estimated using information of subject i. For v≤u:
where the superscript c indicates the quantities related to the censoring process, i.e. \({N_{i}^{c}}(u)=I(X_{i}\leq u,\Delta _{i}=0)\) is the indicator of censoring for subject i up to time u and \(\hat {\lambda }^{c}(u)=\hat {N}_{\cdot }^{c}(du)/\hat {Y}_{\cdot }(u)\) the instantaneous hazard of censoring. The derivative of \(\hat {Y}^{*}_{\cdot k}(u)\) becomes:
Thus, the influence function of \(\hat {\Lambda }^{*}_{k}(t)\) of the i ^{th} subject results:
where w _{ j }=1/π _{ j }. By defining \({M_{i}^{c}}(s,t)={\int _{s}^{t}} \frac {{N_{i}^{c}}(du)\hat {\lambda }^{c}(u)Y_{i}(u)}{\hat {Y}_{\cdot } (u)}\) as the influence function of subject i on censoring, \(v_{j}=\frac {\xi _{j}}{\pi _{j}} \sum _{l \neq k}N_{\textit {jl}}(s^{})\hat {G}(s^{}X_{j}^{})\), and \(M_{\textit {ik}}(s)=\left [N_{\textit {ik}}(s)\hat {\Lambda }_{k}^{*}(s)Y^{*}_{\textit {ik}}(s)\right ]/\hat {Y}^{*}_{\cdot k}(s)\).
Thus it can be reduced as:
Abbreviations
 ALL:

Acute lymphoblastic leukemia
 ART:

Antiretroviral therapy
 BM:

Bonemarrow
 CI:

Confidence interval
 GSTT1:

Glutathione Stransferase 𝜃
 HR:

Hazard ratio
References
 1
Neyman J. Contribution to the theory of sampling human populations. J Am Stat Assoc. 1938; 33(201):101–16.
 2
Borgan Ø, Samuelsen SO. A review of cohort sampling designs for Cox’s regression model: Potentials in epidemiology. Norsk Epidemiol. 2003; 13:239–48.
 3
Prentice RL. A casecohort design for epidemiologic cohort studies and disease prevention trials. Biometrika. 1986; 73(1):1–11.
 4
Samuelsen SO. A psudolikelihood approach to analysis of nested casecontrol studies. Biometrika. 1997; 84(2):379–94.
 5
Langholz B, Borgan Ø. Countermatching: a stratified nested casecontrol sampling method. Biometrika. 1995; 82(1):69–79.
 6
Rudolph KE, Gary SW, Stuart EA, Glass TA, Marques AH, Duncko R, et al. The association between cortisol and neighborhood disadvantage in a US populationbased sample of adolescents. Health Place. 2014; 25:68–77.
 7
Lumley TS. Complex Surveys: A Guide to Analysis Using R. 1st ed. Inc JWS, editor. Wiley Series in Survey Methodology. Hoboken, New Jersey: John Wiley & Sons; 2010.
 8
Breslow N, Lumley T, Ballantyne C, Chambless L, Kulich M. Using the Whole Cohort in the Analysis of CaseCohort Data. Am J Epidemiol. 2009; 169(11):1398–405.
 9
Anderson GL, Manson J, Wallace R, Lund B, Hall D, Davis S, et al. Implementation of the women’s health initiative study design. Ann Epidemiol. 2003; 13(9, Supplement):S5—17.
 10
Fried LP, Borhani NO, Enright P, Furberg CD, Gardin JM, Kronmal RA, et al. The Cardiovascular Health Study: design and rationale. Ann Epidemiol. 1991; 1(3):263–76.
 11
Franca R, Rebora P, Basso G, Biondi A, Cazzaniga G, Crovella S, et al. Glutathione Stransferase homozygous deletions and relapse in childhood acute lymphoblastic leukemia: a novel study design in a large Italian AIEOP cohort. Pharmacogenomics. 2012; 13(16):1905–16.
 12
Geng E, Emenyonu N, Bwana M, Glidden D, Martin J. Samplingbased approach to determining outcomes of patients lost to followup in antiretroviral therapy scaleup programs in Africa. JAMA. 2008; 300(5):506–7. Available from: doi:10.1001/jama.300.5.506.
 13
Rebora P, Valsecchi MG. Survival estimation in twophase cohort studies with application to biomarkers evaluation. Stat Methods Med Res. 2014. in press. doi:10.1177/0962280214534411.
 14
Lin DY. On fitting Cox’s proportional hazards models to survey data. Biometrika. 2000; 87(1):37–47.
 15
Frangakis CE, Rubin DB. Addressing an idiosyncrasy in estimating survival curves using double sampling in the presence of selfselected right censoring. Biometrics. 2001; 57(2):333–42.
 16
Borgan Ø. Estimation of covariatedependent Markov transition probabilities from nested casecontrol data. Stat Methods Med Res. 2002; 11(2):183–202.
 17
Aalen OO, Borgan Ø, Fekjær H. Covariate adjustment of event histories estimated from Markov chains: the additive approach. Biometrics. 2001; 57(4):993–1001.
 18
Gray RJ. A class of Ksample tests for comparing the cumulative incidence of a competing risk. The Annals of Statistics. 1988; 16(3):1141–54.
 19
Fine JP, Gray RJ. A proportional hazards model for the subdistribution of a competing risk. J Am Stat Assoc. 1999; 94(446):496–509.
 20
Antolini L, Biganzoli EM, Boracchi P. Crude cumulative incidence in the form of a HorvitzThompson like and KaplanMeier like estimator. COBRA Preprint Series. 2006. 10 http://biostats.bepress.com/cobra/art10.
 21
Särndal C, Swensson B. A general view of estimation for two phases of selection with applications to twophase sampling and nonresponse. Int Stat Rev. 1987; 55(3):279–94.
 22
Kang S, Cai J. Marginal hazards model for casecohort studies with multiple disease outcomes. Biometrika. 2009; 96(4):887–901.
 23
Marubini E, Valsecchi MG. Analysing survival data from clinical trials and observational studies. Chichester, England: WileyInterscience; 2004.
 24
Bernasconi D, Antolini L. Description of survival data extended to the case of competing risks: a teaching approach based on frequency tables. Epidemiol Biostat Public Heal. 2013; 11(1). e8874–1:e8874–10.
 25
Satten GA, Datta S. The KaplanMeier estimator as an inverseprobabilityofcensoring weighted average. Am Stat. 2001; 55(3):207–10.
 26
Breslow NE, Wellner JA. Weighted likelihood for semiparametric models and twophase stratified samples, with application to cox regression. Scand J Stat. 2007; 34(1):86–102.
 27
Demnati A, Rao JNK. Linearization variance estimators for model parameters from complex survey data. Surv Methodol. 2010; 36:193–201.
 28
Breslow NE, Lumley T, Ballantyne CM, Chambless LE, Kulich M. Improved Horvitz–Thompson estimation of model parameters from twophase stratified samples: applications in epidemiology. Stat Biosci. 2009; 1(1):32–49.
 29
Horvitz DG, Thompson DJ. A generalization of sampling without replacement from a finite universe. J Am Stat Assoc. 1952; 47(260):663–85.
 30
de Wreede LC, Fiocco M, Putter H. mstate: An R Package for the Analysis of Competing Risks and MultiState Models. J Stat Softw. 2011; 38(7):1–30. Available from: http://www.jstatsoft.org/v38/i07/.
 31
R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2014. Available from: http://www.Rproject.org.
 32
Lumley T. Analysis of complex survey samples. J Stat Softw. 2004; 9(8):1–19.
 33
Rebora P. R code to estimate crude incidence in twophase designs. 2015. Accessed: 20151027. doi:10.6070/H4F18WRG.
 34
Tsiatis AA. A nonidentifiability aspect of the problem of competing risks. Proc Natl Acad Sci U S A. 1975; 72(1):20–2.
 35
Geng EH, Glidden DV, Bwana MB, Musinguzi N, Emenyonu N, Muyindike W, et al. Retention in care and connection to care among HIVinfected patients on antiretroviral therapy in Africa: estimation via a samplingbased approach. PLoS One. 2011; e21797:7.
 36
Glidden DV. Robust inference for event probabilities with NonMarkov event data. Biometrics. 2002; 58(2):361–68.
 37
Lin DY, Ying Z. Cox regression with incomplete covariate measurements. J Am Stat Assoc. 1993; 88:1341–9.
 38
Barlow W, Ichikawa L, Rosner D, Izum iS. Analysis of casecohort designs. J Clin Epidemiol. 1999; 52(12):1165–72.
 39
Scott AJ, Wild CJ. Fitting regression models with responsebiased samples. Can J Stat. 2011; 39(3):519–36.
 40
Rudolph KE, Diaz I, Rosenblum M, Stuart EA. Estimating population treatment effects from a survey subsample. Am J Epidemiol. 2014; 180:737–48.
 41
Aalen OO, Johansen S. An empirical transition matrix for nonhomogeneous Markov chains based on censored observations. Scand J Stat. 1978; 5(3):141–50. Available from: http://www.jstor.org/stable/4615704.
 42
Benichou J, Gail MH. Estimates of absolute causespecific risk in cohort studies. Biometrics. 1990; 46(3):813–26. Available from: http://www.jstor.org/stable/2532098.
 43
Langholz B, Borgan Ø. Estimation of absolute risk from nested casecontrol data. Biometrics. 1997; 53(2):767–74.
 44
Wolkewitz M, Cooper BS, PalomarMartinez M, OlaecheaAstigarraga P, AlvarezLerma F, Schumacher M. Nested casecontrol studies in cohorts with competing events. Epidemiology. 2014; 25(1):122–5.
 45
Kovalchik S, Pfeiffer R. Populationbased absolute risk estimation with survey data. Lifetime Data Anal. 2014; 20(2):252–75.
 46
Jewell NP, Lei X, Ghani AC, Donnelly CA, Leung GM, Ho LM, et al. Nonparametric estimation of the case fatality ratio with competing risks data: an application to Severe Acute Respiratory Syndrome (SARS). Stat Med. 2007; 26(9):1982–98.
Acknowledgements
The authors thank Thomas Lumley for useful comments, Raffaella Franca and Marco Rabusin for the ALL data and Jeff Martin and Elvin Geng for the HIV data. PR was supported by the grant SIR RBSI14LOVD of the Italian Ministry of Education, University and Research. This research was partially supported by AIRC (201314634, MGV) and by NIH (U01 AI069911).
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
PR performed the theoretical derivation, analyses and simulation studies and drafted the manuscript. LA, DVG, MGV contributed to the analyses and to the manuscript. All authors read and approved the final manuscript.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Received
Accepted
Published
DOI
Keywords
 Twophase design
 Competing risks
 Crude incidence
 Casecontrol
 Casecohort
 Missing data
 Subdistribution hazard