Crude incidence in two-phase designs in the presence of competing risks

Background In many studies, some information might not be available for the whole cohort, some covariates, or even the outcome, might be ascertained in selected subsamples. These studies are part of a broad category termed two-phase studies. Common examples include the nested case-control and the case-cohort designs. For two-phase studies, appropriate weighted survival estimates have been derived; however, no estimator of cumulative incidence accounting for competing events has been proposed. This is relevant in the presence of multiple types of events, where estimation of event type specific quantities are needed for evaluating outcome. Methods We develop a non parametric estimator of the cumulative incidence function of events accounting for possible competing events. It handles a general sampling design by weights derived from the sampling probabilities. The variance is derived from the influence function of the subdistribution hazard. Results The proposed method shows good performance in simulations. It is applied to estimate the crude incidence of relapse in childhood acute lymphoblastic leukemia in groups defined by a genotype not available for everyone in a cohort of nearly 2000 patients, where death due to toxicity acted as a competing event. In a second example the aim was to estimate engagement in care of a cohort of HIV patients in resource limited setting, where for some patients the outcome itself was missing due to lost to follow-up. A sampling based approach was used to identify outcome in a subsample of lost patients and to obtain a valid estimate of connection to care. Conclusions A valid estimator for cumulative incidence of events accounting for competing risks under a general sampling design from an infinite target population is derived.


Background
In many longitudinal studies, some information might not be measured/available for the whole cohort, in fact biomarkers/additional covariates, or even outcome, might be ascertained only in selected subsamples. These studies are part of a broad category termed two-phase studies [1], in fact they imply two sampling phases: the first one being usually a random sample from the target population, ending up in the entire cohort (phase I sample), and the second one applying some kind of sampling (e.g. efficient or of convenience) to collect additional information or the selection of subjects with no missing data and expensive new technologies make it infeasible to measure the biomarkers on the entire cohort. The Women's Health Initiative program, for example, stored serum and plasma from participants and used them for specialized studies [9]. Also the Cardiovascular Health Study collected DNA from most participants to study different genetic factors underlying cardiovascular or other diseases and only subsets of the cohort have been genotyped in different projects [10].
In our first motivating clinical example, the aim was to evaluate the role of different genetic polymorphisms on treatment failure due to relapse in childhood acute lymphoblastic leukemia (ALL) using clinical information and biological samples available from a clinical trial that enrolled nearly 2000 patients. In this situation, a parsimonious use of these specimens motivated the choice of an efficient/optimal two-phase sampling design [11]. We present also a further application where the aim was to estimate engagement to care of HIV patients in resource limited settings. Here the outcome itself was missing in a group of patients due to possibly informative loss to follow-up. The outcome was tracked in a random sample of those lost to follow-up to obtain a valid estimate of engagement to care [12].
For two-phase studies, appropriate weighted survival estimates have been derived, both in the presence of additional covariates measured in the second phase [7,13,14], as well as in cohorts where the outcome/follow-up is not available for everyone [15]. A Cox model adapted for twophase designs has also been derived [14]. However no estimator of cumulative incidence accounting for competing events in the general framework of two-phase designs has been proposed, while it has been developed for specific designs, such as nested case-control studies [16,17]. This is relevant in the presence of multiple types of events, such as relapse and (toxic) death in cancer patients, as in the motivating examples presented here, where estimation of event type specific quantities are needed for evaluating outcome.
The aim of this paper is to develop a non parametric estimator of the crude incidence of events accounting for possible competing events in the general framework of two-phase designs, where subgroups of analysis might be defined according to explanatory variables ascertained in the phase II sample, or the outcome itself assessed only in the second phase sample.
In the Methods section we propose a weighted crude incidence estimator for application in two-phase designs. The theoretical properties of the proposed method are derived in appendix and investigated through simulations under different scenarios, which results are reported in Results section. In this section we also report the examples on childhood ALL and HIV patients. Conclusions is dedicated to the discussion.

Notation and basics
Let T be the failure time variable and suppose there are K possible causes of failure denoted by ε = 1, 2, . . . K. Let the cause-specific hazard function of the k th event be: and as the probability that a failure due to cause k occurs by time t, that is the quantity that we aim to estimate. Define also S(t) = P(T > t) = 1 − k F k (t) as the probability of surviving from any cause of failure. A convenient representation of the crude incidence function (1) as product limit estimator naturally arises starting from the subdistribution hazard introduced by Gray [18] and defined as: This hazard has been shown to be very useful to compare the crude cumulative hazard functions in different groups, since it restores a one-to-one relationship between the hazard and the cumulative probability of a particular failure type: s)ds and where the product integral notation is used to suggest a limit of finite products [18,19]. Of note, the one-to-one relationship between the hazard and the cumulative probability is not satisfied from the cause-specific hazard in the presence of competing events [20]. The subdistribution hazard can be thought as the hazard of an artificial variable T * k = T · I {ε = k} + ∞ · I {ε = k} that extends to infinity the time to event k when another competing event is observed. In fact, for any finite t, T * k ≤ t is equivalent to T ≤ t and ε = k; thus, given definition (1), The definition of T * k is consistent with the argument that when an event other than k occurs as first, the latter will never be observed as first and thus the corresponding time is infinity. Let be N independent replicates of (T, ε, C, Z), where C is the censoring time and Z a vector of covariates. We will refer to these N subjects as the phase I sample. Define X = min(T, C) and = I(T ≤ C). We will assume that failure and censoring times are conditionally independent, where I(·) is the indicator function. Define G(t) = P(C > t) as the probability to remaining uncensored up to t.
Suppose that complete information on (X i , i ε i , Z i ) is available only for a subset n < N of subjects drawn based on a possibly complex sampling design and let ξ i indicate whether subject i is selected into this sample. We will refer to the n = i ξ i subjects as the phase II sample, even if multiple phases of sampling could actually be involved to obtain the final complete sample [7]. Let π i = P(ξ i = 1|X i , i ε i , Z i ) being the inclusion probability of subject i for the phase II sample, conditional on being selected at the first phase. In a random sample this probability is equal for every subject. However sampling is often stratified on some variables to increase efficiency; in this case, the probability to be selected for the phase II sample is common for all subjects in the same stratum and differs between strata. In particular, it is usually higher for the more informative strata (e.g. strata including subjects with the event of interest as in case-control studies). For nested case-control designs the sampling probability of cases will be 1, while the one of controls might be derived as the probability that individual i is ever selected as control, following Samuelsen [4]. We denote the pairwise sampling probability for any two subjects (i, j, As commonly assumed in survey theory, the sampling method should have the following properties: the sampling probabilities π i and π ij must be non zero for all i, j in the population and must be known for each i, j in the sample [7].

Incidence estimation in the presence of competing risks Overall survival/incidence estimate
Under a two-phase design it is common to be interested in estimating survival in subgroups related to variables ascertained only in phase II sample (i.e. biomarkers). Another possible situation is that, instead of covariates, the outcome itself is not available for the whole cohort. Thus, in both cases an estimate of the incidence of event using only the phase II sample is very useful. The total number of events of type k up to t and the total number of persons at risk at time t for the entire phase I sample can be estimated from the phase II sample (accounting for the sample design) byN , respectively. Note that these estimates are valid under general sampling designs, where π i and π ij , the so-called 'design weights' , are known for the observations actually sampled [21].
The estimate of the overall survival has been shown by several authors in different contexts of complex sampling [13][14][15]: where the overall hazard can be obtained byˆ [22]. It has been shown that √ N[ˆ (t) − (t)] converges weakly to a zero-mean Gaussian process [14,15].

Competing risk
The goal is to estimate the crude incidence of a given cause k, F k (t) = 1 − s≤t 1 − * k (ds) , using the phase II sample, which is also called subdistribution function and is the probability that a failure due to cause k occurs within t [23,24]. The estimate of * k (t) is based on the count of events due to cause k and the count of subjects at risk for T * k , denoted byŶ * ·k (s) (see Appendix A.1): The estimate of the cumulative subdistribution hazard in (2) can now be estimated, using only the phase II sample, by: * Note the complement of F k (t) can be thought as the survival probability of T * k [18,20,25], thus a product limit type estimator can be directly derived as: Interestingly, this estimator is algebraically equivalent to the Aalen-Johansen type estimator, shown by [18] for random sampling, and in the Appendix A.2 for general sampling: It is easy to see that in the absence of competing events, Y * ·k (s) in (4) degenerates to the usual risk setŶ · (s), thuŝ * k (t) =ˆ (t) andF k (t) equals the complement of 1 of the weighted Kaplan-Meier estimator for two-phase studies [13]. Under no censoring, the weightĜ(s − |X i ) becomes 1 and the risk set Y * ·k is eroded in time only by events of type k, thereforeF k (t) degenerates into the proportion of events of type k estimated by the phase II sample (weighted number of events of type k out of the estimated total size of the cohort, phase I). If every subject in phase I is sampled (ξ i = 1 ∀i), then (5) becomes the standard subdistribution cumulative hazard [19,20] and (6) the standard estimator of the crude incidence.
For simplicity of notation, in (6) we estimated the overall incidence regardless of covariates, but the estimator can also be applied on subgroups defined by Z. The censoring probability G(t) should also be estimated in subgroups defined by Z. The overall estimator is reasonable when we make the more restrictive assumption T ⊥ C, otherwise separate estimators conditional on Z would be more appropriate (and eventually an average, weighted on the frequencies of Z, between the conditional estimates).

Variance and confidence intervals
Following Breslow and Wellner [26], we can express represents the crude cumulative incidence estimator that we would have obtained if complete information (X i , i ε i , Z i ) was known for all the subjects in phase I sample (i = 1 . . . N) [18]. The two terms are asymptotically independent [14,26]. The first term converges weakly to a zero-mean Gaussian process [19] with covariance that we denote as σ 2 kI (t). By the arguments in Appendix A.3, the second term converges weakly to a zero-mean Gaussian process with covariance σ 2 kII (t). Hence, converges weakly to a zero-mean Gaussian process with covariance being the sum of the contribution of each sampling phase: The first one represents the irreducible minimum uncertainty that would remain if everyone in phase I would be sampled and the second one accounts for the fact that complete information is available only in the phase II sample [13,14,26].
Each contribution to the variance can be estimated by the influence function approach [27]. The influence function of an estimator describes how the estimator changes when single observations are added or removed from the data and has the property that the difference between the estimate and the population quantity can be expressed as the sum of influence functions over all the subjects in the sample. By denoting with z * ik (t) the influence function of subject i onˆ * k (t) we have that [28]: * The influence function of subject i onˆ * k (t) has been derived in Appendix A.4.
By using the Horvitz-Thomposon variance [29] on the weighted influence function, the contribution of the variance of phase II will be: For phase I, the varianceσ 2 kI (t) can also be estimated using (9) by setting sampling probabilities to 1 [13].
Given the one-to-one relationship between F k (t) and * k (t), the variance of the crude cumulative incidence (6) can now be estimated as: In analogy with the survival estimate for two-phase designs, we derived confidence intervals for (6) on the logarithm scale by: where q α/2 denotes the α/2 quantile of the standard Gaussian distribution.

Software
By using suitable weights for both study design and censoring, any software allowing for time dependent weights can be used to derive the modified risk set and to estimate the crude cumulative incidence function (6). These weights have been implemented in R in function crprep in the mstate package [30] and in STATA in the stcrprep function. However, any software can be used to derive the ingredients for the modified risk set in (4) and these can be used to estimate (6) and its variance by the Horvitz-Thompson approach.
The complete code to compute this estimate has been developed in R software [31] using the survey package [32] and is available at [33]. An example of the application of this function is given in the subsection Genotype ascertained on a subset of a clinical trial cohort.

Simulations protocol
We considered two competing events with independent latent times T 1 and T 2 and constant marginal hazard of 0.1, the crude incidences are then We focused on the crude incidence of event 1 up to t = 2 units of time (i.e. years). This implies a fraction of 82 % with no events at t = 2 (administrative censoring) and a crude incidence of about 9 %. The independence between the latent times T 1 and T 2 is not restrictive given the non identifiability issue [34]. The censoring time followed an uniform distribution on ranges (0.5,30.5) and (0.5,10.5), leading to around 5 % and 15 % censored before t = 2, respectively.
We drew B = 1000 random first-phase samples of size N = 1000, from which we sampled a phase II sample according to different study designs: 1. random sample, with n = 50, 100 units; 2. case-control sampling: we randomly sampled n/2 individuals among those who experienced event 1 (cases) up to time 2 and n/2 individuals among the others (controls), with n = 50, 100 units; 3. stratified sampling: we considered the phase I sample divided into 4 strata defined by the variable Z = {0, 1} (with frequencies 70 % and 30 % for Z = 0 and Z = 1, respectively) and the occurrence or not of event 1 up to time 2. The hazard rates were assumed to be 0.08 and 0.2 with Z = 0 and with Z = 1, respectively. An equal number of subjects (n/4) were sampled for each strata (balanced sampling), with n = 50, 100 units. 4. nested case-control design: we selected all cases and m controls for each case with no events at the time of event of the case, fixing m = 1, 2. Under this design we cannot fix a total sample size a priori, but we expect around 90 events and 90 · m controls. Sampling probabilities for each included subject were derived according to Samuelsen [4].
B = 1000 was chosen in order to get a ±5 % level of accuracy in the estimate of the crude incidence (F 1 (t), t > 0.3) in about 95 % of the samples. For each sample,F 1 (t) has been computed by (6), withˆ * 1 (dt) estimated by (5), and it has been compared with F 1 (t) in order to assess bias in each sample: Bias has been computed and reported for 20 different time points t = 0.1, 0.2, . . . , 2. For each simulation, we also computed standard error ofF 1 (t) according to (10) and the 95 % confidence interval (CI) ofF 1 (t) on the logarithm scale (11) to evaluate coverage and length. Figure 1 compares the average of the estimated standard error ofF 1 (t) in each simulation with the empirical standard error at the 20 different times of observation in the four different scenarios with random censoring of 15 %. They were found to be very close in all scenarios, as expected. Figure 2 reports the distribution of bias in each one of the 1000 simulated samples under random (panel a), case-control (panel b), stratified (panel c) and nested casecontrol sampling (m = 1, panel d). Bias fluctuates around 0 in each scenario, but it has more variability in the random sampling compared to other scenarios, resulting also in a higher mean value of absolute bias over the B simulations (still always lower than 0.2 %). The lower performance of the estimator in the first scenario is due to the fact that random sampling is not a convenient design in the simulated setting. In fact, in phase I cohort we expect around 90 events of type 1 (incidence of 9 % and sample size of 1000), thus if we randomly sample 100 subjects from the phase I cohort, we expect to observe only 9 events (in phase II sample). With such a small number of events, unbiasedness is in fact not sufficient to ensure a reasonable behaviour and to get enough information on event incidence. To address this issue, the other study designs (scenarios 2, 3, and 4) are indeed thought to guarantee to sample more events of type 1. Thus we recommend adopting efficient designs accounting for the event of interest. Relative and standardized biases were always lower than 6 % (data not shown). The mean square error, not shown, slightly increases with time, in fact variability is increasing in time (as confirmed by the empirical standard error ofF 1 (t) and by Fig. 2). The average length of the confidence interval was consistently increasing with time, ranging between 7 % and 12 % in the random sampling and between 2 % and 4 % in the case-control, stratified and nested case-control sampling (data not shown). This comparison underscores the advantages of a careful selection of the subsample. Figure 3 reports results on coverage for the random, case-control, stratified and nested case-control sampling. The coverage was very close to the nominal value of 95 %, ranging mostly within a minimum of 94 % and a maximum of 97 %, except for very early times in the random setting.

Simulations results
In the same setting, we also considered a longer followup time, t = 50, with around 500 events of type 1 expected in phase I sample (under no censoring), and confirmed the performance of our estimator in a scenario with higher variability, with similar results for the different sampling schemes (data not shown).

Motivating examples Genotype ascertained on a subset of a clinical trial cohort
A study on childhood ALL evaluated the role of a genetic polymorphism (glutathione S-transferase-θ, GST-T1) on treatment failure due to relapse (in different sites), in the presence of a competing event (toxic death). GST-T1 is a common genetic polymorphism in Caucasians, with 13-26 % of individuals displaying a homozygous deletion of the gene (null genotype). Subjects carrying the null variants fail to express the GST-T1 enzyme, that is involved in drug metabolism. Clinical information were available for a cohort of 1999 consecutive patients (mainly European Caucasians, aged between 1 and 17 years, median age: 5 years) newly diagnosed with ALL in the Italian Associazione Italiana di Ematologia Pediatrica centers between September 2000 and July 2006. Biological samples stored at diagnosis were available, but genotype was ascertained only in a subgroup (phase II sample) for an efficient use of specimens [11,13]. The interest was to evaluate incidence at different relapse sites by GST-T1, that can only be estimated using phase II data. In order to select the subgroup to be genotyped we adopted an optimal strategy that is carefully described in [11,13]. Briefly, sampling was done after classifying patients into 6 strata according to the event of interest (relapse/no relapse) and to 3 groups, defined by prognostic features in the treatment protocol, that modulate the intensity of treatment, we will call them  1, panel d). For the first three scenarios a sample size of 100 was used, while in the nested case-control sampling a mean of 180 subjects was considered. Data were subject to a random censoring of 15 % (plus administrative censoring). Dashed lines represent the main bisector corresponding to the equality between estimated and empirical reference treatment protocols (Table 1). Strata were not defined based on the competing event death due to toxicity-58 events -for efficiency reasons given that the event of interest was relapse. Patients were sampled at random without replacement from the 6 strata, with the sampling from each stratum conducted independently (stratified sampling) and with higher probability in the more informative strata according to an optimal design [13]. The full cohort of 1999 patients represents the phase I sample, for which clinical information are available, while genotype is ascertained in the phase II sample only (n = 601).
Relapses were classified according to the site, in particular we distinguished relapses involving bone-marrow (BM) from the others (extramedullary). We estimated the crude incidence of BM relapse by GST-T1 deletion using (6) and (10) and found higher relapse incidence for patients with GST-T1 deletion, with 5-year crude incidence of 19 Fig. 4 panel a).
This was derived accounting for the competing risk of other sites of relapse as well as for death due to toxicity.
We report here the R code used to compute these estimates: library(survey) d.std<-twophase(id=list(˜upn,˜upn), subset=˜!is.na(GST_T), strata=list (NULL,˜interaction(rel,elfin)),data=dat) GSTse<-svycr(Surv(time,event>0)˜GST_T, etype="BMrelapse",d.std,se=TRUE) The twophase function in the survey package describes the design and produces a survey object [32]. The svycr function, available at [33], performs the estimate of crude incidence by the influence approach and uses 3 variables: time is the time of event, event the censoring indicator (1 if an event of any type is observed and 0 otherwise) and BMrelapse indicates whether a BM relapse is observed or not. Details on the survey package can be found in [7,32].  1, panel d). For the first three scenarios a sample size of 100 was used, while in the nested case-control sampling a mean of 180 subjects was considered. Data were subject to a random censoring of 15 % (over administrative censoring). The box represent the first and third quartile, the black line the median and the empty dots represent outliers defined as bias more than 1.5 times the interquartile range above the third quartile (or more than 1.5 times the interquartile range below the first quartile). Dotted lines represent the reference for bias (bias equal to 0) The right panel of Fig. 4 represents the incidence of extramedullary relapses by GST-T1 showing that the difference in relapse incidence between GST-T1 deleted and other patients is mainly due to relapse involving the BM, that represents the most relevant type of relapse in childhood ALL. A Cox model adapted for two-phase design [14], when applied to the cause specific hazard of BM relapse, gives an hazard ratio (HR) of 1.53 (95 % CI 0.98-2.37) for GST-T1 deleted patients versus non deleted; after adjusting for relevant factors (treatment protocol, gender, age), the HR dropped to 1.38 (95 % CI 0.90-2.13). For extramedullary relapses the HR was 1.22 (95 % CI 0.60-2.49). Of note, in order to compare patients with and without deletion of the GST-T1 gene, we used a cause-specific model, thus we actually compared the cause-specific hazard of relapse. In fact, a subdistribution model accounting for the two-phase design is not available. This would be useful to compare the actual incidences of relapse in the two groups, however the cause-specific model is still very useful to address the impact of the genotype on relapse by an aetiological point of view.

Outcome ascertained on a subset of patients lost to follow-up
In the evaluation of the effectiveness of the global effort to provide antiretroviral therapy (ART) for HIV-infected patients in resource limited settings, the estimate of the number of patients who continue to access care after starting ART is essential. This estimate is hampered, however, by the fact that some patients die shortly after their last visit to clinic -a group of individuals who cannot be considered as "stopping care" nor censored for the event of stopping care. In addition, the number of patients who are starting care is large and a high fraction have unknown outcomes (i.e., are lost to follow-up), generating informative censoring. Given that lost patients could reasonably be not in care, but they could also have changed clinic or  1, panel d). For the first three scenarios a sample size of 100 was used, while in the nested case control sampling a mean of 180 subjects was considered. Data were subject to a random censoring of 15 % (over administrative censoring). Dotted lines represent the reference for CI coverage (nominal 95 %) be dead, one approach to obtain outcomes estimates has been to identify a numerically small, but random, sample of those who are lost [15], intensively seeking their outcomes, and using them to correct outcomes among the lost. To illustrate, a cohort of 13,321 HIV-infected adult patients, who initiated ART treatment, were followed from ART initiation to either death, disengagement or administrative database closure (see Fig. 5). Among them, 2451 patients were lost to follow-up [35], defined as not being seen at the clinic for at least 90 days (after the last return visit). A tracker went into the community to determine the outcome of a random subsample of 428 among the 2451 lost patients and got information on 306 patients (110 patient were found to be in care in other clinics, 80 died while in care and 116 were found to be not treated/disengaged) [12]. The 10,870 patients no lost to follow-up and the 306 tracked patients can been considered as the second phase sample of the whole cohort, stratified on lost to follow-up.
We used the methods developed in the Methods section to estimate crude cumulative incidence, where the 306 tracked patients represented the 2451 lost patients by the Fig. 4 Crude cumulative incidence of relapse in ALL data. Panel a reports the crude cumulative incidence estimate of relapse involving the bone marrow in patients with normal (estimate and confidence limits reported with solid lines) and deleted (dashed lines) GST-T1 gene. Panel b reports the crude cumulative incidence estimate of relapse in extramedullary sites in patients with normal (solid lines) and deleted (dashed lines) GST-T1 gene sampling probability 306/2451, while the other 10,870 had sampling weight one. The crude incidence estimate of disengagement is reported in Fig. 6, the curve starts to rise after 90 days from ART start, that is the earliest possible time of disengagement, by definition. At 1 year, disengagement resulted 6.8 % (CI 95 % 5.7-8.2 %). This was subject to a strong influence of the competing event death in care that resulted 7.7 % (CI 95 % 6.8-8.9 %) at 1 year. A naïve (but less expensive) approach to deal with informative censoring would be to treat all lost patients as events or, contrarily, as censored observations. We plotted the two corresponding curves in Fig. 6, obtaining estimates of crude incidence at 1 year since ART treatment of 18.5 % and 0.2 %, respectively. We can consider that the true incidence will lie between these two estimates (that are however quite far in this context), as in fact it does

Availability of supporting data
The R code to compute the proposed estimate of crude cumulative incidence is available at [33]. The results of simulations presented in the Simulations section and the related code are also available at [33].

Conclusions
We have derived an estimator for cumulative incidence of events based on the subdistribution hazard accounting for competing risks under a general sampling design from an infinite target population. The estimator shows good performance in simulations under different scenarios and the variance, derived by the influence function of the subdistribution hazard and the Horvitz-Thompson theory, was very close to the empirical variance, therefore we expect it to be very close to the one obtainable by replicate weights (e.g. bootstrap) [7]. Confidence intervals, derived on the log scale, provided good coverage in simulations, but alternative confidence intervals might also be considered such as the complementary log-log transformation [36]. The proposed estimator was used to estimate incidence of relapse by genotype in a cohort of childhood ALL patients, where the genotype was ascertained only on a subsample of the cohort chosen by an optimal sampling approach based on relapse as the event of interest. Interestingly, we can also analyze the incidence of the competing event (toxic death) or of the combined endpoint (relapse or toxic death), but the efficiency could be lower unless the subsampling is adapted to this new endpoint by including a further strata on toxic death in the sampling process. This is particularly important since toxic death is a rare event in this context. We should also remember to avoid random sampling when the event of interest is rare, as discussed in the Simulations section.
In the second case, we dealt with a missing data problem, in which the outcome itself was not available for everybody, since some patients were lost to follow-up. A subsample of lost patients had been tracked to ascertain the outcome, but if this tracking was not possible, a more basic/naïve approach to deal with this informative censoring could have been to identify the variables affecting missingness, post-stratify the sample in homogeneous strata and use the missing probabilities for each strata as sort of sampling weights to adjust the incidence estimate. This approach would make an important assumption of missing at random that might be not appropriate and cannot be tested [7,15,21,37]. However this underlines how the proposed estimator could be applied also in the presence of missing data.
The code to compute this estimate has been developed in R software [31] under the survey package [32] and is available at [33]. The survey package is a flexible package for complex surveys including also two-phase studies. It provides flexible functions to describe the design of the study and to derive sampling fraction accordingly. The package includes functions to estimate survival and to perform a weighted Cox model with standard error properly adjusted for the design and with the possibility to use general weights, as calibrated weights. Our function takes advantage of the facilities of the package (see an example of use in Genotype ascertained on a subset of a clinical trial cohort).
In order to recover the representativeness of the subcohort (phase II) for the entire cohort, we used weights related to the inverse of the probability to be sampled, similarly to the weights of Barlow for case-cohort studies [38]. More general weights can be used, such as calibration weights [7,39,40]. The use of calibration weights is advantageous when there is availability of phase I variables that are strongly related to the additional variables ascertained in phase II. This would provide results more representative of phase I data and increase precision. When phase II variables are common genetic polymorphisms, as in our first example, it is unlikely to find any strong relation between phase I and II variables, therefore no big advantage would be expected by calibration.
The estimator can also be extended to a situation where an individual may move among a finite number of states to estimate the Aalen-Johansen probabilities of transition among each state in a multistate framework [41] in the presence of general sampling design.
In order to derive a model-based estimate of incidence (adjusted for possible covariates) two main approaches have been followed in the context of competing risk, the first one based on the cause-specific hazard inspired by Benichou and Gail [16,42,43] and the other one based on the subdistribution hazard [19,44]. The crude cumulative estimator developed by Kovalchik and Pfeiffer [45] for two-phase studies for finite population follows the first approach, and the Cox model for two-phase designs [14] could be used to extend it for infinite population. Under this model we can estimate the effect of a covariate on the cause-specific hazard to address its impact on the event by an aetiological point of view. However it is well known that this does not reflect the impact of the variable on the crude cumulative incidence. The latter effect, even if affected by the incidence of the other competing events, could still be of interest for a public health prospective. Future work will concern the development of a regression model to assess the effect of a covariate on the crude cumulative incidence. The Fine and Gray regression model [19] could be extended to complex sampling by weighting the estimating function of the parameter of interest and working out their influence function.

A Appendix
A.1 Derivation of the risk set for T * k The risk set for the usual survival time T at s is commonly obtained in standard analysis by counting the observed times greater then s. It can be also written as: (12) whereĜ(s) is the probability to be free of censoring up to s and is estimated considering censored observations as events and viceversa according to (3). This can proved also in the case of a two-phase design by the following: whereN c denotes the number of censoring up to time t andN ·· (t) = n i=1N i· (t) the total count of events observed up to time t.
The derivation of the risk set for T * k at s can be obtained by: By writingF l (s − ) as empirical cumulative distribution functionF l (s − ) = 1 [25,46], the risk set becomes: that can be simplified to that is equivalent to (4). The first summation estimates the usual total number of subjects at risk at s, where the condition X i = min(T i , C i ) ≥ s is satisfied. This in fact implies min(T * ki , C i ) ≥ s, i.e. being at risk for T implies being also at risk for T * k . The second summation estimates the number of subjects who had other events before s, satisfying the condition X i = min(T i , C i ) < s, i = 1 and ε i = k which implies T * ki = ∞ > s and completes the number at risk at s for T * k . While the first part is exposed to censoring, the contribute of each subject observed to fail of cause l = k, l =k N il (s − ), would remain equal to 1 up to ∞, thus ignoring possible censoring, given that C i is (usually) not observable if T i < C i . A possible way to deal with this inconsistency is to mimic the presence of random censoring acting on the infinite times, by weighting the unitary . This weight assumes value 1 before X i and decreases afterword according to the censoring distribution.
Of note, an alternative expression forŶ * ·k (s) derives substitutingŶ · (0)Ĝ(s − ) =Ŷ · (s) S(s − ) from (12) in (14): This shows asŶ · (s) is upweighted by a multiplier that gets greater as the action of competing events gets larger, accounting for the fact that subjects that experienced events of type l = k will never experience event k as first.

A.2 Proof of equivalence (7)
The equality between the cumulative incidence of the artificial variable T * in (6) and the Aalen-Johansen type estimator (that for the purpose of this proof will be denoted asF AJ k (t)) holds true if and only if, ∀t: which can be proved by induction. Both quantities are step functions, changing value at each occurrence of type k events. At the time t where the first event of type k is observed,ˆ k (dt) =ˆ * k (dt) being from (4)Ŷ · (t) =Ŷ * ·k (t). If this was the first event overall, thenŜ(0) = 1 and Eq. 18 Now, assuming that (18) holds true for a given t − , this impliesF AJ k (t) =F k (t) if and only if, from (18): and using the equality at t − : That is proved by (17).

A.3 Weak convergence ofˆ * k (t)
Lin showed that the normalised Horvitz-Thompson estimators of the number of events ) are asymptotically multivariate zero-mean normal [14]. Firstly, we concentrate on the normalised Horvitz-Thompson estimators of the modified at risk process: The first term represents the estimator of the number at risk, that Lin showed to be asymptotically multivariate zero-mean normal [14]. We concentrate now on the second term: that using Lemma 1 in [14] also converges to a zero-mean Gaussian process. We want to prove that also converges to a zero-mean normal, where˜ * k (t) represents the crude cumulative incidence estimator that we would have obtained if complete information (X i , i ε i , Z i ) was known for all the subjects in phase I sample (i = 1 . . . N) [18]. Fine and Gray proved that the first term converges weakly to a zero-mean Gaussian process [19].
The second term results: It then follows that also √ N ˆ * k (t) −˜ * k (t) converges weakly to a zero-mean Gaussian process.

A.4 Influence function forˆ * k (t)
The estimatorˆ * k (t) can be expressed as a differentiable function g of the estimated total number of events of type k and total number at risk for T * up to t: * where w i = 1/π i and ξ i indicates whether subject i is withdrawn in the phase II sample. The difference between the true and estimated cumulative hazard can be expressed as a sum of influence functions: is the influence function of the i th subject. Demnati and Rao [27] proved that we can express the influence function of subject i as z * ik (t) = ∂g N ·k (dt),Ŷ * ·k (t) ∂w i . The influence function ofˆ * k (t) of the i th subject can thus be derived as: