Simulating recurrent event data with hazard functions defined on a total time scale

Background In medical studies with recurrent event data a total time scale perspective is often needed to adequately reflect disease mechanisms. This means that the hazard process is defined on the time since some starting point, e.g. the beginning of some disease, in contrast to a gap time scale where the hazard process restarts after each event. While techniques such as the Andersen-Gill model have been developed for analyzing data from a total time perspective, techniques for the simulation of such data, e.g. for sample size planning, have not been investigated so far. Methods We have derived a simulation algorithm covering the Andersen-Gill model that can be used for sample size planning in clinical trials as well as the investigation of modeling techniques. Specifically, we allow for fixed and/or random covariates and an arbitrary hazard function defined on a total time scale. Furthermore we take into account that individuals may be temporarily insusceptible to a recurrent incidence of the event. The methods are based on conditional distributions of the inter-event times conditional on the total time of the preceeding event or study start. Closed form solutions are provided for common distributions. The derived methods have been implemented in a readily accessible R script. Results The proposed techniques are illustrated by planning the sample size for a clinical trial with complex recurrent event data. The required sample size is shown to be affected not only by censoring and intra-patient correlation, but also by the presence of risk-free intervals. This demonstrates the need for a simulation algorithm that particularly allows for complex study designs where no analytical sample size formulas might exist. Conclusions The derived simulation algorithm is seen to be useful for the simulation of recurrent event data that follow an Andersen-Gill model. Next to the use of a total time scale, it allows for intra-patient correlation and risk-free intervals as are often observed in clinical trial data. Its application therefore allows the simulation of data that closely resemble real settings and thus can improve the use of simulation studies for designing and analysing studies. Electronic supplementary material The online version of this article (doi:10.1186/s12874-015-0005-2) contains supplementary material, which is available to authorized users.

where X i defines the covariate vector and β the regression coefficient vector. λ 0 (t) denotes the baseline hazard, being a function of the total/calendar time t, and Y i (t) the predictable process that equals one as long as individual i is under observation and at risk for experiencing events. Z i denotes the frailty variable with (Z i ) i iid with E(Z i ) = 1 and V ar(Z i ) = θ. The parameter θ describes the degree of between-subject-heterogeneity. Data output is in the counting process format. simrec(N, fu.min, fu.max, cens.prob = 0, dist.x = "binomial", par.x = 0, beta = 0, dist.z = "gamma", par.z = 0, dist.rec, par.rec, pfree = 0, dfree = 0) cens.prob Gives the probability of being censored due to loss to follow-up before fu.max. For a random set of individuals defined by a B(N,cens.prob)-distribution, the time to censoring is generated from a uniform distribution on [0, fu.max]. Default is cens.prob=0, i.e. no censoring due to loss to follow-up.
dist.x Distribution of the covariate(s) X. If there is more than one covariate, dist.x must be a vector of distributions with one entry for each covariate. Possible values are "binomial" and "normal", default is dist.x="binomial".
par.x Parameters of the covariate distribution(s). For "binomial", par.x is the probability for x = 1. For "normal", par.x=c(µ, σ) where µ is the mean and σ is the standard deviation of a normal distribution. If one of the covariates is defined to be normally distributed, par.x must be a list, e.g. dist.
beta Regression coefficient(s) for the covariate(s) x. If there is more than one covariate, beta must be a vector of coefficients with one entry for each covariate. simrec generates as many covariates as there are entries in beta. Default is beta=0, corresponding to no effect of the covariate x.
dist.z Distribution of the frailty variable Z with E(Z) = 1 and V ar(Z) = θ. Possible values are "gamma" for a Gamma distributed frailty and "lognormal" for a lognormal distributed frailty. Default is dist.z="gamma".
par.z Parameter θ for the frailty distribution: this parameter gives the variance of the frailty variable Z. Default is par.z=0, which causes Z ≡ 1, i.e. no frailty effect.
dist.rec Form of the baseline hazard function. Possible values are "weibull" or "gompertz" or "lognormal".
par.rec Parameters for the distribution of the event data. If dist.rec="weibull" the hazard function is where λ > 0 is the scale and ν > 0 is the shape parameter. Then par.rec=c(λ, ν). A special case of this is the exponential distribution for ν = 1. If dist.rec="gompertz", the hazard function is where λ > 0 is the scale and α ∈ (−∞, +∞) is the shape parameter. Then par.rec=c(λ, α). If dist.rec="lognormal", the hazard function is where φ is the probability density function and Φ is the cumulative distribution function of the standard normal distribution, µ ∈ (−∞, +∞) is a location parameter and σ > 0 is a shape parameter. Then par.rec=c(µ, σ). Please note, that specifying dist.rec="lognormal" together with some covariates does not specify the usual lognormal model (with covariates specified as effects on the parameters of the lognormal distribution resulting in non-proportional hazards), but only defines the baseline hazard and incorporates covariate effects using the proportional hazard assumtion.
pfree Probability that after experiencing an event the individual is not at risk for experiencing further events for a length of dfree time units. Default is pfree=0.
dfree Length of the risk-free interval. Must be in the same time unit as fu.max. Default is dfree=0, i.e. the individual is continously at risk for experiencing events until end of follow-up.

Output
The output is a data.frame consisting of the columns: id An integer number for identification of each individual x or x.V1, x.V2, ... -depending on the covariate matrix. Contains the randomly generated value of the covariate(s) X for each individual.
z Contains the randomly generated value of the frailty variable Z for each individual.
start The start of interval [start, stop], when the individual starts to be at risk for a next event. For each individual there are as many lines as it experiences events, plus one line if being censored. The data format corresponds to the counting process format.

Details
Data are simulated by extending the methods proposed by Bender et al [2] to the multiplicative intensity model.