### Notations

Let *T* denote the random variable for the time to event. Assume there are *n* clusters in the study, and the *i*th cluster (*i* = 1, 2, ⋯, *n*) has *j* = 1, 2, ⋯, *n*
_{
i
}observations. In the HR-HPV example, *T* is the time to clearance of an HR-HPV infection, and each participant is a cluster. For a person with single genotype HR-HPV infection, *n*
_{
i
}= 1; while if a person has multiple HR-HPV genotype infections, *n*
_{
i
}> 1, and the clearance of these multiple infections is not independent. For the *j*th HR-HPV genotype infection (*j* = 1, 2, ⋯, *n*
_{
i
}) on the *i*th participant, *t*
_{
ij
}is the actual clearance time which is not exactly observed. Let (*a*
_{
ij
}, *b*
_{
ij
}] denote the interval where the clearance occurs, i.e. *a*
_{
ij
}<*t*
_{
ij
}≤ *b*
_{
ij
}. If a prevalent infection of genotype *j* for person *i* is observed to be cleared at first follow-up, then *a*
_{
ij
}= 0, and *t*
_{
ij
}≤ *b*
_{
ij
}= 6 months, which corresponds to a left censored observation; if it is never cleared during the trial period, then *a*
_{
ij
}= 24 months (the last follow-up time), and *a*
_{
ij
}<*t*
_{
ij
}< ∞ (the upper bound *b*
_{
ij
}can be viewed as ∞), which is a right censored observation.

Suppose there are *p* explanatory variables, which can include the variable indicating treatment arm and other covariates that may be of interest, and let **x**
_{
ij
}denote the vector of explanatory variables for the *j*th infection in person *i*. Without loss of generality, we only consider one explanatory variable for treatment arm, and *x*
_{
ij
}= 1 denotes intervention and *x*
_{
ij
}= 0 denotes control. Note that treatment arm for the fourteen HR-HPV genotypes is identical for each participant, although in general applications, **x**
_{
ij
}can be different for different observations belonging to the same cluster.

### Parametric Proportional Hazards Frailty Model

Let *h*(*t*) denote the hazard function representing the "hazard" (i.e. the instantaneous rate) of clearance at time *t*. To examine the treatment effect on clearance, we can model the hazard to be a function of the explanatory variables, while at the same time including a random effect to account for the correlation between the multiple HR-HPV genotype infections on such infected participants. For example, as in [7], for clearance of the *j*th infection on person *i*, we can use the conditional proportional hazards model (or semiparametric frailty model):

Without the ξ_{
i
}term, expression 1 is just the ordinary Cox proportional hazards model, where *h*
_{
ij
}(*t;* **x**) is the "hazard" for clearance of the *j*th genotype infection in person *i*, *h*
_{0}(*t*) is the baseline hazard for a person with all explanatory variables being zero. The *ξ*
_{
i
}term in expression 1 represents a random effect realized on the *i*th person. It is assumed to follow a prior distribution, such as a normal distribution, or equivalently exp(*ξ*
_{
i
}) ~ a log-normal distribution. The use of the normal random effect *ξ*
_{
i
}on *t* (through the hazard function) is one way of introducing correlation within the *i*th cluster, and is similar to that in linear mixed effects or generalized linear mixed effects models. Because the same realization value of *ξ*
_{
i
}(from its prior distribution) is shared by observations on the multiple HR-HPV genotype infections within the *i*th person (thus its' subscript does not depend on *j* which indexes genotype), it is therefore taken into account of the dependence between the clearance times of these multiple HR-HPV infections. One advantage of using normally distributed random effect is that more complicated correlation structures between observations, such as multi-level correlations, can be handled naturally by extending the use of a univariate normal random effect to multivariate normal random effects [11, 12].

Direct estimation of the coefficient parameters from model 1 by maximizing the likelihood may not be available with standard software such as SAS. To reduce the computation complexity, we further assume a parametric form on the clearance time for participants in the control arm, for example, conditional on the random effect *ξ*for a person in the control arm, let time to clearance of any HR-HPV infection have a Weibull distribution. That is *h*
_{0}(*t*) = *γα*
_{0}
*t*
^{γ-1}, where *γ*(*γ* > 0) and *α*
_{0} are the shape and scale parameters respectively for the Weibull distribution. The hazard function for Weibull distribution is a monotone function of *t* [13] (Figure 1a). The Gompertz distribution introduced for describing human mortality [4] is another parameteric distribution that has the proportional hazards property, and can be considered in the proportional hazards frailty model 1 .

Plugging *h*
_{0}(*t*) into expression 1, for the *j*th HR-HPV infection in person *i* with explanatory variables *x*
_{
ij
}, *j* = 1, 2, ⋯, *n*
_{
i
}, *i* = 1, 2, ⋯, *n*, the Weibull frailty model for clearance is:

where the random frailty effect is assumed to follow a normal distribution with zero mean (i.e. exp (*ξ*
_{
i
}) ~log-normal distribution). Model 2 implies that conditional on the random frailty *ξ*
_{
i
}, the clearance of any HR-HPV infection follows a Weibull distribution with shape parameter *γ* and scale parameter , where circumcision's effect on clearance is quantified by its' coefficient parameter *β*. *ξ*
_{
i
}is the random effect shared by the multiple infections within the same person. Given a value of *ξ*
_{
i
}, the hazard ratio of HR-HPV clearance between the intervention and control arm is exp (*β*). Therefore, if *β* estimated from the data is significantly larger than 0, then exp(*β*) > 1, indicating the instantaneous clearance rate is larger if a participant was circumcised than not circumcised.

### Parametric Accelerated Failure Time Frailty Model

The Weibull frailty model given in 2 can be equivalently expressed in terms of the survivor function of the Weibull distribution as:

where *S*
_{0}(*t*) is the baseline survivor function of the conditional Weibull distribution, that is, *S*
_{0}(*t*) = exp(-*α*
_{0}
*t*
^{γ}), where *α*
_{0} = exp(*β*
_{0}). Thus we have .

Therefore, for a specific participant, the clearance process for an uncircumcised man is e^{β}times of the clearance process if the man was circumcised, implying circumcision accelerates HR-HPV clearance with a factor of exp(
*β*).

Conditional on the random effect, expression 4 corresponding to Weibull distribution in fact belongs to the family of parametric accelerated failure time (AFT) models [4]. For clustered survival data, the *general* AFT model with normal random frailty effect can be written as:

where *ξ*
_{
i
}~Normal (0, *σ*
^{2}), *S*
_{0}(·) is the baseline survivor function of a parametric survival distribution, such as Weibull distribution, log-normal distribution, generalized gamma distribution, log-logistic distribution, generalized *F* distribution [13], and inverse Gaussian distribution [4]. Compared to parametric proportional hazards models where only few distributions have the proportional hazard property, this family of models comprehends a broader class of parametric survival distributions and allows for more flexibility on the shape of the conditional hazard function. It can be shown that AFT models can be expressed in log-linear model form [4], where it is easy to see that the predictor variables act additively on the logarithm of the survival time *T*, or equivalently multiplicatively on *T* itself. Conditional on a given value of the frailty effect *ξ*
_{
i
}in the AFT model 5, exp (*β*) has the interpretation of the ratio of the median (or any percentile) survival times between intervention and control arms. Moreover, Additional file 1 shows that the AFT frailty model also provides a marginal interpretation of the treatment effect, where -*β* is the average log ratio of the clearance times between intervention and control arms, i.e. .

It is important to notice that, however, except for the Weibull distribution (including the exponential distribution), other aforementioned distributions for AFT frailty models do not have the proportional hazards property and thus cannot be modeled as a proportional hazards frailty model given in 1, exp (*β*) consequently does not have the interpretation of conditional hazard ratio.

For the HR-HPV data, the proportional hazard frailty model 2 with Weibull distribution assumption implicitly imposes a constant instant clearance rate ratio (conditional on the random effect) over time between intervention arm and control arm (Figure 1b). The clearance rate ratio may however change over time. From Table 1, the majority of infections had cleared by year 2 in either arm, thus the clearance rate ratio should be close to 1; whereas the rate ratio by month 6 may be different from 1. The log-logistic distribution was also fit for the survivor function *S*(*t*|*ξ*
_{
i
}) in the AFT frailty model 5. The hazard of log-logistic distribution is a unimodal function of *t* when its shape parameter is larger than 1 (Figure 1a), which may better capture the natural history of HPV infection. It also allows the clearance rate ratio between arms to change over time(Figure 1b). The log-logistic AFT random frailty model is:

where *a*
_{
ij
}= exp(*β*
_{0} + *x*
_{
ij
}β + *ξ*
_{
i
}), and *ξ*
_{
i
}~Normal (0, *σ*
^{2}).

### Estimation of model parameters

With parametric assumptions in model 1 or model 5, parameters (including fixed effects *β* and the variance of the random effect *σ*
^{2}) can be estimated using the maximum likelihood method. Recall that for the *j*th infection in the *i*th participant, the clearance time is observed to be in the interval of (*a*
_{
ij
}, *b*
_{
ij
}], where *a*
_{
ij
}= 0 for left censored observations and *b*
_{
ij
}= ∞ for right censored observations. For the *n*
_{
i
}genotype HR-HPV infections in person *i*, their clearances can either be left, right, or interval censored. Without loss of generality, let *l*
_{
i
}denote the number of left censored observations and *r*
_{
i
}denote the number of right censored observations, and hence *n*
_{
i
}-*l*
_{
i
}-*r*
_{
i
}is the number of interval censored observations on person *i*. Thus the full likelihood on all participants can be written as:

where *S*
_{
ij
}(*t*|*ξ*
_{
i
}) is the conditional survivor function for the *j*th observation in person *i*, and given *ξ*
_{
i
}, the conditional survivor functions corresponding to the multiple infections within person *i* are independent. is the density function for the normal random effect *ξ*
_{
i
}. The full likelihood is then obtained by integrating over all the possible values of the random effect.

The maximum likelihood estimate (MLE), denoted as and , can be attained by maximizing the likelihood in expression 7 using iterative procedures, and the variance-covariance matrix is estimated as the inverse Hessian matrix. To test the significance of the explanatory variable, i.e. *H*
_{0} : *β*
_{
k
}= 0, the likelihood ratio (LR) statistic (difference in -2Log(L) between the hierarchical models containing and not containing the variable) can be used by comparing it to a *χ*
^{2} distribution with 1 degree of freedom [14]. Alternatively, Wald type inference can be drawn with its MLE and variance. For the random effect parameter *σ*
^{2}, the null hypothesis *H*
_{0}: *σ*
^{2} = 0 corresponds to the situation where all of the observations within a cluster are independent, and thus may be of interest of testing. However, since *σ*
^{2} ≥ 0, = 0 is the boundary of the parameter space of *σ*
^{2}. [15] showed that the asymptotic distribution of the LR statistic (comparing between the models containing and not containing the random effect) is a 50:50 mixture of the and distribution. Thus a simple rule of computing P-value for testing *H*
_{0}: *σ*
^{2} = 0 is that if the LR statistic is 0, then *P* = 1; otherwise, the P-value is half of the P-value obtained from comparing the LR statistic to distribution [16].

Since we assume a normal prior on the random frailty effect, the likelihood function normally does not have a close analytical form, although SAS PROC NLMIXED can numerically compute the integrals and maximize the approximated likelihood. The procedure provides commonly estimated statistics such as the MLE, Wald-type confidence intervals, and -2log-likelihood. In order to perform likelihood ratio inference for a variable, the hierarchical models with and without the variable have to be estimated respectively. The SAS code for analyzing the HR-HPV clearance data using model 2 is listed in Additional file 2 as an illustration, and relevant computation details are also discussed. The format of the input dataset and some useful options needed when calling the procedure are also described.