Parametric frailty models for clustered data with arbitrary censoring: application to effect of male circumcision on HPV clearance

Background In epidemiological studies, subjects are often followed for a period during which study outcomes are measured at selected time points, such as by diagnostic testing performed on biological samples collected at each visit. Although test results may indicate the presence or absence of a disease or condition, they cannot provide information on when exactly it occurred. Such study designs generate arbitrarily censored time-to-event data, which can include left, interval and right censoring. Adding to this complexity, the data may be clustered such that observations within the same cluster are not independent, such as time to recovery of an infectious disease of family or community members. This data structure is observed when evaluating circumcision's effect on clearance of penile high risk human papillomavirus (HR-HPV) infections using data collected from the male circumcision(MC) trial conducted in Rakai, Uganda, where the multiple infections within individual and HPV testings performed at trial follow-up visits gave rise to the clustered data with arbitrary censoring. Methods We describe the use of parametric proportional hazards frailty models and accelerated failure time frailty models to examine the relationship between explanatory variables and the survival outcomes that are subject to arbitrary censoring, while accounting for the correlation within clusters. Standard software such as SAS can be used for parameter estimation. Results Circumcision's effect on HPV infection was a secondary end point in the Rakai MC trial, and HPV genotyping was conducted for penile samples of a subset of trial participants collected at enrollment, 6, 12 and 24-month follow up visits. At enrollment, 36.7% intervention arm men (immediate circumcision) and 36.6% control arm men (delayed circumcision at 2 years) were infected with HR-HPV, with the number of infections per man being 1-5. The proposed models were used to examine whether MC facilitated clearance of the prevalent infections. Results show that clearance of multiple infections within each man is highly correlated, and clearance was 60% faster if a man was circumcised. Conclusions Parametric frailty models provide viable ways to study the relationship between exposure variables and clustered survival outcome that is subject to arbitrary censoring, as is often observed in HPV epidemiology studies.


Background
In epidemiological studies, subjects are often followed over time and study outcomes are measured at selected time points. The study outcomes may be diagnostic testing results based on biological samples such as tissue, blood, or urine samples. The testing at each time point can detect the presence or absence of a condition (for example, infection of an infectious disease), but it does not provide the exact information of when the infection or condition occurred. The best knowledge about the actual event time is that it occurred during the interval where discordant test results are observed at the start and end of the interval, yielding so-called interval censoring of time-to-event (survival) data [1]. If the event is observed at the first follow-up, we only know that the event occurred before the first scheduled testing time, and this generates left censored data. On the other hand, if a subject drops out of the study or remains event free at the end of the study, the time to event could only be after the last observed testing time, which corresponds to the well-known right censoring. When it is of interest to evaluate the effect of treatment on time to event occurrence, many analysts use either the end point [2] or the midpoint of the interval where discordant results were observed as the actual event times. The former way introduces "time aggregation" bias when estimating the hazard rate, while the latter mid-point method reduces timeaggregation bias under certain conditions [3]. For unbiased estimation of the treatment or exposure effect, approaches that directly model the likelihood of the arbitrarily censored data (including left, interval and right censoring) can be used [4]. Other than the presence of arbitrarily censored data, what can further complicate the analysis is the presence of clustered data where the study subjects are correlated within each cluster, such as patients visiting the same clinic in a multi-site study, members within the same family or community, or repeated testing on the same subject. The Cox regression model has been extended for clustered time-to-event data to model the marginal distributions without full specification of the correlation structure between the clustered observations [5,6], though it has only been established for data with non-informative right censoring. For clustered data with interval censoring, [7] introduces the use of conditional proportional hazards model (i.e. semiparametric frailty model), and briefly discusses the advantages of parametric frailty models. In this paper we describe the use of parametric frailty models to assess treatment or exposure effect on time to a well-defined event for clustered survival data with arbitrary censoring, which includes left, right, as well as interval censoring. The model estimation can be carried out using existing software, such as SAS (SAS Institute, Inc., Cary, NC) PROC NLMIXED. The method is illustrated through the application on data collected from a randomized clinical trial of male circumcision (MC) conducted in Rakai, Uganda, and the current study purpose is to evaluate the effect of MC on clearance of penile high risk human papillomavirus (HR-HPV) in HIV-negative men.
The following section provides a brief introduction about the HR-HPV data to exemplify the problem of interest. Statistical notations and the proposed model are then presented through the context of the example. Parameter estimation can be realized using SAS and the SAS code for this example is provided. The analytical result on the HR-HPV data is subsequently presented. We conclude with discussion about the general use of the proposed model for clustered survival data with arbitrary censoring.

Example: Effect of Male Circumcision on HR-HPV clearance
A clinical trial of MC was conducted on initially HIVnegative uncircumcised men aged 15-49 years in Rakai District of Uganda during 2003-2006 [8,9]. Approximately 5000 men were enrolled in the trial and randomized to either immediate circumcision (intervention arm) or delayed circumcision after 24 months (control arm). At enrollment and follow-ups scheduled at 6, 12, and 24 months, variables about participant sociodemographic characteristics, sexual risk behaviors and symptoms of sexually transmitted infections were recorded using questionnaires. Penile swabs were collected by clinicians from the preputial cavity of uncircumcised men and from the coronal sulcus/glans of circumcised men. Circumcision effect on HPV was a secondary endpoint in the trial and HPV testing was performed restricted to consistently HIV-negative married men with concurrently enrollment wives. HPV testing was performed on samples collected at all four visits only for a subset of randomly selected 330 such men (39.5%) in the intervention arm and 314 men (39.1%) in the control arm due to resource limitation. Roche HPV Linear Array (Roche Diagnostics, Indianapolis, IN) was used for HPV genotyping. The fourteen genotypes 16, 18, 31, 33, 35, 39, 45, 51, 52, 56, 58, 59, 66, and 68 were considered HR-HPV, that is, carcinogenic viral genotypes. Therefore, for each participant at each visit, there are fourteen binary indicators indicating the presence or absence of the fourteen HR-HPV genotypes, respectively. At enrollment, the intervention and control arm were comparable in sociodemographic and sexual behavioral characteristics [10]; the prevalence of HR-HPV was also comparable, with 36.7% men being positive on at least one HR-HPV genotype in the intervention arm and 36.6% in the control arm. One particular interest with this dataset is to study whether MC facilitates HR-HPV clearance process since foreskin provides a reservoire for viral protein expression. We studied circumcision's effect on clearance of enrollment prevalent HR-HPV infections, and the prevalent infection profile is summarized in Table 1 (upper panel).
Clustered data structure arises in this dataset, as each participant (a cluster) had testing results for the fourteen HR-HPV genotypes. A person with multiple infections may contribute multiple events (i.e. clearances) that are likely to be highly correlated. The exact clearance time point is unavailable, but it is known to be either before the first subsequent visit (left censored), or lie within an interval between two visits (interval censored), or after the last follow-up visit (right-censored).

Notations
Let T denote the random variable for the time to event. Assume there are n clusters in the study, and the ith cluster (i = 1, 2, ʜ, n) has j = 1, 2, ʜ, n i observations. In the HR-HPV example, T is the time to clearance of an HR-HPV infection, and each participant is a cluster. For a person with single genotype HR-HPV infection, n i = 1; while if a person has multiple HR-HPV genotype infections, n i > 1, and the clearance of these multiple infections is not independent. For the jth HR-HPV genotype infection (j = 1, 2, ʜ, n i ) on the ith participant, t ij is the actual clearance time which is not exactly observed. Let (a ij , b ij ] denote the interval where the clearance occurs, i.e. a ij <t ij ≤ b ij . If a prevalent infection of genotype j for person i is observed to be cleared at first follow-up, then a ij = 0, and t ij ≤ b ij = 6 months, which corresponds to a left censored observation; if it is never cleared during the trial period, then a ij = 24 months (the last follow-up time), and a ij <t ij < ∞ (the upper bound b ij can be viewed as ∞), which is a right censored observation.
Suppose there are p explanatory variables, which can include the variable indicating treatment arm and other covariates that may be of interest, and let x ij denote the vector of explanatory variables for the jth infection in person i. Without loss of generality, we only consider one explanatory variable for treatment arm, and x ij = 1 denotes intervention and x ij = 0 denotes control. Note that treatment arm for the fourteen HR-HPV genotypes is identical for each participant, although in general applications, x ij can be different for different observations belonging to the same cluster.

Parametric Proportional Hazards Frailty Model
Let h(t) denote the hazard function representing the "hazard" (i.e. the instantaneous rate) of clearance at time t. To examine the treatment effect on clearance, we can model the hazard to be a function of the explanatory variables, while at the same time including a random effect to account for the correlation between the multiple HR-HPV genotype infections on such infected participants.
For example, as in [7], for clearance of the jth infection on person i, we can use the conditional proportional hazards model (or semiparametric frailty model): Without the ξ i term, expression 1 is just the ordinary Cox proportional hazards model, where h ij (t;x) is the "hazard" for clearance of the jth genotype infection in person i, h 0 (t) is the baseline hazard for a person with all explanatory variables being zero. The ξ i term in expression 1 represents a random effect realized on the ith person. It is assumed to follow a prior distribution, such as a normal distribution, or equivalently exp(ξ i ) ~ a log-normal distribution. The use of the normal random effect ξ i on t (through the hazard function) is one way of introducing correlation within the ith cluster, and is similar to that in linear mixed effects or generalized linear mixed effects models. Because the same realization value of ξ i (from its prior distribution) is shared by observations on the multiple HR-HPV genotype infections within the ith person (thus its' subscript does not depend on j which indexes genotype), it is therefore taken into account of the dependence between the clearance times of these multiple HR-HPV infections. One advantage of using normally distributed random effect is that more complicated correlation structures between observations, such as multi-level correlations, can be handled naturally by extending the use of a univariate normal random effect to multivariate normal random effects [11,12].
Direct estimation of the coefficient parameters from model 1 by maximizing the likelihood may not be available with standard software such as SAS. To reduce the computation complexity, we further assume a parametric form on the clearance time for participants in the control arm, for example, conditional on the random effect ξfor a person in the control arm, let time to clearance of any HR-HPV infection have a Weibull distribution. That is h 0 (t) = γα 0 t γ-1 , where γ(γ > 0) and α 0 are the shape and scale parameters respectively for the Weibull distribution. The hazard function for Weibull distribution is a monotone function of t [13] (Figure 1a). The Gompertz distribution introduced for describing human mortality [4] is another parameteric distribution that has the proportional hazards property, and can be considered in the proportional hazards frailty model 1 .
Plugging h 0 (t) into expression 1, for the jth HR-HPV infection in person i with explanatory variables x ij , j = 1, 2, ʜ, n i , i = 1, 2, ʜ, n, the Weibull frailty model for clearance is: where the random frailty effect is assumed to follow a normal distribution with zero mean (i.e. exp (ξ i ) ~lognormal distribution). Model 2 implies that conditional on the random frailty ξ i , the clearance of any HR-HPV infec-tion follows a Weibull distribution with shape parameter γ and scale parameter , where circumcision's effect on clearance is quantified by its' coefficient parameter β. ξ i is the random effect shared by the multiple infections within the same person. Given a value of ξ i , the hazard ratio of HR-HPV clearance between the intervention and control arm is exp (β).
Therefore, if β estimated from the data is significantly larger than 0, then exp(β) > 1, indicating the instantaneous clearance rate is larger if a participant was circumcised than not circumcised.

Parametric Accelerated Failure Time Frailty Model
The Weibull frailty model given in 2 can be equivalently expressed in terms of the survivor function of the Weibull distribution as: Conditional on the random effect, expression 4 corresponding to Weibull distribution in fact belongs to the family of parametric accelerated failure time (AFT) models [4]. For clustered survival data, the general AFT model with normal random frailty effect can be written as: where ξ i ~Normal (0, σ 2 ), S 0 (·) is the baseline survivor function of a parametric survival distribution, such as Weibull distribution, log-normal distribution, generalized gamma distribution, log-logistic distribution, generalized F distribution [13], and inverse Gaussian distribution [4]. It is important to notice that, however, except for the Weibull distribution (including the exponential distribution), other aforementioned distributions for AFT frailty models do not have the proportional hazards property and thus cannot be modeled as a proportional hazards frailty model given in 1, exp (β) consequently does not have the interpretation of conditional hazard ratio.
For the HR-HPV data, the proportional hazard frailty model 2 with Weibull distribution assumption implicitly imposes a constant instant clearance rate ratio (conditional on the random effect) over time between intervention arm and control arm (Figure 1b). The clearance rate ratio may however change over time. From Table 1, the majority of infections had cleared by year 2 in either arm, thus the clearance rate ratio should be close to 1; whereas the rate ratio by month 6 may be different from 1. The log-logistic distribution was also fit for the survivor function S(t|ξ i ) in the AFT frailty model 5. The hazard of loglogistic distribution is a unimodal function of t when its shape parameter is larger than 1 (Figure 1a), which may better capture the natural history of HPV infection. It also allows the clearance rate ratio between arms to change over time (Figure 1b). The log-logistic AFT random frailty model is: where a ij = exp(β 0 + x ij β + ξ i ), and ξ i ~Normal (0, σ 2 ).

Estimation of model parameters
With parametric assumptions in model 1 or model 5, parameters (including fixed effects β and the variance of the random effect σ 2 ) can be estimated using the maximum likelihood method. Recall that for the jth infection in the ith participant, the clearance time is observed to be in the interval of (a ij , b ij ], where a ij = 0 for left censored observations and b ij = ∞ for right censored observations. For the n i genotype HR-HPV infections in person i, their clearances can either be left, right, or interval censored. Without loss of generality, let l i denote the number of left censored observations and r i denote the number of right censored observations, and hence n i -l i -r i is the number of interval censored observations on person i. Thus the full likelihood on all participants can be written as: Thus a simple rule of computing P-value for testing H 0 : σ 2 = 0 is that if the LR statistic is 0, then P = 1; otherwise, the P-value is half of the P-value obtained from comparing the LR statistic to distribution [16]. Since we assume a normal prior on the random frailty effect, the likelihood function normally does not have a close analytical form, although SAS PROC NLMIXED can numerically compute the integrals and maximize the approximated likelihood. The procedure provides commonly estimated statistics such as the MLE, Wald-type confidence intervals, and -2log-likelihood. In order to perform likelihood ratio inference for a variable, the hierarchical models with and without the variable have to be estimated respectively. The SAS code for analyzing the HR-HPV clearance data using model 2 is listed in Additional file 2 as an illustration, and relevant computation details are also discussed. The format of the input dataset and some useful options needed when calling the procedure are also described. Table 1  There are several limitations of this application, however. One is that it was found that at the follow-up visits, a significant higher proportion of samples collected on circumcised men could not be amplified rendering more missing HPV results in intervention arm. In this illustration, it was assumed that the infection persisted during the interval if the testing result was missing for the fol-low-up visit at the end of the interval. This may underestimate circumcision's effect. Techniques for dealing with missing data, such as multiple imputation [18,19], can be utilized for this specific dataset. Another limitation for this study is that the follow-up intervals of 6, 12 and 24 months were too long to capture the clearance of HPV infection. It has been known that the median duration of genital HPV infection in woman is 8 months, and persistent detectable infection rate is approximately 30% after 1 year and 9% after 2 years [17]. In a recent observational study on men [20], it is reported that the median time to clearance was 5.9 months (95% CI, 5.7-6.1 months), and 75% infections had cleared by 24 months after initial detection, implying the clearance rate is low when t is large. With the effective immune response to HPV, most infections had cleared by 6 months as observed in the trial participants, meaning that there are no data describing the early phase of the clearance process. The limited data determine that there is "little to choose between alternative distributional models" [4] and different models may yield similar results in the range where data are observed. It is suggested to adopt the model "most convenient for the purpose in hand " [4]. However, any extrapolation on the functional form of the clearance process from the estimated model should be conducted with caution. This trial was primarily designed to study male circumcision's effect on preventing HIV acquisition. If it is of interest to examine HPV clearance in a future study, the study design should allow for a more frequent testing

Results of Study on Male Circumcision Effect on HR-HPV Clearance
interval in order to capture the whole process. The presented parametric frailty models implicitly assume the same conditional baseline hazard functions for the clearance of different genotype infections within an individual. The different genotypes all belong to the papillomavirus family, and the mechanism of immune response is the same when fighting against the different type of HPV infections. Therefore it is reasonable to apply the same form of conditional baseline hazard functions for the clearance of different genotype infections.

Conclusion and Discussion
Arbitrarily censored survival data is not uncommon in epidemiological studies, and the censoring nature should be considered during analysis to reduce estimation bias, or when the disease onset and diagnosis are two steps that need to be differentiated [7]. Moreover, clustered data may arise and the correlation within each cluster should also be accounted for. In the current study we particularly describe the use of parametric frailty models to explore treatment effect's on survival when data are clustered and subject to arbitrary censoring. The two main classes of models are parametric proportional hazards frailty model and accelerated failure time frailty model. Most commonly used survival distributions can be used in these models, providing abundant choices of parametric forms to appropriately model the data of interest. For example, Weibull distribution and Gamma distribution can be used for survival problems where the hazard monotonically changes with time, and log-logistic and log-normal distribution can be used when the hazard is a unimodal function of time. The main advantage of adopting a parametric form is for computational ease. On the other hand, with the presence of arbitrary censoring and clustering, it is difficult to perform model diagnostics on the assumption of the parametric form. A clear understanding of the scientific nature of the problem to be addressed is essential for choosing an appropriate parametric distribution in analysis. When using normal random effect, the presented models can be estimated using SAS PROC NLIMIXED, and the code for analyzing the example HPV dataset is provided in the Additional file 2. Models with random effects following other distributions may be estimated using PROC NLIMIXED by transforming the normal random effect using appropriate probability transformation function provided by SAS [21,22]. Alternatively, for gamma frailty model or log-t proportional hazards frailty model for data with arbitrary censoring, the "frailty()" function provided in the R package "survival" can be used. Genital HPV infection has high prevalence in both men and women, and the high risk types of HPV are well known to be associated with anogenital cancers, especially cervical cancer [17]. Current diagnostic tools allow for simultaneous detection of multiple HPV genotypes, though the actual infection or clearance time is unknown. Therefore clustered data with arbitrary censoring are normally generated from such studies. The presented modeling approach can be used to study factors associated with HPV clearance (or persistence), or to compare the clearance process between different genotypes to examine type-specific persistence. However, as pointed earlier, the design for such studies need to allow for appropriate short testing intervals to capture the entire process.