Comparison of statistical methods for the analysis of recurrent adverse events in the presence of non-proportional hazards and unobserved heterogeneity: a simulation study

Background In preventive drug trials such as intermittent preventive treatment for malaria prevention during pregnancy (IPTp), where there is repeated treatment administration, recurrence of adverse events (AEs) is expected. Challenges in modelling the risk of the AEs include accounting for time-to-AE and within-patient-correlation, beyond the conventional methods. The correlation comes from two sources; (a) individual patient unobserved heterogeneity (i.e. frailty) and (b) the dependence between AEs characterised by time-dependent treatment effects. Potential AE-dependence can be modelled via time-dependent treatment effects, event-specific baseline and event-specific random effect, while heterogeneity can be modelled via subject-specific random effect. Methods that can improve the estimation of both the unobserved heterogeneity and treatment effects can be useful in understanding the evolution of risk of AEs, especially in preventive trials where time-dependent treatment effect is expected. Methods Using both a simulation study and the Chloroquine for Malaria in Pregnancy (NCT01443130) trial data to demonstrate the application of the models, we investigated whether the lognormal shared frailty models with restricted cubic splines and non-proportional hazards (LSF-NPH) assumption can improve estimates for both frailty variance and treatment effect compared to the conventional inverse Gaussian shared frailty model with proportional hazard (ISF-PH), in the presence of time-dependent treatment effects and unobserved patient heterogeneity. We assessed the bias, precision gain and coverage probability of 95% confidence interval of the frailty variance estimates for the models under varying known unobserved heterogeneity, sample sizes and time-dependent effects. Results The ISF-PH model provided a better coverage probability of 95% confidence interval, less bias and less precise frailty variance estimates compared to the LSF-NPH models. The LSF-NPH models yielded unbiased hazard ratio estimates at the expense of imprecision and high mean square error compared to the ISF-PH model. Conclusion The choice of the shared frailty model for the recurrent AEs analysis should be driven by the study objective. Using the LSF-NPH models is appropriate if unbiased hazard ratio estimation is of primary interest in the presence of time-dependent treatment effects. However, ISF-PH model is appropriate if unbiased frailty variance estimation is of primary interest. Trial registration ClinicalTrials.gov; NCT01443130


Background
The development of a comprehensive drug safety profile in randomized controlled trials (RCTs) requires statistical methods that can adequately characterize the adverse event (AE) occurrence process (e.g. time of onset, recurrence and duration) [1]. Traditionally, analysis of AEs in IPTp trials, like all clinical trials, is predominantly descriptive and non-parametric such that advanced methods (e.g. parametric and semi-parametric) are rarely used [2][3][4]. The use of descriptive and non-parametric methods can lead to poor estimates of the effect of treatment on AE recurrence. The descriptive and non-parametric methods may not account for essential aspects of the data structure such as censoring, potential timevarying treatment effect and the AE-dependence. For example, mean cumulative function is the simple and popular non-parametric method used to facilitate the development of trajectories of AEs profiles over the study period, where a patient may experience more than one AEs [5]. However, the method fails to explicitly account for the potential dependence of the AEs or individual patient heterogeneity. The utility of the existing recurrent event methods towards the AEs analysis has been recently considered [6]. However, adopting and interpreting such recurrent events methods results in drug safety assessment setting remains a challenge. This motivates a need for further investigations on the clinical value and practical applicability of the advanced methods [7]. In the framework of benefit-risk assessment, two models (the Andersen Gill model and Prentice, Williams and Peterson model) have been demonstrated to be useful in providing both direct and indirect effects of treatment on AE recurrence [6]. However, these models yield unbiased estimates if the AEs are uncorrelated and do not efficiently capture (i.e. account for) both potential time-dependent treatment effects and unobserved heterogeneity. This motivates the need to consider using flexible parametric shared frailty models that can optimally capture both the time-dependent effects and unobserved heterogeneity, to improve drug safety estimates.
In IPTp clinical trials, the potential of recurrent AEs is high because the preventive antimalarial drugs are repeatedly administered. Recurrent AEs may induce within-patient correlation that originates from two sources: the individual patient unobserved heterogeneity and the dependence between events [8]. The individual patient heterogeneity cannot be explained by the observed/known factors [9,10], while the between-AEdependence can be explained based on the observed AEs. The unobserved heterogeneity arises from unmeasured information that partly explains the treatment effect. For example, in IPTp trials, unobserved heterogeneity can arise from the unmeasured inherent characteristics (e.g. physiological changes associated with foetus development). The underlying individual patient-specific unobserved heterogeneity can influence the susceptibility of patients to AE recurrence. In IPTp trials, as the case with some drug trials, the treatment effect depends on follow-up time. Such treatment effect that is a function of follow-up time is called time-varying or time-dependent treatment effect. In the presence of the time-dependent treatment effect, the relative effect of treatment changes over the follow-up time such that the proportional assumption is violated (on log hazard scale). Ignoring the time-dependent effects can lead to biased estimates. Therefore, accounting for the potential time-dependent treatment effects in the analysis of recurrent events is necessary. However, in practice, identifying a model that can account for both unobserved heterogeneity and time-dependent effects can present a challenge. Since the recurrence of AEs can be a reflection of both timedependent treatment effects and underlying individual patient unobserved heterogeneity. Therefore, it is vital to consider models that account for both issues.
Currently, shared frailty is one of the prevalent models for the analysis of recurrent events. This model can account for both the observed and unobserved heterogeneity. However, the common current applications of the shared frailty model assume that the treatment effect estimate (i.e. hazard ratio) is constant over the follow-up time [11]. In preventive drug trial settings, where there is repeated treatment administration and a long patient follow-up time, the proportional hazard assumption can be invalid when treatment hazard rates, between/across arms, for recurrent events can vary over the follow-up time. For example, in the context of IPTp drug safety assessment, more recurrent AEs are expected close to the day(s) of taking the days. The potential presence of time-dependent treatment effects can pose a challenge in applying the shared frailty model to analyse recurrent presence of time-dependent treatment effects. However, ISF-PH model is appropriate if unbiased frailty variance estimation is of primary interest. events since it becomes complicated to separate timedependent effects from unobserved heterogeneity [12]. Recent developments in flexible parametric survival models can effectively capture both time-dependent effects and unobserved heterogeneity [13]. Researchers have established, mainly in univariate survival data context, that flexible parametric survival models with restricted cubic splines can accurately model timedependent effects [14,15]. However, there is limited literature investigating the efficiency and utility of such methods in the context of multivariate survival data and drug safety assessment. Within multivariate survival data setting, Gasparini et al. recently compared the standard and spline-based shared frailty models under varying heterogeneity and baseline hazard specification assuming proportional hazards [16]. The shared parametric models with splines emerged to be more robust to model misspecification than the standard approaches. However, their work is limited to the context of the proportional hazards assumption. As highlighted above, a challenge arises in applying the shared frailty model when there are time-dependent effects. Using the standard semiparametric shared frailty models, Balan and Putter demonstrated that the bias of frailty variance estimate, in the presence of time-dependent effects, can be mitigated as the cluster size (i.e. number of recurrent events within an individual) increases [12]. Based on the available literature, it is not clear whether the use of flexible parametric shared frailty with splines models can improve the frailty estimates compared to the standard shared frailty models in the presence of both time-dependent treatment and unobserved heterogeneity.
Our work aims at investigating whether the flexible parametric shared frailty model with non-proportional hazards and restricted cubic splines can improve both the frailty variance and hazard ratio estimates compared to the standard parametric shared frailty models, in the presence of both unobserved heterogeneity and timedependent treatment effects. The optimal shared frailty model can efficiently capture both unobserved heterogeneity and time-dependent effects. We evaluated the following models; (a) inverse Gaussian shared frailty (ISF-PH) model with proportional hazards and Weibull baseline hazard (b) Lognormal shared frailty model with non-proportional hazard and restricted cubic splines where we model the baseline hazard function with 1 degree of freedom and model the time-dependent treatment effect with 1 degree of freedom (LSF-NPHS1) and (c) Lognormal shared frailty model with non-proportional hazard and restricted cubic splines where we model the baseline hazard function with 3 degrees of freedom and model the time-dependent treatment effect with 1 degree of freedom (LSF-NPHS3). We compare the models using a simulation study assuming an application area of recurrent AEs in IPTp trials. Understanding the statistical properties would enhance the informed choice of statistical methods for the analysis of recurrent AEs. The simulation offers an opportunity to effectively understand the properties of the statistical models since the true data distribution is known, unlike in a practical clinical trial setting. We present the remaining sections of the paper as follows: In the methods section, we provide a detailed review of the models under study, outline the simulation study design and describe the analysis approach. In the results section, we present the findings of our simulation study and apply the models to the analysis of recurrent AEs observed in Chloroquine for Malaria in pregnancy trial (NCT01443130). In the discussion section, we discuss the key findings of our work and provide a conclusion.

Methods
This section outlines notation and provides an overview of the three models that were compared in our simulation study. Specifically, we briefly introduce and discuss the general inferential methodological framework of the models, focussing on model specification, parameter estimation and respective assumptions. The presentation of this section is customized to the context of drug safety assessment in clinical trials. For readers interested in detailed concepts, on recurrent events analysis, they can consult excellent introductory books on recurrent events analysis and frailty models methodology [10,17,18], written for a wider audience.

Notation
We assume that the observation begins at enrolment, denoted as time t = 0. Let T ik be the total time of the k th event for the i th patient where i = 1, 2, ………n, k = 0, 1, 2, ……. p and we assume T i1 < T i2 < … < T ik . The AEs are assumed to be recorded over a period of time interval [0, τ] such that the total number of AEs recorded per patient i over that time is = ∞ k=0 I(T ik ≤ t) . The history of the recorded number of events over given time t, denoted as N(t), 0 ≤ t, is defined as a counting process; this forms the basis for easy-to-implement recurrent events analysis. The frailty is defined as φ i and interpreted as an unobserved covariate common for all recurrent AEs within the patient. Covariate and regression coefficients vectors are defined as X ik and β respectively. The covariates from the covariates vector are considered to be fixed. Under the simulation study design described below, one covariate is considered (i.e. treatment). For the non-proportional hazard models, the treatment effect is allowed to be dependent on time (i.e. β(t)). These are further discussed in context in the section below.

Statistical methods for recurrent events
Recurrent events modelling is popularly based on three main approaches; event counts, event intensity function and joint distribution of gap times between successive events. Event counts-based models are suitable when there are frequent events and event occurrence does not impact the occurrence of the subsequent events [17,19]. In the context of AEs, some mild AEs may not alter the event process (e.g. mild headache). When events are infrequent and the prediction of the next event is of interest, gap times methods are appropriate [20,21].
To ensure that the event process history is appropriately incorporated, the intensity function is used; the function defines the instantaneous probability of an event occurring at time t given the event process history [17].
Where ∆N(t) is the number of events in the time interval [t, ∆t), H(t) is the history of the event process at time t. Since event counts (i.e. Poisson process) modelling approach assumes that the events are independent such that the event history process at t, H(t), does not affect the instantaneous probability of an event at time t, the intensity function is reduced to the rate function ρ(t) below; This implies; ∆tρ(t) = E(∆N(t)). Hence if we let the expected cumulative rate at time t to be μ(t), we can derive the expected cumulative rate by integrating the rate function over the interval [0, t).
The gap/waiting times method is another alternative approach that focusses on modelling time between consecutive events. Methods based on gap time are suitable mainly in setting where the occurrence of the events is considered relatively infrequent [17]. The work in this paper is confined to multiplicative intensity-based models because it permits deriving and interpretation of failure intensity relative to the treatment under investigation. Shared frailty models are considered as important intensity-based models for the analysis of recurrent events that can account for both measured and unmeasured heterogeneity.

The shared frailty models
In practice, the common approach to account for unobserved heterogeneity in recurrent events analysis is to use a robust variance (i.e. robust sandwich covariance matrix is used in estimating the log hazard ratio). Adjusting for within-subject unobserved heterogeneity using robust variance has been shown to be inadequate in some settings [22]. One of the solutions to this is explicitly modelling the correlation using patient-specific random effect. Such a model is defined as a frailty model. Frailty models are hazard models having a multiplicative frailty factor [10]; these are conditional models. Overall, the model has frailty, linear predictor and the baseline hazard function as the multiplicative factors. The random effect introduced in frailty models describes the excess risk that cannot be explained by the measured/observed variables [10]. The unmeasured observations may introduce heterogeneity between the analysis units e.g. for recurrent AEs, observations within a patient may be more similar compared to different patients. Hence frailty model attempts to account for the within-patient correlation. Ignoring the potential heterogeneity in modelling recurrent events can lead to underestimation of treatment effect [23,24].
For recurrent events, when the frailty is modelled as a common random effect for all events within a patient, it is defined as a shared frailty model (see model M4). The dependence accounted for in the shared frailty model is only due to the unobserved individual heterogeneity, φ i . This measure of the extent of heterogeneity among the patients can follow several distributions [18]. Gamma and inverse Gaussian distributions are the common frailty distribution specifications, possibly due to their mathematical tractability and software availability. Across all the patients, the respective random effects are assumed to be independent and identically distributed.
If the baseline hazard, λ 0 (t ij ), is unspecified, the model is called a semi-parametric frailty model. Specifying the baseline hazard function ensures that an appropriate shape of the hazard function is captured. However, miss-specification of the baseline hazard distribution may induce bias on regression coefficients [16]. Therefore, caution needs to be exercised on the choice of the baseline hazard distribution by paying attention to the most clinically plausible hazard function. Traditionally, the hazards are assumed proportional over the whole follow-up time. Within the flexible parametric survival methods framework, the proportional hazards assumption can be relaxed to accommodate time-dependent treatment effects (i.e. non-proportional hazards [13]). This can be implemented by interacting treatment with splines of log time as part of the linear predictor. Fitting such flexible non-proportional shared frailty model on log hazard scale is helpful since it can effectively accommodate interpretation and presentation of results. The shared frailty model with non-proportional hazards and restricted cubic splines can be written as; The treatment x ik is interacted with restricted cubic function of log time, s{log(t)| γ, k 0 }, such that γ is parameter vector and k 0 is knot vector. Similarly, s{log(t)| δ f , k f } represents a spline function of log time for the f th time-dependent effect with parameter vector δ f and knots vector k f . Since the number of restricted cubic splines can determine the most appropriate model to use we considered to evaluate two flexible parametric shared frailty models as follows; (i) LSF-NPHS1: We considered a shared frailty model assuming lognormal frailty distribution and nonproportional hazards with restricted cubic splines where the baseline hazard function is modelled using 1 degree of freedom and the time-dependent treatment effect is modelled using 1 degree of freedom. (ii) LSF-NPHS3: We also considered a shared frailty model was fitted assuming lognormal frailty distribution and non-proportional hazards with restricted cubic splines where the baseline hazard function is modelled using 3 degrees of freedom and the time-dependent treatment effect is modelled using 1 degree of freedom.
Generally, under shared frailty models, depending on the model specification, parameter estimation can be done using expectation-maximization algorithm, maximum likelihood penalized partial likelihood and Markov chain Monte-Carlo methods [18]. Since in this paper we focus on parametric shared frailty models from a Frequentist perspective, the parameter estimation is based on maximum likelihood estimation. It should be noted that in drug clinical trials not all patients may experience the AEs (i.e. some patients are censored). Censoring in this paper is considered independent of the AE occurrence. Specifically, the model parameters can be estimated by maximizing marginal log-likelihood as shown below; Let Z be the vector containing the observed patient-specific information such that (M5) where y ij , ξ ij and x ij are time to AE occurrence, AE occurrence indicator (1 for AE occurrence and 0 otherwise) and fixed covariate for patient i and jth AE occurrence, respectively. Let U be a vector containing the latent information (the frailty terms) for n patients.
therefore, the marginal log-likelihood of the observed data Z can be written as: where d i represents the total number of events for patient and L (v) (.) represents the v th derivative of the Laplace transform for the derivative. Availability of the Laplace transform simplifies the maximization of the log-likelihood in order to compute the desired model parameters. Under the time-dependent treatment effect, the fixed treatment effect β can be replaced by the time-dependent treatment effect β(t).

Simulation study
This simulation study was conducted and reported based on "Aims, Data generating process, Method of analysis, Estimands and Performance" approach [25] since it provides a scientifically coherent structured framework for designing, interpreting and reporting simulation studies.

Aim
There is a dearth of literature highlighting the performance of recurrent statistical methods when both AEdependence and heterogeneity is accounted for in analysis of recurrent AEs in clinical trials. The aim of the simulation study was to evaluate whether the use of Lognormal shared frailty model with restricted splines and non-proportional hazards improves both the frailty variance and hazard ratio estimates in the presence of time-dependent effects and unobserved heterogeneity under varying unobserved heterogeneity levels and sample sizes. The performance of this flexible model was compared against the standard proportional hazards-based inverse Gaussian shared frailty model with proportional hazard and Weibull baseline hazard distribution assumptions.

Data generating mechanisms
The data generating mechanism (DGM) in this simulation study focussed on the specification and estimation of frailty variance and hazard ratio in the presence of time-dependent treatment effects and unobserved heterogeneity (Table 1). For each scenario, we considered 1000 simulations, lognormal frailty distribution, Weibull baseline hazard, fixed effect treatment effect of log hazard ratio 0.5 and up to a maximum of 4 AEs experienced by a patient. The choice of possible maximum of 4 AEs per patient was aimed at mimicking a setting for moderately high occurrence of recurrent AEs, in IPTp trials. The choice of lognormal frailty distribution was driven by the mathematical tractability of the distribution, availability of software for implementation and interpretability of the results in the context of AEs. The DGMs varied sample size (n = 300, n = 500), frailty distribution variance (0.25, 0.50, 0.75) and time-dependent effect (0.03, − 0.03). The choice of the sample size was intended to reflect sample sizes that are commonly employed in most IPTp trials. Choice of frailty variance (i.e. a measure of the strength of the unobserved heterogeneity within an individual) intended to represent a setting 0.25, 0.5 and 0.75 represented weak variance, moderate variance and heavy variance respectively. The time-dependent treatment effect allowed the fixed effect treatment effect of log hazard ratio 0.5 to decay over time (when set at − 0.03) or increase over time (when set at 0.03) or remain constant when set at 0.0). All the DGMs were parametrically based, to accommodate the investigation of several scenarios unlike resampling that confines investigation to data with properties only for the application data under study [25]. Hence, we simulated recurrent AE data similar to common scenarios that are practically and theoretically expected in IPTp trials. In all the simulation scenarios, the data generation mimicked a randomized controlled trial with two treatment arms. In all scenarios under study, we assumed that there were no substantial competing risks observed during the trial such that censoring was considered independent. The individual follow-up times were simulated based on uniform distribution and maximum follow-up time of 12 months (i.e. administrative censoring was set at 1 year), using Survsim package in stata. The choice of 12 months of follow-up time was motivated by the fact that most IPTp trials follow-up the mother for 12 months from enrolment to beyond delivery. The hazard functions were simulated using inversion method, under Weibull baseline hazard function, accommodating both time-varying treatment effect and unobserved heterogeneity with lognormal frailty distribution. The inversion method enabled sampling from a given distribution using uniform distribution and quantile function [26] where uniform distribution was assumed to have minimum 0 and maximum of 1. An overview of the investigated scenarios appears in Table 1 below.
During the data generation process, convergence of the models was also continuously monitored. After the data generation descriptive summary statistics were also done to check for any abnormal observations. For each scenario, graphical exploration of the distribution of the estimates was done to ascertain the assumed underlying distribution of the generated data.

Simulation study targets
Of primary interest in this study was shared frailty models` ability to effectively capture the unobserved heterogeneity. Therefore, the primary estimand of interest was log frailty variance φ i for each of the investigated models. Since our second interest was to assess the ability of the models in capturing time-dependent treatment effect, the log hazard ratio β computed at 3 months of follow-up time was considered as a secondary estimand. We opted for log hazard ratio at the specific follow-up time in order to ensure that the estimates from both the proportional and non-proportional hazard models are comparable.

Methods for data analysis
Each simulated data set was analysed using the three methods under investigation: ISF-PH, LSF-NPHS1 and LSF-NPHS3. Based on our simulation scenarios, the ISF-PH model miss-specified the proportionality of the hazards although it correctly specified the baseline hazard function (i.e. the baseline hazard function for the ISF-PH was modelled using Weibull distribution). The LSF-NPHS1 and LSF-NPHS3 models correctly specified the proportionality of the hazards although the baseline hazard function is approximated by the restricted cubic splines. When there are non-proportional hazards, reporting of a single average HR is considered not practically useful. Therefore, in order to ensure that the performance measures for the HR estimates are comparable across the three models, we computed hazard ratio at 3 months of follow-up time since enrolment (i.e. LSF-NPHS1 and LSF-NPHS3 are non-proportional hazard models and ISF-PH is a proportional hazard model). The choice of the time at which the HR was calculated was motivated by our practical experience where most patients take approximately half of their scheduled IPTp doses by third month of follow-up time.
The three shared frailty models (ISF-PH, LSF-NPHS1 and LSF-NPHS3) are also used in the analysis of recurrent adverse events that were observed in the Chloroquine for Malaria Prevention in Pregnancy trial (NCT01443130) data. The motivating data description, data analysis and results appear in the application section of this paper, after simulation results presentation.

Performance measures
In this paper, our model performance measures of interest were defined as described in Morris et al [25]. Bias and MSE were main performance measures for our estimands. For frailty variance estimand, we assessed the model performance based on precision, bias, MSE and coverage probability of 95% confidence interval of the frailty variance estimates. For the hazard ratio estimand, we assessed the model performance based on precision, bias and MSE. Point estimates of performance are reported including Monte Carlo standard errors in order to highlight the certainty of our point estimates. We used bias in order to quantify whether the estimator of interest targeted the true value of interest. Bias is defined as the difference between the expected value of an estimator and the true parameter value. Secondly, in order to integrate bias and variance of an estimator into one performance measure, we computed the mean square error (MSE). MSE is defined as sum of the squared bias and variance of the estimator. In defining relative precision gain, ISF-PH model was considered as a reference model. Therefore, precision gain is defined as the inverse squared ratio of the empirical SE of the respective flexible parametric model (i.e. LSF-NPHS1 or LSF-NPHS3) to the empirical SE of the ISF-PH model. Additionally, coverage was also computed, defined as proportion of confidence intervals for an estimator that contain the true parameter value.

Software and computer hardware specification
The simulation study was conducted using user-written survsim and merlin packages in Stata version 15.1 [27][28][29]. The analyses of the simulated and application data were also done using streg built-in function in Stata for the ISF-PH model.

Ethical considerations
The Chloroquine for malaria in pregnancy trial was conducted in accordance with the Declaration of Helsinki and Good Clinical Practice guidelines. The trial obtained ethical approval from the Institutional Review Board at the University of Maryland, the College of Medicine Research and Ethics Committee at the University of

Frailty variance estimates coverage probability, bias and MSE across the shared frailty models
Based on 1000 simulations for each scenario, we observed a 100% convergence rate of all the models that were fitted. Across all the scenarios, the magnitude of bias for the frailty variance estimates increased as the known unobserved heterogeneity variance increased from 0.25 to 0.75 whether the sample size was 300 or 500 ( Table 1). The observed bias across all the scenarios was negative suggesting underestimation of the unobserved heterogeneity. The magnitude of bias for the frailty variance estimates were slightly higher when the sample size was 300 compared to the sample size of 500. Overall, the coverage probability of the frailty variance estimates consistently decreased as the known unobserved heterogeneity variance increased from 0.25 to 0.75. The coverage probability across all the models was slightly above 95% for scenarios where the unobserved heterogeneity variance was 0.25, suggesting some negligible over-coverage. For example, for scenario 1, coverage (MCSE) was 96.40 (0.5891), 95.90 (0.6270) and 96.20 (0.6046) for ISF-PH, LSF-NPHS1 and LSF-NPHS3 models respectively. Overall, substantial under-coverage was observed for scenarios where the unobserved heterogeneity variance was 0.50 or 0.75 among the flexible parametric models (LSF-NPHS1 and LSF-NPHS3). The under-coverage worsened as the sample size increased from 300 to 500. Such under-coverage suggested inaccuracy of the frailty variance estimates obtained using the LSF-NPHS1 and LSF-NPHS3 models. As expected, overall, among the non-proportional hazard models with restricted cubic splines, the LSF-NPHS3 had lower bias and higher coverage of the frailty variance estimates compared to the LSF-NPHS1. This result suggests that for the non-proportional hazard models with restricted cubic splines, increasing the number of degrees of freedom modelling the baseline hazard function from 1 to 3 improves the frailty variance estimates. For the scenarios where the known unobserved heterogeneity was 0.50 or 0.75, we observed a higher magnitude of bias of frailty variance estimates for the non-proportional hazard models with restricted cubic splines models compared to the ISF-PH model. For the scenarios where the known unobserved heterogeneity variance was 0.25, we observed the lowest magnitude of bias for the frailty variance estimates in LSF-NPHS3 model followed by ISF-PH model then LSF-NPHS1 model.. For example, under scenario 1, the frailty variance estimate bias (MCSE) was − 0.0389 (0.0064), − 0.0556 (0.0060), − 0.0263 (0.0061) for ISF-PH, LSF-NPHS1 and LSF-NPHS3 models respectively. Pertaining to coverage, although the coverage probability decreased with increase in unobserved heterogeneity variance, decreasing the sample size from 500 to 300 yielded results with higher coverage. The decrease in coverage probability for the frailty variance estimates with increased unobserved heterogeneity was more drastic among the non-proportional hazards with restricted cubic splines models (LSF-NPHS1 or LSF-NPHS3) compared to the ISF-PH model. Overall, the ISF-PH model consistently yielded coverage probability closer to the nominal level of 95% compared to the non-proportional hazards with restricted cubic splines models (LSF-NPHS1 or LSF-NPHS3).
Overall, the MSE for the ISF-PH model frailty variance estimates were lower compared to the LSF-NPHS1 and LSF-NPHS3 models, except when the known unobserved heterogeneity variance was 0.25. Across all the 18 scenarios studied, we observed that the MSE estimates for the ISF-PH model decreased as the unobserved heterogeneity variance increased. However, the MSE estimates for the LSF-NPHS1 and LSF-NPHS1 models increased as the unobserved heterogeneity variance increased: Notably, the MSE for the LSF-NPHS3 model yielded lower MSE estimates compared to the LSF-NPHS1 model. As expected, increasing the sample size from 300 to 500 reduced the MSE estimates for the corresponding scenarios.
Our investigation showed that LSF-NPHS1 and LSF-NPHS3 models were more efficient than the ISF-PH model as demonstrated by the lower mean of modelbased standard errors (Fig. 1). The difference in modelbased standard errors between that LSF-NPHS1 and LSF-NPHS3 models was negligible. As expected, increasing sample size from 300 to 500 improved the efficiency of the shared frailty models under study such that the model-based standard errors for the frailty variance estimates decreased.

Percentage gain in precision for frailty variance estimates of non-proportional hazards shared frailty models with restricted cubic splines relative to ISF-PH model
Overall, as the unobserved heterogeneity increased, percentage gain in precision for frailty variance estimates of non-proportional hazards shared frailty models with restricted cubic splines relative ISF-PH model also increased ( Table 1). The gain in precision was higher when time-dependent treatment effects were increasing (i.e. TDE = 0.03) compared to the decaying time-dependent treatment effects (TDE = -0.03). Across the models, the gain in precision was higher in the LSF-NPHS1 model compared to LSF-NPHS3 model. The observed gain in precision for LSF-NPHS1 and LSF-NPHS3 models compared to ISF-PH model under proportional hazard scenarios (i.e. TDE = 0) was similar to the scenarios when time-dependent treatment effects were increasing (i.e. TDE = 0.03).

Bias and MSE for the estimated log hazard ratio estimates across the shared frailty models under varying unobserved heterogeneity and sample size
Overall, across all the three models under study, the MSE for the HR estimates increased as the unobserved heterogeneity variance increased ( Table 2). The ISF-PH model consistently yielded lowest MSE estimates for the HRs across the three models regardless of the nature of the time-dependent treatment effect. Increasing the sample size from 300 to 500 also led to reduction in the estimated MSE for the HRs. We observed a loss in precision of HR estimates for the ISF-PH model compared to the LSF-NPHS1 and LSF-NPHS3 model. The precision loss ranged from 9 to 20%. The highest precision loss was observed when the unobserved heterogeneity was low (i.e. frailty variance equal to 0.25). The precision loss estimates were similar for both LSF-NPHS1 and LSF-NPHS3 models. There was low precision loss in the absence of time-dependent treatment effects (TDE = 0). In the presence of time-dependent treatment effects (i.e. TDE = 0.03 and TDE = -0.03), we observed higher magnitude of bias in the ISF-PH model compared to the LSF-NPHS1 and LSF-NPHS3 models. In the absence of the time-dependent treatment effects, the bias for the HR estimates was lower in the ISF-PH model compared to the LSF-NPHS1 and LSF-NPHS3 models. However, in the absence of the time-dependent treatment effects, the bias across all the three models was relatively lower (i.e. under 0.01) compared to scenarios when time-dependent effects are present.

Example data application: Chloroquine for malaria in pregnancy trial
We illustrated the application of the three investigated shared frailty models above (ISF-PH, LSF-NPHS1 and LSF-NPHS3) to data on recurrent AEs collected from a clinical trial evaluating antimalarial drugs to prevent The current modelling focussed on investigating the effect of treatment on recurrent AEs. We included a population of 600 pregnant women in their first or second trimesters who were randomized to receive IPTp of (a) sulfadoxine-pyrimethamine (SP, two doses, 4 weeks apart), (b) chloroquine (CQ, 600 mg on day 1, 600 mg on day 2, and 300 mg on day 3 two days apart). Although the original trial had three treatment arms, in the current analysis, we dropped the arm of CQ prophylaxis (to focus on IPTp arms) since its treatment schedule was radically different from the IPTp arms. Full details of the trial are published elsewhere [30].
Of the 600 pregnant women analysed in the current analysis, 474 (79%) experienced at least one AE. Cumulatively, more AEs were experienced in the CQ arm (895 AEs) than those in the SP arm (650 AEs). We fitted ISF-PH model followed by the LSF-NPHS1 and LSF-NPHS3 models where IPTp treatment was an exposure of interest. The SP treatment was considered as a reference treatment in the interpretation of the results. ISF-PH assumed the baseline hazard function followed Weibull distribution, unobserved heterogeneity followed inverse Gaussian distribution and the hazard rates between the CQ and SP treatments were proportional over the follow-up time. The non-proportional hazard models with restricted models (NPHS1 and LSF-NPHS3) assumed that unobserved heterogeneity followed lognormal distribution and time-dependent effects could be modelled using one degree of freedom. NPHS1 and LSF-NPHS3 approximated the baseline hazard function using 1 and 2 degrees of freedom respectively. We fitted adjusted and unadjusted model for all the three model under investigation. The adjusted model included the following covariates: treatment, maternal age, maternal body mass index and gravidity (i.e. whether the pregnant woman had previous pregnancy or not). We adjusted for the same covariates for the three models to ensure that the estimates were comparable. Since the results from the adjusted and unadjusted models were similar, our results reported in Table 3 below are based the unadjusted models since we considered them more parsimonious. The reported hazard ratios across the three models are based on marginal hazard ratio, predicted hazard ratio at 14 days and 60 days of follow-up. Additionally, we plotted the hazard ratio estimates over the follow-up time.
In this clinical example for recurrent AEs data analysis, the estimated marginal log hazard ratios were 0.2281 (0.0499)0.2328(SE: 0.0517), 0.3090 (0.0769), 0.3360 (SE:0.0850) and 0.3024(0.0746) for ISF-PH, LSF-NPHS1, LSF-NPHS2, LSF-NPHS3 and LSF-NPHS4 models respectively, indicating an increased risk of AEs among the women who took the CQ treatment arm compared to those in the SP treatment arm (Table 3). At 14 days, the Log HR across the non-proportional hazard models were similar ranging between 0.43 to 0.47 and were higher compared to the ISF-PH model. However, at 60 days only LSF-NPHS2, LSF-NPHS3 and yielded similar log HR estimates approximately 0.20 and were lower compared to the ISF-PH model estimates. The frailty variance estimates generally higher among the non-proportional hazard models compared to the ISF-PH model. Among the non-proportional hazards models with restricted cubic splines, the AIC estimates show that the model fit was good when the baseline hazard function was approximated with minimum of 2 degrees.

Discussion
The analysis of recurrent AEs in clinical trials rarely account for potential unobserved heterogeneity and time-dependent effects [2]. Using a simulation study, we investigated the performance of three statistical methods for recurrent events that can be adapted towards the analysis of recurrent AEs in clinical trials, in the presence of both time-dependent effects and unobserved heterogeneity. Our paper mainly focused on evaluating the ability of the flexible parametric shared frailty models with non-proportional hazards and restricted cubic splines to capture both the unobserved heterogeneity and timedependent treatment effects. We established that the inverse Gaussian shared frailty with proportional hazard model consistently yielded good coverage closer to the nominal level of 95%, lower bias and lower MSE of the frailty variance estimates at the expense of imprecision compared to the flexible shared frailty models with nonproportional hazards and restricted cubic splines. Interestingly, the frailty variance estimates coverage and bias, for the shared frailty models with non-proportional hazards and restricted cubic splines, improved upon increasing the number of degrees of freedom used to model the baseline hazard function. As expected, the flexible parametric shared frailty model with non-proportional hazards and restricted cubic splines yielded better hazard ratio estimates, reflecting the time-dependent treatment effects, compared to the inverse Gaussian shared frailty with proportional hazard model. The flexible parametric shared frailty models with non-proportional hazards and restricted cubic splines yielded unbiased estimates of HRs at the expense of imprecision and inaccuracy (i.e. high MSE) compared to the ISF-PH model. Therefore, our findings show that the more precise estimators were biased and imprecise estimators were unbiased. In drug safety assessment using methods that yield an unbiased hazard ratio would be of primary interest such that the flexible parametric shared frailty model should be Table 3 Comparing sharing frailty models for analysis of recurrent AEs among pregnant women on IPTp treatment in Malawi a Lognormal shared frailty model with non-proportional hazard and restricted cubic splines where the baseline hazard is modelled with 2 degrees of freedom and the time-dependent treatment effects is modelled with 1 degree of freedom b Lognormal shared frailty model with non-proportional hazard and restricted cubic splines where the baseline hazard is modelled with 4 degrees of freedom and the time-dependent treatment effects is modelled with 1 degree of freedom considered if there is the presence of both unobserved heterogeneity and time-dependent effects. However, such a decision should be made at the cost of imprecision and loss of accuracy. Hence, we emphasize that the flexible shared frailty with non-proportional hazards and restricted cubic splines should be applied only when the proportional hazards assumption is violated. If interest is in estimating frailty variance, the inverse Gaussian model with proportional hazards can be considered sufficient even in the presence of both time-dependent treatment effect and unobserved heterogeneity. The inverse Gaussian shared frailty with proportional hazard probably effectively captured the unobserved heterogeneity because we did not mis-specify the baseline hazard (i.e. Weibull distribution). Secondly, the longer follow-up period of 12 months made the population more homogeneous with time since the unobserved heterogeneity for inverse Gaussian frailty decays over the follow-up time [31]. Thirdly, the inverse Gaussian frailty is practically close to the lognormal frailty where the simulated data arose. The structure of the inverse Gaussian frailty would be considered more appropriate in the context of recurrent AEs in antimalarial drug trials where there are more frequent AEs early in the clinical trial [18]. Unfortunately, the conditional hazard ratio derived from the inverse Gaussian shared frailty with proportional hazard is constant over the follow-up time which does not reflect the nonproportionality of the hazards in the presence of timedependent treatment effect.
The interpretation of our results is confined to scenarios where the censoring is assumed to be noninformative. In the presence of informative censoring, it will be interesting to understand how the models perform. Failure to account for time to medication presents a limitation in all the models considered in this work. Another key limitation of our study is relying on the assumption of constant unobserved heterogeneity. Future work can consider evaluating the performance of the shared frailty models accounting for both actual time to medication and time to AE occurrence under complicated scenarios where there is both time-varying treatment effects and time-varying unobserved heterogeneity.
Generally, in drug clinical trials, understanding drug safety profile requires efficient modelling of the recurrent AEs including accounting for both unobserved heterogeneity and any non-linear effects of the investigated drug. The methods studied in this paper provide key knowledge on choosing the most appropriate model for analysing recurrent AE data presenting the unobserved heterogeneity and time-dependent effects issues. The flexible parametric shared methods accounting for both time-dependent effects and unobserved heterogeneity are directly useful in IPTp trials since physiological changes that occur during pregnancy are mostly unobserved and can contribute to the heterogeneity of the treatment effect.
Based on our findings, we provide some practical recommendations on the studied shared frailty models to analyse the recurrent AEs when the hazards are nonproportional and the potential unobserved heterogeneity. Overall, the choice of the most appropriate model should be driven by the objective of the analysis. For instance, if interest is mainly in estimating hazard ratios, a shared frailty model with non-proportional hazards and restricted cubic splines is appropriate. When interest is in obtaining the unobserved heterogeneity estimates, inverse Gaussian with proportional hazard can serve the purpose. In the absence of time-dependent effects, using a shared frailty model with non-proportional hazards and restricted cubic splines should be avoided since it leads to biased estimates due to model overfitting.