Heterogeneity and event dependence in the analysis of sickness absence

Background Sickness absence (SA) is an important social, economic and public health issue. Identifying and understanding the determinants, whether biological, regulatory or, health services-related, of variability in SA duration is essential for better management of SA. The conditional frailty model (CFM) is useful when repeated SA events occur within the same individual, as it allows simultaneous analysis of event dependence and heterogeneity due to unknown, unmeasured, or unmeasurable factors. However, its use may encounter computational limitations when applied to very large data sets, as may frequently occur in the analysis of SA duration. Methods To overcome the computational issue, we propose a Poisson-based conditional frailty model (CFPM) for repeated SA events that accounts for both event dependence and heterogeneity. To demonstrate the usefulness of the model proposed in the SA duration context, we used data from all non-work-related SA episodes that occurred in Catalonia (Spain) in 2007, initiated by either a diagnosis of neoplasm or mental and behavioral disorders. Results As expected, the CFPM results were very similar to those of the CFM for both diagnosis groups. The CPU time for the CFPM was substantially shorter than the CFM. Conclusions The CFPM is an suitable alternative to the CFM in survival analysis with recurrent events, especially with large databases.


Background
Sickness absence (SA) is a complex phenomenon with great economic and social impact, and is considered a major occupational and public health issue [1][2][3]. SA is defined as a temporary situation in which a worker is unable to perform his/her usual work, either because of illness or injury [4]. The duration of SA affects the individual worker's quality of life, and have a great impact in his/her family, employer and society overall [5]. Knowing what factors are associated with how long a sickness absence episode lasts is of great importance in trying to reduce the SA duration. Sickness absence duration has been examined using a number of statistical techniques, most frequently survival analysis techniques [6][7][8]. Generally, survival studies analyze the time until the occurrence of a certain event of interest (e.g., death) [9]. However, in the context of sickness absence, some individuals may be more prone to experience multiple events, whether due to new illnesses or injuries, or recurrence of the same event. Repeated events can create within-subject correlation in event times [8,[10][11][12], arising from two sources: 1) event dependence; and 2) heterogeneity across individuals [11]. Event dependence occurs when the risk of a particular event depends on events previously experienced, whereas heterogeneity occurs when some individuals have a higher or lower risk of experiencing the events due to unknown, unmeasured or unmeasurable factors. Consequently, analytical approaches to modeling of sickness absence duration should take into account both event dependence and heterogeneity to avoid obtaining biased estimates of the parameters of interest [11,12]. The conditional frailty model (CFM) proposed by Box-Steffensmeier and De Boef [11], which can be viewed as an extension of the Cox model, simultaneously captures event dependence and heterogeneity [11], and has been used previously in political sciences research [12]. The computational applicability of the CFM maybe limited when dealing with very large datasets such a sickness absence registries, numbering hundreds of thousands or millions of individuals and/or episodes. For example, in Catalonia for the year 2007, the Catalan Institute of Medical and Health Evaluations (ICAMS, by its Spanish acronym) recorded 800,464 sickness episodes in 580,959 persons. It is well established that Poisson regression is a possible alternative to Cox regression [13,14]. Specifically, when a Cox model is confronted with computational limitations in analyzing large databases, a Poisson regression model maybe a reasonable alternative [15].
The goal of this paper is to propose a Poisson-based conditional frailty model (CFPM) that accounts for event dependence and heterogeneity for a large data analysis of sickness absence.

Methods
In the first section we will introduce the CFM and explain the proposed CFPM. In the following section we will explain the methods used to empirically compare the CFM and CFPM.

Conditional frailty and conditional frailty Poisson models
The conditional frailty model The CFM models the dependence of events and heterogeneity by stratifying the baseline hazard function by event order and incorporating random effects for individuals, respectively. The formulation of the model is in gap time so that time at risk is reset after each event. Let λ ik (t) the hazard of k-th event occurring in the i-th individual, the CFM is defined as where t k-1 is the time of occurrence of (k-1) th event, λ 0k (t − t k − 1 ) is the baseline hazard rate for the k-th event, β is the vector of parameters associated with covariates X and ω i is the random effect or "frailty" of the i-th individual that follows a gamma distribution. Considering right-censored failures, the parameters are interpreted as the log hazard ratio estimates associated with covariates for an event since the previous event, due to the gap time data structure incorporated in (t − t k − 1 ). More details about the CFM can be found in Box-Steffensmeier et al. [11,12].

The conditional frailty Poisson model
The CFPM considers λ Ã ik t ð Þ to be the hazard of k-th event at time t occurring in the i-th individual, as Following the piecewise exponential model formulation [16], the baseline hazard for the k-th event is defined as τ ∞ ] which are J non-zero, nonoverlapping intervals, with τ 1 = 0. The model captures event dependence (i.e., the dependence of the risk of a subsequent event on the occurrences of previous events) by allowing the baseline hazard to vary by event orders using an index "k" for the baseline hazard λ Ã 0k for the k th event. The heterogeneity is controlled by including an ω i random effect for the i-th individual. We consider a gamma distribution for the random effect.
Let n jik and d jik denote the time at risk and a covariate indicator of an event (d jik = 1) or non-event (d jik = 0), in the j-th time interval, for i-th individual and k-th event.
The proposed Poisson regression model assumes a Poisson distribution on d jik |ω i with the following loglinear mean, Note that the observed duration of SA ("time at risk under observation") is include as an offset term in the Poisson model which starts on the day of SA certification and ends on the day the worker returns to work or the day the worker's SA status becomes unknown (e.g., due to retirement, death, emigration), whichever is earlier.

Empirical comparison between conditional frailty models Description of the data
The CFM and CFPM were compared empirically using data from all episodes of non work-related SA that occurred in Catalonia (Spain) in 2007 (n = 800,464). Specifically, we assessed the influence of certain covariates of interest on SA duration, where the end of the episode of SA is considered the event of interest. A same individual may suffer more than one SA during the study period and therefore SAs are repeatable events.
The data were recorded through the Integrated Management System for Sickness Absence (SIGIT, by its Spanish acronym) at the ICAMS, a computerized registry and connected to all physicians in Catalonia responsible for certifying SA episodes.
For each episode, the diagnosis at case closure was available, coded according to the International Classification of Diseases, 10th Edition (ICD-10). We separately analyzed two large ICD-10 diagnosis groups selected to reflect frequent SA diagnoses (mental and behavioral disorders, codes F00-F99) and SA diagnoses with typically long duration times (neoplasms, codes C00-D48). Mental and behavioural disorders accounted for 3,268,075 days from 59,647 episodes in 53,238 individuals with a median duration of 10 days (25th percentile, 25 days; 75th percentile, 67 days); and neoplasms accounted for 516,676 days from 7,431 episodes in 6,975 individuals with median duration of 11 days (25th percentile, 28 days; 75 th percentile, 80 days). Approximately 10% of individuals had repeated events. For neoplasms, repeated events occur in 5% of individuals. Problems with convergence may emerge if there are too many event-order strata and/or a small number of episodes per stratum in both CFM [12] and CFPM. Therefore, we collapsed the event number so that any number of repeated episodes greater than 5 was set equal to 5.

Empirical comparison
We empirically compared the hazard ratio (HR) and 95% confidence intervals (95% CI) obtained by the CFM and the proposed CFPM. To define the baseline hazard function in the CFPM following the piecewise exponential model, we chopped time into 90-day-length nonoverlapping.
To explore the source of correlation existing in the data and to better assess the proposed CFPM as a reliable alternative to the CFM, we also computed the HR and 95% CI, with models which: 1) only take into account the event dependence; or 2) only take into account for heterogeneity. The former models were based on a gap time conditional model (CM) [17] which takes into account the event dependence by stratifying the baseline hazard function according to event order [18]. The CM is similar to CFM but does not include the individual random effect term. We also ran a conditional Poisson model (CPM) with the same expression as the CFPM, but without the random effect term by individual. With respect to models that control only for heterogeneity we considered a frailty model (FM), which is similar to the CFM but without stratifying the baseline hazard functions by event order and controls for the heterogeneity by including random effects for individuals.
Finally, we ran a Poisson model that takes into account only heterogeneity (FPM). The FPM presents a similar expression to the CFPM, but without the interaction between event order and the baseline hazard function.
Based on Box-Steffensmeier and De Boef [11] we hypothesized that when event dependence is strong, the event-dependence-only models (CM and CPM) should give estimates of the effects which are closed to the CFM, than models that do not control for the dependence of events (FM and FPM). Similarly, if heterogeneity is strong, the results of frailty models (FM and FPM) should be closer to the CFM than the models which only take into account dependence of events (CM and CPM). For both cases, i.e., regardless of the cause of correlation that predominates (event dependence or heterogeneity), we should expect that the estimates of CFPM will be closer to the CFM than the other models that only control for event dependence or only for heterogeneity. Thus, the comparison of the different models with the CFM serves to evaluate the suitability of CFPM when there is event dependence and/or heterogeneity.
The results between models were compared using the % relative bias (%RB) in point estimate and the % relative width difference in confidence interval (%RW), using the CFM as reference [15]. These measures are defined as where HR Other and HR CFM are the hazard ratio under a specific model (CM, FM, CPM, FPM, CFPM) and the CFM, respectively, and U Other and L Other are the respective upper and lower confidence interval endpoints under a specific model, and U CFM and L CFM represent the upper and lower confidence interval endpoints for the CFM, respectively. To compare the time for obtaining the parameter estimates from CFPM and CFM, their respective CPU time was measured on the Windows XP operating system on Intel® Core™2 CPU machine. The CFM and CFPM were fitted using R version 2.8.1 and Stata version 11, respectively. The Stata code for the CFPM, FPM and CPM is provided in the Additional file 1. Specifically we used the poisson command for the CPM, and the xtpoisson command for the CFPM and the FPM. Information about these commands can be found in the book written by Rabe-Hesketh S and Skrondal A [19].
This study was approved by the Parc de Salut Mar Clinical Research Ethical Committee of Barcelona, Spain (number 2011/4229/I).

Results
Tables 1 and 2 summarize the HR estimates and 95% CIs, for the six models, adjusted for covariates, for mental or behavioural disorders and neoplasms, respectively.
The six models we considered showed associations that were in the same direction (for a specific group of a covariate the HR were above 1 (or below 1) in the six models). Although the associations for all six models were in the same direction, there were differences in the magnitude of HR across the models. The CFPM results were very similar to those of the CFM for both diagnosis groups (Tables 3 and 4).
For neoplasms, the %RB for the CFPM ranged from 0% to 6.9% (absolute values), and the %RW from 0% to Table 1 Hazard ratio and 95% confidence interval for selected covariates from episodes of non work-related sickness absence occurred in Catalonia         CFM. In terms of %RW in general the CFPM presents the lowest percentages, but they can be up to 15%. In the case of CM and FM, the %RW can reach 30-40%. The CPU time for the CFPM was much shorter than the CFM. Using R version 2.8.1. on the Windows XP operating system on Intel® Core™2 CPU machine, the CFM took 124,877.67 (2,081.30 minutes) and 647.53 seconds (10.80 minutes) CPU time for mental health disorders and neoplasm data analysis, respectively. Using Stata version 11, on the same operating system and Table 3 Percentage of relative bias in point estimates and percentage of relative width differences in confidence intervals for selected covariates for episodes of non work-related sickness absence due to mental or behavioural disorders occurring in Catalonia (Spain) in 2007 Age at onset of absence episode (years)

Discussion
We proposed for the first time a Poisson-based conditional frailty model that accounts for both event dependence and heterogeneity. The CPU time required for the CFPM was substantially shorter than that for the CFM. In addition, as expected, the CFPM results were very similar to those of the CFM for both diagnosis groups. The similarity of results between the CFM and CFPM, and the differences noted with models that do not include event dependence and/or heterogeneity reinforces Age at onset of absence episode (years) the usefulness of the CFPM. In the case of neoplasms, the %RB for frailty models is closer to the CFM than for conditional models, suggesting that the dependence that dominates the data is heterogeneity. Conversely, in the case of mental health disorders, the %RB is smaller in the CM than that of the FM, indicating a greater influence of event dependence. The choice of time intervals may influence the model fit result. The key issue is to sufficiently approximate the underlying hazard function over time by a set of piecewiseconstant hazards in Poisson models. The shorter we make the time intervals of the piecewise-constant hazards the closer Poisson models get to Cox models. If data in each time interval become sparse by making the intervals shorter, however, parameter estimation becomes unstable, which in turn affect the estimation of the covariates' effects of interest. As Michael Friedman pointed out "precise practical guidelines for choosing the number of intervals have not been formulated" [20]. Choosing different cutpoints has a trade-off. It will be helpful to explore the form of the underlying hazard function and also assess the availability of data in each interval. In addition, performing a sensitivity analysis choosing different cut-points is useful for assessing changes in the parameters estimates of interest.
To avoid convergence problems we treated repeated episodes greater than 5 as equal to 5. The percentage of individuals with more than 5 repeated episodes for neoplasms is 0.52%, and for mental and behavioural disorders is 0.35%. Due to the very low percentages of individuals with more than 5 episodes, treating episodes greater than 5 as equal to 5 do not change the results.
A key advantage of the CFPM over the CFM is the reduction of computational time when analyzing large databases. This may be particularly important for institutions in countries where computers with high computational speed are not readily available. Currently, the CFM can only be run using R version 2.8.1. software [21]. The CFPM, though, can easily be run using other, statistical software such as Stata [22].

Conclusions
In summary, assuming that within-subject correlation is a result of event dependence will result in biased estimates when, in fact, it is due to heterogeneity in the data. Conversely, assuming correlation in event times is due to heterogeneity will also result in biased inferences when, in fact, the source is event dependence [12]. For this reason, we recommend incorporating both sources of correlation when fitting a model. To achieve this, the CFPM is an attractive alternative to the CFM in survival analysis with recurrent events, especially with large databases, such as those that may exist for the analysis of sickness absence data.