Research article | Open | Open Peer Review | Published:
Heterogeneity and event dependence in the analysis of sickness absence
BMC Medical Research Methodologyvolume 13, Article number: 114 (2013)
Sickness absence (SA) is an important social, economic and public health issue. Identifying and understanding the determinants, whether biological, regulatory or, health services-related, of variability in SA duration is essential for better management of SA. The conditional frailty model (CFM) is useful when repeated SA events occur within the same individual, as it allows simultaneous analysis of event dependence and heterogeneity due to unknown, unmeasured, or unmeasurable factors. However, its use may encounter computational limitations when applied to very large data sets, as may frequently occur in the analysis of SA duration.
To overcome the computational issue, we propose a Poisson-based conditional frailty model (CFPM) for repeated SA events that accounts for both event dependence and heterogeneity. To demonstrate the usefulness of the model proposed in the SA duration context, we used data from all non-work-related SA episodes that occurred in Catalonia (Spain) in 2007, initiated by either a diagnosis of neoplasm or mental and behavioral disorders.
As expected, the CFPM results were very similar to those of the CFM for both diagnosis groups. The CPU time for the CFPM was substantially shorter than the CFM.
The CFPM is an suitable alternative to the CFM in survival analysis with recurrent events, especially with large databases.
Sickness absence (SA) is a complex phenomenon with great economic and social impact, and is considered a major occupational and public health issue [1–3]. SA is defined as a temporary situation in which a worker is unable to perform his/her usual work, either because of illness or injury . The duration of SA affects the individual worker’s quality of life, and have a great impact in his/her family, employer and society overall . Knowing what factors are associated with how long a sickness absence episode lasts is of great importance in trying to reduce the SA duration. Sickness absence duration has been examined using a number of statistical techniques, most frequently survival analysis techniques [6–8]. Generally, survival studies analyze the time until the occurrence of a certain event of interest (e.g., death) . However, in the context of sickness absence, some individuals may be more prone to experience multiple events, whether due to new illnesses or injuries, or recurrence of the same event. Repeated events can create within-subject correlation in event times [8, 10–12], arising from two sources: 1) event dependence; and 2) heterogeneity across individuals . Event dependence occurs when the risk of a particular event depends on events previously experienced, whereas heterogeneity occurs when some individuals have a higher or lower risk of experiencing the events due to unknown, unmeasured or unmeasurable factors. Consequently, analytical approaches to modeling of sickness absence duration should take into account both event dependence and heterogeneity to avoid obtaining biased estimates of the parameters of interest [11, 12].
The conditional frailty model (CFM) proposed by Box-Steffensmeier and De Boef , which can be viewed as an extension of the Cox model, simultaneously captures event dependence and heterogeneity , and has been used previously in political sciences research . The computational applicability of the CFM maybe limited when dealing with very large datasets such a sickness absence registries, numbering hundreds of thousands or millions of individuals and/or episodes. For example, in Catalonia for the year 2007, the Catalan Institute of Medical and Health Evaluations (ICAMS, by its Spanish acronym) recorded 800,464 sickness episodes in 580,959 persons. It is well established that Poisson regression is a possible alternative to Cox regression [13, 14]. Specifically, when a Cox model is confronted with computational limitations in analyzing large databases, a Poisson regression model maybe a reasonable alternative .
The goal of this paper is to propose a Poisson-based conditional frailty model (CFPM) that accounts for event dependence and heterogeneity for a large data analysis of sickness absence.
In the first section we will introduce the CFM and explain the proposed CFPM. In the following section we will explain the methods used to empirically compare the CFM and CFPM.
Conditional frailty and conditional frailty Poisson models
The conditional frailty model
The CFM models the dependence of events and heterogeneity by stratifying the baseline hazard function by event order and incorporating random effects for individuals, respectively. The formulation of the model is in gap time so that time at risk is reset after each event. Let λ ik (t) the hazard of k-th event occurring in the i-th individual, the CFM is defined as
where tk-1 is the time of occurrence of (k-1)th event, λ 0k(t - t k - 1) is the baseline hazard rate for the k-th event, β is the vector of parameters associated with covariates X and ω i is the random effect or “frailty” of the i-th individual that follows a gamma distribution. Considering right-censored failures, the parameters are interpreted as the log hazard ratio estimates associated with covariates for an event since the previous event, due to the gap time data structure incorporated in (t - t k - 1). More details about the CFM can be found in Box-Steffensmeier et al.[11, 12].
The conditional frailty Poisson model
The CFPM considers to be the hazard of k-th event at time t occurring in the i-th individual, as
Following the piecewise exponential model formulation , the baseline hazard for the k-th event is defined as
with divisions of time scale into (τ 1, τ 2], (τ 2, τ 3], …, (τ J, τ ∞ ] which are J non-zero, nonoverlapping intervals, with τ 1 = 0. The model captures event dependence (i.e., the dependence of the risk of a subsequent event on the occurrences of previous events) by allowing the baseline hazard to vary by event orders using an index “k” for the baseline hazard for the k th event. The heterogeneity is controlled by including an ωi random effect for the i-th individual. We consider a gamma distribution for the random effect.
Let n jik and d jik denote the time at risk and a covariate indicator of an event (d jik = 1) or non-event (d jik = 0), in the j-th time interval, for i-th individual and k-th event. The proposed Poisson regression model assumes a Poisson distribution on d jik |ω i with the following log-linear mean,
Note that the observed duration of SA (“time at risk under observation”) is include as an offset term in the Poisson model which starts on the day of SA certification and ends on the day the worker returns to work or the day the worker’s SA status becomes unknown (e.g., due to retirement, death, emigration), whichever is earlier.
Empirical comparison between conditional frailty models
Description of the data
The CFM and CFPM were compared empirically using data from all episodes of non work-related SA that occurred in Catalonia (Spain) in 2007 (n = 800,464). Specifically, we assessed the influence of certain covariates of interest on SA duration, where the end of the episode of SA is considered the event of interest. A same individual may suffer more than one SA during the study period and therefore SAs are repeatable events.
The data were recorded through the Integrated Management System for Sickness Absence (SIGIT, by its Spanish acronym) at the ICAMS, a computerized registry and connected to all physicians in Catalonia responsible for certifying SA episodes.
For each episode, the diagnosis at case closure was available, coded according to the International Classification of Diseases, 10th Edition (ICD-10). We separately analyzed two large ICD-10 diagnosis groups selected to reflect frequent SA diagnoses (mental and behavioral disorders, codes F00-F99) and SA diagnoses with typically long duration times (neoplasms, codes C00-D48). Mental and behavioural disorders accounted for 3,268,075 days from 59,647 episodes in 53,238 individuals with a median duration of 10 days (25th percentile, 25 days; 75th percentile, 67 days); and neoplasms accounted for 516,676 days from 7,431 episodes in 6,975 individuals with median duration of 11 days (25th percentile, 28 days; 75th percentile, 80 days). Approximately 10% of individuals had repeated events. For neoplasms, repeated events occur in 5% of individuals. Problems with convergence may emerge if there are too many event-order strata and/or a small number of episodes per stratum in both CFM  and CFPM. Therefore, we collapsed the event number so that any number of repeated episodes greater than 5 was set equal to 5.
Other covariates of interest were sex, age (16–28, 29–35, 36–45, >45 years), economic activity (11 branches), Catalonian health region, entity responsible for case management (National Institute of Social Security or a mutual insurance company), and employment status (salaried or self-employed).
We empirically compared the hazard ratio (HR) and 95% confidence intervals (95% CI) obtained by the CFM and the proposed CFPM. To define the baseline hazard function in the CFPM following the piecewise exponential model, we chopped time into 90-day-length non-overlapping.
To explore the source of correlation existing in the data and to better assess the proposed CFPM as a reliable alternative to the CFM, we also computed the HR and 95% CI, with models which: 1) only take into account the event dependence; or 2) only take into account for heterogeneity. The former models were based on a gap time conditional model (CM)  which takes into account the event dependence by stratifying the baseline hazard function according to event order . The CM is similar to CFM but does not include the individual random effect term. We also ran a conditional Poisson model (CPM) with the same expression as the CFPM, but without the random effect term by individual. With respect to models that control only for heterogeneity we considered a frailty model (FM), which is similar to the CFM but without stratifying the baseline hazard functions by event order and controls for the heterogeneity by including random effects for individuals. Finally, we ran a Poisson model that takes into account only heterogeneity (FPM). The FPM presents a similar expression to the CFPM, but without the interaction between event order and the baseline hazard function.
Based on Box-Steffensmeier and De Boef  we hypothesized that when event dependence is strong, the event-dependence-only models (CM and CPM) should give estimates of the effects which are closed to the CFM, than models that do not control for the dependence of events (FM and FPM). Similarly, if heterogeneity is strong, the results of frailty models (FM and FPM) should be closer to the CFM than the models which only take into account dependence of events (CM and CPM). For both cases, i.e., regardless of the cause of correlation that predominates (event dependence or heterogeneity), we should expect that the estimates of CFPM will be closer to the CFM than the other models that only control for event dependence or only for heterogeneity. Thus, the comparison of the different models with the CFM serves to evaluate the suitability of CFPM when there is event dependence and/or heterogeneity.
The results between models were compared using the % relative bias (%RB) in point estimate and the % relative width difference in confidence interval (%RW), using the CFM as reference . These measures are defined as
where HROther and HRCFM are the hazard ratio under a specific model (CM, FM, CPM, FPM, CFPM) and the CFM, respectively, and UOther and LOther are the respective upper and lower confidence interval endpoints under a specific model, and UCFM and LCFM represent the upper and lower confidence interval endpoints for the CFM, respectively.
To compare the time for obtaining the parameter estimates from CFPM and CFM, their respective CPU time was measured on the Windows XP operating system on Intel® Core™2 CPU machine. The CFM and CFPM were fitted using R version 2.8.1 and Stata version 11, respectively. The Stata code for the CFPM, FPM and CPM is provided in the Additional file 1. Specifically we used the poisson command for the CPM, and the xtpoisson command for the CFPM and the FPM. Information about these commands can be found in the book written by Rabe-Hesketh S and Skrondal A .
This study was approved by the Parc de Salut Mar Clinical Research Ethical Committee of Barcelona, Spain (number 2011/4229/I).
The six models we considered showed associations that were in the same direction (for a specific group of a covariate the HR were above 1 (or below 1) in the six models). Although the associations for all six models were in the same direction, there were differences in the magnitude of HR across the models. The CFPM results were very similar to those of the CFM for both diagnosis groups (Tables 3 and 4).
For neoplasms, the %RB for the CFPM ranged from 0% to 6.9% (absolute values), and the %RW from 0% to 7.1%. For the FPM and FM, these results were not as close to the CFM as the CFPM (10.3% in the %RB and 14.3% in the %RW for age > 45 in the FPM, and -17.9% in the %RB for extraterritorial agencies, and 33.3% in the %RW for Camp de Tarragona Health Region in the FM). The results for the CM and CPM were further apart from the CFM as compared to the FPM and FM, in some cases %RB reaching the 20%, 40% or, in the case of CM, 60% and%RW exceeding the 20%.
For mental or behavioural disorders, the CFPM, CPM and FPM behaved similar and were better than the CM and FM, and the CFPM behaved very closely to the CFM. In terms of %RW in general the CFPM presents the lowest percentages, but they can be up to 15%. In the case of CM and FM, the %RW can reach 30-40%.
The CPU time for the CFPM was much shorter than the CFM. Using R version 2.8.1. on the Windows XP operating system on Intel® Core™2 CPU machine, the CFM took 124,877.67 (2,081.30 minutes) and 647.53 seconds (10.80 minutes) CPU time for mental health disorders and neoplasm data analysis, respectively. Using Stata version 11, on the same operating system and hardware, the CFPM took 260.56 (4.34 minutes) and 35.77 (0.60 minutes) seconds for mental health disorders and neoplasm, respectively.
We proposed for the first time a Poisson-based conditional frailty model that accounts for both event dependence and heterogeneity. The CPU time required for the CFPM was substantially shorter than that for the CFM. In addition, as expected, the CFPM results were very similar to those of the CFM for both diagnosis groups.
The similarity of results between the CFM and CFPM, and the differences noted with models that do not include event dependence and/or heterogeneity reinforces the usefulness of the CFPM. In the case of neoplasms, the %RB for frailty models is closer to the CFM than for conditional models, suggesting that the dependence that dominates the data is heterogeneity. Conversely, in the case of mental health disorders, the %RB is smaller in the CM than that of the FM, indicating a greater influence of event dependence.
The choice of time intervals may influence the model fit result. The key issue is to sufficiently approximate the underlying hazard function over time by a set of piecewise-constant hazards in Poisson models. The shorter we make the time intervals of the piecewise-constant hazards the closer Poisson models get to Cox models. If data in each time interval become sparse by making the intervals shorter, however, parameter estimation becomes unstable, which in turn affect the estimation of the covariates’ effects of interest. As Michael Friedman pointed out “precise practical guidelines for choosing the number of intervals have not been formulated” . Choosing different cut-points has a trade-off. It will be helpful to explore the form of the underlying hazard function and also assess the availability of data in each interval. In addition, performing a sensitivity analysis choosing different cut-points is useful for assessing changes in the parameters estimates of interest.
To avoid convergence problems we treated repeated episodes greater than 5 as equal to 5. The percentage of individuals with more than 5 repeated episodes for neoplasms is 0.52%, and for mental and behavioural disorders is 0.35%. Due to the very low percentages of individuals with more than 5 episodes, treating episodes greater than 5 as equal to 5 do not change the results.
A key advantage of the CFPM over the CFM is the reduction of computational time when analyzing large databases. This may be particularly important for institutions in countries where computers with high computational speed are not readily available. Currently, the CFM can only be run using R version 2.8.1. software . The CFPM, though, can easily be run using other, statistical software such as Stata .
In summary, assuming that within-subject correlation is a result of event dependence will result in biased estimates when, in fact, it is due to heterogeneity in the data. Conversely, assuming correlation in event times is due to heterogeneity will also result in biased inferences when, in fact, the source is event dependence . For this reason, we recommend incorporating both sources of correlation when fitting a model. To achieve this, the CFPM is an attractive alternative to the CFM in survival analysis with recurrent events, especially with large databases, such as those that may exist for the analysis of sickness absence data.
- ICAMS (by its Spanish acronym):
Catalan Institute of Medical and Health Evaluations
Conditional frailty model
Poisson-based conditional frailty model
- SIGIT (by its Spanish acronym):
Integrated management system for sickness absence
International classification of diseases 10th edition
Conditional poisson model
Frailty poisson model
% Relative bias
% Relative width difference in confidence interval
Central processing unit.
Moncada S, Navarro A, Cortès I, Molinero E, Artazcoz L: Sickness leave, administrative category and gender: results from the “Casa Gran” project. Scand J Public Health Suppl. 2002, 59: 26-33.
Whitaker SC: The management of sickness absence. Occup Environ Med. 2001, 58: 420-424. 10.1136/oem.58.6.420.
Gimeno D, Benavides FG, Benach J, Amick BC: Distribution of sickness absence in the European union countries. Occup Environ Med. 2004, 61: 867-869. 10.1136/oem.2003.010074.
Fábrega O, Company A: La gestión de la incapacidad temporal en Gestión de Atención Primaria. Curso a distancia de Gestión de Atención Primaria. Edited by: Gené Badia J, Grego Recasens JM. 1999, Barcelona: semFYC y Formación Continuada Les Hueres – Universitat de Barcelona. Fundació Bosch Gimpera, 95-103.
Alexanderson K: Sickness absence: a review of performed studies with focused on levels of exposures and theories utilized. Scand J Soc Med. 1998, 26: 241-249.
Navarro A, J Reis R, Martin M: Some alternatives in the statistical analysis of sickness absence. Am J Ind Med. 2009, 52: 811-816. 10.1002/ajim.20739.
González JR, Peña EA: Non-parametric estimation of survival function for recurrent events data. Rev Esp Salud Publica. 2004, 78: 189-199. 10.1590/S1135-57272004000200006.
Roelen CA, Koopmans PC, Anema JR, van der Beek AJ: Recurrence of medically certified sickness absence according to diagnosis: a sickness absence register study. J Occup Rehabil. 2010, 20: 113-121. 10.1007/s10926-009-9226-8.
Barceló MA: Marginal and conditional models in multivariate survival analysis. Gac Sanit. 2002, 16: 59-68.
Christensen KB, Andersen PK, Smith-Hansen L, Nielsen ML, Kristensen TS: Analyzing sickness absence with statistical models for survival data. Scand J Work Environ Health. 2007, 33: 233-239. 10.5271/sjweh.1132.
Box-Steffensmeier JM, De Boef S: Repeated events survival models: the conditional frailty model. Stat Med. 2006, 25: 3518-3533. 10.1002/sim.2434.
Box-Steffensmeier JM, De Boef S, Joyce KA: Event dependence and heterogeneity in duration models: the conditional frailty model. Polit Anal. 2007, 15: 237-256.
Aitkin M, Clayton D: The fitting of exponential, weibull and extreme value distributions to complex censored survival data using GLIM. Appl Statist. 1980, 29: 156-163. 10.2307/2986301.
Laird N, Olivier D: Covariance analysis of censored survival data using log-linear analysis techniques. J Am Stat Assoc. 1981, 76: 231-240. 10.1080/01621459.1981.10477634.
Callas PW, Pastides H, Hosmer DW: Empirical comparisons of proportional hazards, Poisson, and logistic regression modeling of occupational cohort data. Am J Ind Med. 1998, 33: 33-47. 10.1002/(SICI)1097-0274(199801)33:1<33::AID-AJIM5>3.0.CO;2-X.
Breslow N: Covariance analysis of censored survival data. Biometrics. 1974, 30: 89-99. 10.2307/2529620.
Therneau TM, Grambsch PM: Modeling survival data: extending the Cox model. statistics for biology and health. 2000, New York: Springer
Kelly PJ, Lim LL: Survival analysis for recurrent event data: an application to childhood infectious diseases. Stat Med. 2000, 19: 13-33. 10.1002/(SICI)1097-0258(20000115)19:1<13::AID-SIM279>3.0.CO;2-5.
Rabe-Hesketh S, Skrondal A: Multilevel and longitudinal modelling using Stata. 2008, Stata Press, 4905 Lakeway Drive, College Station, Texas, 2
Friedman M: Piecewise exponential models for survival data with covariates. Ann Stat. 1982, 10: 101-103. 10.1214/aos/1176345693.
R Development Corp Team: R: a language and environment for statistical computing. 2003, Vienna: R Fundation for Statistical Computing
StataCorp: College station. Stata statistical software: release 12. 2011, Texas: StataCorp MP
The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1471-2288/13/114/prepub
This work was supported by grants from the Fondo de Investigación Sanitaria [PI11/01470], Operating Grant of the Canadian Institute of Health Research entitled 'Statistical Methods for Epidemiologic Investigations’ and the Institut Català d’Avaluacions Mèdiques. The authors would like to thanks the Institut Català d’Avaluacions Mèdiques i Sanitàries (ICAMS) for providing the database.
The authors declare that they have no competing interests.
ITR, YY, and JMM conceived the idea and worked on in the statistical and public health issues of the method, conducted the data analysis, and wrote the initial draft. DG, GD, and FGB gave critiques from public-health points of view and study design. JJ, RM, and CA provided discussion based on sickness absence. CA assisted the data analyses. All authors reviewed numerous drafts of the manuscript, are in agreement with the text and findings, and we have all approved this final version.
Yutaka Yasui and José Miguel Martínez contributed equally to this work.